Daan De Meyer [Thu, 18 Dec 2025 09:28:30 +0000 (10:28 +0100)]
process-util: Use pidref_wait_for_terminate_and_check in pidref_safe_fork()
Note that we still have to block SIGCHLD so that
we can be certain the process is not reaped before
we get the pidfd to it. safe_fork() and friends are
used in libsystemd where we don't control how the
SIGCHLD signal is configured. Specifically, kernel
autoreaping could be enabled which is why we have
to block SIGCHLD until we get the pidfd so that the
kernel cannot autoreap the process before we get
the pidfd.
Daan De Meyer [Thu, 18 Dec 2025 09:30:36 +0000 (10:30 +0100)]
sd-event: Clean up SIGCHLD conditions for sd_event_add_child()
First, don't require blocking SIGCHLD for WEXITED. We watch for WEXITED
via pidfd instead of signalfd, so no need to insist on blocking SIGCHLD
anymore if we're only watching for WEXITED.
Second, do a proper check to see if the kernel autoreaping logic is
enabled. That has nothing to do with SIGCHLD being blocked for the current
thread or not. Instead, the kernel autoreaping logic is enabled either if
the disposition is set to SIG_IGN or if the SA_NOCLDWAIT flag is enabled.
Mike Yuan [Sun, 14 Dec 2025 15:02:17 +0000 (16:02 +0100)]
process-util: revamp flags handling in namespace_fork()
* Specifying all 3 of FORK_DEATHSIG_SIG{KILL,TERM,INT} for
the middle man makes zero sense. Use SIGKILL only.
* Make sure operations on except_fds work sensibly - close/pack/
de-ocloexecify fds only in the second level, so that the namespace
fds remain usable across first safe_fork().
* Fire FORK_NEW_*NS after attaching to the desired namespaces,
not already in the outer process.
* Insist on PDEATHSIG being enabled to ensure propagation of killing.
* Suppress more redundant flags.
Luca Boccassi [Fri, 19 Dec 2025 13:19:16 +0000 (13:19 +0000)]
core: reuse existing dm-verity device for single filesystem images pinned by policy (#40007)
Loading images is, generally speaking, the slowest part of sd-executor
when spawning a service. This is due to multiple factors. dm-verity is
obviously a big part of the cost, but dissecting in general via libblkid
also can take a lot of time, due to probing the images and their
filesystems.
A performance test doing service restarts in a row shows these
results, ran on a production system (low power and slow ARM64 SOC) with
a real production service, show the following service interruption
intervals:
One iteration is 507 restarts in this case, but this has ran hundreds
of times and the results are always in line within margin of error.
This also holds true for metrics from live systems, same numbers.
Between 1.0s and 1.2s can be attributed by profiling to the time needed
for the service code itself to start up and sd_notify, the rest is spent
inside systemd's code.
This means there is currently a tradeoff for services - either use
secure
images, or make restarting fast. Downtime of services is a very
important
metric, as for many cases this directly translates to outages, total or
partial (blackouot or greyout).
In order to facilitate using secure images without downsides, skip the
slow dissect steps (probing, loop devices, etc) when the configured
image is a single filesystem dm-verity image with a policy that pins it
to a single filesystem, and an already existing and open dm-verity
device
can be found and reused.
This allows orchestrators to pre-open images on download, before
restarting
services, to minimize downtimes.
meson: avoid double compilation for standalone progs
So far we compiled the normal and standalone versions completely
independently. Let's use the 'extract' template pattern to avoid any
additional compilation and only require an single link to produce the
.standalone variants.
Unfortunately, as designed, the 'extract' framework only allows one set of
object files to be extracted. Since we need all the files for the
.standalone version, we cannot use 'extract' for other purposes. Thus, in
the two cases where 'extract' was used for the test binaries, this is now
changed to compile the files a second time. But the number of files in that
list is small, so this seems like a better option.
(If we weren't using the template system, we could easily extract just the
objects we need. But with the current system, at the point of the
definition, the binaries are not defined yet. We'd need to handle all of
this through sets of dictionaries, and that just seems like too much
trouble to avoid double compilation of a few small files.)
If we ever want to add it back, it should be with -DSTANDALONE=0|1, so
that #if instead of #ifdef can be used. We generally converted our internal
defines to that form.
sysusers,tmpfiles: make standalone versions full-featured
This effectively reverts 3537577c37d2c23a518540d36884a127aab944f8. Originally,
the #ifdefs were added because we didn't want to pull in the whole tree of
libmount and other dependencies in standalone versions. But dependencies are
now loaded through dlopen(), so this is not needed anymore. (And doesn't even
make much of a difference.)
meson: allow .standalone version to be always built
Allow .standalone version to be built on-demand, even if -Dstandalone=false
is configured. In other words, this changes the meson option from a hard
disablement to a soft "build is on/off by default".
The meson config was originally written in this way but we lost this feature
after the transition to templates. It is nice to build additional targets on
demand during development, so add this back.
meson: put src/import source lists directly in templates
The indirection through variables doesn't seem that useful here:
OTOH, the lists are short, and OTOH, there is a bunch of different
programs with similar names. Overall, it's all easier to follow if
the lists are inline.
* cac8dde28a test: Allow passing in extra tests to skip via TEST_SKIP
* 56377438ba Disable sysinit-path for upstream builds
* 0c8ea706f9 Fix links to patches
* 4f5b5a9615 Version 259
* bf8019c840 Version 259~rc3
* ef777d6572 Check if --max-lines is supported by meson
* b562e38e22 Fix use of removed $LOCAL_CONF variable
* 0289127dae Patch machined to continue after selinux denial
* 7e409130ee Version 259~rc2
* 33b38cdbc7 Suppress errors from tar
* ddb6474e94 Drop provides for removed sysvinit tools
* 9ac8c36307 Set meson auto features to auto when building for upstream
Luca Boccassi [Fri, 19 Dec 2025 11:33:13 +0000 (11:33 +0000)]
tools: use -f in mkosi summary in fetch-distro.py
$ ./tools/fetch-distro.py -u fedora
+ mkosi --json -d fedora summary
‣ Ignoring --distribution from the CLI. Run with -f to rebuild the image with this setting
Luca Boccassi [Thu, 27 Feb 2025 16:58:55 +0000 (16:58 +0000)]
core: reuse existing dm-verity device for single filesystem images pinned by policy
Loading images is, generally speaking, the slowest part of sd-executor
when spawning a service. This is due to multiple factors. dm-verity is
obviously a big part of the cost, but dissecting in general via libblkid
also can take a lot of time, due to probing the images and their filesystems.
A performance test doing service restarts in a row shows these
results, ran on a production system (low power and slow ARM64 SOC) with
a real production service, show the following service interruption intervals:
One iteration is 507 restarts in this case, but this has ran hundreds
of times and the results are always in line within margin of error.
This also holds true for metrics from live systems, same numbers.
Between 1.0s and 1.2s can be attributed by profiling to the time needed
for the service code itself to start up and sd_notify, the rest is spent
inside systemd's code.
This means there is currently a tradeoff for services - either use secure
images, or make restarting fast. Downtime of services is a very important
metric, as for many cases this directly translates to outages, total or
partial (blackouot or greyout).
In order to facilitate using secure images without downsides, skip the
slow dissect steps (probing, loop devices, etc) when the configured
image is a single filesystem dm-verity image with a policy that pins it
to a single filesystem, and an already existing and open dm-verity device
can be found and reused.
This allows orchestrators to pre-open images on download, before restarting
services, to minimize downtimes.
* d9f2aa1704 Install systemd-tpm2-generator.8 only for UEFI builds
* ac1c7d8048 Drop dependencies on libcap-dev, no longer used since v259
* c36e5871ca Do not install systemd-sysv-generator.8 in upstream build
* bac0cca0e8 Install new files for upstream build
* 2855fb1302 Update changelog for 259-1 release
Useful to extract a certificate from a hardware token to a file, for
example in mkosi to ship the certificate from a hardware token in
/usr/lib/verity.d in an image
repart: add basic support for LUKS2 integrity verification (#39295)
Authenticated disk encryption is experimentally supported by cryptsetup
since v2.0.0 and allows for automatic dm-integrity setup for LUKS
devices. Add support for the mode to systemd-repart. The PR adds support
for `cryptsetup luksFormat --integrity` to systemd-repart and
"encryptedwithintegrity" dissection policy.
Limitations:
- No discard, online-only mode for repart.
Miao Wang [Thu, 13 Nov 2025 19:49:15 +0000 (03:49 +0800)]
ssh-proxy: expect OK PORT response from vsock-mux
The unix-domain socket to AF_VSOCK multiplexers in Firecracker and
vhost-device-vsock sends OK PORT response to the client, resulting
ssh clients to abort the connection with the additional response. This
patch addresses this issue by waiting and expecting the possible OK PORT
response from the multiplexer, if any, and then handover the socket fd
to the ssh client. It only checks if the response begins with OK and
consume the response till the first \n, for simplicity.
Signed-off-by: Miao Wang <shankerwangmiao@gmail.com>
Daan De Meyer [Mon, 1 Dec 2025 21:21:45 +0000 (22:21 +0100)]
keyutil: Add extract-certificate
Useful to extract a certificate from a hardware token to a file, for
example in mkosi to ship the certificate from a hardware token in
/usr/lib/verity.d in an image
Luca Boccassi [Mon, 24 Nov 2025 20:07:00 +0000 (20:07 +0000)]
fido2: fix enrolling when UV is required ('alwaysUv')
When a Yubikey or other fido2 device has FIPS mode enabled, UV will
always be required and cannot be disabled. Unhelpfully, when it is not
sent down, the hardware token (not the library) returns a generic
FIDO_ERR_MISSING_PARAMETER:
Jeremy Kerr [Fri, 11 Jul 2025 01:34:05 +0000 (09:34 +0800)]
udev-builtin-net_id: Extend persistent naming support to MCTP interfaces
Now that we have Management Component Transport Protocol (MCTP) transports
available over USB, it would be helpful to apply udev's persistent
naming rules to MCTP interfaces, to follow the USB hub/port topology.
Enable persistent naming for ARPHRD_MCTP-type devices, using a "mc" name
prefix, and add appropriate definitions for the v260 naming sheme.
Popax21 [Tue, 9 Dec 2025 01:56:01 +0000 (02:56 +0100)]
nss-resolve: add env var to specify resolved ifindex
Adds a new `SYSTEMD_NSS_RESOLVE_INTERFACE` environment variable to the nss-resolve module, whose value is subsequently passed down to the `ifindex` resolved lookup option.
This allows name lookups to be constrained to a just single interface for e.g. captive portal browsers.
core: use terminal_get_size_by_csi18 to query terminal size
This allows us to query the window size without moving the cursor. We have
various reports about the cursor being in an unexpected position and/or state.
LUKS2 supports built-in integrity checking which may come very handy to
mitigate partial rollback attacks on the storage when only some specific
parts are restored to some old encrypted state. Specific use-cases like
Confidential VMs may want to mandate the usage of feature e.g. on the root
volume. Introduce "encryptedwithintegrity" image policy to support that.
Note, due to the current libcryptsetup limitations, checking whether the
feature is enabled or not for the 'file' case (e.g. DDI image as a raw file)
requires setting up a loop device. To avoid that and keep dissect fully
functional when working unpriviliged, implement a minimal custom LUKS header
parser.
repart: add basic support for LUKS2 integrity verification
Authenticated disk encryption is experimentally supported by cryptsetup since
v2.0.0 and allows for automatic dm-integrity setup for LUKS devices. Add
support for the mode to systemd-repart. Currently, the option can only be used
in 'online' mode as libcryptsetup does not support creating integrity data
without the use of in-kernel dm-integrity infrastructure.
Integrity=/IntegrityAlgorithm= are added in the anticipation of other integrity
protection options, e.g. enabling dm-integrity for a plain unencrypted
partition.
* 6f15bdaae7 Update architecture match for 50-pid-max.conf (v3)
* 333cc1fcc5 Downgrade depends to recommends for IPC endpoint of respective libnss modules
* ab99a1b51a Revert "Update architecture match for 50-pid-max.conf"
* b93d7f855a Update changelog for 259~rc3-1 release
* 95c7f8a3d6 Install new udev rule
* 89509d9692 d/t/tests-in-lxd: re-construct --pin-packages arguments for autopkgtest
* 6b77249c71 d/extra/dbus-1: rename systemd-localed-read-only.conf
* 819831c19a Update architecture match for 50-pid-max.conf
* 0ddff89e9d Mirror dmi_arches from meson.build into debian/udev.install
* 398e8791db d/t/control: pull in optional libs for boot-and-services too
* c727922ad5 Update changelog for 259~rc2-1 release
* 8faf105531 Install new varlinkctl bash completion script
* f4b4cea2be d/t/control: ensure unit-tests autopkgtest pulls in dlopened libraries for test
* 7e8aba9883 Update changelog for 259~rc1-1 release
* 5953c42402 Update symbols file for v259~rc1
* 353125ccfa Install new files for v259~rc1
* ca22d1ca4f Drop patches, all merged upstream
* 32c75efca2 d/t/unit-config: fix python decorator copypasta
* e32179d633 d/rules: disable sysv compat in upstream builds
* cf77bd44be Install new files for upstream build
* aa564e5d3b kernel-install: skip 55-initrd.install when an initrd is already staged
In setup_output() we assume stdout has been set up properly
before stderr, hence the stdout we're inheriting from must
be writable (or more precisely, would have been adjusted to be).
Hence no need to duplicate it.
Mike Yuan [Sat, 22 Nov 2025 18:23:53 +0000 (19:23 +0100)]
core/exec-invoke: split out maybe_inherit_stdout_from_stdin(), use exec_input_is_inheritable()
Note that exec_input_is_inheritable() rightfully refuses EXEC_INPUT_FILE,
in which case std_output would have been reset in service_fix_stdio()
already.
While at it, use the generic fallback logic of first trying user manager
stdout when stdin is not writable.
Mike Yuan [Sat, 22 Nov 2025 06:10:09 +0000 (07:10 +0100)]
core/execute-serialize: clean up stdio serialization
* Do not interleave root_directory_as_fd with stdio fields
* Do not use different serialization key for different modes
pointing to same path
* Escape stdio file paths (as per 9be46b1da8b01c3f47e6c050185f2b45484d6300)
Luca Boccassi [Wed, 3 Dec 2025 18:59:34 +0000 (18:59 +0000)]
core: set Result=start-limit-hit when a unit is rate limited
There is currently no way to figure out a rate limit was hit on a unit,
as the last result is stripped in order to keep reporting the first
result, which is useful in case of a watchdog failure, which is the
reason why it was changed as such.
But rate limiting is also an important information to provide to
users, so allow the Result property to reflect it when it
happens.
man/systemd-boot: say that /EFI/systemd/drivers is for hardware
In aad0d11e7c6f1f7dcc7b00173140c74b8abf88cc we stopped supporting XBOOTLDR
with a different fs driver. This was the primary example that comes to mind
when we talk about loading filesystem drivers in the firmware. Since we don't
want people to do load such drivers, use a different example.
docs/BOOT_LOADER_INTERFACE: use full variable names once
We said in the header that "all EFI variables use the vendor UUID 4a67b082-0a4c-41cf-b6c7-440b29bb8c4f", but people not familiar with
UEFI might not know that this is concatenated with the variable name.
Let's use the full form once — when introducing the variable — to
make it easier to grep and search for.
While at it, use sembreaks in the document. This makes subsequent
changes much easier to review. (It also shows that some sentences
are rather long and thus hard to understand.)
Haiyue Wang [Wed, 17 Dec 2025 08:02:31 +0000 (16:02 +0800)]
meson: fix BPF build warnings due to MS extensions
Fix BPF program build warnings on Linux-6.19.0-rc1, more detail is [1]:
A). clang-bpf
[781/2458] Generating src/network/bpf/sysctl-monitor/sysctl-monitor.bpf.unstripped.o with a custom command
In file included from ../src/network/bpf/sysctl-monitor/sysctl-monitor.bpf.c:3:
./vmlinux.h:60263:3: warning: declaration does not declare anything [-Wmissing-declarations]
60263 | struct ns_tree;
| ^~~~~~~~~~~~~~
./vmlinux.h:80251:2: warning: declaration does not declare anything [-Wmissing-declarations]
80251 | struct __fs_path;
| ^~~~~~~~~~~~~~~~
./vmlinux.h:96184:2: warning: declaration does not declare anything [-Wmissing-declarations]
96184 | struct freelist_tid;
| ^~~~~~~~~~~~~~~~~~~
./vmlinux.h:114441:2: warning: declaration does not declare anything [-Wmissing-declarations]
114441 | struct renamedata;
| ^~~~~~~~~~~~~~~~~
./vmlinux.h:118480:2: warning: declaration does not declare anything [-Wmissing-declarations]
118480 | union pipe_index;
| ^~~~~~~~~~~~~~~~
./vmlinux.h:130452:4: warning: declaration does not declare anything [-Wmissing-declarations]
130452 | struct freelist_counters;
| ^~~~~~~~~~~~~~~~~~~~~~~~
6 warnings generated.
B). gcc-bpf
meson setup -Dbpf-compiler=gcc build
[1040/2458] Generating src/network/bpf/sysctl-monitor/sysctl-monitor.bpf.unstripped.o with a custom command
In file included from ../src/network/bpf/sysctl-monitor/sysctl-monitor.bpf.c:3:
./vmlinux.h:60263:31: warning: declaration does not declare anything
60263 | struct ns_tree;
| ^
./vmlinux.h:80251:25: warning: declaration does not declare anything
80251 | struct __fs_path;
| ^
./vmlinux.h:96184:28: warning: declaration does not declare anything
96184 | struct freelist_tid;
| ^
./vmlinux.h:114441:26: warning: declaration does not declare anything
114441 | struct renamedata;
| ^
./vmlinux.h:118480:25: warning: declaration does not declare anything
118480 | union pipe_index;
| ^
./vmlinux.h:130452:49: warning: declaration does not declare anything
130452 | struct freelist_counters;
| ^
[1] https://git.kernel.org/torvalds/c/639f58a0f480
"bpftool: Fix build warnings due to MS extensions"
Andrew Halaney [Mon, 15 Dec 2025 21:47:17 +0000 (15:47 -0600)]
man/systemd.exec: Make EnvironmentFile error conditions more explicit
It is not entirely clear what happens when EnvironmentFile fails in the
prior wording. With the new wording it should now be clear that if it
fails to process the file the service will fail, and if it is prefixed
with "-" all errors are silently ignored.
Signed-off-by: Andrew Halaney <ahalaney@netflix.com>
Luca Boccassi [Tue, 16 Dec 2025 21:44:57 +0000 (21:44 +0000)]
test: fix race condition in TEST-80-NOTIFYACCESS
In some cases systemd is faster to send the SIGHUP
than the script is to start the 'sleep' and background
it, so it never gets interrupted later and the test
is left hanging waiting for it.
[ 5028.410588] systemd[1]: Starting reload-timeout.service...
[ 5028.429544] reload-timeout.sh[165]: + set -o pipefail
[ 5028.429544] reload-timeout.sh[165]: + COUNTER=0
[ 5028.429841] reload-timeout.sh[165]: + trap sighup_handler SIGHUP
[ 5028.429841] reload-timeout.sh[165]: + export SYSTEMD_LOG_LEVEL=debug
[ 5028.429841] reload-timeout.sh[165]: + SYSTEMD_LOG_LEVEL=debug
[ 5028.429841] reload-timeout.sh[165]: + systemd-notify --ready
[ 5028.432891] systemd[1]: reload-timeout.service: Got notification message from PID 165: READY=1
[ 5028.432908] systemd[1]: reload-timeout.service: Changed start -> running
[ 5028.432983] systemd[1]: reload-timeout.service: Job 409 reload-timeout.service/start finished, result=done
[ 5028.432997] systemd[1]: Started reload-timeout.service.
[ 5028.433941] TEST-80-NOTIFYACCESS.sh[164]: Job for reload-timeout.service finished.
[ 5028.433941] TEST-80-NOTIFYACCESS.sh[164]: Got result done/Success for job reload-timeout.service.
[ 5028.433941] TEST-80-NOTIFYACCESS.sh[164]: Bus n/a: changing state RUNNING → CLOSED
[ 5028.436949] TEST-80-NOTIFYACCESS.sh[99]: + systemctl reload --no-block reload-timeout.service
[ 5028.444523] TEST-80-NOTIFYACCESS.sh[167]: Bus n/a: changing state UNSET → OPENING
[ 5028.444523] TEST-80-NOTIFYACCESS.sh[167]: sd-bus: starting bus by connecting to /run/systemd/private...
[ 5028.444523] TEST-80-NOTIFYACCESS.sh[167]: Bus n/a: changing state OPENING → AUTHENTICATING
[ 5028.444523] TEST-80-NOTIFYACCESS.sh[167]: Executing dbus call org.freedesktop.systemd1.Manager ReloadUnit(reload-timeout.service, replace)
[ 5028.444523] TEST-80-NOTIFYACCESS.sh[167]: Bus n/a: changing state AUTHENTICATING → RUNNING
[ 5028.445202] reload-timeout.sh[165]: + wait_for_signal
[ 5028.445586] reload-timeout.sh[169]: + sleep infinity
[ 5028.447285] reload-timeout.sh[165]: ++ sighup_handler
[ 5028.447285] reload-timeout.sh[165]: ++ echo hup1
[ 5028.444886] systemd[1]: reload-timeout.service: Trying to enqueue job reload-timeout.service/reload/replace
[ 5028.445228] systemd[1]: reload-timeout.service: Installed new job reload-timeout.service/reload as 491
[ 5028.445240] systemd[1]: reload-timeout.service: Enqueued job reload-timeout.service/reload as 491
[ 5028.446601] systemd[1]: reload-timeout.service: Service has no extensions to reload.
[ 5028.446799] systemd[1]: reload-timeout.service: Changed running -> reload-signal
[ 5028.446881] systemd[1]: Reloading reload-timeout.service...
[ 5028.451343] TEST-80-NOTIFYACCESS.sh[167]: Bus n/a: changing state RUNNING → CLOSED
[ 5028.452421] TEST-80-NOTIFYACCESS.sh[99]: + timeout 10 bash -c 'until [[ $(systemctl show reload-timeout.service -P SubState) == "reload-signal" ]]; do sleep .5; done'
[ 5028.460676] TEST-80-NOTIFYACCESS.sh[172]: Bus n/a: changing state UNSET → OPENING
[ 5028.460676] TEST-80-NOTIFYACCESS.sh[172]: sd-bus: starting bus by connecting to /run/systemd/private...
[ 5028.462029] TEST-80-NOTIFYACCESS.sh[172]: Bus n/a: changing state OPENING → AUTHENTICATING
[ 5028.462029] TEST-80-NOTIFYACCESS.sh[172]: Showing one /org/freedesktop/systemd1/unit/reload_2dtimeout_2eservice
[ 5028.463759] TEST-80-NOTIFYACCESS.sh[172]: Bus n/a: changing state AUTHENTICATING → RUNNING
[ 5028.470322] TEST-80-NOTIFYACCESS.sh[172]: Bus n/a: changing state RUNNING → CLOSED
[ 5028.472991] TEST-80-NOTIFYACCESS.sh[99]: + sync_in hup1
[ 5028.472991] TEST-80-NOTIFYACCESS.sh[99]: + read -r x
[ 5028.473839] reload-timeout.sh[165]: + wait 169
[ 5028.473996] TEST-80-NOTIFYACCESS.sh[99]: + test hup1 = hup1
[ 5028.473996] TEST-80-NOTIFYACCESS.sh[99]: + timeout 10 bash -c 'until [[ $(systemctl show reload-timeout.service -P SubState) == "reload-notify" ]]; do sleep .5; done'
[ 5038.477383] systemd[1]: TEST-80-NOTIFYACCESS.service: Failed with result 'exit-code'.
(note how the 'wait' is long after SIGHUP has been processed already)