Daan De Meyer [Thu, 14 May 2026 19:20:02 +0000 (19:20 +0000)]
libc: Make sure C23 versions of strtol(), sscanf() are not used
When _GNU_SOURCE is defined, glibc will always use c23 versions
of strtol(), sscanf() and friends if available (introduced after
glibc 2.34). Which means that any binaries built with headers
from newer glibc won't load on glibc < 2.38. To work around this,
redefine the appropriate constants to zero make sure the c99
versions are used instead.
Daan De Meyer [Thu, 14 May 2026 19:20:02 +0000 (19:20 +0000)]
libc: Use dlsym() from a constructor instead of weak symbols
Weak symbols still introduce a version requirement on a newer libc.
Resolve each libc symbol via dlsym(RTLD_DEFAULT) from a per-shim
constructor and cache the result in a file-scope static instead. This
avoids the version requirement, keeps the call path free of atomics
(constructors run single-threaded before main() and before any signal
handler can fire), and keeps dlsym() out of contexts where it is not
async-signal-safe.
Daan De Meyer [Fri, 15 May 2026 09:54:53 +0000 (09:54 +0000)]
meson: drop libdl, threads, and librt dependencies
Our baseline glibc is 2.34, which merged libdl, libpthread (the
dependency('threads') target), and librt into libc. Empty .so/.a stubs
remain for backward compatibility with old binaries, but new builds
resolve dl_*, pthread_*, mq_*, timer_*, etc. directly from libc.
On musl the same libraries are likewise empty stubs.
Drop the libdl, threads, and librt entries from every meson.build, and
remove the now-stale 'Libs.private: -lrt -pthread' from libudev.pc.in
since both flags resolve to empty link-time stubs on glibc 2.34+ and
musl.
Verified with readelf -d that libsystemd.so, libudev.so, and systemd no
longer carry DT_NEEDED entries for libdl/libpthread/librt.
Daan De Meyer [Fri, 15 May 2026 18:33:43 +0000 (18:33 +0000)]
locale-util: dlopen() libintl instead of linking against it
dgettext() lives in libc on glibc and in libintl.so.8 on musl with
gettext. Resolve it via dlsym() so neither configuration produces a
hard link-time dependency on libintl: try libintl.so.8 first and fall
back to RTLD_DEFAULT (which finds dgettext in libc on glibc).
The _() macro now expands to a runtime check that returns the
untranslated string if dlopen_libintl() has not run successfully, so
callers don't have to gate every translatable message on a runtime
check. pam_systemd_home — currently the only consumer of _() — calls
dlopen_libintl() best-effort from each PAM entry point.
The meson find_library('intl') dance is replaced with a has_header()
check; the only thing we need at build time is the prototype.
Daan De Meyer [Fri, 15 May 2026 11:06:21 +0000 (11:06 +0000)]
tree-wide: Use our own macros instead of fabs()/fmax()/fmin()
To make this work, ABS() is made generic so it also works on
floats and doubles.
While at it, fold the __ABS_INTEGER indirection and the
assert_cc(sizeof(long long) == sizeof(intmax_t)) away. The previous
form switched between __builtin_llabs (clang) and __builtin_imaxabs
(gcc), with the assert keeping the two paths behaviorally identical
on every platform we build for. imaxabs was originally chosen because
intmax_t is conceptually the widest signed integer type the platform
exposes, but the _Generic ABS already casts to (long long) before the
call, so the extra width imaxabs could in theory carry was being
narrowed away immediately anyway. With both paths collapsed to
__builtin_llabs((long long) (a)), the size relationship between
long long and intmax_t is no longer relevant.
Also add explicit unsigned long long / unsigned long / unsigned int
cases that pass the argument through unchanged. The previous default
branch cast unsigned values to (long long); for values above LLONG_MAX
this reinterprets them as negative, and __builtin_llabs(LLONG_MIN) is
UB. Unsigned values are already non-negative, so passing them through
is both correct and avoids the narrowing. Smaller unsigned types
(unsigned char, unsigned short) still go through the default branch
but promote to int first and fit in long long losslessly.
Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>
In hsv_to_rgb, restructure the conversion around the sector index
k = (int)(h/60) and fractional offset f = h/60 - k. The auxiliary
x value becomes c * (k & 1 ? 1.0 - f : f) and the six branches turn
into a switch on k. This drops the two xfmod() calls that were doing
the modulo work, in exchange for a single assert(h >= 0 && h < 360) —
all in-tree callers satisfy this and never relied on the wrap.
In rgb_to_hsv, the two xfmod() calls were no-ops (their arguments
were always within the divisor's magnitude). The trailing
xfmod(*ret_h, 360) appeared to be wrapping negative hues from the
r-max branch back into [0, 360), but fmod is sign-preserving so it
never did. Drop the no-ops and add an explicit +360 wrap so magenta
(1, 0, 1) now yields h ≈ 300 instead of -60.
Extend the tests to cover all six primary/secondary colors at sector
boundaries, all six sector midpoints (to catch any future inversion
of the ramp direction), the h-near-360 edge of the last sector, and
the rgb_to_hsv negative-wrap path via magenta. Switch the new and
existing integer-channel checks to ASSERT_EQ from tests.h; the
double-typed h/s/v range checks stay on ASSERT_TRUE since the
ASSERT_* comparison macros only support integer types.
Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>
Daan De Meyer [Fri, 15 May 2026 18:24:11 +0000 (18:24 +0000)]
test-bus-marshal: dlopen() glib and libdbus instead of linking directly
The test only uses 9 symbols (5 from glib, 4 from libdbus) for its
interop checks; dlopen them at runtime so the binary no longer carries a
hard link-time dependency on either library. Headers are still pulled in
through the *_cflags partial dependencies for the type declarations.
While we're at it, drop the compat glue for glib 2.36 which is long obsolete
at this point.
Daan De Meyer [Fri, 15 May 2026 14:54:04 +0000 (14:54 +0000)]
tree-wide: dlopen libpam in pam plugins
Same reasoning as for cryptsetup tokens. It means we can include
the pam plugins in the main systemd package without the package
manager introducing a dependency on libpam. It also makes things
more consistent and makes writing the upcoming linking test script
a lot simpler.
At the same time, we get rid of the libpam_misc dependency as the
one symbol we were using from it is trivial to reimplement ourselves.
Daan De Meyer [Fri, 15 May 2026 12:16:01 +0000 (12:16 +0000)]
cryptsetup: dlopen libcryptsetup in tokens
This avoids having to subpackage the tokens separately. If they link directly
against libcryptsetup, package manager will automatically add a dependency on
libcryptsetup to the package containing the tokens. With this change, the tokens
can ship in the main systemd package without necessarily pulling in libcryptsetup.
It also makes things more consistent. Once we also do the same for pam, any direct
linking will be limited to just libc, which for example simplifies writing tests for
ensuring we don't link unnecessarily as we don't have to add exceptions for the
cryptsetup tokens.
This actually drops the dependency on cryptsetup-libs for the fedora/centos/opensuse
systemd-udev package so install it explicitly in the initrd now to keep the tests
working.
Daan De Meyer [Thu, 14 May 2026 17:13:06 +0000 (17:13 +0000)]
bpf-util: rename from bpf-dlopen, unify version-specific symbol handling
Renames src/shared/bpf-dlopen.{c,h} to src/shared/bpf-util.{c,h} and
folds the former src/shared/bpf-compat.h (struct forward decl and
compat_bpf_map_create() helper) into the new header.
Aligns dlopen_bpf() with the standard wrapper pattern: drops the
manual dlopen_safe()/dlsym_many_or_warn()/TAKE_PTR(dl) plumbing and
the bespoke 'cached' int in favor of dlopen_many_sym_or_warn() inside
a FOREACH_STRING() soname-fallback loop.
Unifies declaration of the version-specific symbols (bpf_create_map,
bpf_map_create, bpf_object__next_map, bpf_token_create) into a single
DISABLE_WARNING_REDUNDANT_DECLS block in the header, and alphabetically
merges the DLSYM_PROTOTYPE list. DLSYM_OPTIONAL is used to load each
one — call sites already handle NULL (compat_bpf_map_create() and the
sym_bpf_object__next_map guard in userns-restrict.c). bpf_token_create
additionally defaults to a missing_bpf_token_create() stub returning
-ENOSYS, so callers can branch on the errno instead of NULL-checking
the pointer.
Updates test-bpf-token to match: drops the compile-time
LIBBPF_MAJOR_VERSION ≥ 1.5 gate and the direct <bpf/bpf.h> include in
favor of dlopen_bpf() + sym_bpf_token_create(), and treats -ENOSYS as
the test-skip path (covering both 'libbpf too old' and 'kernel lacks
BPF_TOKEN_CREATE support').
Yu Watanabe [Sat, 16 May 2026 18:20:45 +0000 (03:20 +0900)]
ci/alpine: do not install util-linux-login
For some reasons, after util-linux is bumped from 2.41.4-r0 to 2.42-r0,
the 'su' command from util-linux-login seems to not correctly run commands in
https://github.com/jirutka/setup-alpine/blob/v1.4.1/alpine.sh
and causes the following spurious failure:
```
2026-05-15T21:19:15.6539432Z ##[group]Set up user runner
2026-05-15T21:19:15.6981963Z /bin/sh: line 0: ��: not found
2026-05-15T21:19:15.6982503Z /bin/sh: line 1: ␡ELF␂␁␁␃: not found
2026-05-15T21:19:15.6985788Z /bin/sh: line 10: ␒␐␆␒B␈␒�␄␒y␄␒�␁␒␞␇␒:␁␒�␃␒�␄␒@␁␒9␈␒?␆␒␚␈␒x: not found
2026-05-15T21:19:15.7010731Z /bin/sh: line 33: can't open ␂␒-␂␒�: no such file
2026-05-15T21:19:15.7016026Z /bin/sh: line 33: syntax error: unexpected word (expecting ")")
2026-05-15T21:19:15.7049583Z
2026-05-15T21:19:15.7050199Z ␛[1;31mError occurred at line 338:␛[0m
2026-05-15T21:19:15.7050830Z 335 | echo 'permit nopass keepenv $SUDO_USER' | tee /etc/doas.d/root.conf
2026-05-15T21:19:15.7051287Z 336 | fi
2026-05-15T21:19:15.7051549Z 337 | SHELL
2026-05-15T21:19:15.7052039Z ␛[1;31m> 338 | abin/"$INPUT_SHELL_NAME" --root /.setup.sh␛[0m
2026-05-15T21:19:15.7052506Z 339 |
2026-05-15T21:19:15.7052796Z 340 | rm .setup.sh
2026-05-15T21:19:15.7053172Z 341 | endgroup
2026-05-15T21:19:15.7096322Z ##[error]Error occurred at line 338: abin/"$INPUT_SHELL_NAME" --root /.setup.sh (see the job log for more information)
2026-05-15T21:19:15.7101400Z ##[error]Process completed with exit code 1.
```
Let's not install the package. It seems no command provided by the
package is used.
test-verbs: dispatch via _dispatch_verb_with_args() directly
Drops the global-optind dependency from the test helper. Verb fixtures
stay inline as static const Verb[] — the section-based VERB() macro
would force unique verb names across the three test cases, which they
deliberately share to exercise overlap.
Co-developed-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Place VERB() declarations above each dispatch function and use
verbs_get_help_table() in help(). run() switches to
dispatch_verb_with_args(); the argv_looks_like_help() shortcut is
preserved since this is an internal tool with no proper option parsing.
Co-developed-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There is no --help implemented, so both verbs don't get help strings.
We should probably add --help + --version, and a proper description
of the program, but I'm leaving that for later.
Co-developed-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Place VERB() declarations above each dispatch function and use
verbs_get_help_table() in help() so the command listing stays in sync.
run() switches to dispatch_verb_with_args(); the argv_looks_like_help()
shortcut is preserved since this is an internal tool with no proper
option parsing.
Co-developed-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Place VERB() declarations directly above each dispatch function and use
verbs_get_help_table() in help() so the command listing stays in sync.
run() switches to dispatch_verb_with_args(); the argv_looks_like_help()
shortcut is preserved since this is an internal tool with no proper
option parsing.
Co-developed-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
storagectl: convert run_as_mount_helper to OPTION macros
This is the util-linux mount-helper interface (mount.storage), so all
options stay hidden via help=NULL — they are not user-facing. The
namespace "mount.storage" is distinct from the storagectl namespace
used for the user-facing CLI.
Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>
This bothered me for a while, but I didn't think too much about it and just
copied the existing usage pattern. But it really doesn't make sense. We expect
the compiler to align the section properly. But if it didn't align it, applying
alignment after the fact would just cause our pointer to point to the middle
of the structure. That'd be even worse than a misaligned pointer.
Similarly, when doing pointer arithmetic, p++ should really result in a value
with the appropriate alignment. This is the basic principle of C pointer
addition. So we really shouldn't try to adjust the pointer ourselves. At most,
we can assert that it is indeed aligned in tests.
Yu Watanabe [Sat, 16 May 2026 15:33:43 +0000 (00:33 +0900)]
sd-dhcp-client: use new message parser (#42123)
In 26b7c5ff3b944aa3a16d4e859e9c84ce7e968a5a, we introduced a new parser
for received DHCP message, but it was not used at that time. This PR
replaces the legacy parser with the new one, and makes the fuzzer also
use the new parser.
For the shell verb we want switches specified after the program name to
be passed to the program to execute, not processed by us. Mirror the
approach in 'userdbctl ssh-authorized-keys': start with
OPTION_PARSER_RETURN_POSITIONAL_ARGS, then lates switch to
STOP_AT_FIRST_NONOPTION for "shell" or NORMAL otherwise.
VERB declarations are placed directly above each function; functions
that dispatch multiple verb names get stacked VERB() declarations.
chainload_importctl() now takes the args strv instead of relying on the
global optind.
--help output is mostly the same.
--no-pager/--no-legend/--no-ask-password/-q/--quiet are now at the end.
bind-volume/unbind-volume are documented.
Also, if the fuzzing engine provides a valid message, then try to build
json variant and UDP payload from the parsed message. We will drop
dhcp_lease_save() and dhcp_lease_load(), hence the tests for them are
dropped.
Currently translated at 100.0% (266 of 266 strings)
Co-authored-by: Fco. Javier F. Serrador <fserrador@gmail.com>
Translate-URL: https://translate.fedoraproject.org/projects/systemd/main/es/
Translation: systemd/main
Yu Watanabe [Tue, 31 Mar 2026 22:56:09 +0000 (07:56 +0900)]
networkctl: load information about DHCP client from varlink
By the previous commit, networkd now exposes the received DHCP message
in the Descibe() DBus/Varlink method. Let's make networkctl deserialize
the DHCP message and use it where necessary.
This internally uses sd_dhcp_message object, and replaces functions
for creating and sending DHCP messages.
By using sd_dhcp_message internally, now we can correctly send long
(> 255 bytes) option data that cannot be fit in a single DHCP option TLV.
This also fixes the value in DHCP option 57 (Maximum Message Size).
Previously the IP and UDP header size is subtracted from the interface
MTU, but it should not.
Except for the above, this should not change any effective behaviors.
Luca Boccassi [Fri, 15 May 2026 17:19:41 +0000 (18:19 +0100)]
test-network: retry networkctl status in wait_operstate()
networkctl status may transiently fail right after start_networkd() because networkd has not yet picked up the freshly-created link from the kernel. The retry loop in wait_operstate() did not catch the resulting subprocess.CalledProcessError, so the test aborted on the first attempt instead of retrying for the configured timeout.
Observed in TEST-85-NETWORK-NetworkdBridgeTests, subtest test_bridge_configure_without_carrier[no-slave]:
Daan De Meyer [Fri, 15 May 2026 18:51:30 +0000 (18:51 +0000)]
meson: drop vestigial libgpg-error dependency
libgpg-error was added in 2017 (commit 76c8741060, Michael Biebl) to
gate HAVE_GCRYPT on its presence because src/resolve referenced
libgpg-error directly at the time. That usage is long gone — no source
file references any gpg-error API today — so the dependency only served
to fail HAVE_GCRYPT detection when gpg-error-dev wasn't installed.
libgcrypt's pkg-config Requires already pulls in the gpg-error headers
(via the transitive #include <gpg-error.h> in <gcrypt.h>), so dropping
the dep doesn't break compilation.
machinectl: reorder verb functions to match --help
The net diff is negative because some spurious whitespace and forward
declarations were dropped. One new forward declaration was added. (For
verb_poweroff_machine. The func could be moved, but I think it's better
to keep it adjacent to verb_reboot_machine which is very similar.)
Daan De Meyer [Fri, 15 May 2026 19:19:15 +0000 (21:19 +0200)]
nsresourced: detect and clean up registry entries for dead user namespaces (#42070)
The BPF kprobe that fires on user namespace destruction is the only
thing
that triggers registry cleanup, so any time it doesn't run — ring buffer
overflow, kprobe missing, fdstore entry dropped outside our cleanup path
— a registry entry is left behind forever.
Stamp each registry entry with the kernel's unique namespace identifier
(NS_GET_ID, kernel ≥ 6.13) at allocation time. At manager startup, after
the existing fdstore→registry sweep, walk the registry and ask the
kernel
to look each namespace up by id via open_by_handle_at() on nsfs; if the
lookup returns -ESTALE the namespace is gone and we release the entry.
Old entries written before this change carry no identifier and are left
alone.
Add a namespace_open_by_id() helper for the lookup. The kernel restricts
open_by_handle_at() on nsfs to processes in the initial user namespace,
collapsing both permission denials and dead namespaces onto -ESTALE; the
helper refuses early with -EPERM outside the initial user namespace
so callers can tell the two apart.
Daan De Meyer [Wed, 13 May 2026 10:54:02 +0000 (12:54 +0200)]
nsresourced: detect and clean up registry entries for dead user namespaces
The BPF kprobe that fires on user namespace destruction is the only thing
that triggers registry cleanup, so any time it doesn't run — ring buffer
overflow, kprobe missing, fdstore entry dropped outside our cleanup path
— a registry entry is left behind forever.
Stamp each registry entry with the kernel's unique namespace identifier
(NS_GET_ID, kernel ≥ 6.13) at allocation time. At manager startup, after
the existing fdstore→registry sweep, walk the registry and ask the kernel
to look each namespace up by id via open_by_handle_at() on nsfs; if the
lookup returns -ESTALE the namespace is gone and we release the entry.
Old entries written before this change carry no identifier and are left
alone.
Add a namespace_open_by_id() helper for the lookup. The kernel restricts
open_by_handle_at() on nsfs to processes in the initial user namespace,
collapsing both permission denials and dead namespaces onto -ESTALE; the
helper refuses early with -EHOSTDOWN outside the initial user namespace
so callers can tell the two apart.
Rewrite help() with help-util.h primitives + option_parser_get_help_table_group
for each User Record Properties section. The verbs[] table stays
unchanged for now; run() switches from dispatch_verb() (which depended
on the global optind) to _dispatch_verb_with_args() fed by
option_parser_get_args().
Explanations are improved for --birth-date[=DATE] (correct placement of
'['), --skel=, --shell= (short options listed). Some minor rewordings
for other options. The explanation for -E and -EE is split.
(OPTION_HELP_ENTRY_VERBATIM is used for -EE.)
Co-developed-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
homectl: reorder verb functions to match order in --help
Just a hand-crafted moving of blocks of code up and down, no other
changes. The net diff is -2 because add_signing_keys_from_credentials
forward declaration was dropped.
Luca Boccassi [Fri, 15 May 2026 15:07:48 +0000 (16:07 +0100)]
core: support FD Store propagation through manager instances, and preservation through kexec via LUO (#41683)
First of all, FD store propagation is enabled between manager instances,
and nspawn instances, via LISTEN_FDS and sd_notify. These can be nested
arbitrarily deeply all the way up to the system manager, and on restart
will be propagated all the way down to the origin. The FDs payload of an
nspawn container running as a user unit will be preserved, all the way
up to the system manager, and then down again.
The kernel Live Update Orchestrator (LUO) exposes /dev/liveupdate, which
lets userspace hand a set of "preservable" kernel objects to the new
kernel across a kexec-based reboot. For now it only supports memfds,
with more object types (virtio devices, etc.) expected to be added
later.
This is a natural fit for systemd's FD Store feature: services hand
memfds (containing serialized state or other service data) to PID 1 via
FDSTORE=1 sd_notify() messages, and get them back on their next start.
Today this works across service restarts, soft reboots and initrd→rootfs
transitions. With LUO, this series extends the same mechanism to work
across kexec too. The nesting preservation of FD stores thus now is
extended across kexec.
All preservable fds are collected into a single LUO session named
"systemd". Each fd is uploaded with an index (token). Token 0 is
reserved for a "mapping" memfd, which carries a JSON object describing
how to dispatch the other tokens back to units on the next boot.
Unit names are used as the unit identifier, as they are stable across
daemon-reexec, switch-root and kexec. token refers to the LUO index
assigned to the object in the session.
On shutdown for MANAGER_KEXEC, just before manager_free(), systemd walks
all services and serializes their persistent fd store contents (fds +
FDNAMEs + cgroup paths) into a JSON memfd. The fds themselves are
gathered into an FDSet. The fdset and the serialization memfd are passed
to systemd-shutdown via the SYSTEMD_LUO_SERIALIZE_FD environment
variable providing the fd number, so the actual LUO session creation and
ioctls happen as the very last step before kexec.
On boot, manager_luo_restore_fd_stores() opens /dev/liveupdate, tries to
retrieve the "systemd" session, reads the mapping memfd, then for each
entry retrieves the fd from the session and attempts to attach it to the
matching unit's fd store.
Because the initrd-stage PID 1 runs before the real rootfs units are
loaded, fds whose target unit is not (yet) known are not dropped: they
are stashed in a new luo_held_fds hashmap keyed by cgroup path. They are
re-tried in two places: after deserialization, and from unit_load(), so
fds land in the correct fd store as soon as the owning unit is parsed,
allowing units to be plugged in at runtime.
Non-kexec shutdown paths are unaffected: if MANAGER_KEXEC is not the
final objective, no serialization file is produced and no LUO session is
ever created. Likewise if /dev/liveupdate does not exist, nothing
happens.
The LUO session creation is performed by systemd-shutdown, rather than
by PID 1, deliberately: it is the last point where we can be sure all
other processes have already been killed, so nothing else can race us
into creating (or worse, hijacking) the "systemd" session.
/dev/liveupdate is a singleton and session names are global. In
addition, any kernel-visible side effects of preserving objects (memory
pinning, etc.) are delayed until the absolute last moment, minimizing
the window in which they could affect the running system. There is no
behaviour change for shutdown paths other than kexec, or for kexec when
systemd didn't hand over a serialization fd (e.g. because no service had
any fds stored, or because LUO wasn't supported at serialization time).
Finally, since LUO sessions cannot be nested under other sessions,
third-party sessions need to be handled explicitly and held open in the
shutdown binary alongside our own internal session, to allow services to
create and preserve their own sessions. The requirement comes from VMMs
that wish to preserve VM state across kexec: some file descriptors (e.g.
KVM's vmfd from the KVM_CREATE_VM ioctl) cannot be transferred between
processes via SCM_RIGHTS, so they cannot be stashed in the FD Store
directly. Additionally, some file descriptors must be handled
all-or-nothing, again tied to KVM, where a VM and its associated devices
are one indivisible group.
LUO: add support for preserving third party sessions
LUO sessions cannot be nested under other sessions. This means we need
to handle them explicitly, and held them open in the shutdown binary
like we do with our own internal session, to allow services to create
their own.
The requirement to support third party sessions comes from VMMs that
wish to preserve VM(s) state(s) across kexec, as some file descriptors
(KVM's vmfd from the KVM_CREATE_VM ioctl) cannot be transfered between
processes via SCM_RIGHTS, so they cannot be stashed in the FD Store
directly. Also some file descriptors have to be handled all together or
not at all, again to do with KVM and devices that are all part of the
same vm.
Luca Boccassi [Mon, 30 Mar 2026 23:29:19 +0000 (00:29 +0100)]
shutdown: prepare LUO session for FD Stores before kexec
Wires up the systemd-shutdown side of the kexec-via-LUO fd store preservation.
When rebooting via kexec, systemd builds a JSON description of the fd
stores of all loaded services and passes it to systemd-shutdown through
the SYSTEMD_LUO_SERIALIZE_FD environment variable. The FDs themselves
come in as part of the normal shutdown FDSet. systemd-shutdown's job is
then, at the very last moment before invoking the kexec syscall, to
move that state into a kernel LUO session so it survives the reboot.
Doing the LUO session creation here, rather than in PID 1, is
deliberate:
* It's the last point where we can be sure all other processes have
already been killed, so nothing else can race us into creating (or
worse, hijacking) the "systemd" session, as /dev/liveupdate is a
singleton and a session name is global.
* Any kernel-visible side effects of preserving objects (memory
pinning etc.) are delayed until the absolute last moment, minimizing
the window in which they could affect the running system
No behaviour change for shutdown paths other than kexec, or for kexec
when systemd didn't hand over a serialization fd (e.g. because no
service had any fds stored, or because LUO wasn't supported at
serialization time).
Luca Boccassi [Fri, 1 May 2026 13:25:11 +0000 (14:25 +0100)]
core: support FD Store preservation through kexec via LUO
The kernel Live Update Orchestrator (LUO) exposes /dev/liveupdate, which
allows userspace to hand a set of "preservable" kernel objects to the
new kernel across a kexec-based reboot. For now it only supports memfds,
with more object types (virtio devices, etc.) expected to be added later.
This is a natural fit for systemd's FD Store feature: services hand
memfds (containing serialized state or other service data) to PID 1 via
FDSTORE=1 sd_notify() messages, and get them back on their next start.
Today this works across service restarts, soft reboots and
initrd→rootfs transitions. With LUO we can extend the same mechanism to
work across kexec, too.
The protocol on the PID 1 side works roughly as follows:
* All preservable fds are collected into a single LUO session named
"systemd". Each FD gets uploaded with a token. Token 0 in that session
is reserved for a "mapping" memfd, which carries a JSON object
describing how to dispatch the other tokens back to units on the next
boot:
unit IDs are used as the unit identifier, as they're stable
across daemon-reexec, switch-root and kexec. token refers to the
LUO token assigned to the object in the session.
* On shutdown for MANAGER_KEXEC, just before manager_free(), systemd
walks all services and serializes their persistent fd store contents
(fds + FDNAMEs + unit IDs) into a JSON memfd. The FDs themselves are
gathered into a FDSet to be kept around. The fdset and the
serialization memfd are passed to systemd-shutdown via the
SYSTEMD_LUO_SERIALIZE_FD environment variable providing the fd number,
so the actual LUO session creation and ioctls can happen as the very
last step before kexec (shutdown implementation is the next commit).
* On boot, manager_luo_restore_fd_stores() opens /dev/liveupdate,
tries to retrieve the "systemd" session, reads the mapping memfd,
then for each entry retrieves the fd from the session and attempts
to attach it to the matching unit's fd store.
* The FDs are injected in the appropriate unit's FD stores using the
same mechanism as the LISTEN_FDS propagation that was set up earlier.
Non-kexec shutdown paths are unaffected: if MANAGER_KEXEC is not the
final objective, no serialization file is produced and no LUO session
is ever created. Likewise if /dev/liveupdate does not exist, nothing
happens.