From: Luca Boccassi Date: Fri, 15 May 2026 15:07:48 +0000 (+0100) Subject: core: support FD Store propagation through manager instances, and preservation throug... X-Git-Tag: v261-rc1~154 X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=053470f2db9af16f3fa9bed10d0d98c00940c630;p=thirdparty%2Fsystemd.git core: support FD Store propagation through manager instances, and preservation through kexec via LUO (#41683) First of all, FD store propagation is enabled between manager instances, and nspawn instances, via LISTEN_FDS and sd_notify. These can be nested arbitrarily deeply all the way up to the system manager, and on restart will be propagated all the way down to the origin. The FDs payload of an nspawn container running as a user unit will be preserved, all the way up to the system manager, and then down again. The kernel Live Update Orchestrator (LUO) exposes /dev/liveupdate, which lets userspace hand a set of "preservable" kernel objects to the new kernel across a kexec-based reboot. For now it only supports memfds, with more object types (virtio devices, etc.) expected to be added later. This is a natural fit for systemd's FD Store feature: services hand memfds (containing serialized state or other service data) to PID 1 via FDSTORE=1 sd_notify() messages, and get them back on their next start. Today this works across service restarts, soft reboots and initrd→rootfs transitions. With LUO, this series extends the same mechanism to work across kexec too. The nesting preservation of FD stores thus now is extended across kexec. All preservable fds are collected into a single LUO session named "systemd". Each fd is uploaded with an index (token). Token 0 is reserved for a "mapping" memfd, which carries a JSON object describing how to dispatch the other tokens back to units on the next boot. Unit names are used as the unit identifier, as they are stable across daemon-reexec, switch-root and kexec. token refers to the LUO index assigned to the object in the session. On shutdown for MANAGER_KEXEC, just before manager_free(), systemd walks all services and serializes their persistent fd store contents (fds + FDNAMEs + cgroup paths) into a JSON memfd. The fds themselves are gathered into an FDSet. The fdset and the serialization memfd are passed to systemd-shutdown via the SYSTEMD_LUO_SERIALIZE_FD environment variable providing the fd number, so the actual LUO session creation and ioctls happen as the very last step before kexec. On boot, manager_luo_restore_fd_stores() opens /dev/liveupdate, tries to retrieve the "systemd" session, reads the mapping memfd, then for each entry retrieves the fd from the session and attempts to attach it to the matching unit's fd store. Because the initrd-stage PID 1 runs before the real rootfs units are loaded, fds whose target unit is not (yet) known are not dropped: they are stashed in a new luo_held_fds hashmap keyed by cgroup path. They are re-tried in two places: after deserialization, and from unit_load(), so fds land in the correct fd store as soon as the owning unit is parsed, allowing units to be plugged in at runtime. Non-kexec shutdown paths are unaffected: if MANAGER_KEXEC is not the final objective, no serialization file is produced and no LUO session is ever created. Likewise if /dev/liveupdate does not exist, nothing happens. The LUO session creation is performed by systemd-shutdown, rather than by PID 1, deliberately: it is the last point where we can be sure all other processes have already been killed, so nothing else can race us into creating (or worse, hijacking) the "systemd" session. /dev/liveupdate is a singleton and session names are global. In addition, any kernel-visible side effects of preserving objects (memory pinning, etc.) are delayed until the absolute last moment, minimizing the window in which they could affect the running system. There is no behaviour change for shutdown paths other than kexec, or for kexec when systemd didn't hand over a serialization fd (e.g. because no service had any fds stored, or because LUO wasn't supported at serialization time). Finally, since LUO sessions cannot be nested under other sessions, third-party sessions need to be handled explicitly and held open in the shutdown binary alongside our own internal session, to allow services to create and preserve their own sessions. The requirement comes from VMMs that wish to preserve VM state across kexec: some file descriptors (e.g. KVM's vmfd from the KVM_CREATE_VM ioctl) cannot be transferred between processes via SCM_RIGHTS, so they cannot be stashed in the FD Store directly. Additionally, some file descriptors must be handled all-or-nothing, again tied to KVM, where a VM and its associated devices are one indivisible group. https://docs.kernel.org/userspace-api/liveupdate.html https://docs.kernel.org/core-api/liveupdate.html https://docs.kernel.org/admin-guide/mm/kho.html --- 053470f2db9af16f3fa9bed10d0d98c00940c630