LUO: add support for preserving third party sessions
LUO sessions cannot be nested under other sessions. This means we need
to handle them explicitly, and held them open in the shutdown binary
like we do with our own internal session, to allow services to create
their own.
The requirement to support third party sessions comes from VMMs that
wish to preserve VM(s) state(s) across kexec, as some file descriptors
(KVM's vmfd from the KVM_CREATE_VM ioctl) cannot be transfered between
processes via SCM_RIGHTS, so they cannot be stashed in the FD Store
directly. Also some file descriptors have to be handled all together or
not at all, again to do with KVM and devices that are all part of the
same vm.
Luca Boccassi [Mon, 30 Mar 2026 23:29:19 +0000 (00:29 +0100)]
shutdown: prepare LUO session for FD Stores before kexec
Wires up the systemd-shutdown side of the kexec-via-LUO fd store preservation.
When rebooting via kexec, systemd builds a JSON description of the fd
stores of all loaded services and passes it to systemd-shutdown through
the SYSTEMD_LUO_SERIALIZE_FD environment variable. The FDs themselves
come in as part of the normal shutdown FDSet. systemd-shutdown's job is
then, at the very last moment before invoking the kexec syscall, to
move that state into a kernel LUO session so it survives the reboot.
Doing the LUO session creation here, rather than in PID 1, is
deliberate:
* It's the last point where we can be sure all other processes have
already been killed, so nothing else can race us into creating (or
worse, hijacking) the "systemd" session, as /dev/liveupdate is a
singleton and a session name is global.
* Any kernel-visible side effects of preserving objects (memory
pinning etc.) are delayed until the absolute last moment, minimizing
the window in which they could affect the running system
No behaviour change for shutdown paths other than kexec, or for kexec
when systemd didn't hand over a serialization fd (e.g. because no
service had any fds stored, or because LUO wasn't supported at
serialization time).
Luca Boccassi [Fri, 1 May 2026 13:25:11 +0000 (14:25 +0100)]
core: support FD Store preservation through kexec via LUO
The kernel Live Update Orchestrator (LUO) exposes /dev/liveupdate, which
allows userspace to hand a set of "preservable" kernel objects to the
new kernel across a kexec-based reboot. For now it only supports memfds,
with more object types (virtio devices, etc.) expected to be added later.
This is a natural fit for systemd's FD Store feature: services hand
memfds (containing serialized state or other service data) to PID 1 via
FDSTORE=1 sd_notify() messages, and get them back on their next start.
Today this works across service restarts, soft reboots and
initrd→rootfs transitions. With LUO we can extend the same mechanism to
work across kexec, too.
The protocol on the PID 1 side works roughly as follows:
* All preservable fds are collected into a single LUO session named
"systemd". Each FD gets uploaded with a token. Token 0 in that session
is reserved for a "mapping" memfd, which carries a JSON object
describing how to dispatch the other tokens back to units on the next
boot:
unit IDs are used as the unit identifier, as they're stable
across daemon-reexec, switch-root and kexec. token refers to the
LUO token assigned to the object in the session.
* On shutdown for MANAGER_KEXEC, just before manager_free(), systemd
walks all services and serializes their persistent fd store contents
(fds + FDNAMEs + unit IDs) into a JSON memfd. The FDs themselves are
gathered into a FDSet to be kept around. The fdset and the
serialization memfd are passed to systemd-shutdown via the
SYSTEMD_LUO_SERIALIZE_FD environment variable providing the fd number,
so the actual LUO session creation and ioctls can happen as the very
last step before kexec (shutdown implementation is the next commit).
* On boot, manager_luo_restore_fd_stores() opens /dev/liveupdate,
tries to retrieve the "systemd" session, reads the mapping memfd,
then for each entry retrieves the fd from the session and attempts
to attach it to the matching unit's fd store.
* The FDs are injected in the appropriate unit's FD stores using the
same mechanism as the LISTEN_FDS propagation that was set up earlier.
Non-kexec shutdown paths are unaffected: if MANAGER_KEXEC is not the
final objective, no serialization file is produced and no LUO session
is ever created. Likewise if /dev/liveupdate does not exist, nothing
happens.
Luca Boccassi [Fri, 1 May 2026 13:06:11 +0000 (14:06 +0100)]
nspawn: support forwarding FDs from payloads to managers
When there is a NOTIFY_SOCKET, and FDs are received from the
payload following the FD Store protocol, forward them up the
chain to the service manager that is managing nspawn.
This allows FD Store persistence across container restarts,
and can chain up for user managers as well to survive restarting
those, or reexecs, and in the future reboots too via LUO.
Add a new test case to exercise the PID1 -> user session -> nspawn -> payload
chain.
Luca Boccassi [Fri, 1 May 2026 13:19:33 +0000 (14:19 +0100)]
core: propagate FDs from store from user to system manager
In order to allow FD Stores of user units to survive a user
session restart, propagate FDs received via the protocol up one
level from user to system manager via sd_notify.
And the other way around, propagate them down via LISTEN_FDS
tagging them with the unit name so that the child manager can
inject them in the appropriate unit.
Ensure units that are dead or not loaded can get FDs added to
their stores, and that they are correctly propagated once the
unit is started or loaded. When the unit is not loaded we don't
know what the FD max limit is, so simply increase it for each FD
injected, and then when the unit is realised prune it down to
match the unit's now available config in case the limit is lower
than the number of FDs in the store.
Each FD sent up or down is assigned a monotonic index, and the manager
also sends a JSON map that associates the index with the original
unit and FDNAME:
core: add WorkingDirectory, Environment and SetCredential{,Encrypted} to io.systemd.Unit.StartTransient (#41874)
This PR adds some more properties to the io.systemd.Unit.StartTransient
varlink interface: WorkingDirectory, Environment and
SetCredential{,Encrypted}. Its also hopefully a useful starting point to
establish a pattern to add even more.
manager: skip reopening of console and signal reset when running as normal program
We want to reopen the console used for logging when running as PID1, but
also when running a user manager (c.f. 48a601fe5de8aa0d89ba6dadde168769fa7ce992
and 2a646b1d624e510a79785e1268b55a9c3a441db5). But this can cause
problems when the binary is invoked directly, e.g. to print --help.
E.g. if we ignore SIGPIPE, we'd remain running briefly after
'/usr/lib/systemd/systemd --help | head -n1'.
Previusly, the getopt machinery would print to stderr unconditionally.
But after the rework of option parsing, which means that we use the
log_* functions to repor errors, the test that checks if we print errors
to stderr started failing.
So let's skip some more of the setup if !invoked_by_systemd().
It'd be nice if we could not repeat the information about the option
list a second time. But I don't see a nice way to do this, since
(by design) with the macro approach, the macros must be intertwined
with the parse_argv() code. But that code in turn refers to a bunch
of variables, so lifting out the function is not immediately possible.
So I think it's best to keep the existing approach where we provide
a list of options, without additional context, and skip them using
a custom routine.
099663ff8c117303af369a4d412dafed0c5614c2 added "support" for
-b/-s/-z ARG with a comment of
> /* Just to eat away the sysvinit kernel cmdline args without getopt()
> * error messages that we'll parse in parse_proc_cmdline_word() or
> * ignore. */
And for PID1 those was valid. But when not running as PID1, those
options would be parsed as valid but then help() would immediately
return -EINVAL:
$ build-old/systemd -b; echo $?
1
At the same time, when running as PID1, if we encounter an error,
we shouldn't opine about the rest of the command line. So continuing
with the loop and the checks after the loop were iffy.
Later, cd57038a30aa9447bde3af7111ac8dc517b38bbf made a big refactoring,
and the 'break' (i.e. continuation of the loop) was changed to 'return 0',
making things even more confusing, since now we'd just silently stop in
the middle of the command line if -b/-s/-z were encountered.
So be more careful: when running as PID1, stop parsing on error
and return from the function. We didn't parse the full command line,
so the later checks are not useful. Silently ignore -b/-s/-z.
When not running as PID1, explicitly say that -b/-s/-z are not
supported, and propogate the error if parsing fails, e.g. with
an unknown option.
parse_argv() uses FOREACH_OPTION (not _OR_RETURN) so we can preserve the
existing PID 1 tolerance: an unknown option, or one of the sysvinit-style
-b/-s/-z catch-alls, returns 0 instead of -EINVAL when getpid_cached() == 1.
The docs documented --crash-vt=, but the code implemented "crash-chvt",
as introduced in b9e74c399458a1146894ce371e7d85c60658110c.
The output from --help is now modified to match code.
getopt-defs.h is intentionally left in place since
proc_cmdline_filter_pid1_args() in src/basic/proc-cmdline.c still uses
its COMMON_/SYSTEMD_/SHUTDOWN_GETOPT_* macros to walk the kernel
command line.
Previously, opterr=0 was used to suppress error messages about option
parsing in PID1. They are now logged at debug level (if debug logging
is enabled.)
Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>
Ivan Kruglov [Thu, 14 May 2026 16:41:24 +0000 (09:41 -0700)]
shared: add exec_command_status_build_json() and ExecCommandStatus varlink type to common
Add exec_command_status_build_json() and exec_command_status_list_build_json() to varlink-common, alongside exec_command_build_json() and exec_command_list_build_json(). The status list function is the runtime counterpart of the command list function — the two arrays are positionally aligned so index N in the status array corresponds to index N in the command array. Commands that have not yet run produce null entries to preserve alignment.
Add the ExecCommandStatus varlink struct type to varlink-idl-common next to ExecCommand. It contains PID, timestamps, and mutually exclusive ExitStatus (int, for normal exit) / ExitSignal (string, for signal kill).
Luca Boccassi [Thu, 14 May 2026 20:06:15 +0000 (21:06 +0100)]
ci: switch SUSE mkosi mirror to cdn.o.o
The cdn mirror is preferred by SUSE for clouds/CIs. There have been issues with some
mirrors, which fail to download from GHA quite often lately, so hopefully this will
make it reliable again.
Philip Withnall [Sun, 3 May 2026 21:36:32 +0000 (22:36 +0100)]
test: Add a sysupdate test for files which are a prefix match of each other
This tests whether the pattern matching code checks it’s matched the
whole string and not just a prefix (see commit 4ffb60319b).
In particular, this tests a setup which KDE currently use in their
sysupdate images, where two regular file transfers are done, one of a
`foo.erofs` file, and the other `foo.erofs.caibx`. As one is a prefix of
the other, they were hitting this bug.
When the periodic RA timer fires, any error returned by sendmsg()
currently propagates up through sd_radv_send() into radv_timeout(),
which then calls sd_radv_stop(). The RA engine is never started again
until the next carrier transition.
On an 802.3ad bond there is a window right after carrier-up where the
link is administratively up but no aggregator has been selected yet, so
sendmsg() returns ENOBUFS. If the very first RA after a flap lands in
that window, radv stops permanently and all clients lose their SLAAC
addresses, on-link/PD prefixes, and default router once the previously
advertised lifetimes expire, while IPv4 keeps working, leading to a very
confusing situation with v4 up and v6 down.
Handle this the same way solicited RAs already do (see
radv_process_packet()): log the failure and reschedule the timer instead
of giving up. ra_sent is left untouched on failure so we stay in the
fast initial-advertisement regime until a send actually succeeds.
network: honour static IPv6LL addresses in network_adjust_*()
link_radv_enabled() and link_ndisc_enabled() use
link_ipv6ll_enabled_harder(), which considers a static fe80:: address
in [Address] sufficient to run radv/ndisc even when LinkLocalAddressing=
(or IPv6LinkLocalAddressGenerationMode=none, which network_verify()
folds into the same flag) disables the kernel-generated link-local.
network_adjust_radv()/ndisc()/dhcp() however only check the raw
link_local flag and zero router_prefix_delegation / ndisc / dhcp&IPV6
at parse time, so the runtime gate never gets a chance to fire.
Factor the static-LL lookup out of link_ipv6ll_enabled_harder() into a
Network-level helper and use it in the three network_adjust_*()
functions, bringing parse-time and runtime behaviour in line.
[zjs: this was originally proposed in
https://github.com/systemd/systemd/commit/930fc9d6980f27b278527b0d6117f97296fcaf6a.
I'm rescuing one chunk from that patch.]
core: reorder cases in parse_argv() to match order in --help
The hidden-from-help options (--crash-reboot, --service-watchdogs,
--deserialize, --switched-root, --machine-id, -D, -b/-s/-z, ?)
move to the bottom. The 'b'/'s'/'z' → '?' fall-through is preserved.
Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>
我超厉害 [Thu, 14 May 2026 17:51:29 +0000 (01:51 +0800)]
sd-device: use ERRNO_IS_NEG_DEVICE_ABSENT() for device-id load failures (#41764)
Device enumeration may encounter transient errors such as ENXIO when devices
appear or disappear concurrently. These conditions represent expected "device absent"
races and should be treated uniformly across the enumeration logic.
This change replaces the ENODEV-specific check with ERRNO_IS_NEG_DEVICE_ABSENT(),
ensuring that all expected disappearance conditions are handled consistently.
Unexpected errors are still propagated, while expected races are ignored without
aborting the enumeration.
Yu Watanabe [Thu, 14 May 2026 17:47:23 +0000 (02:47 +0900)]
A few more conversions of options and verbs (#41795)
I had those prepared before but I didn't submit them because the
automatic layout didn't work well. In two cases now the sync of widths
between verbs and options is disabled and one case is left with the
automatic alignment. I think it'd good enough to merge.
Ivan Kruglov [Thu, 14 May 2026 16:41:09 +0000 (09:41 -0700)]
core: move service_context_build_json() to varlink-service.c
Move the existing (partial) service context builder from varlink-unit.c into its own varlink-service.c file, following the pattern established by other unit type context builders (varlink-path.c, varlink-scope.c, etc.). No functional change.
Yu Watanabe [Thu, 14 May 2026 15:38:19 +0000 (00:38 +0900)]
meson: don't use Python module for host Python (#41959)
Checking for pefile required that module to be made available for the
Python used to build systemd, even though it's only used at runtime,
potentially via a different Python installation.
Furthermore, Meson's Python module doesn't do the right thing when cross
compiling and looking up a Python for the host system, so this would end
up uselessly checking whether the build Python had the pefile module,
which is not needed. Even if it were made to check the host Python using
find_program, it still relies on being able to run its Python, which in
a cross scenario it probably wouldn't be able to do.
All in all, this check does more harm than good, and prevents building
ukify in valid configurations, so remove it.
noxiouz [Tue, 17 Mar 2026 23:55:51 +0000 (23:55 +0000)]
coredump: add JSON output support to coredumpctl info
Implement support for the --json= flag in the info subcommand
(issue #38844). Previously, coredumpctl info always produced
human-readable text output regardless of --json=.
Add a CoredumpFields struct that holds all journal fields extracted
for a coredump entry, along with coredump_fields_done() to release
member resources and coredump_fields_load() to populate the struct
from a journal entry. Both print_info() and the new print_info_json()
use this shared loader, eliminating the duplicate RETRIEVE loop.
print_info_json() builds a JSON object with the same fields shown by
print_info(). Missing fields are omitted via SD_JSON_BUILD_PAIR_CONDITION,
matching the tolerant behavior of print_info() rather than skipping the
entry entirely. Signal/Reason handling mirrors print_info(): normal
coredumps (MESSAGE_ID == SD_MESSAGE_COREDUMP_STR) emit a numeric Signal
field; non-normal entries (kernel oops, etc.) emit a Reason field with
the raw text from COREDUMP_SIGNAL.
Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>
Daan De Meyer [Tue, 12 May 2026 13:03:49 +0000 (13:03 +0000)]
nspawn: split boot parameters into env vars and argv
When the kernel hands the command line to PID 1, any KEY=VALUE assignment
whose KEY does not contain a '.' is exported as an environment variable
(with '-' replaced by '_') rather than passed as an argument. Mimic the
same split in --boot mode so kernel-cmdline-style arguments passed after
the container path behave as they would on a real boot.
various: fix duplicated logging from parse_path_argument
As pointed out in review, parse_path_argument can fail for non-oom reasons.
But the function already logs, so the correct thing to do is to just
propagate the error.
busctl: convert to the new option and verb parsers
The conversion doesn't work great, because some of the verbs take many
arguments and the first column is extermely wide. So similarly to
kernel-install, I dropped the sync of column widths. This allows the
help for options to use most of the available space.
localectl: convert to the new option and verb parsers
The verb synopses are long, so they got broken up:
===============================================================================
> localectl [OPTIONS…] COMMAND …
Query or change system locale and keyboard settings.
Commands:
[status] Show current locale settings
set-locale LOCALE... Set system locale
list-locales Show known locales
set-keymap MAP [MAP] Set console and X11 keyboard mappings
list-keymaps Show known virtual console keyboard mappings
set-x11-keymap LAYOUT [MODEL Set X11 and console keyboard mappings
[VARIANT [OPTIONS]]]
list-x11-keymap-models Show known X11 keyboard mapping models
list-x11-keymap-layouts Show known X11 keyboard mapping layouts
list-x11-keymap-variants Show known X11 keyboard mapping variants
[LAYOUT]
list-x11-keymap-options Show known X11 keyboard mapping options
Options:
-h --help Show this help
--version Show package version
-l --full Do not ellipsize output
--no-pager Do not start a pager
--no-ask-password Do not prompt for password
-H --host=[USER@]HOST Operate on remote host
-M --machine=CONTAINER Operate on local container
--no-convert Don't convert keyboard mappings
See the localectl(1) man page for details.
===============================================================================
But I think this is OK. Everything is readable. On a more normal terminal,
everything fits nicely.
Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>
kernel-install: convert to the new option and verb parsers
The verb synopses are very long because of the many parameters.
Previously were shown without help and occupied all available columns.
With the autogenerated help format, this doesn't work great. So the
verbs and options tables are not synced, so that help for options can
use more columns. I think in this case this is better than the
alternatives.
Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>
Luca Boccassi [Thu, 14 May 2026 12:10:13 +0000 (13:10 +0100)]
cgroup: Add CPUSetPartition= setting (#42013)
Add support for configuring cpuset partition type via the
CPUSetPartition= unit file setting. This controls the kernel's
cpuset.cpus.partition cgroup attribute.
The setting takes one of "member", "root", or "isolated". This is
useful for real-time workloads that require dedicated CPU resources
without interference from other processes.
When set, systemd will write the partition type to the
cpuset.cpus.partition cgroup file. If the kernel rejects the value
(e.g., due to partition hierarchy rules), a warning is logged and the
unit continues with the kernel's default partition type.
Co-developed-by: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
shared/verbs: split verbs in two lines when the synopsis is > 25 characters
The help tests would not pass because in cases where the verb synopsis
is very long, we'd format the table badly if the terminal is fairly
narrow. I experimented with a few solutions, but overall, it's hard to
achieve very good layout with the automatic formatting. I think the
approach in this commit works the best: we end up with an two- or
three-line verb synopis, which is similar to what we did manually
before.
$ COLUMNS=80 build/localectl -h
...
Commands:
[status] Show current locale settings
set-locale LOCALE... Set system locale
list-locales Show known locales
set-keymap MAP [MAP] Set console and X11 keyboard mappings
list-keymaps Show known virtual console keyboard mappings
set-x11-keymap LAYOUT [MODEL Set X11 and console keyboard mappings
[VARIANT [OPTIONS]]]
list-x11-keymap-models Show known X11 keyboard mapping models
list-x11-keymap-layouts Show known X11 keyboard mapping layouts
list-x11-keymap-variants Show known X11 keyboard mapping variants
[LAYOUT]
list-x11-keymap-options Show known X11 keyboard mapping options
I think that almost nobody actually uses an 80 column terminal, and if
they do, they probably don't spend too much time looking at our --help
output there. So the goal here is to do something reasonable and robust
and get the tests to pass.
We can use strjoina here because the strings are fully under our
control.
fuzz-systemctl-parse-argv: add two corpus files to test compat parsers
Looking at the corpus examples, I'm not sure the fuzzer even went into
the compat parsers. None of the files have argv[0] that'd cause
invoked_as() to go into the compat paths. So add the files to provide
a quick test and possibly bias the fuzzer search into the right
direction.
shared/options: implement the equivalent of 'opterr'
All log messages during option parsing are emitted using log_full,
and the level is set as LOG_ERR + state->log_level_shift. The default
shift is 0, but if set to e.g. 4, we log at LOG_DEBUG, and if set
to 5 or higher, logging is effectively suppressed. (Unless compiled
with LOG_TRACE, when it'd be suppressed if the shift if set to 6
or higher.) So this gives something like 'opterr', except that
without global state and potentially more flexible.
systemctl_main() is moved to systemctl.c to allow fuzz-systemctl-parse-argv
to compile. It needs systemctl_help(), which needs the verb table, with the
expected groups. Once we provide that, the linker needs all the verb_*
functions. So add dummy implementations in fuzz-systemctl-parse-argv to
allow the link to happen.
The alternative would be to provide an empty option table, but that
seems to be more complicated, and also can simulate parsing of the whole
command line with the full verb set, so it seems better to test with the
real verb table.
The verbs[] table still lives in systemctl-main.c — only the option parsing
side is migrated. systemctl_dispatch_parse_argv() gains a remaining_args
out-param so run() can pass the parsed positional args to systemctl_main(),
which dispatches via _dispatch_verb_with_args() instead of dispatch_verb().
The Options section of --help now renders from the OPTION declarations; the
verb sections still use raw printfs and will be converted alongside the
verbs[] migration.
Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>
systemctl: reorder cases in parse_argv() to match order in --help
Compatibility-only options (--fail, --irreversible, --ignore-dependencies,
--no-legend) are grouped at the end alongside the '.' / '?' error handlers.
The case 'P': … _fallthrough_; case 'p': pair is kept intact and placed at
-p's slot in --help, so -P sits immediately before -p in the source.
Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>
systemctl: split out helper for --state= and allow resetting
So far we'd reject --state=, but it seems nicer to make it reset the
setting as we generally do. The output variable is modified in place…
Option parsing isn't atomic anyway, so I think it's fine to to that.
glemco [Sun, 10 May 2026 09:48:27 +0000 (11:48 +0200)]
cgroup: Add CPUSetPartition= setting
Add support for configuring cpuset partition type via the
CPUSetPartition= unit file setting. This controls the kernel's
cpuset.cpus.partition cgroup attribute.
The setting takes one of "member", "root", or "isolated". This is
useful for real-time workloads that require dedicated CPU resources
without interference from other processes.
When set, systemd will write the partition type to the
cpuset.cpus.partition cgroup file. If the kernel rejects the value
(e.g., due to partition hierarchy rules), a warning is logged and the
unit continues with the kernel's default partition type.
Co-developed-by: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
vmspawn: multifunction-pack pcie-root-ports on pcie.0 (#42077)
The pre-allocated pcie-root-port block in run_virtual_machine() places
every port directly on pcie.0 with an auto-assigned PCI address. A
minimal VM already costs 4 builtin + 10 hotplug spares = 14 pcie.0
slots, on top of 3 implicit virtio devices (virtio-rng-pci,
virtio-balloon, virtio-serial-pci) for another 3.
pcie.0 has 32 device-numbers; q35 reserves 0x00 (host bridge) and 0x1f
(ICH9 LPC), leaving ~30 auto-assignable slots. TEST-64-UDEV-STORAGE-
nvme_basic pushes 20 '-device nvme' lines through
$SYSTEMD_VMSPAWN_QEMU_EXTRA, which vmspawn does not see — total demand
14 + 3 + 20 = 37 > 30. Bus realization fails after QEMU's chardev has
already emitted the QMP greeting, and the monitor socket POLLHUPs while
we are mid-feature-probe, reported as 'QMP connection dropped during
feature probing'.
Pack the root ports as multifunction devices, 8 per pcie.0 device-
number (QEMU docs/pcie.txt:84, 117-120, 255-258). Function 0 of each
group carries multifunction=on; functions 1-7 ride the same slot via
addr=N.F. Each function remains independently hot-pluggable so vmspawn's
QMP device_add machinery is unaffected. 14 ports collapse to 2 pcie.0
slots; the nvme_basic budget becomes 2 + 3 + 20 = 25.
The chassis/slot properties (used for ACPI hotplug identity) stay as i+1
— they live in a uint8_t namespace independent of the PCI BDF and are
still unique. Base PCI slot 0x10 sits above the auto-assigned virtio
devices (which land at 0x01-0x03 in config order) and below the q35 LPC
reservation at 0x1f.
While here, rebuild the slot-count formula to match what
assign_pcie_ports() actually allocates. The +1 'SCSI controller' term
was bogus — virtio-scsi-pci comes from the hotplug-spares pool via
hotplug_port_owner[] in vmspawn-qmp.c, never from a builtin port (see
the comment in assign_pcie_ports()). The +1 'network' and +1 'vsock'
terms are now conditional on arg_network_stack and use_vsock. Bind
volumes were missing entirely. And the per-drive accounting now mirrors
assign_pcie_ports()'s skip-SCSI behaviour: non-SCSI drives (root +
extras + bind volumes) take one builtin port each, SCSI drives take none
— they share a controller drawn from the hotplug pool at device-add
time. Tighten the cap from UINT8_MAX to 192 (24 packed device-numbers ×
8) so we cannot claim more than 24 slots on pcie.0 regardless of how
many extras/runtime-mounts a caller asks for.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Add a helper that tries to determine the number of installed CPUs. This
borrows heavily from physical_memory(), i.e. uses the physical number,
but caps by per-container cpuset.
Daan De Meyer [Wed, 13 May 2026 10:21:10 +0000 (12:21 +0200)]
nsresourced: re-link GID delegation file after atomic UID file write
userns_registry_remove() restores a sub-delegated UID range by writing
the previous owner's data to u<UID>.delegate with WRITE_STRING_FILE_ATOMIC.
Atomic writes go via a temp file and rename, which replaces the directory
entry with a fresh inode and severs the hardlink to g<GID>.delegate. The
stale GID side then keeps pointing at the prior inode with outdated owner
and ancestor data, so subsequent lookups via GID return wrong results.
Re-create the hardlink after the atomic write so the two views stay in
sync, matching what userns_registry_store() already does after writing
a new delegation.
Daan De Meyer [Wed, 13 May 2026 20:21:57 +0000 (22:21 +0200)]
blockdev-util: Drop name argument from BLKPG functions
We don't use it, the kernel ignores it, let's just drop
the argument. Saves callers from having to ensure the name
they pass in fits in the 64 char buffer.
vmspawn: multifunction-pack pcie-root-ports on pcie.0
The pre-allocated pcie-root-port block in run_virtual_machine() places
every port directly on pcie.0 with an auto-assigned PCI address. A
minimal VM already costs 4 builtin + 10 hotplug spares = 14 pcie.0
slots, on top of 3 implicit virtio devices (virtio-rng-pci,
virtio-balloon, virtio-serial-pci) for another 3.
pcie.0 has 32 device-numbers; q35 reserves 0x00 (host bridge) and 0x1f
(ICH9 LPC), leaving ~30 auto-assignable slots. TEST-64-UDEV-STORAGE-
nvme_basic pushes 20 '-device nvme' lines through
$SYSTEMD_VMSPAWN_QEMU_EXTRA, which vmspawn does not see — total demand
14 + 3 + 20 = 37 > 30. Bus realization fails after QEMU's chardev has
already emitted the QMP greeting, and the monitor socket POLLHUPs
while we are mid-feature-probe, reported as 'QMP connection dropped
during feature probing'.
Pack the root ports as multifunction devices, 8 per pcie.0 device-
number (QEMU docs/pcie.txt:84, 117-120, 255-258). Function 0 of each
group carries multifunction=on; functions 1-7 ride the same slot via
addr=N.F. Each function remains independently hot-pluggable so
vmspawn's QMP device_add machinery is unaffected. 14 ports collapse to
2 pcie.0 slots; the nvme_basic budget becomes 2 + 3 + 20 = 25.
The chassis/slot properties (used for ACPI hotplug identity) stay as
i+1 — they live in a uint8_t namespace independent of the PCI BDF and
are still unique. Base PCI slot 0x10 sits above the auto-assigned
virtio devices (which land at 0x01-0x03 in config order) and below
the q35 LPC reservation at 0x1f.
While here, rebuild the slot-count formula to match what
assign_pcie_ports() actually allocates. The +1 'SCSI controller' term
was bogus — virtio-scsi-pci comes from the hotplug-spares pool via
hotplug_port_owner[] in vmspawn-qmp.c, never from a builtin port (see
the comment in assign_pcie_ports()). The +1 'network' and +1 'vsock'
terms are now conditional on arg_network_stack and use_vsock. Bind
volumes were missing entirely. And the per-drive accounting now
mirrors assign_pcie_ports()'s skip-SCSI behaviour: non-SCSI drives
(root + extras + bind volumes) take one builtin port each, SCSI
drives take none — they share a controller drawn from the hotplug
pool at device-add time. Cap at 120 ports (15 device-numbers × 8) so
we cannot run off the end of the 5-bit PCI device-number space — the
usable range starting at 0x10 ends at 0x1e because ICH9 LPC sits at
0x1f.0 single-function, blocking the rest of that slot for
multifunction packing.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
core: when figuring out whether to create orphanage units, consult vtable instead of allowlist
As per https://github.com/systemd/systemd/pull/41986#pullrequestreview-4281939586
This also corrects the list of unit types a bit:
1. this removes the mount/automount unit type from the list, since for these types
we do not allow aliases/renaming anyway.
2. this adds socket + swap units to the list, since they can change
name, and for both of them we actually do fork off processes hence
track resources.
* 8b9ea8981e Install new files for upstream build
* b230cf0490 use dh-cruft to register & purge volatile files
* 8f9b9952e1 Install new files for upstream build
Luca Boccassi [Wed, 13 May 2026 17:31:27 +0000 (18:31 +0100)]
import: do not create foreign ns on cleanup if not needed
The user ns is only used if the appropriate flag is set, so avoid
creating it unless it is. This avoids a spurious EPERM error in
TEST-13-NSPAWN.machined that is confusing when debugging failures
[ 34.054] systemd-importd[504]: (transfer18) Imported 92%.
[ 34.118] systemd-importd[504]: (transfer18) Failed to decode and write: Broken pipe
[ 34.119] systemd-importd[504]: (transfer18) Exiting.
[ 34.121] systemd-importd[504]: (transfer18) Failed to allocate transient user namespace: Operation not permitted
[ 34.121] systemd-importd[504]: Transfer process failed with exit code 1.
Michael Vogt [Wed, 29 Apr 2026 13:45:10 +0000 (15:45 +0200)]
core: add support for Environment in io.systemd.Unit.StartTransient
This commit adds support to set `Environment` in the
`io.systemd.Unit.StartTransient` varlink call.
The behavior is similar to D-Bus, i.e. a `null` or `[]` clears the
environment. This is not needed for StartTransient() as there the env
always starts empty but it seems a good property to have if this is
reused.
Michael Vogt [Wed, 29 Apr 2026 08:56:41 +0000 (10:56 +0200)]
core: add WorkingDirectory to Unit.StartTransient
This commit adds setting the WorkingDirectory to the
`io.systemd.Unit.StartTransient` varlink call. This is a
first step towards more complete StartTransient in varlink.
The goal is to be as close as possible to the D-Bus parameters.
The exception is WorkingDirectory which is an object here so that
we avoid the `-` prefixes and use a more type-safe approach by
making it an explicit `missingOK` parameter.
The key names stay the same as the D-Bus properties (PascalCase).
If there are no equivalent D-Bus properties the native varlink
convention of camelCase is used.
Michael Vogt [Wed, 29 Apr 2026 14:52:15 +0000 (16:52 +0200)]
varlink: make fields of ExecContext nullable for partial input
We want to allow partial inputs for the ExecContext when we pass
that to io.systemd.Unit.StartTransient. So this commit makes the
fields nullable. Without this the varlink input validation will
reject partial ExecContext objects.
Michael Vogt [Wed, 29 Apr 2026 13:27:54 +0000 (15:27 +0200)]
core: tweak pattern of applying varlink properties for StartTransient
So far we had the pattern to check first if any property needs setting
and then have a function to set it. The downside is that when we add
a new property two different places need to change (once the `if`
in vl_method_start_transient_unit() and once in the specific helper
to do the actual setting). So instead this commit moves everything to
the helpers and tweaks the code so that we can always call the function.
copy: retire splice use() for copying files on disk
Apparently splice() is quite problematic, hence just don't anymore. It's
also unnecessary these days since either copy_file_range() or sendfile()
nowadays typically work, the splice() fallback doesn't give us much
anymore.
(At least I am not aware of a combo of fds where splice() would work
where neither cfr nor sf would work).
This leaves one use of splice() in place, in
src/shared/socket-forward.c. We should probably kill that too, but
that'd require some reworking to use sendfile() I guess, and I am too
lazy for that right now. Moreover, in contrast to the other uses it's
probably even safe, since it uses an intermediary pipe always. But what
do I know...
This stuff is so useful, and should work out of the box I am sure. Given
that the metrics are only generated on request this shouldn't create any
additional burden by default.
Yes, this might enlarge reports a bit, if generated with everything on,
but we really should solve that at the report generation level, not at
the point where we make the metrics available.
Chris Down [Wed, 13 May 2026 12:25:08 +0000 (21:25 +0900)]
core: do not leak resources when handling stale alias state on reload (#41986)
The fix for the corrupted state when units become aliased on reload
leaks the now-aliased unit's resources, which become untracked and
essentially lost.
While fixing the state corruption is of course necessary, leaking
processes/etc. is not ideal for a system and service manager, so
instead attempt to keep track of them by creating stub units
on-the-fly.
This way resources are not leaked, there are clear indications of
where they moved, and all state can be tracked as expected.
RestrictFileSystemAccess= — dm-verity filesystem access enforcement via BPF LSM (#41340)
This series adds a new `RestrictFileSystemAccess=` setting in the
`[Manager]` section of `system.conf` that enforces a deny-default
execution policy: only binaries residing on signed dm-verity block
devices (and the initramfs during early boot) are permitted to execute.
Everything else — tmpfs, procfs, sysfs, anonymous executable mappings,
unsigned dm-verity devices — is denied.
The directive takes the values `no` (default), `exec` (lock down
execution), and accepts `yes` as an alias for `exec`. The name is
deliberately broader than what the initial values cover so the same
setting can grow to restrict other filesystem access categories in the
future (e.g. `any` to deny all access from untrusted filesystems, not
just execution).
### How it works
The BPF program is entirely self-contained; PID1 loads it and the kernel
does the rest. When dm-verity brings up a device, the kernel calls
`security_bdev_setintegrity()` twice during `verity_preresume()`: once
with the root hash and once with the signature validity status. Our
`lsm/bdev_setintegrity` hook captures the second call and records the
device number in a BPF hash map if the signature is valid. When a device
is torn down, `lsm/bdev_free_security` cleans up the map entry. No
userspace map population is needed at any point.
The enforcement side hooks `bprm_check_security` (execve), `mmap_file`
(PROT_EXEC mappings including shared libraries), and `file_mprotect`
(W→X transitions like JIT and libffi). Each hook resolves the file's
backing device via `file->f_inode->i_sb->s_dev` and looks it up in the
verity device map. For block-backed filesystems, `s_dev` equals
`s_bdev->bd_dev`, which avoids an extra pointer chase and NULL check on
`s_bdev` — non-block filesystems simply miss in the map and get denied
by the default policy.
During early boot the initramfs needs to be trusted as well, since it
runs before any dm-verity volume is mounted. PID1 writes the initramfs
superblock's device number into a BPF global before attaching the
programs, and clears it after `switch_root` to close the trust window.
As a prerequisite, PID1 also verifies that
`dm_verity.require_signatures=1` is active — without it, unsigned
dm-verity devices could be created, which would weaken the security
model even though the BPF program would correctly deny execution from
them.
### Surviving daemon-reexec
The BPF programs and their verity device map must survive PID1
re-execution (daemon-reexec, switch_root, soft-reboot). Without
preservation, `manager_free()` would destroy the skeleton, the link FDs
would close, programs would detach, and the map would be freed. After
exec, a fresh skeleton would have an empty map — but existing dm-verity
devices have already signaled their integrity and won't do so again. A
deny-default policy plus an empty map means all execution denied and the
system is bricked.
We solve this by serializing the raw BPF link FDs and the `.bss` map FD
across exec using systemd's existing `serialize_fd` / `fdset_cloexec` /
`deserialize_fd` infrastructure. The kernel reference chain (link FD →
`struct bpf_link` → `struct bpf_prog` → `struct bpf_map`) keeps programs
attached and map data intact as long as the dup'd FDs survive. After
exec, PID1 detects the deserialized FDs and skips skeleton re-creation
entirely. If switching root, it uses the deserialized `.bss` map FD to
clear `initramfs_s_dev` via a targeted `mmap()` write, preserving the
other guard globals in `.bss`.
We intentionally avoid bpffs pinning. Pinned objects are discoverable
and manipulable by any process with sufficient privileges
(`BPF_OBJ_GET`, unlink). FD serialization keeps everything private to
PID1 with no external attack surface.
### Self-protection
BPF LSM programs attached via the tracing trampoline (`BPF_LSM_MAC`) are
inherently tamper-resistant — `bpf_tracing_link_lops` has no
`.update_prog` and no `.detach` callbacks, so the kernel rejects
`BPF_LINK_UPDATE` with `-EINVAL` and `BPF_LINK_DETACH` with
`-EOPNOTSUPP`. Once attached, our programs cannot be modified or
detached through the `bpf()` syscall.
The remaining attack vector is map injection: `BPF_MAP_GET_FD_BY_ID` to
obtain an FD to `verity_devices`, then `BPF_MAP_UPDATE_ELEM` to insert a
fake trusted device. The self-protection guard blocks this with three
hooks. `lsm/bpf_map` fires inside `bpf_map_new_fd()`, the chokepoint for
all code paths that produce a map FD, and denies access to our map IDs
from any process other than PID1 (identified via `tgid == 1`, which is
unspoofable — `bpf_get_current_pid_tgid()` reads `current->tgid` from
`pid->numbers[0].nr`, the init-namespace PID). `lsm/bpf_prog` provides
analogous protection for program FDs as defense-in-depth. `lsm/bpf`
handles `BPF_LINK_GET_FD_BY_ID` at the command level since there is no
`security_bpf_link()` hook in the kernel.
The guard starts inactive — all protected IDs default to 0 in `.bss`,
and no real BPF object has ID 0 — so there is no window where it
interferes with PID1's own setup. After attaching all programs, PID1
queries the kernel-assigned IDs via `bpf_obj_get_info_by_fd()` and
writes them into the guard's globals. From that point on, the guard is
active. The guard has zero collateral damage: it only denies access to
our specific object IDs, leaving bpftrace, bpftool,
`RestrictFileSystems=`, and all other BPF usage completely unaffected.
Additionally, a ptrace guard (`lsm/ptrace_access_check`) blocks
`PTRACE_MODE_ATTACH` to PID1 from other processes, preventing extraction
of sensitive state from PID1's address space via ptrace, `/proc/1/mem`,
`process_vm_readv()`, or `pidfd_getfd()`. `PTRACE_MODE_READ` is allowed
so that monitoring tools and `systemctl` continue to work normally.
### Limitations
- The enforcement hooks resolve trust by looking at
`file->f_inode->i_sb->s_dev` — the device number of the superblock that
owns the file's inode. This works correctly for files directly on a
dm-verity block device, but it does not see through overlayfs. When a
file is accessed on an overlay mount, `f_inode` points to the overlay
inode, and `i_sb->s_dev` is the overlay superblock's anonymous device
number — not the underlying dm-verity device. The overlay superblock has
no backing block device, so the lookup misses in the verity map and
execution is denied by the default policy.
This means that overlayfs mounts whose lower layers are on
dm-verity-protected volumes will currently have execution blocked, even
though the actual data is integrity-protected. The correct fix requires
a kernel extension that allows the BPF program to call something like
`d_real_inode()` to resolve through the overlay to the real inode on the
underlying filesystem, and then check that inode's superblock device
number against the verity map. I plan to add a BPF kfunc exposing this
functionality in a follow-up kernel series.
- Multi-device filesystems such as btrfs use entirely synthetic device
numbers and there is no way to reach the actual device backing the inode
from the inode itself. So `RestrictFileSystemAccess=` only works
reliably with a subset of filesystems. In practice this isn't a problem
because the feature is tailored to erofs; using it on arbitrary
filesystems requires careful vetting of the actual filesystem behaviour.
- The initial implementation also blocks JIT-style execution that relies
on memory mapped executable. This is part of `exec` semantics today and
can be loosened later by introducing finer-grained values (a common
pattern in systemd — following the precedent of `ProtectSystem=`, which
started as a boolean and later grew `auto`/`yes`/`full`/`strict`
semantics).
- The configuration is a system-wide setting with no per-unit opt-out.
This is intentional for the initial implementation: a global invariant
is easier to reason about and harder to accidentally weaken. Per-unit
relaxation can be added later if a concrete need arises.
### Testing
The series includes unit tests and integration tests covering both the
core enforcement logic and the self-protection guard. The unit test
loads the skeleton, attaches programs, populates guard globals, and
verifies that protected IDs are set correctly. The integration tests
exercise the guard by attempting `BPF_MAP_GET_FD_BY_ID` and
`BPF_PROG_GET_FD_BY_ID` from a non-PID1 process and verifying that
access is denied.
What we cannot currently test end-to-end is actual execution enforcement
against a dm-verity-signed root filesystem. The systemd test suite does
not yet have infrastructure for booting a VM with a signed dm-verity
rootfs image — the existing mkosi-based test framework lacks the ability
to produce and boot such images. This will hopefully change soon when
Daan integrates barrage into the test suite.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Daan De Meyer [Tue, 12 May 2026 07:41:01 +0000 (09:41 +0200)]
test: Modernize btrfs tests
Convert test-btrfs to use the test framework and
assertions, merge the physical offset test into it
and beef it up to include what TEST-83-BTRFS does and
finally get rid of TEST-83-BTRFS as it is unneeded now.
Daan De Meyer [Wed, 13 May 2026 11:06:35 +0000 (13:06 +0200)]
libc,shared: detect newer library symbols at runtime via weak references (#42065)
For libc syscall wrappers (pidfd_open, fsopen, openat2, etc.) we
previously
gated the calls behind build-time HAVE_* checks. Replace these with weak
external references, falling back to the raw syscall at runtime when the
loaded glibc lacks the symbol. Drop the corresponding cc.has_function()
loop
from meson.build and disable -Wredundant-decls /
readability-redundant-declaration
for src/libc/ via meson c_args and a local .clang-tidy.
For optional libraries (libcryptsetup, libdw, libarchive), drop the
per-symbol
HAVE_* checks. Always declare the prototypes, suppressing the
redundant-decl
warnings via DISABLE_WARNING_REDUNDANT_DECLS and NOLINT, and resolve the
symbols after the main dlopen via a new DLSYM_OPTIONAL() helper that
only
assigns on success. libarchive's *_is_set wrappers now use fallback
functions
as their pointer initializers, so call sites never need to NULL-check.
The same treatment applies to pidfd_spawn / posix_spawnattr_setcgroup_np
in
process-util.c and epoll_pwait2 in sd-event.c. coredump-config and
coredump-submit get a dlopen_dw_has_dwfl_set_sysroot() helper. The kexec
arch gate now uses defined(__NR_kexec_file_load) directly; pidfd.h uses
__has_include_next() to decide whether to pull in glibc's header.
This lets binaries built against newer glibc / libcryptsetup / libdw /
libarchive headers still load and run on older targets where these
symbols
are absent.