Ivan Kruglov [Fri, 15 May 2026 14:01:43 +0000 (07:01 -0700)]
test: split TEST-74-AUX-UTILS.varlinkctl.sh into per-interface subtests
Split the monolithic varlinkctl test script into separate files per varlink interface for better organization and easier maintenance:
- varlinkctl.sh: core varlinkctl tool tests (CLI, transports, socket discovery, upgrade/serve) and io.systemd.Manager
- varlinkctl-network.sh: io.systemd.Network
- varlinkctl-unit.sh: io.systemd.Unit (system + user manager)
- varlinkctl-metrics.sh: io.systemd.Metrics
No functional changes — the test content is moved as-is.
Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>
Ivan Kruglov [Fri, 15 May 2026 10:16:00 +0000 (03:16 -0700)]
json-util: add json_dispatch_job_id() dispatcher for job IDs
Job IDs are uint32_t values that are always >= 1 (the manager's ID counter starts at 1 and wraps from UINT32_MAX back to 1, never assigning 0). Add a dedicated dispatch function that validates this constraint, rejecting 0 and treating null as "unset" (0).
Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>
Ivan Kruglov [Thu, 14 May 2026 16:41:24 +0000 (09:41 -0700)]
shared: add exec_command_status_build_json() and ExecCommandStatus varlink type to common
Add exec_command_status_build_json() and exec_command_status_list_build_json() to varlink-common, alongside exec_command_build_json() and exec_command_list_build_json(). The status list function is the runtime counterpart of the command list function — the two arrays are positionally aligned so index N in the status array corresponds to index N in the command array. Commands that have not yet run produce null entries to preserve alignment.
Add the ExecCommandStatus varlink struct type to varlink-idl-common next to ExecCommand. It contains PID, timestamps, and mutually exclusive ExitStatus (int, for normal exit) / ExitSignal (string, for signal kill).
Luca Boccassi [Thu, 14 May 2026 20:06:15 +0000 (21:06 +0100)]
ci: switch SUSE mkosi mirror to cdn.o.o
The cdn mirror is preferred by SUSE for clouds/CIs. There have been issues with some
mirrors, which fail to download from GHA quite often lately, so hopefully this will
make it reliable again.
Philip Withnall [Sun, 3 May 2026 21:36:32 +0000 (22:36 +0100)]
test: Add a sysupdate test for files which are a prefix match of each other
This tests whether the pattern matching code checks it’s matched the
whole string and not just a prefix (see commit 4ffb60319b).
In particular, this tests a setup which KDE currently use in their
sysupdate images, where two regular file transfers are done, one of a
`foo.erofs` file, and the other `foo.erofs.caibx`. As one is a prefix of
the other, they were hitting this bug.
When the periodic RA timer fires, any error returned by sendmsg()
currently propagates up through sd_radv_send() into radv_timeout(),
which then calls sd_radv_stop(). The RA engine is never started again
until the next carrier transition.
On an 802.3ad bond there is a window right after carrier-up where the
link is administratively up but no aggregator has been selected yet, so
sendmsg() returns ENOBUFS. If the very first RA after a flap lands in
that window, radv stops permanently and all clients lose their SLAAC
addresses, on-link/PD prefixes, and default router once the previously
advertised lifetimes expire, while IPv4 keeps working, leading to a very
confusing situation with v4 up and v6 down.
Handle this the same way solicited RAs already do (see
radv_process_packet()): log the failure and reschedule the timer instead
of giving up. ra_sent is left untouched on failure so we stay in the
fast initial-advertisement regime until a send actually succeeds.
network: honour static IPv6LL addresses in network_adjust_*()
link_radv_enabled() and link_ndisc_enabled() use
link_ipv6ll_enabled_harder(), which considers a static fe80:: address
in [Address] sufficient to run radv/ndisc even when LinkLocalAddressing=
(or IPv6LinkLocalAddressGenerationMode=none, which network_verify()
folds into the same flag) disables the kernel-generated link-local.
network_adjust_radv()/ndisc()/dhcp() however only check the raw
link_local flag and zero router_prefix_delegation / ndisc / dhcp&IPV6
at parse time, so the runtime gate never gets a chance to fire.
Factor the static-LL lookup out of link_ipv6ll_enabled_harder() into a
Network-level helper and use it in the three network_adjust_*()
functions, bringing parse-time and runtime behaviour in line.
我超厉害 [Thu, 14 May 2026 17:51:29 +0000 (01:51 +0800)]
sd-device: use ERRNO_IS_NEG_DEVICE_ABSENT() for device-id load failures (#41764)
Device enumeration may encounter transient errors such as ENXIO when devices
appear or disappear concurrently. These conditions represent expected "device absent"
races and should be treated uniformly across the enumeration logic.
This change replaces the ENODEV-specific check with ERRNO_IS_NEG_DEVICE_ABSENT(),
ensuring that all expected disappearance conditions are handled consistently.
Unexpected errors are still propagated, while expected races are ignored without
aborting the enumeration.
Yu Watanabe [Thu, 14 May 2026 17:47:23 +0000 (02:47 +0900)]
A few more conversions of options and verbs (#41795)
I had those prepared before but I didn't submit them because the
automatic layout didn't work well. In two cases now the sync of widths
between verbs and options is disabled and one case is left with the
automatic alignment. I think it'd good enough to merge.
Ivan Kruglov [Thu, 14 May 2026 16:41:09 +0000 (09:41 -0700)]
core: move service_context_build_json() to varlink-service.c
Move the existing (partial) service context builder from varlink-unit.c into its own varlink-service.c file, following the pattern established by other unit type context builders (varlink-path.c, varlink-scope.c, etc.). No functional change.
Yu Watanabe [Thu, 14 May 2026 15:38:19 +0000 (00:38 +0900)]
meson: don't use Python module for host Python (#41959)
Checking for pefile required that module to be made available for the
Python used to build systemd, even though it's only used at runtime,
potentially via a different Python installation.
Furthermore, Meson's Python module doesn't do the right thing when cross
compiling and looking up a Python for the host system, so this would end
up uselessly checking whether the build Python had the pefile module,
which is not needed. Even if it were made to check the host Python using
find_program, it still relies on being able to run its Python, which in
a cross scenario it probably wouldn't be able to do.
All in all, this check does more harm than good, and prevents building
ukify in valid configurations, so remove it.
noxiouz [Tue, 17 Mar 2026 23:55:51 +0000 (23:55 +0000)]
coredump: add JSON output support to coredumpctl info
Implement support for the --json= flag in the info subcommand
(issue #38844). Previously, coredumpctl info always produced
human-readable text output regardless of --json=.
Add a CoredumpFields struct that holds all journal fields extracted
for a coredump entry, along with coredump_fields_done() to release
member resources and coredump_fields_load() to populate the struct
from a journal entry. Both print_info() and the new print_info_json()
use this shared loader, eliminating the duplicate RETRIEVE loop.
print_info_json() builds a JSON object with the same fields shown by
print_info(). Missing fields are omitted via SD_JSON_BUILD_PAIR_CONDITION,
matching the tolerant behavior of print_info() rather than skipping the
entry entirely. Signal/Reason handling mirrors print_info(): normal
coredumps (MESSAGE_ID == SD_MESSAGE_COREDUMP_STR) emit a numeric Signal
field; non-normal entries (kernel oops, etc.) emit a Reason field with
the raw text from COREDUMP_SIGNAL.
Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>
Daan De Meyer [Tue, 12 May 2026 13:03:49 +0000 (13:03 +0000)]
nspawn: split boot parameters into env vars and argv
When the kernel hands the command line to PID 1, any KEY=VALUE assignment
whose KEY does not contain a '.' is exported as an environment variable
(with '-' replaced by '_') rather than passed as an argument. Mimic the
same split in --boot mode so kernel-cmdline-style arguments passed after
the container path behave as they would on a real boot.
various: fix duplicated logging from parse_path_argument
As pointed out in review, parse_path_argument can fail for non-oom reasons.
But the function already logs, so the correct thing to do is to just
propagate the error.
busctl: convert to the new option and verb parsers
The conversion doesn't work great, because some of the verbs take many
arguments and the first column is extermely wide. So similarly to
kernel-install, I dropped the sync of column widths. This allows the
help for options to use most of the available space.
localectl: convert to the new option and verb parsers
The verb synopses are long, so they got broken up:
===============================================================================
> localectl [OPTIONS…] COMMAND …
Query or change system locale and keyboard settings.
Commands:
[status] Show current locale settings
set-locale LOCALE... Set system locale
list-locales Show known locales
set-keymap MAP [MAP] Set console and X11 keyboard mappings
list-keymaps Show known virtual console keyboard mappings
set-x11-keymap LAYOUT [MODEL Set X11 and console keyboard mappings
[VARIANT [OPTIONS]]]
list-x11-keymap-models Show known X11 keyboard mapping models
list-x11-keymap-layouts Show known X11 keyboard mapping layouts
list-x11-keymap-variants Show known X11 keyboard mapping variants
[LAYOUT]
list-x11-keymap-options Show known X11 keyboard mapping options
Options:
-h --help Show this help
--version Show package version
-l --full Do not ellipsize output
--no-pager Do not start a pager
--no-ask-password Do not prompt for password
-H --host=[USER@]HOST Operate on remote host
-M --machine=CONTAINER Operate on local container
--no-convert Don't convert keyboard mappings
See the localectl(1) man page for details.
===============================================================================
But I think this is OK. Everything is readable. On a more normal terminal,
everything fits nicely.
Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>
kernel-install: convert to the new option and verb parsers
The verb synopses are very long because of the many parameters.
Previously were shown without help and occupied all available columns.
With the autogenerated help format, this doesn't work great. So the
verbs and options tables are not synced, so that help for options can
use more columns. I think in this case this is better than the
alternatives.
Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>
Luca Boccassi [Thu, 14 May 2026 12:10:13 +0000 (13:10 +0100)]
cgroup: Add CPUSetPartition= setting (#42013)
Add support for configuring cpuset partition type via the
CPUSetPartition= unit file setting. This controls the kernel's
cpuset.cpus.partition cgroup attribute.
The setting takes one of "member", "root", or "isolated". This is
useful for real-time workloads that require dedicated CPU resources
without interference from other processes.
When set, systemd will write the partition type to the
cpuset.cpus.partition cgroup file. If the kernel rejects the value
(e.g., due to partition hierarchy rules), a warning is logged and the
unit continues with the kernel's default partition type.
Co-developed-by: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
shared/verbs: split verbs in two lines when the synopsis is > 25 characters
The help tests would not pass because in cases where the verb synopsis
is very long, we'd format the table badly if the terminal is fairly
narrow. I experimented with a few solutions, but overall, it's hard to
achieve very good layout with the automatic formatting. I think the
approach in this commit works the best: we end up with an two- or
three-line verb synopis, which is similar to what we did manually
before.
$ COLUMNS=80 build/localectl -h
...
Commands:
[status] Show current locale settings
set-locale LOCALE... Set system locale
list-locales Show known locales
set-keymap MAP [MAP] Set console and X11 keyboard mappings
list-keymaps Show known virtual console keyboard mappings
set-x11-keymap LAYOUT [MODEL Set X11 and console keyboard mappings
[VARIANT [OPTIONS]]]
list-x11-keymap-models Show known X11 keyboard mapping models
list-x11-keymap-layouts Show known X11 keyboard mapping layouts
list-x11-keymap-variants Show known X11 keyboard mapping variants
[LAYOUT]
list-x11-keymap-options Show known X11 keyboard mapping options
I think that almost nobody actually uses an 80 column terminal, and if
they do, they probably don't spend too much time looking at our --help
output there. So the goal here is to do something reasonable and robust
and get the tests to pass.
We can use strjoina here because the strings are fully under our
control.
fuzz-systemctl-parse-argv: add two corpus files to test compat parsers
Looking at the corpus examples, I'm not sure the fuzzer even went into
the compat parsers. None of the files have argv[0] that'd cause
invoked_as() to go into the compat paths. So add the files to provide
a quick test and possibly bias the fuzzer search into the right
direction.
shared/options: implement the equivalent of 'opterr'
All log messages during option parsing are emitted using log_full,
and the level is set as LOG_ERR + state->log_level_shift. The default
shift is 0, but if set to e.g. 4, we log at LOG_DEBUG, and if set
to 5 or higher, logging is effectively suppressed. (Unless compiled
with LOG_TRACE, when it'd be suppressed if the shift if set to 6
or higher.) So this gives something like 'opterr', except that
without global state and potentially more flexible.
systemctl_main() is moved to systemctl.c to allow fuzz-systemctl-parse-argv
to compile. It needs systemctl_help(), which needs the verb table, with the
expected groups. Once we provide that, the linker needs all the verb_*
functions. So add dummy implementations in fuzz-systemctl-parse-argv to
allow the link to happen.
The alternative would be to provide an empty option table, but that
seems to be more complicated, and also can simulate parsing of the whole
command line with the full verb set, so it seems better to test with the
real verb table.
The verbs[] table still lives in systemctl-main.c — only the option parsing
side is migrated. systemctl_dispatch_parse_argv() gains a remaining_args
out-param so run() can pass the parsed positional args to systemctl_main(),
which dispatches via _dispatch_verb_with_args() instead of dispatch_verb().
The Options section of --help now renders from the OPTION declarations; the
verb sections still use raw printfs and will be converted alongside the
verbs[] migration.
Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>
systemctl: reorder cases in parse_argv() to match order in --help
Compatibility-only options (--fail, --irreversible, --ignore-dependencies,
--no-legend) are grouped at the end alongside the '.' / '?' error handlers.
The case 'P': … _fallthrough_; case 'p': pair is kept intact and placed at
-p's slot in --help, so -P sits immediately before -p in the source.
Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>
systemctl: split out helper for --state= and allow resetting
So far we'd reject --state=, but it seems nicer to make it reset the
setting as we generally do. The output variable is modified in place…
Option parsing isn't atomic anyway, so I think it's fine to to that.
glemco [Sun, 10 May 2026 09:48:27 +0000 (11:48 +0200)]
cgroup: Add CPUSetPartition= setting
Add support for configuring cpuset partition type via the
CPUSetPartition= unit file setting. This controls the kernel's
cpuset.cpus.partition cgroup attribute.
The setting takes one of "member", "root", or "isolated". This is
useful for real-time workloads that require dedicated CPU resources
without interference from other processes.
When set, systemd will write the partition type to the
cpuset.cpus.partition cgroup file. If the kernel rejects the value
(e.g., due to partition hierarchy rules), a warning is logged and the
unit continues with the kernel's default partition type.
Co-developed-by: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
vmspawn: multifunction-pack pcie-root-ports on pcie.0 (#42077)
The pre-allocated pcie-root-port block in run_virtual_machine() places
every port directly on pcie.0 with an auto-assigned PCI address. A
minimal VM already costs 4 builtin + 10 hotplug spares = 14 pcie.0
slots, on top of 3 implicit virtio devices (virtio-rng-pci,
virtio-balloon, virtio-serial-pci) for another 3.
pcie.0 has 32 device-numbers; q35 reserves 0x00 (host bridge) and 0x1f
(ICH9 LPC), leaving ~30 auto-assignable slots. TEST-64-UDEV-STORAGE-
nvme_basic pushes 20 '-device nvme' lines through
$SYSTEMD_VMSPAWN_QEMU_EXTRA, which vmspawn does not see — total demand
14 + 3 + 20 = 37 > 30. Bus realization fails after QEMU's chardev has
already emitted the QMP greeting, and the monitor socket POLLHUPs while
we are mid-feature-probe, reported as 'QMP connection dropped during
feature probing'.
Pack the root ports as multifunction devices, 8 per pcie.0 device-
number (QEMU docs/pcie.txt:84, 117-120, 255-258). Function 0 of each
group carries multifunction=on; functions 1-7 ride the same slot via
addr=N.F. Each function remains independently hot-pluggable so vmspawn's
QMP device_add machinery is unaffected. 14 ports collapse to 2 pcie.0
slots; the nvme_basic budget becomes 2 + 3 + 20 = 25.
The chassis/slot properties (used for ACPI hotplug identity) stay as i+1
— they live in a uint8_t namespace independent of the PCI BDF and are
still unique. Base PCI slot 0x10 sits above the auto-assigned virtio
devices (which land at 0x01-0x03 in config order) and below the q35 LPC
reservation at 0x1f.
While here, rebuild the slot-count formula to match what
assign_pcie_ports() actually allocates. The +1 'SCSI controller' term
was bogus — virtio-scsi-pci comes from the hotplug-spares pool via
hotplug_port_owner[] in vmspawn-qmp.c, never from a builtin port (see
the comment in assign_pcie_ports()). The +1 'network' and +1 'vsock'
terms are now conditional on arg_network_stack and use_vsock. Bind
volumes were missing entirely. And the per-drive accounting now mirrors
assign_pcie_ports()'s skip-SCSI behaviour: non-SCSI drives (root +
extras + bind volumes) take one builtin port each, SCSI drives take none
— they share a controller drawn from the hotplug pool at device-add
time. Tighten the cap from UINT8_MAX to 192 (24 packed device-numbers ×
8) so we cannot claim more than 24 slots on pcie.0 regardless of how
many extras/runtime-mounts a caller asks for.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Add a helper that tries to determine the number of installed CPUs. This
borrows heavily from physical_memory(), i.e. uses the physical number,
but caps by per-container cpuset.
Daan De Meyer [Wed, 13 May 2026 10:21:10 +0000 (12:21 +0200)]
nsresourced: re-link GID delegation file after atomic UID file write
userns_registry_remove() restores a sub-delegated UID range by writing
the previous owner's data to u<UID>.delegate with WRITE_STRING_FILE_ATOMIC.
Atomic writes go via a temp file and rename, which replaces the directory
entry with a fresh inode and severs the hardlink to g<GID>.delegate. The
stale GID side then keeps pointing at the prior inode with outdated owner
and ancestor data, so subsequent lookups via GID return wrong results.
Re-create the hardlink after the atomic write so the two views stay in
sync, matching what userns_registry_store() already does after writing
a new delegation.
Daan De Meyer [Wed, 13 May 2026 20:21:57 +0000 (22:21 +0200)]
blockdev-util: Drop name argument from BLKPG functions
We don't use it, the kernel ignores it, let's just drop
the argument. Saves callers from having to ensure the name
they pass in fits in the 64 char buffer.
vmspawn: multifunction-pack pcie-root-ports on pcie.0
The pre-allocated pcie-root-port block in run_virtual_machine() places
every port directly on pcie.0 with an auto-assigned PCI address. A
minimal VM already costs 4 builtin + 10 hotplug spares = 14 pcie.0
slots, on top of 3 implicit virtio devices (virtio-rng-pci,
virtio-balloon, virtio-serial-pci) for another 3.
pcie.0 has 32 device-numbers; q35 reserves 0x00 (host bridge) and 0x1f
(ICH9 LPC), leaving ~30 auto-assignable slots. TEST-64-UDEV-STORAGE-
nvme_basic pushes 20 '-device nvme' lines through
$SYSTEMD_VMSPAWN_QEMU_EXTRA, which vmspawn does not see — total demand
14 + 3 + 20 = 37 > 30. Bus realization fails after QEMU's chardev has
already emitted the QMP greeting, and the monitor socket POLLHUPs
while we are mid-feature-probe, reported as 'QMP connection dropped
during feature probing'.
Pack the root ports as multifunction devices, 8 per pcie.0 device-
number (QEMU docs/pcie.txt:84, 117-120, 255-258). Function 0 of each
group carries multifunction=on; functions 1-7 ride the same slot via
addr=N.F. Each function remains independently hot-pluggable so
vmspawn's QMP device_add machinery is unaffected. 14 ports collapse to
2 pcie.0 slots; the nvme_basic budget becomes 2 + 3 + 20 = 25.
The chassis/slot properties (used for ACPI hotplug identity) stay as
i+1 — they live in a uint8_t namespace independent of the PCI BDF and
are still unique. Base PCI slot 0x10 sits above the auto-assigned
virtio devices (which land at 0x01-0x03 in config order) and below
the q35 LPC reservation at 0x1f.
While here, rebuild the slot-count formula to match what
assign_pcie_ports() actually allocates. The +1 'SCSI controller' term
was bogus — virtio-scsi-pci comes from the hotplug-spares pool via
hotplug_port_owner[] in vmspawn-qmp.c, never from a builtin port (see
the comment in assign_pcie_ports()). The +1 'network' and +1 'vsock'
terms are now conditional on arg_network_stack and use_vsock. Bind
volumes were missing entirely. And the per-drive accounting now
mirrors assign_pcie_ports()'s skip-SCSI behaviour: non-SCSI drives
(root + extras + bind volumes) take one builtin port each, SCSI
drives take none — they share a controller drawn from the hotplug
pool at device-add time. Cap at 120 ports (15 device-numbers × 8) so
we cannot run off the end of the 5-bit PCI device-number space — the
usable range starting at 0x10 ends at 0x1e because ICH9 LPC sits at
0x1f.0 single-function, blocking the rest of that slot for
multifunction packing.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
core: when figuring out whether to create orphanage units, consult vtable instead of allowlist
As per https://github.com/systemd/systemd/pull/41986#pullrequestreview-4281939586
This also corrects the list of unit types a bit:
1. this removes the mount/automount unit type from the list, since for these types
we do not allow aliases/renaming anyway.
2. this adds socket + swap units to the list, since they can change
name, and for both of them we actually do fork off processes hence
track resources.
* 8b9ea8981e Install new files for upstream build
* b230cf0490 use dh-cruft to register & purge volatile files
* 8f9b9952e1 Install new files for upstream build
Luca Boccassi [Wed, 13 May 2026 17:31:27 +0000 (18:31 +0100)]
import: do not create foreign ns on cleanup if not needed
The user ns is only used if the appropriate flag is set, so avoid
creating it unless it is. This avoids a spurious EPERM error in
TEST-13-NSPAWN.machined that is confusing when debugging failures
[ 34.054] systemd-importd[504]: (transfer18) Imported 92%.
[ 34.118] systemd-importd[504]: (transfer18) Failed to decode and write: Broken pipe
[ 34.119] systemd-importd[504]: (transfer18) Exiting.
[ 34.121] systemd-importd[504]: (transfer18) Failed to allocate transient user namespace: Operation not permitted
[ 34.121] systemd-importd[504]: Transfer process failed with exit code 1.
copy: retire splice use() for copying files on disk
Apparently splice() is quite problematic, hence just don't anymore. It's
also unnecessary these days since either copy_file_range() or sendfile()
nowadays typically work, the splice() fallback doesn't give us much
anymore.
(At least I am not aware of a combo of fds where splice() would work
where neither cfr nor sf would work).
This leaves one use of splice() in place, in
src/shared/socket-forward.c. We should probably kill that too, but
that'd require some reworking to use sendfile() I guess, and I am too
lazy for that right now. Moreover, in contrast to the other uses it's
probably even safe, since it uses an intermediary pipe always. But what
do I know...
This stuff is so useful, and should work out of the box I am sure. Given
that the metrics are only generated on request this shouldn't create any
additional burden by default.
Yes, this might enlarge reports a bit, if generated with everything on,
but we really should solve that at the report generation level, not at
the point where we make the metrics available.
Chris Down [Wed, 13 May 2026 12:25:08 +0000 (21:25 +0900)]
core: do not leak resources when handling stale alias state on reload (#41986)
The fix for the corrupted state when units become aliased on reload
leaks the now-aliased unit's resources, which become untracked and
essentially lost.
While fixing the state corruption is of course necessary, leaking
processes/etc. is not ideal for a system and service manager, so
instead attempt to keep track of them by creating stub units
on-the-fly.
This way resources are not leaked, there are clear indications of
where they moved, and all state can be tracked as expected.
RestrictFileSystemAccess= — dm-verity filesystem access enforcement via BPF LSM (#41340)
This series adds a new `RestrictFileSystemAccess=` setting in the
`[Manager]` section of `system.conf` that enforces a deny-default
execution policy: only binaries residing on signed dm-verity block
devices (and the initramfs during early boot) are permitted to execute.
Everything else — tmpfs, procfs, sysfs, anonymous executable mappings,
unsigned dm-verity devices — is denied.
The directive takes the values `no` (default), `exec` (lock down
execution), and accepts `yes` as an alias for `exec`. The name is
deliberately broader than what the initial values cover so the same
setting can grow to restrict other filesystem access categories in the
future (e.g. `any` to deny all access from untrusted filesystems, not
just execution).
### How it works
The BPF program is entirely self-contained; PID1 loads it and the kernel
does the rest. When dm-verity brings up a device, the kernel calls
`security_bdev_setintegrity()` twice during `verity_preresume()`: once
with the root hash and once with the signature validity status. Our
`lsm/bdev_setintegrity` hook captures the second call and records the
device number in a BPF hash map if the signature is valid. When a device
is torn down, `lsm/bdev_free_security` cleans up the map entry. No
userspace map population is needed at any point.
The enforcement side hooks `bprm_check_security` (execve), `mmap_file`
(PROT_EXEC mappings including shared libraries), and `file_mprotect`
(W→X transitions like JIT and libffi). Each hook resolves the file's
backing device via `file->f_inode->i_sb->s_dev` and looks it up in the
verity device map. For block-backed filesystems, `s_dev` equals
`s_bdev->bd_dev`, which avoids an extra pointer chase and NULL check on
`s_bdev` — non-block filesystems simply miss in the map and get denied
by the default policy.
During early boot the initramfs needs to be trusted as well, since it
runs before any dm-verity volume is mounted. PID1 writes the initramfs
superblock's device number into a BPF global before attaching the
programs, and clears it after `switch_root` to close the trust window.
As a prerequisite, PID1 also verifies that
`dm_verity.require_signatures=1` is active — without it, unsigned
dm-verity devices could be created, which would weaken the security
model even though the BPF program would correctly deny execution from
them.
### Surviving daemon-reexec
The BPF programs and their verity device map must survive PID1
re-execution (daemon-reexec, switch_root, soft-reboot). Without
preservation, `manager_free()` would destroy the skeleton, the link FDs
would close, programs would detach, and the map would be freed. After
exec, a fresh skeleton would have an empty map — but existing dm-verity
devices have already signaled their integrity and won't do so again. A
deny-default policy plus an empty map means all execution denied and the
system is bricked.
We solve this by serializing the raw BPF link FDs and the `.bss` map FD
across exec using systemd's existing `serialize_fd` / `fdset_cloexec` /
`deserialize_fd` infrastructure. The kernel reference chain (link FD →
`struct bpf_link` → `struct bpf_prog` → `struct bpf_map`) keeps programs
attached and map data intact as long as the dup'd FDs survive. After
exec, PID1 detects the deserialized FDs and skips skeleton re-creation
entirely. If switching root, it uses the deserialized `.bss` map FD to
clear `initramfs_s_dev` via a targeted `mmap()` write, preserving the
other guard globals in `.bss`.
We intentionally avoid bpffs pinning. Pinned objects are discoverable
and manipulable by any process with sufficient privileges
(`BPF_OBJ_GET`, unlink). FD serialization keeps everything private to
PID1 with no external attack surface.
### Self-protection
BPF LSM programs attached via the tracing trampoline (`BPF_LSM_MAC`) are
inherently tamper-resistant — `bpf_tracing_link_lops` has no
`.update_prog` and no `.detach` callbacks, so the kernel rejects
`BPF_LINK_UPDATE` with `-EINVAL` and `BPF_LINK_DETACH` with
`-EOPNOTSUPP`. Once attached, our programs cannot be modified or
detached through the `bpf()` syscall.
The remaining attack vector is map injection: `BPF_MAP_GET_FD_BY_ID` to
obtain an FD to `verity_devices`, then `BPF_MAP_UPDATE_ELEM` to insert a
fake trusted device. The self-protection guard blocks this with three
hooks. `lsm/bpf_map` fires inside `bpf_map_new_fd()`, the chokepoint for
all code paths that produce a map FD, and denies access to our map IDs
from any process other than PID1 (identified via `tgid == 1`, which is
unspoofable — `bpf_get_current_pid_tgid()` reads `current->tgid` from
`pid->numbers[0].nr`, the init-namespace PID). `lsm/bpf_prog` provides
analogous protection for program FDs as defense-in-depth. `lsm/bpf`
handles `BPF_LINK_GET_FD_BY_ID` at the command level since there is no
`security_bpf_link()` hook in the kernel.
The guard starts inactive — all protected IDs default to 0 in `.bss`,
and no real BPF object has ID 0 — so there is no window where it
interferes with PID1's own setup. After attaching all programs, PID1
queries the kernel-assigned IDs via `bpf_obj_get_info_by_fd()` and
writes them into the guard's globals. From that point on, the guard is
active. The guard has zero collateral damage: it only denies access to
our specific object IDs, leaving bpftrace, bpftool,
`RestrictFileSystems=`, and all other BPF usage completely unaffected.
Additionally, a ptrace guard (`lsm/ptrace_access_check`) blocks
`PTRACE_MODE_ATTACH` to PID1 from other processes, preventing extraction
of sensitive state from PID1's address space via ptrace, `/proc/1/mem`,
`process_vm_readv()`, or `pidfd_getfd()`. `PTRACE_MODE_READ` is allowed
so that monitoring tools and `systemctl` continue to work normally.
### Limitations
- The enforcement hooks resolve trust by looking at
`file->f_inode->i_sb->s_dev` — the device number of the superblock that
owns the file's inode. This works correctly for files directly on a
dm-verity block device, but it does not see through overlayfs. When a
file is accessed on an overlay mount, `f_inode` points to the overlay
inode, and `i_sb->s_dev` is the overlay superblock's anonymous device
number — not the underlying dm-verity device. The overlay superblock has
no backing block device, so the lookup misses in the verity map and
execution is denied by the default policy.
This means that overlayfs mounts whose lower layers are on
dm-verity-protected volumes will currently have execution blocked, even
though the actual data is integrity-protected. The correct fix requires
a kernel extension that allows the BPF program to call something like
`d_real_inode()` to resolve through the overlay to the real inode on the
underlying filesystem, and then check that inode's superblock device
number against the verity map. I plan to add a BPF kfunc exposing this
functionality in a follow-up kernel series.
- Multi-device filesystems such as btrfs use entirely synthetic device
numbers and there is no way to reach the actual device backing the inode
from the inode itself. So `RestrictFileSystemAccess=` only works
reliably with a subset of filesystems. In practice this isn't a problem
because the feature is tailored to erofs; using it on arbitrary
filesystems requires careful vetting of the actual filesystem behaviour.
- The initial implementation also blocks JIT-style execution that relies
on memory mapped executable. This is part of `exec` semantics today and
can be loosened later by introducing finer-grained values (a common
pattern in systemd — following the precedent of `ProtectSystem=`, which
started as a boolean and later grew `auto`/`yes`/`full`/`strict`
semantics).
- The configuration is a system-wide setting with no per-unit opt-out.
This is intentional for the initial implementation: a global invariant
is easier to reason about and harder to accidentally weaken. Per-unit
relaxation can be added later if a concrete need arises.
### Testing
The series includes unit tests and integration tests covering both the
core enforcement logic and the self-protection guard. The unit test
loads the skeleton, attaches programs, populates guard globals, and
verifies that protected IDs are set correctly. The integration tests
exercise the guard by attempting `BPF_MAP_GET_FD_BY_ID` and
`BPF_PROG_GET_FD_BY_ID` from a non-PID1 process and verifying that
access is denied.
What we cannot currently test end-to-end is actual execution enforcement
against a dm-verity-signed root filesystem. The systemd test suite does
not yet have infrastructure for booting a VM with a signed dm-verity
rootfs image — the existing mkosi-based test framework lacks the ability
to produce and boot such images. This will hopefully change soon when
Daan integrates barrage into the test suite.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Daan De Meyer [Tue, 12 May 2026 07:41:01 +0000 (09:41 +0200)]
test: Modernize btrfs tests
Convert test-btrfs to use the test framework and
assertions, merge the physical offset test into it
and beef it up to include what TEST-83-BTRFS does and
finally get rid of TEST-83-BTRFS as it is unneeded now.
Daan De Meyer [Wed, 13 May 2026 11:06:35 +0000 (13:06 +0200)]
libc,shared: detect newer library symbols at runtime via weak references (#42065)
For libc syscall wrappers (pidfd_open, fsopen, openat2, etc.) we
previously
gated the calls behind build-time HAVE_* checks. Replace these with weak
external references, falling back to the raw syscall at runtime when the
loaded glibc lacks the symbol. Drop the corresponding cc.has_function()
loop
from meson.build and disable -Wredundant-decls /
readability-redundant-declaration
for src/libc/ via meson c_args and a local .clang-tidy.
For optional libraries (libcryptsetup, libdw, libarchive), drop the
per-symbol
HAVE_* checks. Always declare the prototypes, suppressing the
redundant-decl
warnings via DISABLE_WARNING_REDUNDANT_DECLS and NOLINT, and resolve the
symbols after the main dlopen via a new DLSYM_OPTIONAL() helper that
only
assigns on success. libarchive's *_is_set wrappers now use fallback
functions
as their pointer initializers, so call sites never need to NULL-check.
The same treatment applies to pidfd_spawn / posix_spawnattr_setcgroup_np
in
process-util.c and epoll_pwait2 in sd-event.c. coredump-config and
coredump-submit get a dlopen_dw_has_dwfl_set_sysroot() helper. The kexec
arch gate now uses defined(__NR_kexec_file_load) directly; pidfd.h uses
__has_include_next() to decide whether to pull in glibc's header.
This lets binaries built against newer glibc / libcryptsetup / libdw /
libarchive headers still load and run on older targets where these
symbols
are absent.
Jammy's kernel is too old at this point, and doesn't even provide a
vmlinux.h, so disable the feature in the build smoketests to let us
add new features
Co-developed-by: Luca Boccassi <luca.boccassi@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
core: work around btf_ctx_access() rejection of const void * in BPF LSM
Kernels before v6.16 (missing commit 1271a40eeafa "bpf: Allow access to
const void pointer arguments in tracing programs") have a bug in
btf_ctx_access() where const void * parameters in LSM hook signatures
are not recognized as void pointers. The function checks t->type == 0
to detect void *, but for const void * the BTF chain is PTR -> CONST ->
void, so t->type points to the CONST node rather than directly to
type_id 0. This causes the verifier to reject any BPF program that
reads the const void *value argument of bdev_setintegrity:
func 'bpf_lsm_bdev_setintegrity' arg2 type UNKNOWN is not a struct
invalid bpf_context access off=16 size=8
Work around this by providing a compat variant of the
bdev_setintegrity BPF program that avoids reading the const void *value
argument entirely. Instead it reads the size argument (a scalar integer)
directly from the raw BPF context (ctx[3]), which is not subject to the
broken type check. This is safe because dm-verity guarantees that value
and size are always in lockstep: both NULL/0 for unsigned devices, both
non-zero for signed devices.
The loader tries the full version first (which reads both value and size
for defense-in-depth) and falls back to the compat variant if loading
fails. bpf_program__set_autoload(false) disables whichever variant is
not needed so the verifier never sees it.
This compat logic can be removed once the minimum kernel baseline
includes the 1271a40eeafa fix.
Signed-off-by: Christian Brauner <brauner@kernel.org>
test: add integration tests for RestrictFileSystemAccess= BPF LSM
Add TEST-90-RESTRICT-FSACCESS with two subtests:
config subtest — Tests PID1's RestrictFileSystemAccess= configuration parsing and
failure modes via system.conf drop-ins and daemon-reexec:
- Default RestrictFileSystemAccess=no produces no log messages
- RestrictFileSystemAccess=yes without BPF LSM logs appropriate warning
- RestrictFileSystemAccess=yes without require_signatures is correctly rejected
by the test helper binary's precondition check
enforce subtest — Tests actual BPF LSM enforcement using a test helper
binary (test-bpf-restrict-fsaccess) that loads the BPF skeleton with
initramfs_s_dev set to the rootfs s_dev, pins BPF links, and exits:
- Execution from rootfs continues to work (trusted via initramfs_s_dev)
- Execution from tmpfs is blocked with EPERM
- Execution from a signed dm-verity device is allowed, driven via
systemd-run -p RootImage= against the pre-built signed minimal_0
images that mkosi ships and signs at image build time (no on-the-fly
squashfs / verity hash tree / signature build required)
- After BPF detach, enforcement is lifted
All tests skip gracefully when prerequisites are not met (BPF LSM, BPF
framework, dm-verity tools, signing keys).
Signed-off-by: Christian Brauner <brauner@kernel.org>
core: expose internal helpers for test-bpf-restrict-fsaccess
Make dm_verity_require_signatures() non-static and declare it in the
header so the test helper binary can exercise the same precondition
checks that PID1 uses.
Signed-off-by: Christian Brauner <brauner@kernel.org>
core: add self-protection guard for RestrictFileSystemAccess= BPF LSM
Add self-protection guard programs to the RestrictFileSystemAccess= skeleton that
prevent non-PID1 processes from obtaining FDs to our maps, programs, or
links via the bpf() syscall.
This blocks the primary attack vector against the RestrictFileSystemAccess= policy:
using BPF_MAP_GET_FD_BY_ID to get an FD to the verity_devices map,
then BPF_MAP_UPDATE_ELEM to inject fake trusted devices. Protection of
program and link IDs is defense-in-depth (the kernel already blocks
BPF_LINK_UPDATE and BPF_LINK_DETACH for LSM tracing links).
Additionally, a ptrace guard (lsm/ptrace_access_check) blocks
PTRACE_MODE_ATTACH to PID1 from other processes, preventing
extraction of sensitive state from PID1's address space via
ptrace, /proc/1/mem, process_vm_readv(), or pidfd_getfd().
Guard logic:
1. Allow all BPF ops from PID1 (tgid == 1, unspoofable)
2. Deny BPF_MAP_GET_FD_BY_ID for our protected map IDs
3. Deny BPF_PROG_GET_FD_BY_ID for our program IDs
4. Deny BPF_LINK_GET_FD_BY_ID for our link IDs
5. Allow everything else (zero collateral damage)
The guard starts inactive (all protected IDs default to 0 in .bss).
After skeleton attach, PID1 queries kernel-assigned IDs via
bpf_obj_get_info_by_fd() and writes them into the guard globals via
the mmap'd .bss, then extracts owned FDs and destroys the skeleton.
Destroying the skeleton unmaps the .bss page from PID1's address
space, so no BPF state — guard globals, protected map/prog/link IDs,
initramfs_s_dev — remains readable via /proc/1/mem. The kernel map
data persists (held by the dup'd FDs) but is only accessible via
bpf_map_* syscalls, which the guard itself blocks for non-PID1.
Signed-off-by: Christian Brauner <brauner@kernel.org>
core: preserve RestrictFileSystemAccess= BPF state across daemon-reexec
The BPF link and .bss map FDs must survive PID1 re-execution
(daemon-reexec, switch_root, soft-reboot). Without serialization,
manager_free() closes them before execv, programs detach, and the
verity_devices map is freed. After exec a fresh skeleton would have
an empty map — but existing dm-verity devices have already called
bdev_setintegrity and won't call it again. The result would be a
deny-default policy with an empty map, i.e., all execution denied
and the system bricked.
Add serialize/deserialize support using systemd's existing
serialize_fd / fdset_cloexec / deserialize_fd infrastructure:
Before exec (in manager_serialize via bpf_restrict_fsaccess_serialize):
- Dup each link FD and the .bss map FD into the FDSet
- fdset_cloexec(fds, false) + execv() preserves them across exec
After exec (in manager_deserialize + bpf_restrict_fsaccess_setup):
- Deserialize the link FDs and .bss map FD into the Manager struct
- bpf_restrict_fsaccess_setup() detects the deserialized FDs and skips
skeleton re-creation entirely — the programs are already attached
- If no longer in initrd, clear initramfs_s_dev in the kernel map
No bpffs pinning is needed. This avoids a bpffs mount dependency and
eliminates the external attack surface that pinned objects would create
(discoverable/manipulable via unlink or BPF_OBJ_GET). The FDs remain
private to PID1.
Signed-off-by: Christian Brauner <brauner@kernel.org>
core: add RestrictFileSystemAccess= BPF LSM for dm-verity execution enforcement
Add a new RestrictFileSystemAccess= boolean setting in the [Manager] section of
system.conf that enforces execution only from signed dm-verity block
devices and the initramfs during early boot.
When RestrictFileSystemAccess=yes is set, PID1 loads a BPF LSM program early in boot
that:
Integrity tracking (self-populating, no userspace involvement):
- bdev_setintegrity: records dm-verity signature status in a BPF hash
map when the kernel signals device integrity via
security_bdev_setintegrity()
- bdev_free_security: removes devices from the map on teardown
Trust anchors:
- Signed dm-verity volumes (sig_valid flag in the BPF map)
- Initramfs (s_dev captured at load time, cleared after switch_root)
- Everything else is denied (tmpfs, procfs, sysfs, anonymous PROT_EXEC)
PID1 requires dm-verity require_signatures=1 to be enabled and refuses
to load the BPF program otherwise, ensuring the kernel enforces that all
dm-verity devices carry valid signatures.
After attach, PID1 extracts owned FDs from the skeleton (link FDs +
.bss map FD) and lets the skeleton be destroyed. The dup'd link FDs
keep programs attached via the kernel reference chain (link FD ->
bpf_link -> bpf_prog -> bpf_map). Destroying the skeleton unmaps the
.bss page from PID1's address space so no BPF state is readable via
/proc/1/mem. The .bss map FD is retained for targeted writes (clearing
initramfs_s_dev after switch_root via mmap).
Signed-off-by: Christian Brauner <brauner@kernel.org>
Daan De Meyer [Tue, 12 May 2026 14:29:18 +0000 (16:29 +0200)]
libc,shared: detect newer library symbols at runtime
For libc syscall wrappers (pidfd_open, fsopen, openat2, etc.) we previously
gated the calls behind build-time HAVE_* checks. Replace these with shim
functions in src/libc/ that fall back to the raw syscall at runtime when the
loaded glibc lacks the symbol. The infrastructure lives in src/libc/libc-shim.h:
DEFINE_SYSCALL_SHIM falls back to a direct syscall, DEFINE_LIBC_SHIM returns
ENOSYS (for posix_spawn-family helpers that have no corresponding syscall), and
DEFINE_LIBC_ERRNO_SHIM sets errno=ENOSYS and returns -1 (for read/write-style
helpers). The weak reference to the libc symbol is bound via __asm__(\"name\")
rename so the bare libc identifier never appears as a C token — this avoids
both #undef boilerplate against override-header redirects and the resulting
-Wredundant-decls warning. Drop the corresponding cc.has_function() loop from
meson.build.
For optional libraries (libcryptsetup, libdw, libarchive), drop the per-symbol
HAVE_* checks. Always declare the prototypes, suppressing the redundant-decl
warnings via DISABLE_WARNING_REDUNDANT_DECLS and NOLINT, and resolve the symbols
after the main dlopen via a new DLSYM_OPTIONAL() helper that only assigns on
success. libcryptsetup's crypt_set_keyring_to_link / crypt_token_set_external_path
and libarchive's *_is_set wrappers use fallback functions as their pointer
initializers (returning -ENOSYS and 0 respectively), so call sites can invoke
the symbol unconditionally and just check for -ENOSYS where the \"not supported\"
distinction matters.
The same shim treatment applies to pidfd_spawn / posix_spawnattr_setcgroup_np
(src/libc/spawn.c) and epoll_pwait2 (src/libc/epoll.c), with corresponding
override headers in src/include/override/spawn.h and
src/include/override/sys/epoll.h. posix_spawn_wrapper() in process-util.c and
epoll_wait_usec() in sd-event.c now detect ENOSYS in the return value instead
of checking the function pointer, falling back to plain posix_spawn() and
epoll_wait() respectively. coredump-config and coredump-submit get a
dlopen_dw_has_dwfl_set_sysroot() helper. The kexec arch gate now uses
defined(__NR_kexec_file_load) directly; pidfd.h uses __has_include_next() to
decide whether to pull in glibc's header.
This lets binaries built against newer glibc / libcryptsetup / libdw /
libarchive headers still load and run on older targets where these symbols are
absent.
When verb groups were added, I assumed that the first group will always
by the unnamed group, or in other words, that VERB_GROUP() line cannot
appear first. This provides an additional check on the whether the verbs
haven't been reordered by the compiler or linker. But that check is weak
and we can do a better check anyway. And this limitation is unexpected,
since we allow that for OPTIONs. The code should all work without an
unnamed group, once this assertion is removed.
Daan De Meyer [Tue, 12 May 2026 19:54:06 +0000 (21:54 +0200)]
syscall: add kexec_file_load to the generated override header
This makes __NR_kexec_file_load available on architectures where the kernel
UAPI headers don't define it, matching the runtime fallback path in
src/libc/kexec.c which is gated on #ifdef __NR_kexec_file_load.
A follow-up to the AddStorage / RemoveStorage series. ReplaceStorage
swaps the *backing file* of an already-attached storage device on a
running vmspawn-managed VM, leaving the guest-visible device frontend
(virtio-blk, virtio-scsi, nvme, scsi-cd) and every other property of
the device untouched. The intended use is to point an existing disk
at a new image without the guest seeing a hot-unplug/hot-plug cycle.
The signature mirrors AddStorage minus the 'config' field: the
device frontend doesn't change, only the backing behind it. Read-
only / read-write is derived from the new fd's O_ACCMODE; scsi-cd is
forced read-only to match the boot-time policy. S_ISBLK on the new
fd selects host_device vs file driver, matching AddStorage.
The QMP primitive is blockdev-reopen. It cannot change a file /
host_device node's 'filename' so we can't just point the existing
file node at a new fd, but it can swap a format node's 'file' child
to a different existing monitor-owned node by node-name reference
(case 3 in qemu/qapi/block-core.json:5034-5040). The chain is:
add-fd (host fd → new fdset)
blockdev-add (new file node, filename=/dev/fdset/N — fd-only)
remove-fd (release monitor's ref; new file holds the dup)
blockdev-reopen (format node, file = new file node-name)
blockdev-del (old file node; its dup release frees old fdset)
The reopen options must restate every option the original blockdev-
add emitted on the format node — blockdev-reopen resets any
unspecified option to its driver default. The 'file' field is a
node-name string reference, never a path.
No new errors and no new IDL types beyond the method itself;
everything is built on the existing NoSuchStorage / StorageImmutable
/ NotConnected / EBUSY vocabulary.
The series is:
vmspawn: split blockdev-add into separate file and format calls
Preparatory refactor. qemu/blockdev.c:3440 only marks the
top-level BDS returned by blockdev-add as monitor-owned;
inline children are NOT, so blockdev-del later rejects them
with "Node X is not owned by the monitor". Split into two
blockdev-add calls so the file node is independently
deletable. DriveInfo gains qmp_file_node_name and a
file_generation counter; the teardown helper deletes format
then file (file-first is rejected as "node used as 'file'
of Y"). The ephemeral path was already structured this way;
only the regular add path changes. Drops the now-unused
qmp_build_blockdev_add_inline().
shared/varlink-io.systemd.MachineInstance: add ReplaceStorage method
IDL only: ReplaceStorage(fileDescriptorIndex, name). No new
errors.
vmspawn: implement io.systemd.MachineInstance.ReplaceStorage
vmspawn_qmp_replace_block_device() entry point, ReplaceCtx
(refcounted, ReplaceCtxStateFlags for partial-state tracking)
and four async callbacks plus an idempotent replace_fail.
file_generation is bumped before issuing blockdev-add so
retries don't collide on node-name.
BLOCK_DEVICE_STATE_REPLACE_PENDING gates concurrent
Replace / Remove on the same drive. On reopen success the
trailing blockdev-del of the old file node fires from the
reopen callback; its failure logs a warning and still replies
success (the swap already committed; the orphan resolves at VM
exit). QMP disconnect mid-replace routes via
qmp_client_fail_pending → replace_fail → NotConnected.
test: integration test for io.systemd.MachineInstance.ReplaceStorage
TEST-87-AUX-UTILS-VM.replace-storage covers happy-path replace,
successive replaces (file_generation rotation), StorageImmutable
rejection on the boot-time drive, NoSuchStorage on unknown
names, InvalidParameter on malformed names, and clean
RemoveStorage after a replace (proves the new file node is
monitor-owned and the teardown order works). Backing files are
passed via 'varlinkctl --push-fd'; no machinectl front-end is
added in this round.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
The DHCP option 120 (SIP server) option takes a list of addresses or
domain names, and the first byte in the data classifies which type is
stored. Let's extend _addresses() and _domains() to make them support
the SIP server option.