This stuff is so useful, and should work out of the box I am sure. Given
that the metrics are only generated on request this shouldn't create any
additional burden by default.
Yes, this might enlarge reports a bit, if generated with everything on,
but we really should solve that at the report generation level, not at
the point where we make the metrics available.
Chris Down [Wed, 13 May 2026 12:25:08 +0000 (21:25 +0900)]
core: do not leak resources when handling stale alias state on reload (#41986)
The fix for the corrupted state when units become aliased on reload
leaks the now-aliased unit's resources, which become untracked and
essentially lost.
While fixing the state corruption is of course necessary, leaking
processes/etc. is not ideal for a system and service manager, so
instead attempt to keep track of them by creating stub units
on-the-fly.
This way resources are not leaked, there are clear indications of
where they moved, and all state can be tracked as expected.
RestrictFileSystemAccess= — dm-verity filesystem access enforcement via BPF LSM (#41340)
This series adds a new `RestrictFileSystemAccess=` setting in the
`[Manager]` section of `system.conf` that enforces a deny-default
execution policy: only binaries residing on signed dm-verity block
devices (and the initramfs during early boot) are permitted to execute.
Everything else — tmpfs, procfs, sysfs, anonymous executable mappings,
unsigned dm-verity devices — is denied.
The directive takes the values `no` (default), `exec` (lock down
execution), and accepts `yes` as an alias for `exec`. The name is
deliberately broader than what the initial values cover so the same
setting can grow to restrict other filesystem access categories in the
future (e.g. `any` to deny all access from untrusted filesystems, not
just execution).
### How it works
The BPF program is entirely self-contained; PID1 loads it and the kernel
does the rest. When dm-verity brings up a device, the kernel calls
`security_bdev_setintegrity()` twice during `verity_preresume()`: once
with the root hash and once with the signature validity status. Our
`lsm/bdev_setintegrity` hook captures the second call and records the
device number in a BPF hash map if the signature is valid. When a device
is torn down, `lsm/bdev_free_security` cleans up the map entry. No
userspace map population is needed at any point.
The enforcement side hooks `bprm_check_security` (execve), `mmap_file`
(PROT_EXEC mappings including shared libraries), and `file_mprotect`
(W→X transitions like JIT and libffi). Each hook resolves the file's
backing device via `file->f_inode->i_sb->s_dev` and looks it up in the
verity device map. For block-backed filesystems, `s_dev` equals
`s_bdev->bd_dev`, which avoids an extra pointer chase and NULL check on
`s_bdev` — non-block filesystems simply miss in the map and get denied
by the default policy.
During early boot the initramfs needs to be trusted as well, since it
runs before any dm-verity volume is mounted. PID1 writes the initramfs
superblock's device number into a BPF global before attaching the
programs, and clears it after `switch_root` to close the trust window.
As a prerequisite, PID1 also verifies that
`dm_verity.require_signatures=1` is active — without it, unsigned
dm-verity devices could be created, which would weaken the security
model even though the BPF program would correctly deny execution from
them.
### Surviving daemon-reexec
The BPF programs and their verity device map must survive PID1
re-execution (daemon-reexec, switch_root, soft-reboot). Without
preservation, `manager_free()` would destroy the skeleton, the link FDs
would close, programs would detach, and the map would be freed. After
exec, a fresh skeleton would have an empty map — but existing dm-verity
devices have already signaled their integrity and won't do so again. A
deny-default policy plus an empty map means all execution denied and the
system is bricked.
We solve this by serializing the raw BPF link FDs and the `.bss` map FD
across exec using systemd's existing `serialize_fd` / `fdset_cloexec` /
`deserialize_fd` infrastructure. The kernel reference chain (link FD →
`struct bpf_link` → `struct bpf_prog` → `struct bpf_map`) keeps programs
attached and map data intact as long as the dup'd FDs survive. After
exec, PID1 detects the deserialized FDs and skips skeleton re-creation
entirely. If switching root, it uses the deserialized `.bss` map FD to
clear `initramfs_s_dev` via a targeted `mmap()` write, preserving the
other guard globals in `.bss`.
We intentionally avoid bpffs pinning. Pinned objects are discoverable
and manipulable by any process with sufficient privileges
(`BPF_OBJ_GET`, unlink). FD serialization keeps everything private to
PID1 with no external attack surface.
### Self-protection
BPF LSM programs attached via the tracing trampoline (`BPF_LSM_MAC`) are
inherently tamper-resistant — `bpf_tracing_link_lops` has no
`.update_prog` and no `.detach` callbacks, so the kernel rejects
`BPF_LINK_UPDATE` with `-EINVAL` and `BPF_LINK_DETACH` with
`-EOPNOTSUPP`. Once attached, our programs cannot be modified or
detached through the `bpf()` syscall.
The remaining attack vector is map injection: `BPF_MAP_GET_FD_BY_ID` to
obtain an FD to `verity_devices`, then `BPF_MAP_UPDATE_ELEM` to insert a
fake trusted device. The self-protection guard blocks this with three
hooks. `lsm/bpf_map` fires inside `bpf_map_new_fd()`, the chokepoint for
all code paths that produce a map FD, and denies access to our map IDs
from any process other than PID1 (identified via `tgid == 1`, which is
unspoofable — `bpf_get_current_pid_tgid()` reads `current->tgid` from
`pid->numbers[0].nr`, the init-namespace PID). `lsm/bpf_prog` provides
analogous protection for program FDs as defense-in-depth. `lsm/bpf`
handles `BPF_LINK_GET_FD_BY_ID` at the command level since there is no
`security_bpf_link()` hook in the kernel.
The guard starts inactive — all protected IDs default to 0 in `.bss`,
and no real BPF object has ID 0 — so there is no window where it
interferes with PID1's own setup. After attaching all programs, PID1
queries the kernel-assigned IDs via `bpf_obj_get_info_by_fd()` and
writes them into the guard's globals. From that point on, the guard is
active. The guard has zero collateral damage: it only denies access to
our specific object IDs, leaving bpftrace, bpftool,
`RestrictFileSystems=`, and all other BPF usage completely unaffected.
Additionally, a ptrace guard (`lsm/ptrace_access_check`) blocks
`PTRACE_MODE_ATTACH` to PID1 from other processes, preventing extraction
of sensitive state from PID1's address space via ptrace, `/proc/1/mem`,
`process_vm_readv()`, or `pidfd_getfd()`. `PTRACE_MODE_READ` is allowed
so that monitoring tools and `systemctl` continue to work normally.
### Limitations
- The enforcement hooks resolve trust by looking at
`file->f_inode->i_sb->s_dev` — the device number of the superblock that
owns the file's inode. This works correctly for files directly on a
dm-verity block device, but it does not see through overlayfs. When a
file is accessed on an overlay mount, `f_inode` points to the overlay
inode, and `i_sb->s_dev` is the overlay superblock's anonymous device
number — not the underlying dm-verity device. The overlay superblock has
no backing block device, so the lookup misses in the verity map and
execution is denied by the default policy.
This means that overlayfs mounts whose lower layers are on
dm-verity-protected volumes will currently have execution blocked, even
though the actual data is integrity-protected. The correct fix requires
a kernel extension that allows the BPF program to call something like
`d_real_inode()` to resolve through the overlay to the real inode on the
underlying filesystem, and then check that inode's superblock device
number against the verity map. I plan to add a BPF kfunc exposing this
functionality in a follow-up kernel series.
- Multi-device filesystems such as btrfs use entirely synthetic device
numbers and there is no way to reach the actual device backing the inode
from the inode itself. So `RestrictFileSystemAccess=` only works
reliably with a subset of filesystems. In practice this isn't a problem
because the feature is tailored to erofs; using it on arbitrary
filesystems requires careful vetting of the actual filesystem behaviour.
- The initial implementation also blocks JIT-style execution that relies
on memory mapped executable. This is part of `exec` semantics today and
can be loosened later by introducing finer-grained values (a common
pattern in systemd — following the precedent of `ProtectSystem=`, which
started as a boolean and later grew `auto`/`yes`/`full`/`strict`
semantics).
- The configuration is a system-wide setting with no per-unit opt-out.
This is intentional for the initial implementation: a global invariant
is easier to reason about and harder to accidentally weaken. Per-unit
relaxation can be added later if a concrete need arises.
### Testing
The series includes unit tests and integration tests covering both the
core enforcement logic and the self-protection guard. The unit test
loads the skeleton, attaches programs, populates guard globals, and
verifies that protected IDs are set correctly. The integration tests
exercise the guard by attempting `BPF_MAP_GET_FD_BY_ID` and
`BPF_PROG_GET_FD_BY_ID` from a non-PID1 process and verifying that
access is denied.
What we cannot currently test end-to-end is actual execution enforcement
against a dm-verity-signed root filesystem. The systemd test suite does
not yet have infrastructure for booting a VM with a signed dm-verity
rootfs image — the existing mkosi-based test framework lacks the ability
to produce and boot such images. This will hopefully change soon when
Daan integrates barrage into the test suite.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Daan De Meyer [Tue, 12 May 2026 07:41:01 +0000 (09:41 +0200)]
test: Modernize btrfs tests
Convert test-btrfs to use the test framework and
assertions, merge the physical offset test into it
and beef it up to include what TEST-83-BTRFS does and
finally get rid of TEST-83-BTRFS as it is unneeded now.
Daan De Meyer [Wed, 13 May 2026 11:06:35 +0000 (13:06 +0200)]
libc,shared: detect newer library symbols at runtime via weak references (#42065)
For libc syscall wrappers (pidfd_open, fsopen, openat2, etc.) we
previously
gated the calls behind build-time HAVE_* checks. Replace these with weak
external references, falling back to the raw syscall at runtime when the
loaded glibc lacks the symbol. Drop the corresponding cc.has_function()
loop
from meson.build and disable -Wredundant-decls /
readability-redundant-declaration
for src/libc/ via meson c_args and a local .clang-tidy.
For optional libraries (libcryptsetup, libdw, libarchive), drop the
per-symbol
HAVE_* checks. Always declare the prototypes, suppressing the
redundant-decl
warnings via DISABLE_WARNING_REDUNDANT_DECLS and NOLINT, and resolve the
symbols after the main dlopen via a new DLSYM_OPTIONAL() helper that
only
assigns on success. libarchive's *_is_set wrappers now use fallback
functions
as their pointer initializers, so call sites never need to NULL-check.
The same treatment applies to pidfd_spawn / posix_spawnattr_setcgroup_np
in
process-util.c and epoll_pwait2 in sd-event.c. coredump-config and
coredump-submit get a dlopen_dw_has_dwfl_set_sysroot() helper. The kexec
arch gate now uses defined(__NR_kexec_file_load) directly; pidfd.h uses
__has_include_next() to decide whether to pull in glibc's header.
This lets binaries built against newer glibc / libcryptsetup / libdw /
libarchive headers still load and run on older targets where these
symbols
are absent.
Jammy's kernel is too old at this point, and doesn't even provide a
vmlinux.h, so disable the feature in the build smoketests to let us
add new features
Co-developed-by: Luca Boccassi <luca.boccassi@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
core: work around btf_ctx_access() rejection of const void * in BPF LSM
Kernels before v6.16 (missing commit 1271a40eeafa "bpf: Allow access to
const void pointer arguments in tracing programs") have a bug in
btf_ctx_access() where const void * parameters in LSM hook signatures
are not recognized as void pointers. The function checks t->type == 0
to detect void *, but for const void * the BTF chain is PTR -> CONST ->
void, so t->type points to the CONST node rather than directly to
type_id 0. This causes the verifier to reject any BPF program that
reads the const void *value argument of bdev_setintegrity:
func 'bpf_lsm_bdev_setintegrity' arg2 type UNKNOWN is not a struct
invalid bpf_context access off=16 size=8
Work around this by providing a compat variant of the
bdev_setintegrity BPF program that avoids reading the const void *value
argument entirely. Instead it reads the size argument (a scalar integer)
directly from the raw BPF context (ctx[3]), which is not subject to the
broken type check. This is safe because dm-verity guarantees that value
and size are always in lockstep: both NULL/0 for unsigned devices, both
non-zero for signed devices.
The loader tries the full version first (which reads both value and size
for defense-in-depth) and falls back to the compat variant if loading
fails. bpf_program__set_autoload(false) disables whichever variant is
not needed so the verifier never sees it.
This compat logic can be removed once the minimum kernel baseline
includes the 1271a40eeafa fix.
Signed-off-by: Christian Brauner <brauner@kernel.org>
test: add integration tests for RestrictFileSystemAccess= BPF LSM
Add TEST-90-RESTRICT-FSACCESS with two subtests:
config subtest — Tests PID1's RestrictFileSystemAccess= configuration parsing and
failure modes via system.conf drop-ins and daemon-reexec:
- Default RestrictFileSystemAccess=no produces no log messages
- RestrictFileSystemAccess=yes without BPF LSM logs appropriate warning
- RestrictFileSystemAccess=yes without require_signatures is correctly rejected
by the test helper binary's precondition check
enforce subtest — Tests actual BPF LSM enforcement using a test helper
binary (test-bpf-restrict-fsaccess) that loads the BPF skeleton with
initramfs_s_dev set to the rootfs s_dev, pins BPF links, and exits:
- Execution from rootfs continues to work (trusted via initramfs_s_dev)
- Execution from tmpfs is blocked with EPERM
- Execution from a signed dm-verity device is allowed, driven via
systemd-run -p RootImage= against the pre-built signed minimal_0
images that mkosi ships and signs at image build time (no on-the-fly
squashfs / verity hash tree / signature build required)
- After BPF detach, enforcement is lifted
All tests skip gracefully when prerequisites are not met (BPF LSM, BPF
framework, dm-verity tools, signing keys).
Signed-off-by: Christian Brauner <brauner@kernel.org>
core: expose internal helpers for test-bpf-restrict-fsaccess
Make dm_verity_require_signatures() non-static and declare it in the
header so the test helper binary can exercise the same precondition
checks that PID1 uses.
Signed-off-by: Christian Brauner <brauner@kernel.org>
core: add self-protection guard for RestrictFileSystemAccess= BPF LSM
Add self-protection guard programs to the RestrictFileSystemAccess= skeleton that
prevent non-PID1 processes from obtaining FDs to our maps, programs, or
links via the bpf() syscall.
This blocks the primary attack vector against the RestrictFileSystemAccess= policy:
using BPF_MAP_GET_FD_BY_ID to get an FD to the verity_devices map,
then BPF_MAP_UPDATE_ELEM to inject fake trusted devices. Protection of
program and link IDs is defense-in-depth (the kernel already blocks
BPF_LINK_UPDATE and BPF_LINK_DETACH for LSM tracing links).
Additionally, a ptrace guard (lsm/ptrace_access_check) blocks
PTRACE_MODE_ATTACH to PID1 from other processes, preventing
extraction of sensitive state from PID1's address space via
ptrace, /proc/1/mem, process_vm_readv(), or pidfd_getfd().
Guard logic:
1. Allow all BPF ops from PID1 (tgid == 1, unspoofable)
2. Deny BPF_MAP_GET_FD_BY_ID for our protected map IDs
3. Deny BPF_PROG_GET_FD_BY_ID for our program IDs
4. Deny BPF_LINK_GET_FD_BY_ID for our link IDs
5. Allow everything else (zero collateral damage)
The guard starts inactive (all protected IDs default to 0 in .bss).
After skeleton attach, PID1 queries kernel-assigned IDs via
bpf_obj_get_info_by_fd() and writes them into the guard globals via
the mmap'd .bss, then extracts owned FDs and destroys the skeleton.
Destroying the skeleton unmaps the .bss page from PID1's address
space, so no BPF state — guard globals, protected map/prog/link IDs,
initramfs_s_dev — remains readable via /proc/1/mem. The kernel map
data persists (held by the dup'd FDs) but is only accessible via
bpf_map_* syscalls, which the guard itself blocks for non-PID1.
Signed-off-by: Christian Brauner <brauner@kernel.org>
core: preserve RestrictFileSystemAccess= BPF state across daemon-reexec
The BPF link and .bss map FDs must survive PID1 re-execution
(daemon-reexec, switch_root, soft-reboot). Without serialization,
manager_free() closes them before execv, programs detach, and the
verity_devices map is freed. After exec a fresh skeleton would have
an empty map — but existing dm-verity devices have already called
bdev_setintegrity and won't call it again. The result would be a
deny-default policy with an empty map, i.e., all execution denied
and the system bricked.
Add serialize/deserialize support using systemd's existing
serialize_fd / fdset_cloexec / deserialize_fd infrastructure:
Before exec (in manager_serialize via bpf_restrict_fsaccess_serialize):
- Dup each link FD and the .bss map FD into the FDSet
- fdset_cloexec(fds, false) + execv() preserves them across exec
After exec (in manager_deserialize + bpf_restrict_fsaccess_setup):
- Deserialize the link FDs and .bss map FD into the Manager struct
- bpf_restrict_fsaccess_setup() detects the deserialized FDs and skips
skeleton re-creation entirely — the programs are already attached
- If no longer in initrd, clear initramfs_s_dev in the kernel map
No bpffs pinning is needed. This avoids a bpffs mount dependency and
eliminates the external attack surface that pinned objects would create
(discoverable/manipulable via unlink or BPF_OBJ_GET). The FDs remain
private to PID1.
Signed-off-by: Christian Brauner <brauner@kernel.org>
core: add RestrictFileSystemAccess= BPF LSM for dm-verity execution enforcement
Add a new RestrictFileSystemAccess= boolean setting in the [Manager] section of
system.conf that enforces execution only from signed dm-verity block
devices and the initramfs during early boot.
When RestrictFileSystemAccess=yes is set, PID1 loads a BPF LSM program early in boot
that:
Integrity tracking (self-populating, no userspace involvement):
- bdev_setintegrity: records dm-verity signature status in a BPF hash
map when the kernel signals device integrity via
security_bdev_setintegrity()
- bdev_free_security: removes devices from the map on teardown
Trust anchors:
- Signed dm-verity volumes (sig_valid flag in the BPF map)
- Initramfs (s_dev captured at load time, cleared after switch_root)
- Everything else is denied (tmpfs, procfs, sysfs, anonymous PROT_EXEC)
PID1 requires dm-verity require_signatures=1 to be enabled and refuses
to load the BPF program otherwise, ensuring the kernel enforces that all
dm-verity devices carry valid signatures.
After attach, PID1 extracts owned FDs from the skeleton (link FDs +
.bss map FD) and lets the skeleton be destroyed. The dup'd link FDs
keep programs attached via the kernel reference chain (link FD ->
bpf_link -> bpf_prog -> bpf_map). Destroying the skeleton unmaps the
.bss page from PID1's address space so no BPF state is readable via
/proc/1/mem. The .bss map FD is retained for targeted writes (clearing
initramfs_s_dev after switch_root via mmap).
Signed-off-by: Christian Brauner <brauner@kernel.org>
Daan De Meyer [Tue, 12 May 2026 14:29:18 +0000 (16:29 +0200)]
libc,shared: detect newer library symbols at runtime
For libc syscall wrappers (pidfd_open, fsopen, openat2, etc.) we previously
gated the calls behind build-time HAVE_* checks. Replace these with shim
functions in src/libc/ that fall back to the raw syscall at runtime when the
loaded glibc lacks the symbol. The infrastructure lives in src/libc/libc-shim.h:
DEFINE_SYSCALL_SHIM falls back to a direct syscall, DEFINE_LIBC_SHIM returns
ENOSYS (for posix_spawn-family helpers that have no corresponding syscall), and
DEFINE_LIBC_ERRNO_SHIM sets errno=ENOSYS and returns -1 (for read/write-style
helpers). The weak reference to the libc symbol is bound via __asm__(\"name\")
rename so the bare libc identifier never appears as a C token — this avoids
both #undef boilerplate against override-header redirects and the resulting
-Wredundant-decls warning. Drop the corresponding cc.has_function() loop from
meson.build.
For optional libraries (libcryptsetup, libdw, libarchive), drop the per-symbol
HAVE_* checks. Always declare the prototypes, suppressing the redundant-decl
warnings via DISABLE_WARNING_REDUNDANT_DECLS and NOLINT, and resolve the symbols
after the main dlopen via a new DLSYM_OPTIONAL() helper that only assigns on
success. libcryptsetup's crypt_set_keyring_to_link / crypt_token_set_external_path
and libarchive's *_is_set wrappers use fallback functions as their pointer
initializers (returning -ENOSYS and 0 respectively), so call sites can invoke
the symbol unconditionally and just check for -ENOSYS where the \"not supported\"
distinction matters.
The same shim treatment applies to pidfd_spawn / posix_spawnattr_setcgroup_np
(src/libc/spawn.c) and epoll_pwait2 (src/libc/epoll.c), with corresponding
override headers in src/include/override/spawn.h and
src/include/override/sys/epoll.h. posix_spawn_wrapper() in process-util.c and
epoll_wait_usec() in sd-event.c now detect ENOSYS in the return value instead
of checking the function pointer, falling back to plain posix_spawn() and
epoll_wait() respectively. coredump-config and coredump-submit get a
dlopen_dw_has_dwfl_set_sysroot() helper. The kexec arch gate now uses
defined(__NR_kexec_file_load) directly; pidfd.h uses __has_include_next() to
decide whether to pull in glibc's header.
This lets binaries built against newer glibc / libcryptsetup / libdw /
libarchive headers still load and run on older targets where these symbols are
absent.
When verb groups were added, I assumed that the first group will always
by the unnamed group, or in other words, that VERB_GROUP() line cannot
appear first. This provides an additional check on the whether the verbs
haven't been reordered by the compiler or linker. But that check is weak
and we can do a better check anyway. And this limitation is unexpected,
since we allow that for OPTIONs. The code should all work without an
unnamed group, once this assertion is removed.
Daan De Meyer [Tue, 12 May 2026 19:54:06 +0000 (21:54 +0200)]
syscall: add kexec_file_load to the generated override header
This makes __NR_kexec_file_load available on architectures where the kernel
UAPI headers don't define it, matching the runtime fallback path in
src/libc/kexec.c which is gated on #ifdef __NR_kexec_file_load.
A follow-up to the AddStorage / RemoveStorage series. ReplaceStorage
swaps the *backing file* of an already-attached storage device on a
running vmspawn-managed VM, leaving the guest-visible device frontend
(virtio-blk, virtio-scsi, nvme, scsi-cd) and every other property of
the device untouched. The intended use is to point an existing disk
at a new image without the guest seeing a hot-unplug/hot-plug cycle.
The signature mirrors AddStorage minus the 'config' field: the
device frontend doesn't change, only the backing behind it. Read-
only / read-write is derived from the new fd's O_ACCMODE; scsi-cd is
forced read-only to match the boot-time policy. S_ISBLK on the new
fd selects host_device vs file driver, matching AddStorage.
The QMP primitive is blockdev-reopen. It cannot change a file /
host_device node's 'filename' so we can't just point the existing
file node at a new fd, but it can swap a format node's 'file' child
to a different existing monitor-owned node by node-name reference
(case 3 in qemu/qapi/block-core.json:5034-5040). The chain is:
add-fd (host fd → new fdset)
blockdev-add (new file node, filename=/dev/fdset/N — fd-only)
remove-fd (release monitor's ref; new file holds the dup)
blockdev-reopen (format node, file = new file node-name)
blockdev-del (old file node; its dup release frees old fdset)
The reopen options must restate every option the original blockdev-
add emitted on the format node — blockdev-reopen resets any
unspecified option to its driver default. The 'file' field is a
node-name string reference, never a path.
No new errors and no new IDL types beyond the method itself;
everything is built on the existing NoSuchStorage / StorageImmutable
/ NotConnected / EBUSY vocabulary.
The series is:
vmspawn: split blockdev-add into separate file and format calls
Preparatory refactor. qemu/blockdev.c:3440 only marks the
top-level BDS returned by blockdev-add as monitor-owned;
inline children are NOT, so blockdev-del later rejects them
with "Node X is not owned by the monitor". Split into two
blockdev-add calls so the file node is independently
deletable. DriveInfo gains qmp_file_node_name and a
file_generation counter; the teardown helper deletes format
then file (file-first is rejected as "node used as 'file'
of Y"). The ephemeral path was already structured this way;
only the regular add path changes. Drops the now-unused
qmp_build_blockdev_add_inline().
shared/varlink-io.systemd.MachineInstance: add ReplaceStorage method
IDL only: ReplaceStorage(fileDescriptorIndex, name). No new
errors.
vmspawn: implement io.systemd.MachineInstance.ReplaceStorage
vmspawn_qmp_replace_block_device() entry point, ReplaceCtx
(refcounted, ReplaceCtxStateFlags for partial-state tracking)
and four async callbacks plus an idempotent replace_fail.
file_generation is bumped before issuing blockdev-add so
retries don't collide on node-name.
BLOCK_DEVICE_STATE_REPLACE_PENDING gates concurrent
Replace / Remove on the same drive. On reopen success the
trailing blockdev-del of the old file node fires from the
reopen callback; its failure logs a warning and still replies
success (the swap already committed; the orphan resolves at VM
exit). QMP disconnect mid-replace routes via
qmp_client_fail_pending → replace_fail → NotConnected.
test: integration test for io.systemd.MachineInstance.ReplaceStorage
TEST-87-AUX-UTILS-VM.replace-storage covers happy-path replace,
successive replaces (file_generation rotation), StorageImmutable
rejection on the boot-time drive, NoSuchStorage on unknown
names, InvalidParameter on malformed names, and clean
RemoveStorage after a replace (proves the new file node is
monitor-owned and the teardown order works). Backing files are
passed via 'varlinkctl --push-fd'; no machinectl front-end is
added in this round.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
The DHCP option 120 (SIP server) option takes a list of addresses or
domain names, and the first byte in the data classifies which type is
stored. Let's extend _addresses() and _domains() to make them support
the SIP server option.
Frantisek Sumsal [Tue, 12 May 2026 15:09:41 +0000 (17:09 +0200)]
sd-bus: handle non-string keys in dictionaries in JSON dump
JSON only supports string keys in objects, but D-Bus specification is a
bit more lenient and allows dict entries to have any basic type as key.
Let's stringify allowed non-string keys so that we can represent them as
JSON objects.
Relevant snippet from the D-Bus specification:
A DICT_ENTRY works exactly like a struct, but rather than parentheses
it uses curly braces, and it has more restrictions. The restrictions
are: it occurs only as an array element type; it has exactly two
single complete types inside the curly braces; the first single
complete type (the "key") must be a basic type rather than a container
type. Implementations must not accept dict entries outside of arrays,
must not accept dict entries with zero, one, or more than two fields,
and must not accept dict entries with non-basic-typed keys. A dict
entry is always a key-value pair.
Yaping Li [Sun, 10 May 2026 14:50:13 +0000 (14:50 +0000)]
logind: zero-initialize dispatch struct in vl_method_release_session()
The local struct passed to sd_varlink_dispatch() was not
zero-initialized. Since sd_json_dispatch_full() does not call handlers
for absent optional fields, p.id could be left indeterminate when
the client omits the Id parameter, leading to use of uninitialized
memory.
Luca Boccassi [Thu, 7 May 2026 19:02:57 +0000 (20:02 +0100)]
core: do not leak resources when handling stale alias state on reload
The fix for the corrupted state when units become aliased on reload
leaks the now-aliased unit's resources, which become untracked and
essentially lost.
While fixing the state corruption is of course necessary, leaking
processes/etc. is not ideal for a system and service manager, so
instead attempt to keep track of them by creating stub units
on-the-fly as replacements.
This way resources are not leaked, there are clear indications of
where they moved, and all state can be tracked as expected.
Ignore timers/sockets/paths as those are internal resources/triggers.
report-cgroup: use errno_or_else in one more place
Old gcc is confused about initialization:
In function ‘io_read_send’,
inlined from ‘walk_cgroups’ at ../src/report/report-cgroup.c:288:24:
../src/report/report-cgroup.c:167:21: error: ‘values[0]’ may be used uninitialized [-Werror=maybe-uninitialized]
167 | r = metric_build_send_unsigned(mf + i, link, unit, values[i], /* fields= */ NULL);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
I picked the fields that contain useful information about the specific
version/image/variant/experiment/flavour of the system. Also, either
NAME or PRETTY_NAME is included. This one is intended for human readers
to be able to identify the OS version easily.
report: drop MetricsFamilyContext, CGroupContext, CGroupInfo
Previously, we passed around information about the MetricFamily'ies
and the varlink connection in a helper structure. Having a hybrid of
const static and runtime stuff is iffy. Let's simplify things by passing
two separate parameters.
Also, in report-cgroup.c we built a cache of parsed values. This
requires additional storage requirements and introduces complexity when
dealing with population of the cache at the appropriate time.
This cache is not useful: for each cgroup, we generate a list of
metrics, and we have all the information at hand. The only reason
why we'd create the cache and not generate all the relevant replies
at once was that the helper functions called the .generate function
for each MetricFamily separately.
The MetricFamily interface is changed, so that metrics can be
defined without a .generate function. This is understood to mean
that the preceding metric family's .generate function will also
genarate this family. This allows us to define related metrics
nicely in a table:
{ METRIC_IO_SYSTEMD_CGROUP_PREFIX "CpuUsage", generate_func },
{ METRIC_IO_SYSTEMD_CGROUP_PREFIX "IOReadBytes", NULL },
{ METRIC_IO_SYSTEMD_CGROUP_PREFIX "IOReadOperations", NULL },
{ METRIC_IO_SYSTEMD_CGROUP_PREFIX "SomethingElse", generate_func2 },
...
When implementing .Describe, we list all the families. When implementing
.List, we only call those with .generate, and we get the same results
as before.
This allows the .generate functions to be simplified: instead of
keeping state, they just spit out all the metrics for a given
object in a tight loop.
varlink-io.systemd.MachineInstance,vmspawn: treat AddStorage/RemoveStorage name as opaque
The 'name' field on AddStorage and RemoveStorage was documented as
'<provider>:<volume>' and enforced via machine_storage_name_split() at
the varlink boundary. That form is only the convention machinectl
inherits from the StorageProvider routing path; the API itself only
needs a unique identifier the caller can re-use to detach the binding.
Drop the strict format check, require only a non-empty string, and
update the IDL docs to describe the field as a caller-supplied
identifier with machinectl's convention as a non-normative example.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
vmspawn: reject O_PATH and O_WRONLY fds in AddStorage
An fd opened O_PATH cannot be read, and an O_WRONLY fd cannot serve as
a backing file for a virtual disk image. Reject both at the bind-volume
entry point with -EBADF instead of letting the request proceed to QMP
where QEMU's file backend would fail to read from the fd. The
ReplaceStorage entry point grew the same checks in parallel.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
test: integration test for io.systemd.MachineInstance.ReplaceStorage
Modelled on TEST-87-AUX-UTILS-VM.bind-volume.sh. Boots vmspawn with
one boot-time bind-volume, hot-adds a runtime volume via machinectl
bind-volume, then exercises ReplaceStorage:
1. happy-path replace of a runtime drive
2. successive replace (verify file_generation rotation — no
node-name collisions on the second swap)
3. replace of the boot-time drive must fail with StorageImmutable
4. replace of an unknown name must fail with NoSuchStorage
5. invalid name (no provider:volume separator) must fail with
InvalidParameter
6. unbind-volume after replace must succeed — proves the new file
node is monitor-owned and the format-then-file teardown order
in vmspawn_qmp_block_device_teardown() correctly cleans up both
blockdev nodes
Pushes the new backing file via varlinkctl --push-fd; the file is a
plain truncate'd image. Auto-discovered by run_subtests in
TEST-87-AUX-UTILS-VM.sh.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Wire up the runtime hot-swap Varlink method. The signature mirrors
AddStorage minus 'config': the device frontend (virtio-blk,
virtio-scsi, nvme, scsi-cd) doesn't change, only the backing file
behind it. Read-only/read-write may flip based on the new fd's
O_ACCMODE; scsi-cd is forced read-only to match the boot-time policy.
add-fd → on_replace_observe_stage
blockdev-add (new file) → on_replace_blockdev_add_complete
remove-fd (new fdset) → on_replace_observe_stage
blockdev-reopen (format) → on_replace_blockdev_reopen_complete
[commit + fire trailing del]
blockdev-del (old file) → on_replace_old_blockdev_del_complete
The reopen options must be a superset of every option that
qmp_build_blockdev_add_format() may emit, otherwise reopen rejects
'Cannot reset option X to default'. The 'file' field is a string
reference to the new file node — case 3 of the schema in
qemu/qapi/block-core.json:5034-5040 ("the current child is replaced
with that other node"). The format node's qmp_node_name is preserved
so the device frontend's drive=<X> binding does not move.
ReplaceCtx tracks the per-call state with a refcount mirroring the
add-stage drive-info pattern. On any pre-commit failure replace_fail
tears down whatever new-side state we created on the wire and replies
on drive->link via reply_qmp_error (disconnect → NotConnected). On
post-commit del failure we log a warning, leak the orphan, and reply
success — the swap itself succeeded and the leak resolves at VM exit.
file_generation is bumped before issuing blockdev-add so failed
attempts cannot collide on node-name when the user retries.
Errors:
NoSuchStorage - drive not in the registry
StorageImmutable - drive lacks QMP_DRIVE_REMOVABLE (boot-time)
EBUSY - add still pending or another replace/remove in flight
NotConnected - QMP transport disconnect during the chain
EIO - QEMU rejected blockdev-reopen
Also gates RemoveStorage on REPLACE_PENDING so a device_del cannot
race a mid-flight blockdev-reopen on the same drive.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Define the IDL for io.systemd.MachineInstance.ReplaceStorage, a
runtime hot-swap of an already-attached storage volume's backing
file. The signature mirrors AddStorage minus the 'config' field
because the device frontend (virtio-blk, virtio-scsi, nvme, scsi-cd)
does not change — only the backing file behind it.
The implementation lives in vmspawn (next commit) and uses QMP
blockdev-reopen to swap the file child of the existing format node.
The reused error vocabulary (NoSuchStorage, StorageImmutable,
NotConnected, plus the generic errno path) covers every failure
mode; no new errors are added.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
vmspawn: split blockdev-add into separate file and format calls
The current vmspawn_qmp_add_block_device() emits a single blockdev-add
that combines the format-level node ("vmspawn-N-storage") with an
inline file child. QEMU's qmp_blockdev_add() only marks the top-level
returned BDS as monitor-owned (qemu/blockdev.c:3440); inline children
are NOT, so qmp_blockdev_del() rejects them with "Node X is not owned
by the monitor" (qemu/blockdev.c:3513-3517).
To prepare for ReplaceStorage — which needs to swap the file child of
an existing format node via blockdev-reopen, and then blockdev-del the
old file node — make the file node monitor-owned by issuing it as its
own blockdev-add call. The 4-stage add chain becomes 5 stages:
DriveInfo gains qmp_file_node_name ("vmspawn-N-file-G", G a generation
counter bumped on every replace), file_generation, and a stashed
fdset_id so future ReplaceStorage can target both for cleanup.
vmspawn_qmp_block_device_teardown() now deletes both nodes in order —
format first, then file — because the format holds a strong reference
to its file child and a file-first del is rejected with "Node X is
busy: node is used as 'file' of Y".
Folds bridge->features VMSPAWN_QMP_FEATURE_IO_URING into the file
node's flags so the new path inherits io_uring just like the old
inline form did. The format-level options (read-only, discard,
discard-no-unref) are unchanged.
The ephemeral path is structurally already separate file+format with
monitor-owned children; no behavioural change there beyond the
on_add_blockdev_stage → on_add_format_node_stage rename.
Drops the now-unused qmp_build_blockdev_add_inline() helper.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
loginctl: move options and verbs to match order in --help
First, "output modifier" options --no-pager/--no-legend/--no-ask-password are
moved to the end next to --output and --json. I think it makes sense to group
them. Then the implementing code is reordered to match the order in --help.
Daan De Meyer [Tue, 12 May 2026 20:24:51 +0000 (22:24 +0200)]
repart: Add BtrfsReplace= (#41109)
This is a series of commits which adds a feature needed by GNOME OS'
installer. This was show during All Systems Go 2025 talk:
https://cfp.all-systems-go.io/all-systems-go-2025/talk/QRJVL3/
To sum up this PR, this changes first systemd-repart to use BLKPG
partition instead of loop devices when possible. We need then to always
rescan the partitions to try remove partitions if it failed. We allow
encrypted partitions to stay activated and with a chosen name. And we
add a new partition configuration `BtrfsReplace=`.
Note that "replace" comes from the command `btrfs replace`. But in the
case of systemd-repart, maybe "inplace" or "move" would make more sense.
I open to suggestions.
If it is better I can split this into several PRs.
The commits:
## repart: Reuse the backing fd for fdisk
Because fdisk_assign_device tries to open block devices with O_EXCL,
when it does it blocks cryptsetup from using partition block devices for
the same disk.
Since we already have a file descriptor for the device, we can just
share it and use fdisk_assign_device_by_fd instead.
## repart: Use blkpg partitions instead of loop devices when possible
We will want to allow future features to keep some devices mounted or
active. So in order to avoid leaving a mess of many loop devices, we can
just already use the partition block device already.
## repart: Rescan disk on failure if we create blkpg partitions on the
fly
Since we did not write the partition table, then the created partitions
should get removed on error.
## repart: Allow keeping luks2 volumes opened
## repart: Add BtrfsReplace=
BtrfsReplace=/mntpnt will move the btrfs filesystem from mount point to
the partition created. After moving, it will resize to take the whole
partition.
This is useful for OS installers that move a live system into a disk and
do not require a reboot.
## repart: Add VolumeName=
When a luks2 device mapper is to be kept alive after execution of
systemd-cryptsetup, the name of the volume will be taken from this
value.
Daan De Meyer [Tue, 12 May 2026 13:03:49 +0000 (13:03 +0000)]
vmspawn: Prefer systemd-journal-remote from $PATH
$PATH might point to a systemd checkout containing
a newer version of systemd-journal-remote which we
should use, hence prefer an executable from $PATH
over the one from /usr/lib/systemd.
Luca Boccassi [Tue, 12 May 2026 11:40:54 +0000 (12:40 +0100)]
test: make TEST-75-RESOLVED robust against journald metadata race
Even after switching the wait loop to a polling `journalctl --grep`, the
test still fails intermittently because the very first messages emitted by
the freshly-spawned systemd-networkd-wait-online process can carry stale
journald metadata. journald associates `_SYSTEMD_UNIT=` (and friends) with
each entry by reading `/proc/$pid/cgroup` of the originating PID; if those
messages are produced before journald notices the cgroup migration into the
new service, they get tagged with `_SYSTEMD_UNIT=init.scope`. The
`-u $unit` filter then fails to match them.
Capture a journal cursor before launching the unit, and grep using
`--after-cursor=` plus `SYSLOG_IDENTIFIER=systemd-networkd-wait-online`
instead of `-u $unit`. SYSLOG_IDENTIFIER is set by the program itself, so
it's not subject to the cgroup-discovery race. The cursor bounds the search
to entries produced by this invocation, so prior wait-online runs in
earlier testcases don't interfere.
Logs from the failing run showing the messages exist but are tagged with
the wrong unit:
_SYSTEMD_CGROUP=/init.scope
_SYSTEMD_UNIT=init.scope
_EXE=/usr/lib/systemd/systemd-executor
_CMDLINE=/usr/lib/systemd/systemd-executor --deserialize 68 ...
SYSLOG_IDENTIFIER=systemd-networkd-wait-online
MESSAGE=dns0: No DNS server is accessible.
Daan De Meyer [Tue, 12 May 2026 10:26:37 +0000 (12:26 +0200)]
json-stream: tolerate truncated SCM_RIGHTS on inbound messages
When an LSM (e.g. SELinux) denies an fd transfer or the receiver hits
RLIMIT_NOFILE, the kernel drops the fd(s) from the SCM_RIGHTS cmsg and
sets MSG_CTRUNC on the recvmsg(). recvmsg_safe() turns that into
-ECHRNG, which causes json_stream_read() to discard the data bytes
that were nevertheless received and the varlink server to silently
tear down the connection — leaving the caller waiting for a reply
that never comes.
Inline the recvmsg() call instead and, on MSG_CTRUNC, drop the partial
fds but keep the message data. The method handler will surface a clean
-ENXIO when it tries to peek the missing fd, which sd-varlink wraps as
io.systemd.System for the peer, instead of a hang. This matches the
recent sd-bus fix in 6c8de404c9 ('sd-bus: allow receiving messages with
MSG_CTRUNC set').
Valentin David [Thu, 12 Mar 2026 22:14:52 +0000 (23:14 +0100)]
repart: Add BlockDeviceReplace=
BlockDeviceReplace=/mntpnt will move the btrfs filesystem from mount point to
the partition created. After moving, it will resize to take the whole
partition.
This is useful for OS installers that move a live system into a disk and
do not require a reboot.
Valentin David [Thu, 12 Mar 2026 22:14:34 +0000 (23:14 +0100)]
repart: Use blkpg partitions instead of loop devices when possible
We will want to allow future features to keep some devices mounted or
active. So in order to avoid leaving a mess of many loop devices, we can
just already use the partition block device already.
Valentin David [Thu, 12 Mar 2026 22:14:23 +0000 (23:14 +0100)]
repart: Reuse the backing fd for fdisk
Because fdisk_assign_device tries to open block devices with O_EXCL, when it
does it blocks cryptsetup from using partition block devices for the same
disk.
Since we already have a file descriptor for the device, we can just share it
and use fdisk_assign_device_by_fd instead.
This requires at least libfdisk 2.35 (part of util-linux) which was
released in 2020.
core: when skipping state deserializing units, also skip job subsections (#41957)
If a unit has active jobs, when it gets serialized there are job
subsections, each with their own empty line marker. The skipping
function ignores this and skips until the marker, but then leaves
the job in place, breaking deserialization.
Consume jobs subsections too.
This shows up now that there's TEST-07-PID1.alias-corruption,
which occasionally fails when the aliased unit happens to
still have a job when the reexec happens.
Implement Path/Scope/Swap/Timer Context/Runtime for `io.systemd.Unit.List` (#41980)
The PR implements the following objects + tests for
io.systemd.Unit.List:
* PathContext
* PathRuntime
* ScopeContext
* ScopeRuntime
* SwapContext
* SwapRuntime
* TimerContext
* TimerRuntime
It's a continuation of the following PRs:
* https://github.com/systemd/systemd/pull/37432
* https://github.com/systemd/systemd/pull/37646
* https://github.com/systemd/systemd/pull/38032
* https://github.com/systemd/systemd/pull/38212
* https://github.com/systemd/systemd/pull/39391
Daan De Meyer [Tue, 12 May 2026 11:19:18 +0000 (13:19 +0200)]
btrfs-util: clear RDONLY flag on subvolume before destroy ioctl
Without CAP_SYS_ADMIN, btrfs_ioctl_snap_destroy() runs an
inode_permission(MAY_WRITE) check against the target subvolume root, which
btrfs_permission() rejects with EROFS for a read-only subvolume. As a
result, unprivileged removal of a read-only subvolume fails — both via
btrfs_subvol_remove_at() directly and via the recursive cleanup path used
by rm_rf_subvolume_and_freep(), which propagates the EROFS up.
Detect EROFS after the destroy ioctl, clear the RDONLY flag (only inode
ownership is required for BTRFS_IOC_SUBVOL_SETFLAGS), and retry once.
While at it, fix the surrounding comments: BTRFS_IOC_SNAP_DESTROY drops the
entire subvolume tree, so regular files inside are irrelevant; ENOTEMPTY
from the ioctl indicates nested subvolumes (BTRFS_ROOT_REF_KEY entries) via
may_destroy_subvol(), not non-empty contents.
journalctl: move handling of --smart-relinquish-var to action logic
The help string for --smart-relinquish-var and --relinquish-var
were in reversed order because of the _fallthrough_.
We would resolve the conditions for "smart relinquish" immediately
in parse_argv() and call 'return 0' if the conditions were wrong,
terminating option parsing and the program. It seems nicer to delay
action until later. This makes the logic flow more standard. This
also allows the option parsing cases to be exchanged, fixing the
issue with --help.
Two namespaces are used: "journalctl" and "journalctl-varlink". Help for
--user/--system in the latter is added, even though it is not used yet.
I think it'll be good to have this for introspection.
The four FSS-related options (--interval, --verify-key, --force,
--setup-keys) unfortunately each gain an inline #if HAVE_GCRYPT / #else;
the EOPNOTSUPP fallback is duplicated four times.
The metavar for --identifier/--exclude-identifier is changed to "ID"
to make the layout nicer. (And because that seems to make more sense.)
Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>
We have different help strings for --user/--system in different places, so this
only covers a subset of --system/--user instances. But this particular help
seems to be the most widely used.
(In a few cases, the help string is fixed: it should be "system mode", not
"per-system mode".)
journalctl: reorder parse_argv() cases to match --help
Pure reordering. ARG_SMART_RELINQUISH_VAR is kept immediately before
ARG_RELINQUISH_VAR because of the existing _fallthrough_; that's the
only deviation from strict --help order.
Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>
Yu Watanabe [Sun, 22 Mar 2026 08:00:33 +0000 (17:00 +0900)]
dhcp: use TLV object to manage extra and vendor options
Note, previously we replaced the previous option with the same option code with
new one. But, DHCP message can have multiple options with same option code.
Hence, this make the conf parser not replace, but append new one.