git.ipfire.org Git - thirdparty/systemd.git/log

Convert systemctl to option and verb macros (#42088)

This one was non-trivial, so it'd benefit from a close review.

export system memory and number of cpus in basic metrics (#42076)

fuzz-systemctl-parse-argv: add two corpus files to test compat parsers

Looking at the corpus examples, I'm not sure the fuzzer even went into
the compat parsers. None of the files have argv[0] that'd cause
invoked_as() to go into the compat paths. So add the files to provide
a quick test and possibly bias the fuzzer search into the right
direction.

fuzz-systemctl-parse-argv: update suppression of logging and resetting of state

There's certainly more than one way to skin this particular cat,
so I'm keeping this as a separate commit.

shared/options: implement the equivalent of 'opterr'

All log messages during option parsing are emitted using log_full,
and the level is set as LOG_ERR + state->log_level_shift. The default
shift is 0, but if set to e.g. 4, we log at LOG_DEBUG, and if set
to 5 or higher, logging is effectively suppressed. (Unless compiled
with LOG_TRACE, when it'd be suppressed if the shift if set to 6
or higher.) So this gives something like 'opterr', except that
without global state and potentially more flexible.

systemctl: convert shutdown_parse_argv to OPTION macros

Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>

systemctl: convert halt_parse_argv to OPTION macros

Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>

systemctl: convert verbs to VERB macros

systemctl_main() is moved to systemctl.c to allow fuzz-systemctl-parse-argv
to compile. It needs systemctl_help(), which needs the verb table, with the
expected groups. Once we provide that, the linker needs all the verb_*
functions. So add dummy implementations in fuzz-systemctl-parse-argv to
allow the link to happen.

The alternative would be to provide an empty option table, but that
seems to be more complicated, and also can simulate parsing of the whole
command line with the full verb set, so it seems better to test with the
real verb table.

$ nm build/fuzz-systemctl-parse-argv | rg 0000000000418885
0000000000418885 T verb_add_dependency
0000000000418885 T verb_bind
0000000000418885 T verb_cancel
0000000000418885 T verb_cat
0000000000418885 T verb_clean_or_freeze
0000000000418885 T verb_edit
0000000000418885 T verb_enable
0000000000418885 T verb_get_default
0000000000418885 T verb_import_environment
0000000000418885 T verb_is_active
0000000000418885 T verb_is_enabled
0000000000418885 T verb_is_failed
0000000000418885 T verb_is_system_running
0000000000418885 T verb_kill
0000000000418885 T verb_list_automounts
0000000000418885 T verb_list_dependencies
0000000000418885 T verb_list_jobs
0000000000418885 T verb_list_machines
0000000000418885 T verb_list_paths
0000000000418885 T verb_list_sockets
0000000000418885 T verb_list_timers
0000000000418885 T verb_list_unit_files
0000000000418885 T verb_list_units
0000000000418885 T verb_log_setting
0000000000418885 T verb_mount_image
0000000000418885 t verb_noop
0000000000418885 T verb_preset_all
0000000000418885 T verb_reset_failed
0000000000418885 T verb_service_log_setting
0000000000418885 T verb_service_watchdogs
0000000000418885 T verb_set_default
0000000000418885 T verb_set_environment
0000000000418885 T verb_set_property
0000000000418885 T verb_show
0000000000418885 T verb_show_environment
0000000000418885 T verb_start_special
0000000000418885 T verb_start_system_special
0000000000418885 T verb_switch_root
0000000000418885 T verb_trivial_method
0000000000418885 T verb_whoami

systemctl: convert parse_argv to OPTION macros

The verbs[] table still lives in systemctl-main.c — only the option parsing
side is migrated. systemctl_dispatch_parse_argv() gains a remaining_args
out-param so run() can pass the parsed positional args to systemctl_main(),
which dispatches via _dispatch_verb_with_args() instead of dispatch_verb().

The Options section of --help now renders from the OPTION declarations; the
verb sections still use raw printfs and will be converted alongside the
verbs[] migration.

Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>

systemctl: reorder cases in parse_argv() to match order in --help

Compatibility-only options (--fail, --irreversible, --ignore-dependencies,
--no-legend) are grouped at the end alongside the '.' / '?' error handlers.
The case 'P': … _fallthrough_; case 'p': pair is kept intact and placed at
-p's slot in --help, so -P sits immediately before -p in the source.

Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>

systemctl: split out helper for --what and allow resetting

Analogous to parent commit.

systemctl: split out helper for --type= and allow resetting

Analogous to grandparent commit.

systemctl: split out helper for --property=

We explicitly handled --property= in a specific way, so preserve
that behaviour.

systemctl: split out helper for --state= and allow resetting

So far we'd reject --state=, but it seems nicer to make it reset the
setting as we generally do. The output variable is modified in place…
Option parsing isn't atomic anyway, so I think it's fine to to that.

vmspawn: multifunction-pack pcie-root-ports on pcie.0 (#42077)

The pre-allocated pcie-root-port block in run_virtual_machine() places
every port directly on pcie.0 with an auto-assigned PCI address. A
minimal VM already costs 4 builtin + 10 hotplug spares = 14 pcie.0
slots, on top of 3 implicit virtio devices (virtio-rng-pci,
virtio-balloon, virtio-serial-pci) for another 3.

pcie.0 has 32 device-numbers; q35 reserves 0x00 (host bridge) and 0x1f
(ICH9 LPC), leaving ~30 auto-assignable slots. TEST-64-UDEV-STORAGE-
nvme_basic pushes 20 '-device nvme' lines through
$SYSTEMD_VMSPAWN_QEMU_EXTRA, which vmspawn does not see — total demand
14 + 3 + 20 = 37 > 30. Bus realization fails after QEMU's chardev has
already emitted the QMP greeting, and the monitor socket POLLHUPs while
we are mid-feature-probe, reported as 'QMP connection dropped during
feature probing'.

Pack the root ports as multifunction devices, 8 per pcie.0 device-
number (QEMU docs/pcie.txt:84, 117-120, 255-258). Function 0 of each
group carries multifunction=on; functions 1-7 ride the same slot via
addr=N.F. Each function remains independently hot-pluggable so vmspawn's
QMP device_add machinery is unaffected. 14 ports collapse to 2 pcie.0
slots; the nvme_basic budget becomes 2 + 3 + 20 = 25.

The chassis/slot properties (used for ACPI hotplug identity) stay as i+1
— they live in a uint8_t namespace independent of the PCI BDF and are
still unique. Base PCI slot 0x10 sits above the auto-assigned virtio
devices (which land at 0x01-0x03 in config order) and below the q35 LPC
reservation at 0x1f.

While here, rebuild the slot-count formula to match what
assign_pcie_ports() actually allocates. The +1 'SCSI controller' term
was bogus — virtio-scsi-pci comes from the hotplug-spares pool via
hotplug_port_owner[] in vmspawn-qmp.c, never from a builtin port (see
the comment in assign_pcie_ports()). The +1 'network' and +1 'vsock'
terms are now conditional on arg_network_stack and use_vsock. Bind
volumes were missing entirely. And the per-drive accounting now mirrors
assign_pcie_ports()'s skip-SCSI behaviour: non-SCSI drives (root +
extras + bind volumes) take one builtin port each, SCSI drives take none
— they share a controller drawn from the hotplug pool at device-add
time. Tighten the cap from UINT8_MAX to 192 (24 packed device-numbers ×
8) so we cannot claim more than 24 slots on pcie.0 regardless of how
many extras/runtime-mounts a caller asks for.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

update TODO

report-basic: export PhysicalMemorybytes + CPUsOnline metrics

cpu-set-util: introduce cpus_online().

Add a helper that tries to determine the number of installed CPUs. This
borrows heavily from physical_memory(), i.e. uses the physical number,
but caps by per-container cpuset.

cpu-set-util: add cpu_set_count() helper

Let's add a minor, simplifying helper for getting number of CPUs in a
mask.

nsresourced: re-link GID delegation file after atomic UID file write

userns_registry_remove() restores a sub-delegated UID range by writing
the previous owner's data to u<UID>.delegate with WRITE_STRING_FILE_ATOMIC.
Atomic writes go via a temp file and rename, which replaces the directory
entry with a fresh inode and severs the hardlink to g<GID>.delegate. The
stale GID side then keeps pointing at the prior inode with outdated owner
and ancestor data, so subsequent lookups via GID return wrong results.

Re-create the hardlink after the atomic write so the two views stay in
sync, matching what userns_registry_store() already does after writing
a new delegation.

blockdev-util: Drop name argument from BLKPG functions

We don't use it, the kernel ignores it, let's just drop
the argument. Saves callers from having to ensure the name
they pass in fits in the 64 char buffer.

vmspawn: multifunction-pack pcie-root-ports on pcie.0

The pre-allocated pcie-root-port block in run_virtual_machine() places
every port directly on pcie.0 with an auto-assigned PCI address. A
minimal VM already costs 4 builtin + 10 hotplug spares = 14 pcie.0
slots, on top of 3 implicit virtio devices (virtio-rng-pci,
virtio-balloon, virtio-serial-pci) for another 3.

pcie.0 has 32 device-numbers; q35 reserves 0x00 (host bridge) and 0x1f
(ICH9 LPC), leaving ~30 auto-assignable slots. TEST-64-UDEV-STORAGE-
nvme_basic pushes 20 '-device nvme' lines through
$SYSTEMD_VMSPAWN_QEMU_EXTRA, which vmspawn does not see — total demand
14 + 3 + 20 = 37 > 30. Bus realization fails after QEMU's chardev has
already emitted the QMP greeting, and the monitor socket POLLHUPs
while we are mid-feature-probe, reported as 'QMP connection dropped
during feature probing'.

Pack the root ports as multifunction devices, 8 per pcie.0 device-
number (QEMU docs/pcie.txt:84, 117-120, 255-258). Function 0 of each
group carries multifunction=on; functions 1-7 ride the same slot via
addr=N.F. Each function remains independently hot-pluggable so
vmspawn's QMP device_add machinery is unaffected. 14 ports collapse to
2 pcie.0 slots; the nvme_basic budget becomes 2 + 3 + 20 = 25.

The chassis/slot properties (used for ACPI hotplug identity) stay as
i+1 — they live in a uint8_t namespace independent of the PCI BDF and
are still unique. Base PCI slot 0x10 sits above the auto-assigned
virtio devices (which land at 0x01-0x03 in config order) and below
the q35 LPC reservation at 0x1f.

While here, rebuild the slot-count formula to match what
assign_pcie_ports() actually allocates. The +1 'SCSI controller' term
was bogus — virtio-scsi-pci comes from the hotplug-spares pool via
hotplug_port_owner[] in vmspawn-qmp.c, never from a builtin port (see
the comment in assign_pcie_ports()). The +1 'network' and +1 'vsock'
terms are now conditional on arg_network_stack and use_vsock. Bind
volumes were missing entirely. And the per-drive accounting now
mirrors assign_pcie_ports()'s skip-SCSI behaviour: non-SCSI drives
(root + extras + bind volumes) take one builtin port each, SCSI
drives take none — they share a controller drawn from the hotplug
pool at device-add time. Cap at 120 ports (15 device-numbers × 8) so
we cannot run off the end of the 5-bit PCI device-number space — the
usable range starting at 0x10 ends at 0x1e because ICH9 LPC sits at
0x1f.0 single-function, blocking the rest of that slot for
multifunction packing.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

core: when figuring out whether to create orphanage units, consult vtable instead of allowlist

As per https://github.com/systemd/systemd/pull/41986#pullrequestreview-4281939586

This also corrects the list of unit types a bit:

1. this removes the mount/automount unit type from the list, since for these types
   we do not allow aliases/renaming anyway.

2. this adds socket + swap units to the list, since they can change
   name, and for both of them we actually do fork off processes hence
   track resources.

Follow-up for: #41986

import: two minor debugability improvements (#42081)

TEST-13-NSPAWN.machined occasionally fails when importing, and it's hard
to debug, so try to make it better. eg:

https://github.com/systemd/systemd/actions/runs/25800895182/job/75790334230?pr=42071

repart: Add debug logging for block_device_partition_add()

mkosi: update debian commit reference to 8b9ea8981eee267a2fa493435f2869f7b2479350

* 8b9ea8981e Install new files for upstream build
* b230cf0490 use dh-cruft to register & purge volatile files
* 8f9b9952e1 Install new files for upstream build

preset: enable cgroup metrics logic (#42075)

import: try to capture tar exit codes on failure

TEST-13-NSPAWN.machined occasionally fails with a tar error, and it's hard
to say what the problem is at the exit code is lost. Try to capture it.

[ 34.054] systemd-importd[504]: (transfer18) Imported 92%.
[ 34.118] systemd-importd[504]: (transfer18) Failed to decode and write: Broken pipe
[ 34.119] systemd-importd[504]: (transfer18) Exiting.
[ 34.121] systemd-importd[504]: (transfer18) Failed to allocate transient user namespace: Operation not permitted
[ 34.121] systemd-importd[504]: Transfer process failed with exit code 1.

Follow-up for b6e676ce41508e2aeea22202fc8f234126177f52

import: do not create foreign ns on cleanup if not needed

The user ns is only used if the appropriate flag is set, so avoid
creating it unless it is. This avoids a spurious EPERM error in
TEST-13-NSPAWN.machined that is confusing when debugging failures

[ 34.054] systemd-importd[504]: (transfer18) Imported 92%.
[ 34.118] systemd-importd[504]: (transfer18) Failed to decode and write: Broken pipe
[ 34.119] systemd-importd[504]: (transfer18) Exiting.
[ 34.121] systemd-importd[504]: (transfer18) Failed to allocate transient user namespace: Operation not permitted
[ 34.121] systemd-importd[504]: Transfer process failed with exit code 1.

Follow-up for 1be8caa6be6f5a10a7dea5ac562a0df5c5fac2e9

TEST-64-UDEV-STORAGE: Drop number of nvme devices to 12

qemu by default only has 30 PCI slots and with vmspawn now reserving
some of those for its hotplug features, we go over the limit for the
nvme test.

Let's drop the number of nvme devices to 12 to fix the conflict.

copy: retire splice use() for copying files on disk

Apparently splice() is quite problematic, hence just don't anymore. It's
also unnecessary these days since either copy_file_range() or sendfile()
nowadays typically work, the splice() fallback doesn't give us much
anymore.

(At least I am not aware of a combo of fds where splice() would work
where neither cfr nor sf would work).

This leaves one use of splice() in place, in
src/shared/socket-forward.c. We should probably kill that too, but
that'd require some reworking to use sendfile() I guess, and I am too
lazy for that right now. Moreover, in contrast to the other uses it's
probably even safe, since it uses an intermediary pipe always. But what
do I know...

Fixes: #29044

meson: move systemd-sysupdate to /usr/bin/

Let's make systemd-sysupdate easy to call. It was added in 2021
and it's around to stay and not "experimental" in any way.

update TODO

preset: enable cgroup metrics logic

This stuff is so useful, and should work out of the box I am sure. Given
that the metrics are only generated on request this shouldn't create any
additional burden by default.

Yes, this might enlarge reports a bit, if generated with everything on,
but we really should solve that at the report generation level, not at
the point where we make the metrics available.

Follow-up for: 4409e52494d803426a365b6636a66fd2dfc70b62

update TODO

core: do not leak resources when handling stale alias state on reload (#41986)

The fix for the corrupted state when units become aliased on reload
leaks the now-aliased unit's resources, which become untracked and
essentially lost.

While fixing the state corruption is of course necessary, leaking
processes/etc. is not ideal for a system and service manager, so
instead attempt to keep track of them by creating stub units
on-the-fly.
This way resources are not leaked, there are clear indications of
where they moved, and all state can be tracked as expected.

RestrictFileSystemAccess= — dm-verity filesystem access enforcement via BPF LSM (#41340)

This series adds a new `RestrictFileSystemAccess=` setting in the
`[Manager]` section of `system.conf` that enforces a deny-default
execution policy: only binaries residing on signed dm-verity block
devices (and the initramfs during early boot) are permitted to execute.
Everything else — tmpfs, procfs, sysfs, anonymous executable mappings,
unsigned dm-verity devices — is denied.

The directive takes the values `no` (default), `exec` (lock down
execution), and accepts `yes` as an alias for `exec`. The name is
deliberately broader than what the initial values cover so the same
setting can grow to restrict other filesystem access categories in the
future (e.g. `any` to deny all access from untrusted filesystems, not
just execution).

### How it works

The BPF program is entirely self-contained; PID1 loads it and the kernel
does the rest. When dm-verity brings up a device, the kernel calls
`security_bdev_setintegrity()` twice during `verity_preresume()`: once
with the root hash and once with the signature validity status. Our
`lsm/bdev_setintegrity` hook captures the second call and records the
device number in a BPF hash map if the signature is valid. When a device
is torn down, `lsm/bdev_free_security` cleans up the map entry. No
userspace map population is needed at any point.

The enforcement side hooks `bprm_check_security` (execve), `mmap_file`
(PROT_EXEC mappings including shared libraries), and `file_mprotect`
(W→X transitions like JIT and libffi). Each hook resolves the file's
backing device via `file->f_inode->i_sb->s_dev` and looks it up in the
verity device map. For block-backed filesystems, `s_dev` equals
`s_bdev->bd_dev`, which avoids an extra pointer chase and NULL check on
`s_bdev` — non-block filesystems simply miss in the map and get denied
by the default policy.

During early boot the initramfs needs to be trusted as well, since it
runs before any dm-verity volume is mounted. PID1 writes the initramfs
superblock's device number into a BPF global before attaching the
programs, and clears it after `switch_root` to close the trust window.
As a prerequisite, PID1 also verifies that
`dm_verity.require_signatures=1` is active — without it, unsigned
dm-verity devices could be created, which would weaken the security
model even though the BPF program would correctly deny execution from
them.

### Surviving daemon-reexec

The BPF programs and their verity device map must survive PID1
re-execution (daemon-reexec, switch_root, soft-reboot). Without
preservation, `manager_free()` would destroy the skeleton, the link FDs
would close, programs would detach, and the map would be freed. After
exec, a fresh skeleton would have an empty map — but existing dm-verity
devices have already signaled their integrity and won't do so again. A
deny-default policy plus an empty map means all execution denied and the
system is bricked.

We solve this by serializing the raw BPF link FDs and the `.bss` map FD
across exec using systemd's existing `serialize_fd` / `fdset_cloexec` /
`deserialize_fd` infrastructure. The kernel reference chain (link FD →
`struct bpf_link` → `struct bpf_prog` → `struct bpf_map`) keeps programs
attached and map data intact as long as the dup'd FDs survive. After
exec, PID1 detects the deserialized FDs and skips skeleton re-creation
entirely. If switching root, it uses the deserialized `.bss` map FD to
clear `initramfs_s_dev` via a targeted `mmap()` write, preserving the
other guard globals in `.bss`.

We intentionally avoid bpffs pinning. Pinned objects are discoverable
and manipulable by any process with sufficient privileges
(`BPF_OBJ_GET`, unlink). FD serialization keeps everything private to
PID1 with no external attack surface.

### Self-protection

BPF LSM programs attached via the tracing trampoline (`BPF_LSM_MAC`) are
inherently tamper-resistant — `bpf_tracing_link_lops` has no
`.update_prog` and no `.detach` callbacks, so the kernel rejects
`BPF_LINK_UPDATE` with `-EINVAL` and `BPF_LINK_DETACH` with
`-EOPNOTSUPP`. Once attached, our programs cannot be modified or
detached through the `bpf()` syscall.

The remaining attack vector is map injection: `BPF_MAP_GET_FD_BY_ID` to
obtain an FD to `verity_devices`, then `BPF_MAP_UPDATE_ELEM` to insert a
fake trusted device. The self-protection guard blocks this with three
hooks. `lsm/bpf_map` fires inside `bpf_map_new_fd()`, the chokepoint for
all code paths that produce a map FD, and denies access to our map IDs
from any process other than PID1 (identified via `tgid == 1`, which is
unspoofable — `bpf_get_current_pid_tgid()` reads `current->tgid` from
`pid->numbers[0].nr`, the init-namespace PID). `lsm/bpf_prog` provides
analogous protection for program FDs as defense-in-depth. `lsm/bpf`
handles `BPF_LINK_GET_FD_BY_ID` at the command level since there is no
`security_bpf_link()` hook in the kernel.

The guard starts inactive — all protected IDs default to 0 in `.bss`,
and no real BPF object has ID 0 — so there is no window where it
interferes with PID1's own setup. After attaching all programs, PID1
queries the kernel-assigned IDs via `bpf_obj_get_info_by_fd()` and
writes them into the guard's globals. From that point on, the guard is
active. The guard has zero collateral damage: it only denies access to
our specific object IDs, leaving bpftrace, bpftool,
`RestrictFileSystems=`, and all other BPF usage completely unaffected.

Additionally, a ptrace guard (`lsm/ptrace_access_check`) blocks
`PTRACE_MODE_ATTACH` to PID1 from other processes, preventing extraction
of sensitive state from PID1's address space via ptrace, `/proc/1/mem`,
`process_vm_readv()`, or `pidfd_getfd()`. `PTRACE_MODE_READ` is allowed
so that monitoring tools and `systemctl` continue to work normally.

### Limitations

- The enforcement hooks resolve trust by looking at
`file->f_inode->i_sb->s_dev` — the device number of the superblock that
owns the file's inode. This works correctly for files directly on a
dm-verity block device, but it does not see through overlayfs. When a
file is accessed on an overlay mount, `f_inode` points to the overlay
inode, and `i_sb->s_dev` is the overlay superblock's anonymous device
number — not the underlying dm-verity device. The overlay superblock has
no backing block device, so the lookup misses in the verity map and
execution is denied by the default policy.

This means that overlayfs mounts whose lower layers are on
dm-verity-protected volumes will currently have execution blocked, even
though the actual data is integrity-protected. The correct fix requires
a kernel extension that allows the BPF program to call something like
`d_real_inode()` to resolve through the overlay to the real inode on the
underlying filesystem, and then check that inode's superblock device
number against the verity map. I plan to add a BPF kfunc exposing this
functionality in a follow-up kernel series.

- Multi-device filesystems such as btrfs use entirely synthetic device
numbers and there is no way to reach the actual device backing the inode
from the inode itself. So `RestrictFileSystemAccess=` only works
reliably with a subset of filesystems. In practice this isn't a problem
because the feature is tailored to erofs; using it on arbitrary
filesystems requires careful vetting of the actual filesystem behaviour.

- The initial implementation also blocks JIT-style execution that relies
on memory mapped executable. This is part of `exec` semantics today and
can be loosened later by introducing finer-grained values (a common
pattern in systemd — following the precedent of `ProtectSystem=`, which
started as a boolean and later grew `auto`/`yes`/`full`/`strict`
semantics).

- The configuration is a system-wide setting with no per-unit opt-out.
This is intentional for the initial implementation: a global invariant
is easier to reason about and harder to accidentally weaken. Per-unit
relaxation can be added later if a concrete need arises.

### Testing

The series includes unit tests and integration tests covering both the
core enforcement logic and the self-protection guard. The unit test
loads the skeleton, attaches programs, populates guard globals, and
verifies that protected IDs are set correctly. The integration tests
exercise the guard by attempting `BPF_MAP_GET_FD_BY_ID` and
`BPF_PROG_GET_FD_BY_ID` from a non-PID1 process and verifying that
access is denied.

What we cannot currently test end-to-end is actual execution enforcement
against a dm-verity-signed root filesystem. The systemd test suite does
not yet have infrastructure for booting a VM with a signed dm-verity
rootfs image — the existing mkosi-based test framework lacks the ability
to produce and boot such images. This will hopefully change soon when
Daan integrates barrage into the test suite.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

test: Modernize btrfs tests

Convert test-btrfs to use the test framework and
assertions, merge the physical offset test into it
and beef it up to include what TEST-83-BTRFS does and
finally get rid of TEST-83-BTRFS as it is unneeded now.

libc,shared: detect newer library symbols at runtime via weak references (#42065)

For libc syscall wrappers (pidfd_open, fsopen, openat2, etc.) we
previously
gated the calls behind build-time HAVE_* checks. Replace these with weak
external references, falling back to the raw syscall at runtime when the
loaded glibc lacks the symbol. Drop the corresponding cc.has_function()
loop
from meson.build and disable -Wredundant-decls /
readability-redundant-declaration
for src/libc/ via meson c_args and a local .clang-tidy.

For optional libraries (libcryptsetup, libdw, libarchive), drop the
per-symbol
HAVE_* checks. Always declare the prototypes, suppressing the
redundant-decl
warnings via DISABLE_WARNING_REDUNDANT_DECLS and NOLINT, and resolve the
symbols after the main dlopen via a new DLSYM_OPTIONAL() helper that
only
assigns on success. libarchive's *_is_set wrappers now use fallback
functions
as their pointer initializers, so call sites never need to NULL-check.

The same treatment applies to pidfd_spawn / posix_spawnattr_setcgroup_np
in
process-util.c and epoll_pwait2 in sd-event.c. coredump-config and
coredump-submit get a dlopen_dw_has_dwfl_set_sysroot() helper. The kexec
arch gate now uses defined(__NR_kexec_file_load) directly; pidfd.h uses
__has_include_next() to decide whether to pull in glibc's header.

This lets binaries built against newer glibc / libcryptsetup / libdw /
libarchive headers still load and run on older targets where these
symbols
are absent.

dhcp-message: introduce several more functions to parse/append DHCP options (#42063)

Convert loginctl to option and verb macros (#42066)

ci: disable BPF framework in Jammy build tests

Jammy's kernel is too old at this point, and doesn't even provide a
vmlinux.h, so disable the feature in the build smoketests to let us
add new features

Co-developed-by: Luca Boccassi <luca.boccassi@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>

core: work around btf_ctx_access() rejection of const void * in BPF LSM

Kernels before v6.16 (missing commit 1271a40eeafa "bpf: Allow access to
const void pointer arguments in tracing programs") have a bug in
btf_ctx_access() where const void * parameters in LSM hook signatures
are not recognized as void pointers. The function checks t->type == 0
to detect void *, but for const void * the BTF chain is PTR -> CONST ->
void, so t->type points to the CONST node rather than directly to
type_id 0. This causes the verifier to reject any BPF program that
reads the const void *value argument of bdev_setintegrity:

func 'bpf_lsm_bdev_setintegrity' arg2 type UNKNOWN is not a struct
invalid bpf_context access off=16 size=8

Work around this by providing a compat variant of the
bdev_setintegrity BPF program that avoids reading the const void *value
argument entirely. Instead it reads the size argument (a scalar integer)
directly from the raw BPF context (ctx[3]), which is not subject to the
broken type check. This is safe because dm-verity guarantees that value
and size are always in lockstep: both NULL/0 for unsigned devices, both
non-zero for signed devices.

The loader tries the full version first (which reads both value and size
for defense-in-depth) and falls back to the compat variant if loading
fails. bpf_program__set_autoload(false) disables whichever variant is
not needed so the verifier never sees it.

This compat logic can be removed once the minimum kernel baseline
includes the 1271a40eeafa fix.

Signed-off-by: Christian Brauner <brauner@kernel.org>

test: add integration tests for RestrictFileSystemAccess= BPF LSM

Add TEST-90-RESTRICT-FSACCESS with two subtests:

config subtest — Tests PID1's RestrictFileSystemAccess= configuration parsing and
failure modes via system.conf drop-ins and daemon-reexec:
- Default RestrictFileSystemAccess=no produces no log messages
- RestrictFileSystemAccess=yes without BPF LSM logs appropriate warning
- RestrictFileSystemAccess=yes without require_signatures is correctly rejected
   by the test helper binary's precondition check

enforce subtest — Tests actual BPF LSM enforcement using a test helper
binary (test-bpf-restrict-fsaccess) that loads the BPF skeleton with
initramfs_s_dev set to the rootfs s_dev, pins BPF links, and exits:
- Execution from rootfs continues to work (trusted via initramfs_s_dev)
- Execution from tmpfs is blocked with EPERM
- Execution from a signed dm-verity device is allowed, driven via
   systemd-run -p RootImage= against the pre-built signed minimal_0
   images that mkosi ships and signs at image build time (no on-the-fly
   squashfs / verity hash tree / signature build required)
- After BPF detach, enforcement is lifted

All tests skip gracefully when prerequisites are not met (BPF LSM, BPF
framework, dm-verity tools, signing keys).

Signed-off-by: Christian Brauner <brauner@kernel.org>

core: expose internal helpers for test-bpf-restrict-fsaccess

Make dm_verity_require_signatures() non-static and declare it in the
header so the test helper binary can exercise the same precondition
checks that PID1 uses.

Signed-off-by: Christian Brauner <brauner@kernel.org>

core: add self-protection guard for RestrictFileSystemAccess= BPF LSM

Add self-protection guard programs to the RestrictFileSystemAccess= skeleton that
prevent non-PID1 processes from obtaining FDs to our maps, programs, or
links via the bpf() syscall.

This blocks the primary attack vector against the RestrictFileSystemAccess= policy:
using BPF_MAP_GET_FD_BY_ID to get an FD to the verity_devices map,
then BPF_MAP_UPDATE_ELEM to inject fake trusted devices. Protection of
program and link IDs is defense-in-depth (the kernel already blocks
BPF_LINK_UPDATE and BPF_LINK_DETACH for LSM tracing links).

Additionally, a ptrace guard (lsm/ptrace_access_check) blocks
PTRACE_MODE_ATTACH to PID1 from other processes, preventing
extraction of sensitive state from PID1's address space via
ptrace, /proc/1/mem, process_vm_readv(), or pidfd_getfd().

Guard logic:
1. Allow all BPF ops from PID1 (tgid == 1, unspoofable)
2. Deny BPF_MAP_GET_FD_BY_ID for our protected map IDs
3. Deny BPF_PROG_GET_FD_BY_ID for our program IDs
4. Deny BPF_LINK_GET_FD_BY_ID for our link IDs
5. Allow everything else (zero collateral damage)

The guard starts inactive (all protected IDs default to 0 in .bss).
After skeleton attach, PID1 queries kernel-assigned IDs via
bpf_obj_get_info_by_fd() and writes them into the guard globals via
the mmap'd .bss, then extracts owned FDs and destroys the skeleton.
Destroying the skeleton unmaps the .bss page from PID1's address
space, so no BPF state — guard globals, protected map/prog/link IDs,
initramfs_s_dev — remains readable via /proc/1/mem. The kernel map
data persists (held by the dup'd FDs) but is only accessible via
bpf_map_* syscalls, which the guard itself blocks for non-PID1.

Signed-off-by: Christian Brauner <brauner@kernel.org>

core: preserve RestrictFileSystemAccess= BPF state across daemon-reexec

The BPF link and .bss map FDs must survive PID1 re-execution
(daemon-reexec, switch_root, soft-reboot). Without serialization,
manager_free() closes them before execv, programs detach, and the
verity_devices map is freed. After exec a fresh skeleton would have
an empty map — but existing dm-verity devices have already called
bdev_setintegrity and won't call it again. The result would be a
deny-default policy with an empty map, i.e., all execution denied
and the system bricked.

Add serialize/deserialize support using systemd's existing
serialize_fd / fdset_cloexec / deserialize_fd infrastructure:

Before exec (in manager_serialize via bpf_restrict_fsaccess_serialize):
  - Dup each link FD and the .bss map FD into the FDSet
  - fdset_cloexec(fds, false) + execv() preserves them across exec

After exec (in manager_deserialize + bpf_restrict_fsaccess_setup):
  - Deserialize the link FDs and .bss map FD into the Manager struct
  - bpf_restrict_fsaccess_setup() detects the deserialized FDs and skips
    skeleton re-creation entirely — the programs are already attached
  - If no longer in initrd, clear initramfs_s_dev in the kernel map

No bpffs pinning is needed. This avoids a bpffs mount dependency and
eliminates the external attack surface that pinned objects would create
(discoverable/manipulable via unlink or BPF_OBJ_GET). The FDs remain
private to PID1.

Signed-off-by: Christian Brauner <brauner@kernel.org>

core: add RestrictFileSystemAccess= BPF LSM for dm-verity execution enforcement

Add a new RestrictFileSystemAccess= boolean setting in the [Manager] section of
system.conf that enforces execution only from signed dm-verity block
devices and the initramfs during early boot.

When RestrictFileSystemAccess=yes is set, PID1 loads a BPF LSM program early in boot
that:

Integrity tracking (self-populating, no userspace involvement):
- bdev_setintegrity: records dm-verity signature status in a BPF hash
map when the kernel signals device integrity via
security_bdev_setintegrity()
- bdev_free_security: removes devices from the map on teardown

Execution enforcement (deny-default policy):
- bprm_check_security: blocks execve() from untrusted sources
- mmap_file: blocks PROT_EXEC mmap (shared libs, anonymous exec memory)
- file_mprotect: blocks W->X transitions (JIT, libffi, etc.)

Trust anchors:
- Signed dm-verity volumes (sig_valid flag in the BPF map)
- Initramfs (s_dev captured at load time, cleared after switch_root)
- Everything else is denied (tmpfs, procfs, sysfs, anonymous PROT_EXEC)

PID1 requires dm-verity require_signatures=1 to be enabled and refuses
to load the BPF program otherwise, ensuring the kernel enforces that all
dm-verity devices carry valid signatures.

After attach, PID1 extracts owned FDs from the skeleton (link FDs +
.bss map FD) and lets the skeleton be destroyed. The dup'd link FDs
keep programs attached via the kernel reference chain (link FD ->
bpf_link -> bpf_prog -> bpf_map). Destroying the skeleton unmaps the
.bss page from PID1's address space so no BPF state is readable via
/proc/1/mem. The .bss map FD is retained for targeted writes (clearing
initramfs_s_dev after switch_root via mmap).

Signed-off-by: Christian Brauner <brauner@kernel.org>

libc,shared: detect newer library symbols at runtime

For libc syscall wrappers (pidfd_open, fsopen, openat2, etc.) we previously
gated the calls behind build-time HAVE_* checks. Replace these with shim
functions in src/libc/ that fall back to the raw syscall at runtime when the
loaded glibc lacks the symbol. The infrastructure lives in src/libc/libc-shim.h:
DEFINE_SYSCALL_SHIM falls back to a direct syscall, DEFINE_LIBC_SHIM returns
ENOSYS (for posix_spawn-family helpers that have no corresponding syscall), and
DEFINE_LIBC_ERRNO_SHIM sets errno=ENOSYS and returns -1 (for read/write-style
helpers). The weak reference to the libc symbol is bound via __asm__(\"name\")
rename so the bare libc identifier never appears as a C token — this avoids
both #undef boilerplate against override-header redirects and the resulting
-Wredundant-decls warning. Drop the corresponding cc.has_function() loop from
meson.build.

For optional libraries (libcryptsetup, libdw, libarchive), drop the per-symbol
HAVE_* checks. Always declare the prototypes, suppressing the redundant-decl
warnings via DISABLE_WARNING_REDUNDANT_DECLS and NOLINT, and resolve the symbols
after the main dlopen via a new DLSYM_OPTIONAL() helper that only assigns on
success. libcryptsetup's crypt_set_keyring_to_link / crypt_token_set_external_path
and libarchive's *_is_set wrappers use fallback functions as their pointer
initializers (returning -ENOSYS and 0 respectively), so call sites can invoke
the symbol unconditionally and just check for -ENOSYS where the \"not supported\"
distinction matters.

The same shim treatment applies to pidfd_spawn / posix_spawnattr_setcgroup_np
(src/libc/spawn.c) and epoll_pwait2 (src/libc/epoll.c), with corresponding
override headers in src/include/override/spawn.h and
src/include/override/sys/epoll.h. posix_spawn_wrapper() in process-util.c and
epoll_wait_usec() in sd-event.c now detect ENOSYS in the return value instead
of checking the function pointer, falling back to plain posix_spawn() and
epoll_wait() respectively. coredump-config and coredump-submit get a
dlopen_dw_has_dwfl_set_sysroot() helper. The kexec arch gate now uses
defined(__NR_kexec_file_load) directly; pidfd.h uses __has_include_next() to
decide whether to pull in glibc's header.

This lets binaries built against newer glibc / libcryptsetup / libdw /
libarchive headers still load and run on older targets where these symbols are
absent.

loginctl: convert to OPTION and VERB macros

--help output is the same, except for the expected formatting changes
and moving of --no-pager/--no-legend/--no-ask-password to the end.

Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>

shared/verbs: allow all groups to be named

When verb groups were added, I assumed that the first group will always
by the unnamed group, or in other words, that VERB_GROUP() line cannot
appear first. This provides an additional check on the whether the verbs
haven't been reordered by the compiler or linker. But that check is weak
and we can do a better check anyway. And this limitation is unexpected,
since we allow that for OPTIONs. The code should all work without an
unnamed group, once this assertion is removed.

syscall: add kexec_file_load to the generated override header

This makes __NR_kexec_file_load available on architectures where the kernel
UAPI headers don't define it, matching the runtime fallback path in
src/libc/kexec.c which is gated on #ifdef __NR_kexec_file_load.

vmspawn: add io.systemd.MachineInstance.ReplaceStorage (#42017)

A follow-up to the AddStorage / RemoveStorage series. ReplaceStorage
swaps the *backing file* of an already-attached storage device on a
running vmspawn-managed VM, leaving the guest-visible device frontend
(virtio-blk, virtio-scsi, nvme, scsi-cd) and every other property of
the device untouched. The intended use is to point an existing disk
at a new image without the guest seeing a hot-unplug/hot-plug cycle.

The signature mirrors AddStorage minus the 'config' field: the
device frontend doesn't change, only the backing behind it. Read-
only / read-write is derived from the new fd's O_ACCMODE; scsi-cd is
forced read-only to match the boot-time policy. S_ISBLK on the new
fd selects host_device vs file driver, matching AddStorage.

The QMP primitive is blockdev-reopen. It cannot change a file /
host_device node's 'filename' so we can't just point the existing
file node at a new fd, but it can swap a format node's 'file' child
to a different existing monitor-owned node by node-name reference
(case 3 in qemu/qapi/block-core.json:5034-5040). The chain is:

  add-fd          (host fd → new fdset)
  blockdev-add    (new file node, filename=/dev/fdset/N — fd-only)
  remove-fd       (release monitor's ref; new file holds the dup)
  blockdev-reopen (format node, file = new file node-name)
  blockdev-del    (old file node; its dup release frees old fdset)

The reopen options must restate every option the original blockdev-
add emitted on the format node — blockdev-reopen resets any
unspecified option to its driver default. The 'file' field is a
node-name string reference, never a path.

No new errors and no new IDL types beyond the method itself;
everything is built on the existing NoSuchStorage / StorageImmutable
/ NotConnected / EBUSY vocabulary.

The series is:

  vmspawn: split blockdev-add into separate file and format calls
      Preparatory refactor. qemu/blockdev.c:3440 only marks the
      top-level BDS returned by blockdev-add as monitor-owned;
      inline children are NOT, so blockdev-del later rejects them
      with "Node X is not owned by the monitor". Split into two
      blockdev-add calls so the file node is independently
      deletable. DriveInfo gains qmp_file_node_name and a
      file_generation counter; the teardown helper deletes format
      then file (file-first is rejected as "node used as 'file'
      of Y"). The ephemeral path was already structured this way;
      only the regular add path changes. Drops the now-unused
      qmp_build_blockdev_add_inline().

  shared/varlink-io.systemd.MachineInstance: add ReplaceStorage method
      IDL only: ReplaceStorage(fileDescriptorIndex, name). No new
      errors.

  vmspawn: implement io.systemd.MachineInstance.ReplaceStorage
      vmspawn_qmp_replace_block_device() entry point, ReplaceCtx
      (refcounted, ReplaceCtxStateFlags for partial-state tracking)
      and four async callbacks plus an idempotent replace_fail.
      file_generation is bumped before issuing blockdev-add so
      retries don't collide on node-name.
      BLOCK_DEVICE_STATE_REPLACE_PENDING gates concurrent
      Replace / Remove on the same drive. On reopen success the
      trailing blockdev-del of the old file node fires from the
      reopen callback; its failure logs a warning and still replies
      success (the swap already committed; the orphan resolves at VM
      exit). QMP disconnect mid-replace routes via
      qmp_client_fail_pending → replace_fail → NotConnected.

  test: integration test for io.systemd.MachineInstance.ReplaceStorage
      TEST-87-AUX-UTILS-VM.replace-storage covers happy-path replace,
      successive replaces (file_generation rotation), StorageImmutable
      rejection on the boot-time drive, NoSuchStorage on unknown
      names, InvalidParameter on malformed names, and clean
      RemoveStorage after a replace (proves the new file node is
      monitor-owned and the teardown order works). Backing files are
      passed via 'varlinkctl --push-fd'; no machinectl front-end is
      added in this round.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

report-basic: expose os-release fields as a metric (#41988)

Add io.systemd.Basic.OSRelease metric family that reports all the fields
in os-release.

dhcp-message: introduce dhcp_message_get_option_dnr()

This is for DHCP option 162 (DNR).

dhcp-message: introduce dhcp_message_{append,get}_option_6rd()

These are for DHCP option 212 (6rd).

dhcp-message: introduce dhcp_message_{append,get}_option_routes()

These are for DHCP options 33 (static route), 121 (classless static
route), and 249 (private classless static route).

dhcp: move definition of sd_dhcp_route and related functions to dhcp-route.[ch]

This also renames arguments for storing results.
No functional change, just refactoring and preparation for later commits.

dhcp-message: add SIP server option support

The DHCP option 120 (SIP server) option takes a list of addresses or
domain names, and the first byte in the data classifies which type is
stored. Let's extend _addresses() and _domains() to make them support
the SIP server option.

dhcp-message: introduce dhcp_message_get_option_domains()

This is for e.g. DHCP option 119 (domain search).

dhcp-message: introduce dhcp_message_{append,get}_option_length_prefixed_data()

This is for e.g. User Class option.

dhcp-message: introduce dhcp_message_{append,get}_option_sub_tlv()

This is for e.g. Vendor-Specific Information option.

dhcp: random trivial cleanups (#42061)

sd-bus: handle non-string keys in dictionaries in JSON dump

JSON only supports string keys in objects, but D-Bus specification is a
bit more lenient and allows dict entries to have any basic type as key.
Let's stringify allowed non-string keys so that we can represent them as
JSON objects.

Relevant snippet from the D-Bus specification:

  A DICT_ENTRY works exactly like a struct, but rather than parentheses
  it uses curly braces, and it has more restrictions. The restrictions
  are: it occurs only as an array element type; it has exactly two
  single complete types inside the curly braces; the first single
  complete type (the "key") must be a basic type rather than a container
  type. Implementations must not accept dict entries outside of arrays,
  must not accept dict entries with zero, one, or more than two fields,
  and must not accept dict entries with non-basic-typed keys. A dict
  entry is always a key-value pair.

Resolves: #32904

sd-dhcp-client: always set default broadcast hardware address when unspecified

The default value for InfiniBand is copied from dhcp-network.c.

logind: zero-initialize dispatch struct in vl_method_release_session()

The local struct passed to sd_varlink_dispatch() was not
zero-initialized. Since sd_json_dispatch_full() does not call handlers
for absent optional fields, p.id could be left indeterminate when
the client omits the Id parameter, leading to use of uninitialized
memory.

core: do not leak resources when handling stale alias state on reload

The fix for the corrupted state when units become aliased on reload
leaks the now-aliased unit's resources, which become untracked and
essentially lost.

While fixing the state corruption is of course necessary, leaking
processes/etc. is not ideal for a system and service manager, so
instead attempt to keep track of them by creating stub units
on-the-fly as replacements.
This way resources are not leaked, there are clear indications of
where they moved, and all state can be tracked as expected.

Ignore timers/sockets/paths as those are internal resources/triggers.

Follow-up for a77c7a8224447890a304bd857f412c8103f217f1

report-cgroup: use errno_or_else in one more place

Old gcc is confused about initialization:
In function ‘io_read_send’,
    inlined from ‘walk_cgroups’ at ../src/report/report-cgroup.c:288:24:
../src/report/report-cgroup.c:167:21: error: ‘values[0]’ may be used uninitialized [-Werror=maybe-uninitialized]
  167 |                 r = metric_build_send_unsigned(mf + i, link, unit, values[i], /* fields= */ NULL);
      |                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Maybe this helps.

report-basic: expose os-release fields as metrics

Add io.systemd.Basic.OSRelease metric families that reports select fields
from os-release.

$ cat /etc/os-release
NAME="Fedora Linux"
VERSION="45 (Cloud Edition Prerelease)"
RELEASE_TYPE=development
ID=fedora
VERSION_ID=45
VERSION_CODENAME=""
PRETTY_NAME="Fedora Linux 45 (Cloud Edition Prerelease)"
ANSI_COLOR="0;38;2;60;110;180"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:45"
HOME_URL="https://fedoraproject.org/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/rawhide/"
SUPPORT_URL="https://ask.fedoraproject.org/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=rawhide
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=rawhide
SUPPORT_END=2027-11-24
VARIANT="Cloud Edition"
VARIANT_ID=cloud

$ varlinkctl call --more ./build/systemd-report-basic io.systemd.Metrics.List {} | jq --seq -c
...
{"name":"io.systemd.Basic.OSRelease.NAME","value":"Fedora Linux 44 (Workstation Edition)"}
{"name":"io.systemd.Basic.OSRelease.ID","value":"fedora"}
{"name":"io.systemd.Basic.OSRelease.CPE_NAME","value":"cpe:/o:fedoraproject:fedora:44"}
{"name":"io.systemd.Basic.OSRelease.VARIANT_ID","value":"workstation"}
{"name":"io.systemd.Basic.OSRelease.VERSION_ID","value":"44"}
{"name":"io.systemd.Basic.OSRelease.SUPPORT_END","value":"2027-05-19"}

I picked the fields that contain useful information about the specific
version/image/variant/experiment/flavour of the system. Also, either
NAME or PRETTY_NAME is included. This one is intended for human readers
to be able to identify the OS version easily.

report: drop MetricsFamilyContext, CGroupContext, CGroupInfo

Previously, we passed around information about the MetricFamily'ies
and the varlink connection in a helper structure. Having a hybrid of
const static and runtime stuff is iffy. Let's simplify things by passing
two separate parameters.

Also, in report-cgroup.c we built a cache of parsed values. This
requires additional storage requirements and introduces complexity when
dealing with population of the cache at the appropriate time.
This cache is not useful: for each cgroup, we generate a list of
metrics, and we have all the information at hand. The only reason
why we'd create the cache and not generate all the relevant replies
at once was that the helper functions called the .generate function
for each MetricFamily separately.

The MetricFamily interface is changed, so that metrics can be
defined without a .generate function. This is understood to mean
that the preceding metric family's .generate function will also
genarate this family. This allows us to define related metrics
nicely in a table:
  { METRIC_IO_SYSTEMD_CGROUP_PREFIX "CpuUsage", generate_func },
  { METRIC_IO_SYSTEMD_CGROUP_PREFIX "IOReadBytes", NULL },
  { METRIC_IO_SYSTEMD_CGROUP_PREFIX "IOReadOperations", NULL },
  { METRIC_IO_SYSTEMD_CGROUP_PREFIX "SomethingElse", generate_func2 },
  ...
When implementing .Describe, we list all the families. When implementing
.List, we only call those with .generate, and we get the same results
as before.

This allows the .generate functions to be simplified: instead of
keeping state, they just spit out all the metrics for a given
object in a tight loop.

varlink-io.systemd.MachineInstance,vmspawn: treat AddStorage/RemoveStorage name as opaque

The 'name' field on AddStorage and RemoveStorage was documented as
'<provider>:<volume>' and enforced via machine_storage_name_split() at
the varlink boundary. That form is only the convention machinectl
inherits from the StorageProvider routing path; the API itself only
needs a unique identifier the caller can re-use to detach the binding.

Drop the strict format check, require only a non-empty string, and
update the IDL docs to describe the field as a caller-supplied
identifier with machinectl's convention as a non-normative example.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

vmspawn: reject O_PATH and O_WRONLY fds in AddStorage

An fd opened O_PATH cannot be read, and an O_WRONLY fd cannot serve as
a backing file for a virtual disk image. Reject both at the bind-volume
entry point with -EBADF instead of letting the request proceed to QMP
where QEMU's file backend would fail to read from the fd. The
ReplaceStorage entry point grew the same checks in parallel.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

test: integration test for io.systemd.MachineInstance.ReplaceStorage

Modelled on TEST-87-AUX-UTILS-VM.bind-volume.sh. Boots vmspawn with
one boot-time bind-volume, hot-adds a runtime volume via machinectl
bind-volume, then exercises ReplaceStorage:

  1. happy-path replace of a runtime drive
  2. successive replace (verify file_generation rotation — no
     node-name collisions on the second swap)
  3. replace of the boot-time drive must fail with StorageImmutable
  4. replace of an unknown name must fail with NoSuchStorage
  5. invalid name (no provider:volume separator) must fail with
     InvalidParameter
  6. unbind-volume after replace must succeed — proves the new file
     node is monitor-owned and the format-then-file teardown order
     in vmspawn_qmp_block_device_teardown() correctly cleans up both
     blockdev nodes

Pushes the new backing file via varlinkctl --push-fd; the file is a
plain truncate'd image. Auto-discovered by run_subtests in
TEST-87-AUX-UTILS-VM.sh.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

vmspawn: implement io.systemd.MachineInstance.ReplaceStorage

Wire up the runtime hot-swap Varlink method. The signature mirrors
AddStorage minus 'config': the device frontend (virtio-blk,
virtio-scsi, nvme, scsi-cd) doesn't change, only the backing file
behind it. Read-only/read-write may flip based on the new fd's
O_ACCMODE; scsi-cd is forced read-only to match the boot-time policy.

QMP sequence (entry: vmspawn_qmp_replace_block_device):

  add-fd                          → on_replace_observe_stage
  blockdev-add (new file)         → on_replace_blockdev_add_complete
  remove-fd (new fdset)           → on_replace_observe_stage
  blockdev-reopen (format)        → on_replace_blockdev_reopen_complete
                                    [commit + fire trailing del]
  blockdev-del (old file)         → on_replace_old_blockdev_del_complete

The reopen options must be a superset of every option that
qmp_build_blockdev_add_format() may emit, otherwise reopen rejects
'Cannot reset option X to default'. The 'file' field is a string
reference to the new file node — case 3 of the schema in
qemu/qapi/block-core.json:5034-5040 ("the current child is replaced
with that other node"). The format node's qmp_node_name is preserved
so the device frontend's drive=<X> binding does not move.

ReplaceCtx tracks the per-call state with a refcount mirroring the
add-stage drive-info pattern. On any pre-commit failure replace_fail
tears down whatever new-side state we created on the wire and replies
on drive->link via reply_qmp_error (disconnect → NotConnected). On
post-commit del failure we log a warning, leak the orphan, and reply
success — the swap itself succeeded and the leak resolves at VM exit.

file_generation is bumped before issuing blockdev-add so failed
attempts cannot collide on node-name when the user retries.

Errors:
  NoSuchStorage     - drive not in the registry
  StorageImmutable  - drive lacks QMP_DRIVE_REMOVABLE (boot-time)
  EBUSY             - add still pending or another replace/remove in flight
  NotConnected      - QMP transport disconnect during the chain
  EIO               - QEMU rejected blockdev-reopen

Also gates RemoveStorage on REPLACE_PENDING so a device_del cannot
race a mid-flight blockdev-reopen on the same drive.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

shared/varlink-io.systemd.MachineInstance: add ReplaceStorage method

Define the IDL for io.systemd.MachineInstance.ReplaceStorage, a
runtime hot-swap of an already-attached storage volume's backing
file. The signature mirrors AddStorage minus the 'config' field
because the device frontend (virtio-blk, virtio-scsi, nvme, scsi-cd)
does not change — only the backing file behind it.

The implementation lives in vmspawn (next commit) and uses QMP
blockdev-reopen to swap the file child of the existing format node.
The reused error vocabulary (NoSuchStorage, StorageImmutable,
NotConnected, plus the generic errno path) covers every failure
mode; no new errors are added.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

vmspawn: split blockdev-add into separate file and format calls

The current vmspawn_qmp_add_block_device() emits a single blockdev-add
that combines the format-level node ("vmspawn-N-storage") with an
inline file child. QEMU's qmp_blockdev_add() only marks the top-level
returned BDS as monitor-owned (qemu/blockdev.c:3440); inline children
are NOT, so qmp_blockdev_del() rejects them with "Node X is not owned
by the monitor" (qemu/blockdev.c:3513-3517).

To prepare for ReplaceStorage — which needs to swap the file child of
an existing format node via blockdev-reopen, and then blockdev-del the
old file node — make the file node monitor-owned by issuing it as its
own blockdev-add call. The 4-stage add chain becomes 5 stages:

  add-fd
  blockdev-add (file)    → on_add_file_node_stage   sets FILE_NODE_ADDED
  blockdev-add (format)  → on_add_format_node_stage sets BLOCKDEV_ADDED
  remove-fd
  device_add

DriveInfo gains qmp_file_node_name ("vmspawn-N-file-G", G a generation
counter bumped on every replace), file_generation, and a stashed
fdset_id so future ReplaceStorage can target both for cleanup.

vmspawn_qmp_block_device_teardown() now deletes both nodes in order —
format first, then file — because the format holds a strong reference
to its file child and a file-first del is rejected with "Node X is
busy: node is used as 'file' of Y".

Folds bridge->features VMSPAWN_QMP_FEATURE_IO_URING into the file
node's flags so the new path inherits io_uring just like the old
inline form did. The format-level options (read-only, discard,
discard-no-unref) are unchanged.

The ephemeral path is structurally already separate file+format with
monitor-owned children; no behavioural change there beyond the
on_add_blockdev_stage → on_add_format_node_stage rename.

Drops the now-unused qmp_build_blockdev_add_inline() helper.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

loginctl: move options and verbs to match order in --help

First, "output modifier" options --no-pager/--no-legend/--no-ask-password are
moved to the end next to --output and --json. I think it makes sense to group
them. Then the implementing code is reordered to match the order in --help.

journalctl,analyze: use assert_cc in two more places

TODO: fix typo

repart: Add BtrfsReplace= (#41109)

This is a series of commits which adds a feature needed by GNOME OS'
installer. This was show during All Systems Go 2025 talk:
https://cfp.all-systems-go.io/all-systems-go-2025/talk/QRJVL3/

To sum up this PR, this changes first systemd-repart to use BLKPG
partition instead of loop devices when possible. We need then to always
rescan the partitions to try remove partitions if it failed. We allow
encrypted partitions to stay activated and with a chosen name. And we
add a new partition configuration `BtrfsReplace=`.

Note that "replace" comes from the command `btrfs replace`. But in the
case of systemd-repart, maybe "inplace" or "move" would make more sense.
I open to suggestions.

If it is better I can split this into several PRs.

The commits:

## repart: Reuse the backing fd for fdisk

Because fdisk_assign_device tries to open block devices with O_EXCL,
when it does it blocks cryptsetup from using partition block devices for
the same disk.

Since we already have a file descriptor for the device, we can just
share it and use fdisk_assign_device_by_fd instead.

## repart: Use blkpg partitions instead of loop devices when possible

We will want to allow future features to keep some devices mounted or
active. So in order to avoid leaving a mess of many loop devices, we can
just already use the partition block device already.

## repart: Rescan disk on failure if we create blkpg partitions on the
fly

Since we did not write the partition table, then the created partitions
should get removed on error.

## repart: Allow keeping luks2 volumes opened

## repart: Add BtrfsReplace=

BtrfsReplace=/mntpnt will move the btrfs filesystem from mount point to
the partition created. After moving, it will resize to take the whole
partition.
This is useful for OS installers that move a live system into a disk and
do not require a reboot.

## repart: Add VolumeName=

When a luks2 device mapper is to be kept alive after execution of
systemd-cryptsetup, the name of the volume will be taken from this
value.

## test: Add test for repart's BtrfsReplace

test: split unit tests (#42062)

vmspawn: Prefer systemd-journal-remote from $PATH

$PATH might point to a systemd checkout containing
a newer version of systemd-journal-remote which we
should use, hence prefer an executable from $PATH
over the one from /usr/lib/systemd.

Convert journalctl to option macros (#42051)

test: move test cases for client_id_{hash,compare}_func() to test-dhcp-client-id.c

test: move unit test for dhcp_identifier_set_iaid() to test-dhcp-duid.c

dhcp-network: make dhcp_network_send_{raw,udp}_socket() take iovec_wrapper

dhcp: introduce sd_dhcp_message object and several related functions (part 1) (#42047)

test: make TEST-75-RESOLVED robust against journald metadata race

Even after switching the wait loop to a polling `journalctl --grep`, the
test still fails intermittently because the very first messages emitted by
the freshly-spawned systemd-networkd-wait-online process can carry stale
journald metadata. journald associates `_SYSTEMD_UNIT=` (and friends) with
each entry by reading `/proc/$pid/cgroup` of the originating PID; if those
messages are produced before journald notices the cgroup migration into the
new service, they get tagged with `_SYSTEMD_UNIT=init.scope`. The
`-u $unit` filter then fails to match them.

Capture a journal cursor before launching the unit, and grep using
`--after-cursor=` plus `SYSLOG_IDENTIFIER=systemd-networkd-wait-online`
instead of `-u $unit`. SYSLOG_IDENTIFIER is set by the program itself, so
it's not subject to the cgroup-discovery race. The cursor bounds the search
to entries produced by this invocation, so prior wait-online runs in
earlier testcases don't interfere.

Logs from the failing run showing the messages exist but are tagged with
the wrong unit:

  [ 2570.948554] TEST-75-RESOLVED.sh[2178]: + unit=wait-online-dns-ede81407-b93b-459d-8e5d-69292b42d2ae.service
  [ 2571.023162] TEST-75-RESOLVED.sh[2178]: + systemd-run -u wait-online-dns-ede81407-b93b-459d-8e5d-69292b42d2ae.service ...
  [ 2571.049189] TEST-75-RESOLVED.sh[2178]: + timeout 30 bash -c 'until journalctl -b -u wait-online-dns-ede81407-b93b-459d-8e5d-69292b42d2ae.service --grep ...'
  [ 2571.964986] systemd-networkd-wait-online[2190]: dns0: No DNS server is accessible.
  [ 2601.051088] TEST-75-RESOLVED.sh[2178]: ++ cleanup

And for that 2571.964986 entry:

      _SYSTEMD_CGROUP=/init.scope
      _SYSTEMD_UNIT=init.scope
      _EXE=/usr/lib/systemd/systemd-executor
      _CMDLINE=/usr/lib/systemd/systemd-executor --deserialize 68 ...
      SYSLOG_IDENTIFIER=systemd-networkd-wait-online
      MESSAGE=dns0: No DNS server is accessible.

Follow-up for d4bc62713e09df09281f26f4bf385801a3ee2897

Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>

copy: fix typo and slightly update comment

option: fix typo

po: update Japanese translation

json-stream: tolerate truncated SCM_RIGHTS on inbound messages

When an LSM (e.g. SELinux) denies an fd transfer or the receiver hits
RLIMIT_NOFILE, the kernel drops the fd(s) from the SCM_RIGHTS cmsg and
sets MSG_CTRUNC on the recvmsg(). recvmsg_safe() turns that into
-ECHRNG, which causes json_stream_read() to discard the data bytes
that were nevertheless received and the varlink server to silently
tear down the connection — leaving the caller waiting for a reply
that never comes.

Inline the recvmsg() call instead and, on MSG_CTRUNC, drop the partial
fds but keep the message data. The method handler will surface a clean
-ENXIO when it tries to peek the missing fd, which sd-varlink wraps as
io.systemd.System for the peer, instead of a hang. This matches the
recent sd-bus fix in 6c8de404c9 ('sd-bus: allow receiving messages with
MSG_CTRUNC set').

update TODO

test: Add test for repart's BlockDeviceReplace

repart: Add VolumeName=

When a luks2 device mapper is to be kept alive after execution
of systemd-cryptsetup, the name of the volume will be taken
from this value.

repart: Add BlockDeviceReplace=

BlockDeviceReplace=/mntpnt will move the btrfs filesystem from mount point to
the partition created. After moving, it will resize to take the whole
partition.

This is useful for OS installers that move a live system into a disk and
do not require a reboot.

repart: Allow keeping luks2 volumes opened

repart: Rescan disk on failure if we create blkpg partitions on the fly

Since we did not write the partition table, then the created partitions
should get removed on error.

repart: Use blkpg partitions instead of loop devices when possible

We will want to allow future features to keep some devices mounted or
active. So in order to avoid leaving a mess of many loop devices, we can
just already use the partition block device already.

repart: Reuse the backing fd for fdisk

Because fdisk_assign_device tries to open block devices with O_EXCL, when it
does it blocks cryptsetup from using partition block devices for the same
disk.

Since we already have a file descriptor for the device, we can just share it
and use fdisk_assign_device_by_fd instead.

This requires at least libfdisk 2.35 (part of util-linux) which was
released in 2020.