bootctl: rework/modernize "unlink" and add Varlink API for it
Among other things this changes tracking of the location of resources
during GC from using the BootEntrySource enum rather than a path, since
we have that and it is more efficient and easier to grok.
* 1302f123d9 Restrict wildcard for new files
* a6d0098d10 Install new files for upstream build
* ce07fd7616 d/t/boot-and-services: use coreutils tunable in apparmor test (LP: #2125614)
Yaping Li [Wed, 29 Apr 2026 22:17:22 +0000 (15:17 -0700)]
report: report user and system CPU time per cgroup
Extend io.systemd.CGroup.CpuUsage from a single per-unit nanosecond
counter to three rows distinguished by a "type" field of "total",
"user", or "system". The values come from cpu.stat's usage_usec,
user_usec and system_usec keys, read in a single keyed-attribute
fetch and cached on each CGroupInfo so each scrape only opens
cpu.stat once per cgroup.
options: get rid of "on_error" parameter to FOREACH_OPTION
I am really not a fan of full code lines passed to macros as parameters.
Let's get rid of the 3rd parameter of FOREACH_OPTION() hence:
1. Let's return errors just as a regular value (though a negative one),
that can be handled via a OPTION_ERROR case statement for the switch.
This normalizes handling of the error, just like any other event
returned by the option parser.
2. In order to avoid exploding the amount of boilerplate in each use
(that just propagates the error on OPTION_ERROR), let's then
introduce an explicit FOREACH_OPTION_OR_RETURN(), that returns from
the calling function on its own (and makes that clear in the name).
Together this cleans up, normalizes the logic and shortens the code.
dns-question: limit the number of questions per query
Let's cap the number of question each query can have to something
reasonable - 128 questions per query should be more than enough for any
real-world scenario.
fundamental/cleanup: add CLEANUP_ELEMENTS() and DEFINE_POINTER_ARRAY_CLEAR_FUNC()
DEFINE_POINTER_ARRAY_CLEAR_FUNC() generates a helper of the form
helper_array_clear(T *array, size_t n) that drops each element but does
not free the array itself, parallel to DEFINE_POINTER_ARRAY_FREE_FUNC()
for cases where the array has automatic storage duration.
CLEANUP_ELEMENTS() pairs with these helpers to provide a _cleanup_-like
attribute for fixed-size arrays: the bound is taken from ELEMENTSOF(),
and the helper is invoked across the elements at scope exit. Compared to
CLEANUP_ARRAY(), the storage is neither freed nor zeroed.
Migrate various logic across the tree over to the new macros.
sd-device: use DEFINE_POINTER_ARRAY_CLEAR_FUNC() for sd_device_unref_array_clear()
Replace the local device_unref_many() helper with the macro-generated
equivalent.
format-table: switch help-table arrays to CLEANUP_ELEMENTS()
Generate table_unref_array_clear() via DEFINE_POINTER_ARRAY_CLEAR_FUNC()
and convert the help-table arrays in bootctl, cryptenroll, nspawn,
repart and vmspawn to CLEANUP_ELEMENTS(). The arrays no longer need a
trailing NULL slot, so the size matches ELEMENTSOF() of the groups
array.
firewall-util: switch netlink message arrays to CLEANUP_ELEMENTS()
Generate sd_netlink_message_unref_array_clear() via
DEFINE_POINTER_ARRAY_CLEAR_FUNC() in place of the NULL-terminated
sd_netlink_message_unref_many(), and convert the two stack arrays of
sd_netlink_message pointers to CLEANUP_ELEMENTS().
Dan Anderson [Thu, 30 Apr 2026 02:53:10 +0000 (22:53 -0400)]
Improve error logging for fstat failure
Small hygiene fix. r must be >= 0 as per the prior statement (otherwise we would have returned). This is really only going to be r == 0, which means return r; is return 0; I'm updating this to use log_debug_errno
Samuel Dainard [Tue, 28 Apr 2026 15:57:26 +0000 (15:57 +0000)]
binfmt-util: handle ELOOP/EACCES from automount in read-only bind mounts
When /proc is bind-mounted read-only (common in mock/Koji buildroots,
containers, and other sandboxed environments), opening
/proc/sys/fs/binfmt_misc returns ELOOP if it is an automount point
that cannot be triggered in the read-only context.
Currently binfmt_mounted_and_writable() only handles ENOENT, so ELOOP
propagates as an error. This causes test-binfmt-util to fail with
SIGABRT and disable_binfmt() to log a spurious warning at shutdown.
Treat ELOOP and EACCES the same as ENOENT: binfmt_misc is not usably
available, return false.
Note: PR #37006 (merged April 2025) addressed ELOOP in the xstatfsat()
path, but the open() call in binfmt_mounted_and_writable() remained
unhandled.
blockdev-list: fix per-element leak in block_device_array_free() (#41869)
FOREACH_ARRAY declares 'i' as the iterator but the body passed 'd' (the
array base) to block_device_done(). Since mfree() leaves the field NULL
after the first call, element 0 is freed repeatedly while elements
1..N-1 leak their node, symlinks strv, model, vendor and subsystem.
The bug predates the sanitizer-instrumented callers. PR #41776's new
systemd-storage-block daemon runs blockdev_list() under ASan/LSan in
TEST-87-AUX-UTILS-VM and exposes it (15 allocs / 804 bytes leaked per
ListVolumes request). The fix also benefits repart and blockdev_list's
internal CLEANUP_ARRAY cleanup.
volume: add an "io.systemd.StorageProvider" IPC API that is supposed to be used by vmspawn/nspawn/pid1 to provide storage volumes in a generic fashion (#41776)
BindPath= in unit files, and --bind= in nspawn/vmspawn doesn't really
cut it to connect arbitrary storage infra to it. Let's do something
about it, and implement a simple, light-weight API for acquiring an fd
to a storage volume. Benefits:
1. the interface can be implemented by anyone, connecting anything to
vmspawn/nspawn/service management
2. very lose coupling: just bind a socket into a well-known dir, done
3. mounting can happen on-demand
shared/options: add new helper option_parser_get_arg
option_parser_next_arg() is renamed to option_parser_peek_next_arg()
to match option_parser_consume_next_arg().
A new helper is added option_parser_get_arg(…, n). It is a common pattern
to only need a single arg, and getting an array and extracting a single
item from it is too verbose.
It comes with a really thorough test suite matching our currently level
of testing of systemd-boot (read: there is none, I ask you to trust me,
Claude, and your review on this one)...
boot: load extra files for UKIs into memory and register as initrds
This generates on-the-fly cpio initrds from 'extra' resources declared
in Type #1 entries and installs them via the Linux initrd protocol so
that they get passed to the Linux kernel.
The PR to measure into is closely associated with where we place a
resource in the initrd cpios. Hence, let's also track it in CpioTarget,
thus simplifying our function parameter lists that way.
TODO: track StorageProvider follow-ups, sketch a NetworkProvider sibling
Records the still-missing StorageProvider integrations (nspawn,
vmspawn, service-manager BindVolume=) and replaces the now-obsolete
generic "storage API via varlink" entry with a NetworkProvider
proposal modelled on it.
test: add integration test for storagectl and storage providers
VM-only test that exercises both shipped providers through storagectl:
verifies the well-known sockets exist, lists providers/volumes/
templates, creates and acquires volumes from each template
(sparse-file, allocated-file, directory, subvolume), attaches a loop
device to cover the block provider, and exercises the mount.storage
helper.
CLI for inspecting and using storage providers. Scans
/run/systemd/io.systemd.StorageProvider/ (or the user-mode equivalent)
for AF_UNIX sockets and talks to each one over Varlink. Verbs:
"volumes" lists volumes across all providers, "templates" lists
supported creation templates, "providers" lists the endpoints
themselves.
Also installed as a mount.storage helper, so
'mount -t storage PROVIDER:VOLUME /mnt' (or 'mount -t storage.<fstype>'
to put a fresh filesystem on a block volume) acquires the volume and
mounts it. Ships with bash/zsh completions and a man page.
Second StorageProvider implementation, exposing regular files and
directories from a backing filesystem. In system mode the backing
directory is /var/lib/storage/, in user mode $XDG_STATE_HOME/storage/;
entries with a .volume suffix are exposed, with the inode type
determining whether the volume is reported as reg, dir or (via
symlinked/bind-mounted device node) blk.
Unlike the block provider, this one supports creating volumes
on-demand from a small set of built-in templates: sparse-file,
allocated-file, directory and subvolume.
First implementation of io.systemd.StorageProvider, exposing all block
devices known to udev (disks, partitions, dm nodes, …) as volumes of
type "blk". Names are picked from stable /dev/mapper and /dev/disk/by-*
symlinks; content-derived identifiers (by-uuid, by-label, …) are
intentionally avoided for security. Volume creation is not supported by
this backend.
Socket-activated via /run/systemd/io.systemd.StorageProvider/block.
Also adds shared storage-util.[ch] (VolumeType / CreateMode helpers)
that subsequent providers reuse.
Generic Varlink API for services that hand out file descriptors to
storage volumes. Three methods: Acquire() returns an fd for a named
volume (optionally creating it from a template), ListVolumes()
enumerates available volumes, ListTemplates() enumerates supported
creation templates. Volume types follow kernel inode-type naming:
blk (block device), reg (regular file), dir (directory).
Intent is that multiple providers can sit behind AF_UNIX sockets in a
well-known directory and be consumed uniformly by nspawn, vmspawn,
the service manager (BindVolume=) and similar tools.
Merge the two blocks adding tests, since there seems to be
no obvious reason to have two separate blocks, as they both
contain tests from the same libraries.
sd-json: stop printing debug messages about extension fields
The intent was good, but we now print two or three of those messages
for each report metrics received on the wire. If the json object is
extensible, then it's all good and we don't need to inundate the user
with this trivial information. (And the message also sounds like
something is wrong or unexpected, when it totally isn't.)
...
(string):1:73: Unrecognized object field 'object', assuming extension.
(string):1:89: Unrecognized object field 'value', assuming extension.
json-stream: Received message: {"parameters":{"name":"io.systemd.Network.CarrierState","object":"virbr0","value":"degraded-carrier"},"continues":true}
(string):1:66: Unrecognized object field 'object', assuming extension.
(string):1:83: Unrecognized object field 'value', assuming extension.
json-stream: Received message: {"parameters":{"name":"io.systemd.Network.CarrierState","object":"lo","value":"carrier"},"continues":true}
(string):1:66: Unrecognized object field 'object', assuming extension.
(string):1:79: Unrecognized object field 'value', assuming extension.
json-stream: Received message: {"parameters":{"name":"io.systemd.Network.CarrierState","object":"wlp0s20f3","value":"carrier"},"continues":true}
(string):1:66: Unrecognized object field 'object', assuming extension.
(string):1:86: Unrecognized object field 'value', assuming extension.
...
As is often the case, in this case because of alignment, we are actually
not saving any space. With the bitfield we are using one bit of the 8 bytes
allocated, and without the bitfield we are using 8 bits of that.
But we're paying a price in generated code, at every access site to the
field:
Michael Vogt [Wed, 29 Apr 2026 06:20:56 +0000 (08:20 +0200)]
core: add io.systemd.Unit.StartTransient() to the varlink API (#41583)
This commit adds a simple version of io.systemd.Unit.StartTransient
for varlink. It is similar to the dbus version, but there is a key
difference:
1. Instead of building the unit from key/value properties it
takes a structured json object "UnitContext" with a "Service" field
inside.
It is also only implementing a minimal set of what can be done with a
service.
2. No aux units (for now)
3. When called with --more the varlink socket can notify about
state changes depending on the notify{Job,Unit}Changes parameter
This aligns to the json objects/format from
https://github.com/systemd/systemd/pull/39391
and to show how the format can be shared it adds a new
(minimal) `ServiceContext` that is now part of
`io.systemd.Unit.List()`.
run: use a "named namespace" also for the main option parser
It seems that clang reorders the entries in the options array that
originate from different functions, but not within a function. Using
"named namespaces" exclusively should sidestep the issue.
(A bigger hammer would be to sort the array. We *can* do this, since the
options have the increasing .id field. But that'd require duplicating
the memory or making it writable. Let's avoid this until we know for
sure that it's needed.)
run: reorder switch cases to match help() output order
Both parse_argv() and parse_argv_sudo_mode() handled options in an
order that no longer matched the help text. Reorder the case statements
so the source order mirrors what the user sees in --help.
In parse_argv_sudo_mode(), drop the case 'i' → ARG_VIA_SHELL fall-through
so the cases can be sequenced independently; 'i' now sets arg_via_shell
directly.
Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>
shared/options: introduce "namespaces" for options
This allows multiple option parsers to be defined in a single
compilation unit. We put the OPTION_NAMESPACE("name") to split
up the options. The basic implementation is similar to groups,
except that groups only matter for help display, while namespaces
matter for both help display and actual option parsing. When parsing,
we locate the appropriate range between the beginning of options
and the next namespace marker or between two namespace markers and
only look at that range.
loop-util: don't reuse partition fd when partscan needed
Some devices (e.g. android phones running pmOS) cannot have their OEM
partition table altered without breaking the firmware, so the distros's
partitions live inside a nested GPT carved into one of the OEM
partitions. Exposing these subpartitions requires wrapping the outer
partition in a loop device with partscan enabled, since the kernel does
not go into nested partition tables.
systemd already detects this case in udev-builtin-blkid
(ID_PART_GPT_AUTO_ROOT_DISK_NEEDS_LOOP) and acts on with
systemd-loop@.service, but this fails towards the end.
loop_device_make_internal has an optimization where if the input is
already a block device with a matching sector size, it skips creating
a loop and just hands back the original fd. That's fine for whole disks
but wrong for partitions, which don't support partscan, so this causes
dissect_image to fail with EPROTONOSUPPORT.
This patch changes the behavior to only take the shortcut when the input
is a whole disk, or when partscan was not requested.
shared/options: fix --help indentation for long options
In 4339197f5d4f712bc900d8e09c892015d48b19bb the helper to format -o/--opt=
was split out, but the indentation was for --long-options was messed up.
We'd print:
Options:
-h --help Show this help
--version Show package version
--no-ask-password Do not prompt for password
...
But we want
-h --help Show this help
--version Show package version
--no-ask-password Do not prompt for password
...
The prefix argument was arguably ugly, even if it allowed one alloc to be
avoided. Let's get rid of it and let the handler prefix the string as
appropriate. This makes other callers nicer too.
core: ensure all types from execute.h start with Exec
Until very recently all types defined by execute.h started with "Exec"
in the name. I think that was useful, since it made clear that the types
are associated with the ExecContext infrastructure. Let's hence restore
this.
(If we every move these types out of execute.h we should drop the "Exec"
prefix again. But today is not that day.)
The sanitizer job was FUBAR on Fedora Rawhide (and RHEL 10 to some
extent) due to several changes:
- latest LLVM (v22) introduced a change that occasionally generates a
false-positive warning when running with sanitizers
- several tools had to be "ASan-wrapped" because:
- util-linux started linking against libsystemd which propagated to
other tools depending on its shared libraries (like libmount)
- libssh started depending on libfido2, which depends on libudev; this
then translated to an interesting depedency chain where tpm2 utils got a
dependency on libudev through libcurl -> libssh -> libfido2
- polkit added MemoryMax= to its service file, which is incompatible
with ASan-runs (at least with the current limits)
See the commits for more detailed descriptions.
Also, one note: the santizer job is currently still FUBAR on Fedora
Rawhide (or, more specifically, the TEST-50-DISSECT and TEST-58-REPART),
because mkfs.erofs also gained a dependency on libudev (through libcurl,
see above), but the wrapping currently doesn't work as it also depends
on libqpl which is linked with libtsan (which is incompatible with other
sanitizers). This is currently tracked in
https://bugzilla.redhat.com/show_bug.cgi?id=2461146
Michael Vogt [Mon, 27 Apr 2026 11:12:30 +0000 (13:12 +0200)]
core: transition job to JOB_FINISHED on uninstall
When a job reaches the job_uninstall() stage we used to set the
state to JOB_WAITING. However now that we have a JOB_FINISHED state [1]
we should use that instead. This is more accurate so when
varlink_job_send_removed_signal() is called the job is in the expected
state and that is what the user will see.
Note that this does not change the D-Bus API because there
bus_job_send_removed_signal() doesn't send the state, it only sends
the result.
udev: don't assert on worker cap after killing a broken idle worker
manager_can_process_event() considers an event processable if either
there is room below children_max to spawn, or an idle worker exists.
When only the latter holds, event_run() picks the idle worker and
tries device_monitor_send(). If that send fails, event_run() SIGKILLs
the worker, marks it WORKER_KILLED and continues the loop. With no
other idle worker available, it falls through to worker_spawn(),
guarded by:
The just-killed worker is still in manager->workers until its SIGCHLD
is reaped by on_worker_exit(), so at the cap this assertion trips and
udevd aborts:
Assertion 'hashmap_size(manager->workers) < manager->config.children_max'
failed at src/udev/udev-manager.c:635, function event_run(). Aborting.
Instead of asserting, bail out when we are already at the worker
limit. The event remains in EVENT_QUEUED; once the killed worker's
SIGCHLD arrives and frees it from the hashmap, on_post() re-runs
event_queue_start() and the event is retried.
Let's track the state of the option parser in an explicit 'state' field.
Benefits:
1. We can make sure that a parser that got into a failure state will
be invalidated for good (i.e. further operations are guaranteed to
fail too).
2. As a side effect this cleans up the option_parse() return parameter
handling: we'll now always initialize ret_option/ret_arg when
returning >= 0, as per coding style.
ASAN showed a use-after-free error for systemd-vmspawn's
machine_register call because the reply got accessed and freed again
through _cleanup. The same problem exists in two
verb_machine_control_one/unregister_machine.
Fix these call sites to not set up _cleanup.
Kai Lüke [Fri, 24 Apr 2026 16:42:36 +0000 (01:42 +0900)]
nsresource: fix buffer overrun reported by ASAN
This came up when running systemd-vmspawn with ASAN to fix another bug
and thus I had to fix this overrun here first: The dispatch tables were
missing the terminator, add it.
Kai Lüke [Fri, 24 Apr 2026 15:18:56 +0000 (00:18 +0900)]
fork-notify: Use callback instead of argv NULL code path with return
In 012d87c1fc/cc8f398202 it was made possible for fork_notify() to
return in the child but at that point all FDs were closed and the
_cleanup path from the return causes assertion failures due to invalid
FDs in notify_event_source/event, leading to a vmspawn failing to start
with a SIGABRT logged in coredump.
Instead of TAKE_PTR on a bunch of things which is fragile, rather avoid
the return and instead add an explicit callback handler and guarantee
to exit directly after it. A userdata argument is also added but not
used yet I think it's quite normal to have for a callback.
vmspawn-varlink: treat QMP disconnect as success for Terminate
QMP "quit" tells QEMU to exit, which races the reply with the socket
EOF: sometimes the disconnect lands in qmp_client_fail_pending() with
-ECONNRESET before the reply has been parsed. The shared completion
callback then translates that into io.systemd.MachineInstance.NotConnected,
turning the desired outcome into a varlink error.
This is exactly what TEST-87-AUX-UTILS-VM exposes during its repeated
start/pause/resume/terminate stress loop: a successful Pause/Describe
followed milliseconds later by a Terminate that fails with NotConnected
when the disconnect path wins the race.
Give Terminate its own completion callback that treats disconnect-class
errors as success, since QEMU shutting down is the whole point of "quit".
The other simple commands (Pause, Resume, PowerOff, Reboot) keep the
existing semantics: they expect QMP to remain alive, so NotConnected is
the correct reply for them.
Michael Vogt [Tue, 14 Apr 2026 15:04:15 +0000 (17:04 +0200)]
core: add io.systemd.Unit.StartTransient() to the varlink API
This commit adds a simple version of io.systemd.Unit.StartTransient
for varlink. It is similar to the dbus version, but there is a key
difference:
1. Instead of building the unit from key/value properties it
takes a json object in the "service" parameter. It is also
only implementing a minimal set of what can be done with a
service (for now)
2. No aux units (for now)
3. When called with --more the varlink socket can notify about
unit job and state changes controlled via a bool on the
varlink call inputs: notify{Job,Unit}Changes
We use the new io.systemd.Job interface when outputing the
io.systemd.Unit.StartTransient result as it makes the output
nice and mirrors the input.
Note that the property names follow the D-Bus naming to make a
future "systemctl show" transition from D-Bus -> varlink easier.
Because UnitContext is now also used for the inputs we need
to make a bunch of fields `SD_VARLINK_NULLABLE` so that the
input is even accepted. This does not affect the output, it
is still fully populated, just the schema. The ID of UnitContext
is still required.
Thanks to ikruglov and Lennart for their excellent feedback on
this.