shared/options: add new helper option_parser_get_arg
option_parser_next_arg() is renamed to option_parser_peek_next_arg()
to match option_parser_consume_next_arg().
A new helper is added option_parser_get_arg(…, n). It is a common pattern
to only need a single arg, and getting an array and extracting a single
item from it is too verbose.
Merge the two blocks adding tests, since there seems to be
no obvious reason to have two separate blocks, as they both
contain tests from the same libraries.
Michael Vogt [Wed, 29 Apr 2026 06:20:56 +0000 (08:20 +0200)]
core: add io.systemd.Unit.StartTransient() to the varlink API (#41583)
This commit adds a simple version of io.systemd.Unit.StartTransient
for varlink. It is similar to the dbus version, but there is a key
difference:
1. Instead of building the unit from key/value properties it
takes a structured json object "UnitContext" with a "Service" field
inside.
It is also only implementing a minimal set of what can be done with a
service.
2. No aux units (for now)
3. When called with --more the varlink socket can notify about
state changes depending on the notify{Job,Unit}Changes parameter
This aligns to the json objects/format from
https://github.com/systemd/systemd/pull/39391
and to show how the format can be shared it adds a new
(minimal) `ServiceContext` that is now part of
`io.systemd.Unit.List()`.
run: use a "named namespace" also for the main option parser
It seems that clang reorders the entries in the options array that
originate from different functions, but not within a function. Using
"named namespaces" exclusively should sidestep the issue.
(A bigger hammer would be to sort the array. We *can* do this, since the
options have the increasing .id field. But that'd require duplicating
the memory or making it writable. Let's avoid this until we know for
sure that it's needed.)
run: reorder switch cases to match help() output order
Both parse_argv() and parse_argv_sudo_mode() handled options in an
order that no longer matched the help text. Reorder the case statements
so the source order mirrors what the user sees in --help.
In parse_argv_sudo_mode(), drop the case 'i' → ARG_VIA_SHELL fall-through
so the cases can be sequenced independently; 'i' now sets arg_via_shell
directly.
Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>
shared/options: introduce "namespaces" for options
This allows multiple option parsers to be defined in a single
compilation unit. We put the OPTION_NAMESPACE("name") to split
up the options. The basic implementation is similar to groups,
except that groups only matter for help display, while namespaces
matter for both help display and actual option parsing. When parsing,
we locate the appropriate range between the beginning of options
and the next namespace marker or between two namespace markers and
only look at that range.
loop-util: don't reuse partition fd when partscan needed
Some devices (e.g. android phones running pmOS) cannot have their OEM
partition table altered without breaking the firmware, so the distros's
partitions live inside a nested GPT carved into one of the OEM
partitions. Exposing these subpartitions requires wrapping the outer
partition in a loop device with partscan enabled, since the kernel does
not go into nested partition tables.
systemd already detects this case in udev-builtin-blkid
(ID_PART_GPT_AUTO_ROOT_DISK_NEEDS_LOOP) and acts on with
systemd-loop@.service, but this fails towards the end.
loop_device_make_internal has an optimization where if the input is
already a block device with a matching sector size, it skips creating
a loop and just hands back the original fd. That's fine for whole disks
but wrong for partitions, which don't support partscan, so this causes
dissect_image to fail with EPROTONOSUPPORT.
This patch changes the behavior to only take the shortcut when the input
is a whole disk, or when partscan was not requested.
shared/options: fix --help indentation for long options
In 4339197f5d4f712bc900d8e09c892015d48b19bb the helper to format -o/--opt=
was split out, but the indentation was for --long-options was messed up.
We'd print:
Options:
-h --help Show this help
--version Show package version
--no-ask-password Do not prompt for password
...
But we want
-h --help Show this help
--version Show package version
--no-ask-password Do not prompt for password
...
The prefix argument was arguably ugly, even if it allowed one alloc to be
avoided. Let's get rid of it and let the handler prefix the string as
appropriate. This makes other callers nicer too.
core: ensure all types from execute.h start with Exec
Until very recently all types defined by execute.h started with "Exec"
in the name. I think that was useful, since it made clear that the types
are associated with the ExecContext infrastructure. Let's hence restore
this.
(If we every move these types out of execute.h we should drop the "Exec"
prefix again. But today is not that day.)
The sanitizer job was FUBAR on Fedora Rawhide (and RHEL 10 to some
extent) due to several changes:
- latest LLVM (v22) introduced a change that occasionally generates a
false-positive warning when running with sanitizers
- several tools had to be "ASan-wrapped" because:
- util-linux started linking against libsystemd which propagated to
other tools depending on its shared libraries (like libmount)
- libssh started depending on libfido2, which depends on libudev; this
then translated to an interesting depedency chain where tpm2 utils got a
dependency on libudev through libcurl -> libssh -> libfido2
- polkit added MemoryMax= to its service file, which is incompatible
with ASan-runs (at least with the current limits)
See the commits for more detailed descriptions.
Also, one note: the santizer job is currently still FUBAR on Fedora
Rawhide (or, more specifically, the TEST-50-DISSECT and TEST-58-REPART),
because mkfs.erofs also gained a dependency on libudev (through libcurl,
see above), but the wrapping currently doesn't work as it also depends
on libqpl which is linked with libtsan (which is incompatible with other
sanitizers). This is currently tracked in
https://bugzilla.redhat.com/show_bug.cgi?id=2461146
Michael Vogt [Mon, 27 Apr 2026 11:12:30 +0000 (13:12 +0200)]
core: transition job to JOB_FINISHED on uninstall
When a job reaches the job_uninstall() stage we used to set the
state to JOB_WAITING. However now that we have a JOB_FINISHED state [1]
we should use that instead. This is more accurate so when
varlink_job_send_removed_signal() is called the job is in the expected
state and that is what the user will see.
Note that this does not change the D-Bus API because there
bus_job_send_removed_signal() doesn't send the state, it only sends
the result.
udev: don't assert on worker cap after killing a broken idle worker
manager_can_process_event() considers an event processable if either
there is room below children_max to spawn, or an idle worker exists.
When only the latter holds, event_run() picks the idle worker and
tries device_monitor_send(). If that send fails, event_run() SIGKILLs
the worker, marks it WORKER_KILLED and continues the loop. With no
other idle worker available, it falls through to worker_spawn(),
guarded by:
The just-killed worker is still in manager->workers until its SIGCHLD
is reaped by on_worker_exit(), so at the cap this assertion trips and
udevd aborts:
Assertion 'hashmap_size(manager->workers) < manager->config.children_max'
failed at src/udev/udev-manager.c:635, function event_run(). Aborting.
Instead of asserting, bail out when we are already at the worker
limit. The event remains in EVENT_QUEUED; once the killed worker's
SIGCHLD arrives and frees it from the hashmap, on_post() re-runs
event_queue_start() and the event is retried.
Let's track the state of the option parser in an explicit 'state' field.
Benefits:
1. We can make sure that a parser that got into a failure state will
be invalidated for good (i.e. further operations are guaranteed to
fail too).
2. As a side effect this cleans up the option_parse() return parameter
handling: we'll now always initialize ret_option/ret_arg when
returning >= 0, as per coding style.
ASAN showed a use-after-free error for systemd-vmspawn's
machine_register call because the reply got accessed and freed again
through _cleanup. The same problem exists in two
verb_machine_control_one/unregister_machine.
Fix these call sites to not set up _cleanup.
Kai Lüke [Fri, 24 Apr 2026 16:42:36 +0000 (01:42 +0900)]
nsresource: fix buffer overrun reported by ASAN
This came up when running systemd-vmspawn with ASAN to fix another bug
and thus I had to fix this overrun here first: The dispatch tables were
missing the terminator, add it.
Kai Lüke [Fri, 24 Apr 2026 15:18:56 +0000 (00:18 +0900)]
fork-notify: Use callback instead of argv NULL code path with return
In 012d87c1fc/cc8f398202 it was made possible for fork_notify() to
return in the child but at that point all FDs were closed and the
_cleanup path from the return causes assertion failures due to invalid
FDs in notify_event_source/event, leading to a vmspawn failing to start
with a SIGABRT logged in coredump.
Instead of TAKE_PTR on a bunch of things which is fragile, rather avoid
the return and instead add an explicit callback handler and guarantee
to exit directly after it. A userdata argument is also added but not
used yet I think it's quite normal to have for a callback.
vmspawn-varlink: treat QMP disconnect as success for Terminate
QMP "quit" tells QEMU to exit, which races the reply with the socket
EOF: sometimes the disconnect lands in qmp_client_fail_pending() with
-ECONNRESET before the reply has been parsed. The shared completion
callback then translates that into io.systemd.MachineInstance.NotConnected,
turning the desired outcome into a varlink error.
This is exactly what TEST-87-AUX-UTILS-VM exposes during its repeated
start/pause/resume/terminate stress loop: a successful Pause/Describe
followed milliseconds later by a Terminate that fails with NotConnected
when the disconnect path wins the race.
Give Terminate its own completion callback that treats disconnect-class
errors as success, since QEMU shutting down is the whole point of "quit".
The other simple commands (Pause, Resume, PowerOff, Reboot) keep the
existing semantics: they expect QMP to remain alive, so NotConnected is
the correct reply for them.
Michael Vogt [Tue, 14 Apr 2026 15:04:15 +0000 (17:04 +0200)]
core: add io.systemd.Unit.StartTransient() to the varlink API
This commit adds a simple version of io.systemd.Unit.StartTransient
for varlink. It is similar to the dbus version, but there is a key
difference:
1. Instead of building the unit from key/value properties it
takes a json object in the "service" parameter. It is also
only implementing a minimal set of what can be done with a
service (for now)
2. No aux units (for now)
3. When called with --more the varlink socket can notify about
unit job and state changes controlled via a bool on the
varlink call inputs: notify{Job,Unit}Changes
We use the new io.systemd.Job interface when outputing the
io.systemd.Unit.StartTransient result as it makes the output
nice and mirrors the input.
Note that the property names follow the D-Bus naming to make a
future "systemctl show" transition from D-Bus -> varlink easier.
Because UnitContext is now also used for the inputs we need
to make a bunch of fields `SD_VARLINK_NULLABLE` so that the
input is even accepted. This does not affect the output, it
is still fully populated, just the schema. The ID of UnitContext
is still required.
Thanks to ikruglov and Lennart for their excellent feedback on
this.
Michael Vogt [Tue, 21 Apr 2026 07:10:39 +0000 (09:10 +0200)]
varlink: add new io.systemd.Job interface
This commit creates a new varlink-io.systemd.Job.c file and puts
the job related varlink types into this file. Those will be
used by the upcoming io.systemd.Unit.StartTransient.
Note that the property names follow the D-Bus naming to make a
future "systemctl show" transition from D-Bus -> varlink easier.
Thanks to @ikruglov for suggesting this preparation commit and
to Lennart for suggesting the D-Bus compatibility considerations.
tree-wide: change option_parse() to return option and arg via internal state
It was requested to make the 'c', 'opt', and 'arg' params the same, i.e.
defined through the FOREACH_OPTION macro. But we can't do that easily,
because 'c' was defined in the for loop definition, and we can only
define variables of the same type in that way. Also, in some cases we
need only 'c', in other cases with need 'c' and 'arg, in some cases 'c'
and 'opt', and in other cases all three. We'd need to either
conditionalize or mark those variables with _unused_ to deal with
compiler warnings. But a different approach works quite nicely: add
state.opt and state.arg to show the current option and it's argument.
(The short names are picked on purpose to reduce verbosity since those
are used a lot.)
So far when a job completed we'd never transition into any new state,
we'd just do some final processing work (such as notifying clients) and
destroy it.
Let's change that, and briefly enter a final state: "finished". This is
useful so that code that notifies clients can generically send the
quadruplet of id, type, state, result for any change notification and
naturally can communicate job completion that way: by setting the state
field to "finished".
Common page-code parsing extracted into a parse_page_code() helper.
While at it, return real error values (-EINVAL, etc.) rather than -1,
and rename retval to r throughout for consistency.
The body of main is moved to new run. The closing of logging file
descriptors is dropped. They will be closed automatically anyway. Not
sure what the original purpose of that code was. The code is also
modernized in various places… though more changes could be made. The
return convention of help() and similar functions is changed to usual
negative/0/1, where 0 means that the caller should quit.
set_inq_values would return positive error values too, which was
previously ignored. It's not entirely clear, but that doesn't seem
to have been on purpose.
In format-table.h, TABLE_IN_ADDR is commented as "Takes a union in_addr_union
(or a struct in_addr)". However, if we pass struct in_addr to table_add_many(),
the function reads more than the size of the struct.
blockdev-list: make BLOCKDEV_LIST_IGNORE_ROOT suppress all definitions of the root disk
There are various definitions of the root disk, let's suppress them all if
the flag is set. So far only the outermost is suppressed, which is a bit
weird, given it's "further away" from the rootfs.
The original find was matching even our test units, which caused issues
when the check was extended with Memory*= directives, as we stripped
them off from test units for TEST-55-OOMD where we certainly need them.
Since the stripping was meant primarily for "production-grade" units,
let's limit it to units under /etc/systemd/system/ and
/usr/lib/systemd/system/.
test: slightly reduce the performance/memory overhead for wrapped binaries
Let's drop the quarantine that ASan uses for use-after-free detection,
as it's pointless in wrapped binaries and can consume up to 256 MiB of
memory (with the default configuration). Also, don't keep any stack
traces for allocations & deallocations, which should (slightly) help
with both memory & performance overhead.
test: temporarily ignore sanitizer warning about blocked ptrace()
LLVM 22 introduced an additional check [0] for ptrace() syscall when
invoking sanitizers [0] which currently produces a false-positive
warning when running some of our units under sanitizers:
[ 47.524680] systemd-timedated[740]: ==740==WARNING: ptrace appears to be blocked (is seccomp enabled?). LeakSanitizer may hang.
[ 47.524680] systemd-timedated[740]: ==740==Child exited with signal 15.
...
[ 1555.734223] systemd-oomd[93]: ==93==WARNING: ptrace appears to be blocked (is seccomp enabled?). LeakSanitizer may hang.
[ 1555.734223] systemd-oomd[93]: ==93==Child exited with signal 15.
...
It is a false positive because we disable the seccomp filters
system-wide for our units in the sanitizer jobs.
Now, from what I've seen so far this happens only in
Type=notify(-reload) units that also utilize bus_event_loop_with_idle().
This, combined with the fact that the ptrace()-check child process from
[0] checks only if the child process was killed by _any_ signal, means
that if the systemd unit exits on its own after becoming idle and then
something sends it SIGTERM (either via explicit `systemctl stop` or
during system shutdown), this SIGTERM might hit the ptrace()-check child
process from the sanitizer handler (as we also send the signal to all
processes in the target cgroup), which the parent process then
mistakenly evaluates as a blocked ptrace() syscall, even though the
check process wasn't killed by SIGSYS.
I filed this as [1] to the LLVM project, but let's also temporarily
ignore the warning in the sanitizer report processing, as it currently
causes annoying test fails.
test: drop any memory limits from units when running with sanitizers
As the memory usage under sanitizers is quite unpredictable.
This is currently relevant mainly for Polkit, as it introduced memory
limits for its polkitd.service unit in the latest version [0] which are
very easy to trigger when running under sanitizers (as polkitd depends
on libsystemd which brings ASan into polkitd's address space).
hwdb: sensor: add accel mount matrix for GPD WIN 5
The WIN 5 (DMI product G1618-05) ships the same BMI0160
accelerometer with the same physical mounting as the Win Max 2
(G1619-04), so reuse its mount matrix. Verified on hardware:
without the matrix iio-sensor-proxy reports
AccelerometerOrientation=normal regardless of physical pose,
and applying the G1619-04 matrix makes orientation transitions
(normal / left-up / right-up / bottom-up) track the device
correctly.
add helpers for unified --help formatting (#41805)
I see --help output formatting as contiuation of @keszybz's work on
options.[ch], hence let's start to add some really basic infrastructure
to unify the --help output more.
Yu Watanabe [Mon, 16 Mar 2026 15:11:25 +0000 (00:11 +0900)]
ip-util: introduce udp_packet_verify()
This is mostly equivalent to dhcp_packet_verify_headers(), but
- it optionally returns the UDP payload as iovec, and
- supports IP header with options,
- check packet length more strictly.
Yu Watanabe [Sun, 15 Mar 2026 04:57:33 +0000 (13:57 +0900)]
ip-util: introduce udp_packet_build()
Then make dhcp_packet_append_ip_headers() just a wrapper of the new
function. Currently, the wrapper is inefficient, but will be removed in
a later commit.
Nick Rosbrook [Fri, 24 Apr 2026 13:38:42 +0000 (09:38 -0400)]
units: order networkd resolve hook After=network-pre.target
Without this, the socket is available well before systemd-networkd.service
is able to start, because of its own After=network-pre.target ordering.
Then, if resolved handles queries before network-pre.target, it will
hang waiting for networkd to reply to hook queries.
This is currently happening in the wild with cloud-init.
vmspawn: prepare QMP infrastructure for runtime block-device hotplug (#41763)
The block-device hotplug work (#NNNNN) needs a number of cross-cutting
changes to the QMP plumbing that aren't hotplug-specific in
themselves: a refcounted DriveInfo so async stage callbacks can keep
slot refs, a counter-based naming scheme so multiple drives can share
the same backing path, pipelined remove-fd so QEMU's fdsets get
released at blockdev-del time, and a generic-completion callback that
doesn't tear the VM down on a runtime QMP error. None of these change
behaviour from a user's point of view, none of them depend on the
varlink hotplug methods landing, and several are wins on their own
(the fdset leak in particular is observable today with --extra-drive
under long-running VMs). Pulling them out of the hotplug PR keeps
that PR focused on the IDL + server-side method handlers, and lets
this preparatory work land on its own merits without waiting for the
larger feature review.
Cleanup pieces that fall out for free:
qmp-client: widen next_fdset_id to uint64_t
vmspawn: move VMSPAWN_PCIE_HOTPLUG_SPARES to vmspawn-qmp.h
vmspawn-varlink: use error < 0 in async QMP completion callbacks
vmspawn-varlink: simplify on_qmp_describe_complete result extraction
vmspawn-varlink: extract notify_event_subscribers from on_qmp_event
vmspawn-varlink: treat empty event subscription filter as catch-all
vmspawn-qmp: pass bridge to on_cont_complete via invoke userdata
Infrastructure the hotplug add path will sit on top of:
vmspawn-qmp: convert DriveInfo to a refcounted object
vmspawn-qmp: derive QMP node and device ids from a bridge counter
vmspawn-qmp: pipeline remove-fd after each blockdev-add
vmspawn-qmp: keep the event loop running on post-setup QMP failures
vmspawn-qmp: add the hotplug-capable block-device add machinery
vmspawn-qmp: add vmspawn_qmp_remove_block_device
The two final commits introduce vmspawn_qmp_add_block_device() and
vmspawn_qmp_remove_block_device() but leave them without varlink
callers — the io.systemd.VirtualMachineInstance method handlers that
forward into them land with the rest of the hotplug PR. Boot-time
drive setup is rewritten on top of vmspawn_qmp_add_block_device() so
the hotplug and boot paths share a single staged-add pipeline from
day one.
This changes the .result field to invalid initially, which arguably
makes more sense than "done", which was previously the default.
This is a correctnes fix, and afaics has no effect on the API, since we
do not expose this 1:1 as D-Bus property: it's only seen on D-Bus as
part of the job completion signal, at which part it is correctly
initialized.
Noticed while reviewing: https://github.com/systemd/systemd/pull/41583
systemd-cat does not connect the standard *input* of a process to the journal
The first paragraph of the description of the systemd-cat utility incorrectly referred to stdin when it obviously meant stderr: the other fd that it connects to the journal via a unix(7) domain socket, as clarified in the following paragraphs.
I've also replaced "process" with "command" as in that mode, systemd-cat executes a file and does not spawn a process.