The interface of this program was rather strange. It took an option that
specified what to do, but that option behaved exactly like a verb. Let's
change the interface to the more modern style with verbs. Since the
inteface was documented in the man page, provide a compat shim to handle
the old options.
(In practice, I doubt anybody will notice the change. But since it was
documented, it's easier to provide the compat then to think too much
whether it is actually needed. I think we can drop it an year or so.)
Kai Lüke [Thu, 16 Apr 2026 06:24:27 +0000 (15:24 +0900)]
sysupdated: don't crash when an mstack machine image is found
As soon as machinectl list-images has an mstack entry updatectl fails
because systemd-sysupdated crashes with an assertion failing because
the mstack case was not handled.
For now mstack is not supported as image for sysupdate to operate on
and we can skip it.
sd-varlink: Don't log successful sentinel error dispatch as a failure
sd_varlink_error() deliberately returns a negative errno mapped from
the error id on success so callbacks can `return sd_varlink_error(...);`
to enqueue the reply and propagate a matching errno at once. When
varlink_dispatch_method() dispatches a configured error sentinel itself,
it doesn't need that mapping — but it was treating any negative return
as a dispatch failure and logging "Failed to process sentinel" even
though the error reply had been successfully enqueued.
Detect success via the state transition to VARLINK_PROCESSED_METHOD
instead, so only genuine enqueue failures are logged.
systemd-vmspawn: QMP-varlink bridge for VM runtime control (#41449)
systemd-vmspawn currently has zero runtime control over the VMs it
launches. It can kill QEMU (SIGTERM) or SSH in, but it cannot pause,
resume, request a graceful power-off, query status, or
react to VM events. QEMU exposes all of this via its QMP protocol;
systemd's native IPC is varlink. This series bridges the two.
machined stores the controlAddress but never connects to vmspawn.
machinectl discovers the address from Machine.List and connects
directly. Socket mode 0600 is the access-control boundary —
the socket is rooted in vmspawn's $RUNTIME_DIRECTORY, so only the UID
that launched the VM can talk to it.
QMP client library (src/shared/qmp-client.{c,h})
A small non-blocking QMP client modeled on sd-varlink's pump contract:
- Reference-counted QmpClient with an explicit five-state machine:
HANDSHAKE_INITIAL → HANDSHAKE_GREETING_RECEIVED →
HANDSHAKE_CAPABILITIES_SENT → RUNNING → DISCONNECTED.
- qmp_client_connect_fd() is non-blocking: it wraps the fd in a
JsonStream and returns immediately. The greeting + qmp_capabilities
handshake is driven lazily on the first
qmp_client_invoke() or by the event loop — whichever comes first — so
callers never block during connect.
- qmp_client_attach_event() attaches to sd_event for async operation;
qmp_client_process() performs one pump step (write → dispatch → parse →
read → disconnect) with the same contract as
sd_varlink_process(); qmp_client_wait() blocks until the next I/O event.
- qmp_client_invoke() sends an async command and fires the registered
qmp_command_callback_t with (result, error_desc, error, userdata) on
completion. Synchronous callers drive
process()/wait() in a loop until qmp_client_is_idle() is true.
- QmpClientArgs bundles the JSON arguments and an FD list for a single
command; the QMP_CLIENT_ARGS_FD() macro hands one fd to the callee for
SCM_RIGHTS passing. On partial-stage failure the
args list is narrowed so the caller's cleanup closes only the
untransferred tail.
- Event broadcast to a registered callback via qmp_client_bind_event();
transport loss surfaces through qmp_client_bind_disconnect().
- qmp_schema_has_member() walks the query-qmp-schema result for optional
runtime capability probes.
vmspawn device setup via QMP
vmspawn starts QEMU paused (-S), sets up devices via QMP, then resumes
with cont. The entire device plane moves off the legacy INI config path
and onto the bridge.
A new MachineConfig aggregate in vmspawn-qmp.h groups the per-device
info (DriveInfos, NetworkInfo, VirtiofsInfos, VsockInfo) with a single
machine_config_done() cleanup that chains the
sub-structure destructors; each conversion patch populates exactly the
field it owns.
What the conversion enables:
- FD-based device passing via add-fd / getfd + SCM_RIGHTS — vmspawn
opens every image file, TAP, VSOCK, and virtiofs socket itself and hands
the fd to QEMU. QEMU never needs filesystem
access.
- Ephemeral overlays via blockdev-create + async job-concluded
continuations on anonymous O_TMPFILE / memfd backings — no named overlay
files on disk.
- PCIe root-port pre-allocation for q35/virt machine types so
hotplug-capable slots exist at boot (NVMe, virtio-scsi, etc.).
- io_uring availability probing with automatic fallback to the default
AIO backend if QEMU's build doesn't support it.
Per-command callbacks call sd_event_exit() on setup failure so vmspawn
shuts down cleanly if any device can't be attached.
machinectl integration
- machinectl pause / resume / poweroff / reboot / terminate go through
the varlink control socket for VMs.
- D-Bus fallback for containers: poweroff sends SIGRTMIN+4, terminate
calls the existing TerminateMachine method — unchanged container
behavior.
- Multi-machine parallel dispatch via sd_event for bulk operations
(machinectl pause vm1 vm2 ...) so one slow VM doesn't serialize the
rest.
- SubscribeEvents streaming with per-subscriber event-name filters
(importd Pull-style pattern: initial {ready:true} notify, fan out via
varlink_many_notifybo(), lazy init — QMP event pump
runs only while subscribers exist).
Tests
- Unit test with a mock QMP server covering handshake, command/response,
events, and EOF.
- Integration test against real QEMU (-machine none) exercising
handshake + query-qmp-schema (~200 KB reply, validates the buffered
reader across multiple read()s) and query-status.
- Integration test for the machinectl verbs end-to-end: pause / resume /
describe / subscribe / terminate.
- Integration test for the multi-drive pipeline and ephemeral overlays
(blockdev-create async job continuations).
- Stress test: 5 cycles of start → 3× (pause/describe/resume/describe) →
terminate.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
vmspawn: add integration test for multi-drive and ephemeral QMP setup
Test the async QMP drive pipeline with real QEMU:
Test 1 (multi-drive): launches vmspawn with --image plus two
--extra-drive flags. This exercises multiple fdset allocations,
pipelined blockdev-add commands relying on FIFO ordering, io_uring
retry callbacks, and multiple device_add commands — all fired
without waiting for responses.
Test 2 (ephemeral): launches vmspawn with --image --ephemeral. This
exercises the most complex async path: blockdev-create fires a
background job, JOB_STATUS_CHANGE events are watched via the event
callback, and when the job concludes the deferred continuation fires
the overlay format node + device_add. If the continuation fails, the
root drive is never attached, the kernel panics, and vmspawn exits
without registering — so successful registration proves the pipeline
works.
Both tests use a raw ext4 image with a minimal init (sleep infinity)
and direct kernel boot. No virtiofsd needed.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
vmspawn: add integration test for machinectl VM control verbs
Add TEST-87-AUX-UTILS-VM.vmspawn.sh that validates the QMP-varlink
bridge end-to-end using a real QEMU instance:
- Launches vmspawn with --directory and --linux for direct kernel boot
(no UEFI firmware or bootable image needed)
- Waits for machine registration with machined
- Verifies varlinkAddress is exposed in Machine.List
- Tests machinectl pause, resume, poweroff
- Exercises MachineInstance varlink interface directly via varlinkctl:
QueryStatus state verification across pause/resume, Pause, Resume
Skipped automatically if vmspawn, QEMU, or a bootable kernel is not
available. Runs as part of TEST-87-AUX-UTILS-VM in the mkosi
integration test suite.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
vmspawn: add integration test for QMP client library against real QEMU
Add a test that launches QEMU with -machine none (no bootable image
needed) and exercises the QMP client library against the real QMP
implementation:
- test_qmp_client_qemu_handshake_and_schema: sends query-qmp-schema
(~200KB response that exercises the buffered multi-read() path)
via qmp_client_invoke(), then cleanly shuts down QEMU via quit.
The QMP handshake completes transparently inside invoke().
- test_qmp_client_qemu_query_status: validates query-status response
parsing, stop/cont command sequencing with id correlation, and state
verification between commands
The test is automatically skipped when QEMU is not installed.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
vmspawn: add integration test for QMP client library
Test the QMP client library using a mock QMP server over a socketpair:
- test_qmp_client_basic: Verifies full handshake, query-status with
response parsing, stop/cont commands, and asynchronous STOP event
delivery via the sd-event I/O callback
- test_qmp_client_eof: Verifies that the client properly detects
server disconnection (EOF) and returns a disconnect error
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
man: document machinectl pause/resume and update poweroff for VMs
Add manpage entries for the new pause and resume verbs. Update the
poweroff description to cover VMs (ACPI powerdown via QMP) in addition
to containers (SIGRTMIN+4).
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Each verb discovers the machine's varlinkAddress via machined's
Machine.List, connects directly to vmspawn's varlink socket, and
calls the corresponding io.systemd.MachineInstance method.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Remove the static blockdev/device/snapshot INI config sections and
the SCSI controller setup for both the root image drive and extra
drives. Replace with DriveInfos that are constructed in the parent
after fork: vmspawn opens all image files and passes fds to QEMU via
the add-fd path. For ephemeral mode, anonymous overlay files are
created via O_TMPFILE or memfd.
The resolve_disk_driver() helper maps DiskType to the appropriate
QEMU driver name and serial format.
The post-fork device-info preparation is split into helpers:
prepare_primary_drive() and prepare_extra_drives() for per-drive
construction, assign_pcie_ports() for naming the pre-allocated
pcie-root-port bridges once every device type is known, and
prepare_device_info() that stitches them together against the
MachineConfig aggregate.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Remove the static chardev/device INI config sections for both the
root filesystem and runtime mount virtiofs instances. Replace with
VirtiofsInfos that capture socket paths and tags for each virtiofs
mount, passed to vmspawn_varlink_setup_virtiofs() for runtime
configuration via QMP.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Remove the static vsock0 INI config section and the related pass_fds
plumbing. Replace with a VsockInfo struct that captures the vhost fd
and guest CID, passed to vmspawn_qmp_setup_vsock() for runtime
configuration via QMP. The VSOCK fd is now sent to QEMU via QMP getfd
+ SCM_RIGHTS instead of being inherited.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Remove the static netdev/nic INI config sections for both privileged
TAP, nsresourced TAP, and user-mode networking. Replace them with a
NetworkInfo struct that captures the network type, TAP fd or interface
name, and MAC address, passed to vmspawn_varlink_setup_network() for
runtime configuration via QMP.
For the nsresourced TAP path the fd is now passed to QEMU via QMP
getfd + SCM_RIGHTS instead of being inherited through pass_fds.
Declare the MachineConfig aggregate that this and the following
conversion patches populate, zero-initialized with explicit -EBADF
for the fd fields so every sub-structure cleans up safely regardless
of which device types the invocation ends up using.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
vmspawn: QMP-varlink bridge for VM runtime control
Create a QMP socketpair for QEMU machine monitor control, configure
the QMP chardev+mon via the QEMU config file, and wire up the bridge
infrastructure.
After fork, vmspawn initializes the QMP bridge, probes QEMU feature
support synchronously (driving the QMP handshake to RUNNING
transparently), resumes vCPUs, then sets up the varlink server for
runtime VM control. The control socket path is passed to machined via
the controlAddress field in machine registration.
Device configuration still uses the legacy INI config path and will
be converted to bridge calls in subsequent commits.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
vmspawn: pre-allocate PCIe root ports for device hotplug
On PCIe machine types (q35, virt), QMP device_add is always hotplug —
even with vCPUs stopped. The root PCIe bus (pcie.0) does not support
hotplugging; only pcie-root-port bridges do. Pre-allocate enough root
ports in the QEMU config file for all devices that will be set up via
QMP, plus 10 spare ports for future runtime hotplug.
Add ARCHITECTURE_NEEDS_PCIE_ROOT_PORTS macro to guard PCIe-specific
setup on x86, ARM, RISC-V, and LoongArch (the architectures whose
QEMU machine type is q35 or virt).
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Create the QMP-to-varlink bridge layer (vmspawn-qmp.{c,h}) and the
varlink server layer (vmspawn-varlink.{c,h}).
The QMP bridge (VmspawnQmpBridge) owns the QmpClient connection and
manages pending background jobs (e.g. blockdev-create continuations).
vmspawn_qmp_init() creates the client and attaches it to the event
loop. vmspawn_qmp_probe_features() drives io_uring and qcow2
discard-no-unref probes synchronously via a qmp_client_process() +
qmp_client_wait() loop — the QMP handshake completes transparently on
the first invoke. vmspawn_qmp_start() resumes vCPUs via an async
"cont" command.
The varlink server (VmspawnVarlinkContext) exposes three interfaces:
- io.systemd.MachineInstance: generic machine control (Terminate,
PowerOff, Reboot, Pause, Resume, Describe, SubscribeEvents).
Method handlers forward to QMP commands asynchronously — the
varlink reply is deferred until the QMP response arrives.
- io.systemd.VirtualMachineInstance: VM-specific (placeholder for
future snapshot/migration methods).
- io.systemd.QemuMachineInstance: QEMU-specific (AcquireQMP stub).
The server listens on <runtime_dir>/control with mode 0600.
Event streaming follows the importd Pull pattern: SubscribeEvents
sends an initial {ready:true} notification, then fans out QMP events
to all subscribers. The disconnect handler only unrefs subscriber
links (matching resolved's vl_on_notification_disconnect pattern).
Introduce the MachineConfig aggregate in vmspawn-qmp.h grouping the
per-device info structures (DriveInfos, NetworkInfo, VirtiofsInfos,
VsockInfo) together with machine_config_done() that chains the
individual done helpers. Callers populate it field-by-field and rely
on the _cleanup_ attribute for orderly teardown regardless of which
device types the invocation ends up using.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Async QMP client for talking to QEMU's machine monitor from
libsystemd-shared. The I/O core (buffered read/write, output queue,
event source management) follows sd-varlink's patterns.
State machine:
INITIAL -> GREETING_RECEIVED -> CAPABILITIES_SENT -> RUNNING
|
v
DISCONNECTED
The QMP handshake (greeting + qmp_capabilities) is driven transparently
by qmp_client_invoke() through an internal qmp_client_ensure_running()
helper, matching sd-bus's bus_ensure_running() pattern. Callers never
wait for it explicitly.
qmp_client_invoke() is the only command interface: asynchronous, with
per-command callback. Slots are tracked in a Set keyed by id; replies
are dispatched by id match. SCM_RIGHTS fd passing is bundled through
QmpClientArgs and the QMP_CLIENT_ARGS_FD macro.
qmp_client_process() and qmp_client_wait() are exposed publicly,
mirroring sd_varlink_process() and sd_varlink_wait(). Callers that
need to drive the client synchronously — e.g. feature probing before
entering the sd_event main loop — can loop on them exactly like
varlink_call_internal() does on its varlink equivalents.
Other features:
- Buffered stream reader for QMP's \r\n-delimited JSON, handling
multi-read responses (query-qmp-schema is ~200 KiB).
- Fdset id allocation via qmp_client_next_fdset_id().
- Synthetic SHUTDOWN event on unexpected disconnect.
- Disconnect detection with callback notification and pending-command
cleanup.
- -ENOBUFS from the 16 MiB input-buffer cap is treated as recoverable
(not a transport error).
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Move json_stream_description(), json_stream_log(), and
json_stream_log_errno() from json-stream.c into json-stream.h so that
consumers like the QMP client can use the same description-prefixed
logging that json-stream itself uses internally.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
machined: add controlAddress field to Machine.Register and Machine.List
Follow the existing sshAddress pattern to add a controlAddress field
that allows machine registrants (like vmspawn) to advertise a varlink
socket address for direct VM control. machined stores and exposes
the address but never connects to it itself.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
shared: add varlink interface definitions for machine instance control
Add three varlink interface definitions for the machine instance control
hierarchy:
- io.systemd.MachineInstance: generic operations applicable to both
containers and VMs (PowerOff, Reboot, Pause, Resume, QueryStatus,
SubscribeEvents). nspawn could implement this same interface later.
- io.systemd.VirtualMachineInstance: VM-specific but VMM-agnostic
operations. Empty for now, future home for AddBlockDevice and similar.
- io.systemd.QemuMachineInstance: QEMU-specific operations. Defines
AcquireQMP() for protocol upgrade to a direct QMP connection.
The "Instance" suffix avoids collision with machined's existing
io.systemd.Machine interface.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
* 207e2d0044 Stop building support for openssl engines
* 36a234147f Upload sources
* 3681163f81 Version 260.1
* 8f4f0f58e3 Version 260
* e3fab23aa0 Version 260~rc4
* e4c1c2100b Version 260~rc3
* 453696813e Fix typo in unit name in %post scriptlet
* 154edb7cdb Silence false positive "HWID match failed, no DT blob" error (rhbz#2444759)
* 03b6637c35 riscv64 port has LTO disabled
* ce1dec6a40 Version 260~rc2
* 809049777c Add patch for symlink creation error
* 6ff27708f7 Enable getty@.service through presets
* ba7807fbce Drop scriptlet for upgrades from versions <253
* 455f277188 Move support for tpm2 to systemd-udev subpackage
* 0183bc784e Version 260~rc1
Nick Rosbrook [Mon, 13 Apr 2026 20:06:23 +0000 (16:06 -0400)]
test: do not use nanoseconds width specifier in date command
Using the format specifier +%s%6N with GNU date is honored, and only
prints 6 digits of the nanoseconds portion of the seconds since epoch.
The uutils implementation of date does not honor this, and always prints
all 9 digits. This is a known bug[1], but can be worked around by
adapting this test to use nanoseconds instead of microseconds.
tree-wide: convert remaining varlink string fields to enum types (#41615)
Follow-up to #40972. Convert remaining plain string fields to proper
varlink enum types across all interfaces, per the policy that
user-controlled/API fields should be declared as proper enums in the
IDL.
Shared types moved to varlink-idl-common: ExecOutputType,
CGroupPressureWatch, EmergencyAction, ManagedOOMMode — these are reused
across multiple interfaces.
Each interface change includes a corresponding enum sync test to catch
future drift between C string tables and varlink IDL definitions.
Ivan Kruglov [Tue, 14 Apr 2026 09:25:43 +0000 (02:25 -0700)]
docs: clarify when to use varlink enum types vs plain strings
Add guidance on when a field should use a proper varlink enum type
versus remaining a plain string: user-controlled/API fields should be
enums, engine-internal state fields may stay as strings.
Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>
Ivan Kruglov [Mon, 13 Apr 2026 10:56:48 +0000 (03:56 -0700)]
varlink: add ManagedOOMMode enum type to io.systemd.oom
Convert the mode field in ControlGroup from plain string to the
ManagedOOMMode enum type from varlink-idl-common. Register
ManagedOOMMode in both io.systemd.oom and io.systemd.ManagedOOM
interfaces since both use the ControlGroup struct.
Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>
Ivan Kruglov [Mon, 13 Apr 2026 10:32:16 +0000 (03:32 -0700)]
varlink: add enum types for class and whom fields in io.systemd.Machine
Convert the class field (Register input, List output) from plain string
to MachineClass enum type, and the whom field (Kill input) to KillWhom
enum type.
Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>
Ivan Kruglov [Mon, 13 Apr 2026 10:25:21 +0000 (03:25 -0700)]
varlink: add enum types for scheduling and mount settings in io.systemd.Unit
Convert CPUSchedulingPolicy, IOSchedulingClass, NUMAPolicy and MountFlags
fields from plain strings to proper varlink enum types in the io.systemd.Unit
interface. Update the corresponding serialization code to use
json_underscorify() for correct enum value formatting.
Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>
Ivan Kruglov [Mon, 13 Apr 2026 09:53:38 +0000 (02:53 -0700)]
varlink: add enum types for configuration settings in io.systemd.Manager
Convert 8 string fields in the io.systemd.Manager varlink interface to
proper enum types:
- LogTarget: new enum (console, console_prefixed, kmsg, journal, ...)
- DefaultStandardOutput/Error: reuse ExecOutputType from common
- DefaultMemory/CPU/IOPressureWatch: reuse CGroupPressureWatch from common
- DefaultOOMPolicy: new enum (continue, stop, kill)
- CtrlAltDelBurstAction: reuse EmergencyAction from common
Output serialization updated to use JSON_BUILD_PAIR_ENUM for automatic
underscorification of dash-containing values.
Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>
executor: move reopening of the console after option parsing
It seems to be interfering with systemd:check-help-systemd-executor
test in CI. In practice, any messages from parse_argv() are going
to be from manual invocations, since if called from PID1 the option
syntax is going to be correct. So I hope this fixes the redirection
of --help but otherwise is of little consequence.
executor: do not abort on invalid serialization fd
E.g. --deserialize=15 would cause the program to abrt in safe_close.
But in fact, we shouldn't try to do the close in any case: if the
fd is not valid, we should return an error without modifying state.
And if it _is_ valid, we set O_CLOEXEC on it, so it'll be closed
automatically later.
NEWS: pre-announce removal of /run/boot-loader-entries/ support in lo… (#41622)
…gind
logind could read UAPI.1 Boot Loader Spec entries from
/run/boot-loader-entries/ in addition to ESP/XBOOTLDR. This was pretty
half-assed, and to my knowledge was never actually used much.
Let's remove support for it and simplify our codebase.
Let's schedule it for removal via NEWS in a future version, to give
people a chance to speak up.
journal-upload: also disable VERIFYHOST when --trust=all is used
When --trust=all disables CURLOPT_SSL_VERIFYPEER, the residual
CURLOPT_SSL_VERIFYHOST check is ineffective since an attacker can
present a self-signed certificate with the expected hostname. Disable
both for consistency and log that server certificate verification is
disabled.
machined: pass user as positional argument in machine_default_shell_args()
Instead of interpolating the user name directly into the sh -c script
body via asprintf %s, pass it as a positional parameter ($1) in a
separate argv entry. This avoids the user string being parsed as part
of the shell script syntax.
Also validate the user name in bus_machine_method_open_shell() with
valid_user_group_name(), matching the validation already done on the
Varlink path via json_dispatch_const_user_group_name().
logind: reject wall messages containing control characters
method_set_wall_message() and the property setter only checked the
message length but not its content. Since wall messages are broadcast
to all TTYs, control characters in the message could interfere with
terminal state. Reject messages containing control characters other
than newline and tab.
core: add missing SELinux access checks when listing units
Add mac_selinux_unit_access_check_varlink() to the unit enumeration
loop in vl_method_list_units(), silently skipping units the caller
is not permitted to see, matching the D-Bus ListUnits behavior.
Add mac_selinux_access_check_varlink() to vl_method_describe_manager().
In ccecae0efd ("vmspawn: use machine name in runtime directory path")
support for RUNTIME_DIRECTORY was dropped which makes it difficult to
run systemd-vmspawn in a service unit which doesn't have write access to
the regular /run but should use its own managed RUNTIME_DIRECTORY. What
worked before was --keep-unit --system but we can't use XDG_RUNTIME_DIR
and --user because then --keep-unit breaks which we need because it
can't create a scope as there is no session. Switch back to
runtime_directory which handles RUNTIME_DIRECTORY and tells us whether
we should use it as is without later cleanup or if we need to use the
regular path where we create and delete the directory ourselves.
NEWS: pre-announce removal of /run/boot-loader-entries/ support in logind
logind could read UAPI.1 Boot Loader Spec entries from
/run/boot-loader-entries/ in addition to ESP/XBOOTLDR. This was pretty
half-assed, and to my knowledge was never actually used much.
Let's remove support for it and simplify our codebase.
Let's schedule it for removal via NEWS in a future version, to give
people a chance to speak up.
- Use persist-credentials: false for actions/checkout, so we don't
leak the github token credentials to subsequent jobs.
- Remove one / from the Edit/Write permissions. Currently, with the
absolute path from github.workspace, we expand to three slashes while
we only need two.
Ivan Kruglov [Mon, 13 Apr 2026 09:53:23 +0000 (02:53 -0700)]
varlink: move shared enum types to varlink-idl-common
Move ExecOutputType, CGroupPressureWatch, EmergencyAction and
ManagedOOMMode enum type definitions from varlink-io.systemd.Unit to
varlink-idl-common, as these types are shared across multiple varlink
interfaces.
Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>
Kai Lüke [Mon, 13 Apr 2026 12:21:39 +0000 (21:21 +0900)]
vmspawn: Support RUNTIME_DIRECTORY again
In ccecae0efd ("vmspawn: use machine name in runtime directory path")
support for RUNTIME_DIRECTORY was dropped which makes it difficult to
run systemd-vmspawn in a service unit which doesn't have write access
to the regular /run but should use its own managed RUNTIME_DIRECTORY.
What worked before was --keep-unit --system but we can't use
XDG_RUNTIME_DIR and --user because then --keep-unit breaks which
we need because it can't create a scope as there is no session.
Switch back to runtime_directory which handles RUNTIME_DIRECTORY and
tells us whether we should use it as is without later cleanup or if we
need to use the regular path where we create and delete the directory
ourselves.
many: final final set of coccinelle check-pointer-deref tweaks (#41595)
I promised in https://github.com/systemd/systemd/pull/41426 that its the
final update for coccinelle pointer deref checks. However it turned out
there is this coccinelle/parsing_hacks.h that I wasn't aware of. The
file missed some important things like _cleanup_(x) that prevented
coccinelle to check a bunch of functions.
This PR adds some missing defines to the parsing_hacks.h and fixes the
missing asserts(). I apologize that its a bit long (and frankly boring)
and that I missed this earlier.
The last commit contains one small behavior change (ret in
sd_varlink_idl_parse() is now really optional) but the big one is very
mechanical.
This is useful when moving from `--pty` or `--pipe` to using
`--verbose`: you can use `--verbose-output=cat` to get similar output on
stdout while still having all of the advantages of `--verbose` over the
other options.
stat-util: always check S_ISDIR() before S_ISLNK()
Check S_ISDIR() before S_ISLINK() for all stat_verify_xyz() helpers
first, where we check them. Just to ensure we systematically return the
same errors.
Milan Kyselica [Sat, 11 Apr 2026 08:26:13 +0000 (10:26 +0200)]
boot: fix loop bound and OOB in devicetree_get_compatible()
The loop used the byte offset end (struct_off + struct_size) as the
iteration limit, but cursor[i] indexes uint32_t words. This reads
past the struct block when end > size_words.
Use size_words (struct_size / sizeof(uint32_t)) which is the correct
number of words to iterate over.
Also fix a pre-existing OOB in the FDT_BEGIN_NODE handler: the guard
i >= size_words is always false inside the loop (since the loop
condition already ensures i < size_words), so cursor[++i] at the
boundary reads one word past the struct block. Use i + 1 >= size_words
to check before incrementing.
Milan Kyselica [Sat, 11 Apr 2026 08:25:19 +0000 (10:25 +0200)]
boot: fix integer overflow and division by zero in BMP splash parser
Bound image dimensions before computing row_size to prevent overflow
in the depth * x multiplication on 32-bit. Without this, crafted
dimensions like depth=32 x=0x10000001 wrap to a small row_size that
passes all subsequent checks.
Reject channel masks where all bits are set (popcount == 32), since
1U << 32 is undefined behavior and causes division by zero on
architectures where it evaluates to zero. Move the validation before
computing derived values for clarity. Use unsigned 1U in shifts to
avoid signed integer overflow UB for popcount == 31.
journal: limit decompress_blob() output to DATA_SIZE_MAX (#41604)
We already have checks in place during compression that limit the data
we compress, so they shouldn't decompress to anything larger than
DATA_SIZE_MAX unless they've been tampered with. Let's make this
explicit and limit all our decompress_blob() calls in journal-handling
code to that limit.
One possible scenario this should prevent is when one tries to open and
verify a journal file that contains a compression bomb in its payload:
$ systemd-run --user --wait --pipe -- build-local/journalctl --verify --file=$PWD/test.journal
Running as unit: run-p682422-i4875779.service
000110: Invalid hash (00000000 vs. 11e4948d73bdafdd)
000110: Invalid object contents: Bad message
File corruption detected at /home/fsumsal/repos/@systemd/systemd/test.journal:272 (of 1249896 bytes, 0%).
FAIL: /home/fsumsal/repos/@systemd/systemd/test.journal (Bad message)
Finished with result: exit-code
Main processes terminated with: code=exited, status=1/FAILURE
Service runtime: 48.051s
CPU time consumed: 47.941s
Memory peak: 8G (swap: 0B)
```
Same could be, in theory, possible with just `journalctl --file=`, but
the reproducer would be a bit more complicated (haven't tried it, yet).
Lastly, the change in journal-remote is mostly hardening, as the maximum
input size to decompress_blob() there is mandated by MHD's connection
memory limit (set to JOURNAL_SERVER_MEMORY_MAX which is 128 KiB at the
time of writing), so the possible output size there is already quite
limited (e.g. ~800 - 900 MiB for xz-compressed data).
Daan De Meyer [Mon, 22 Dec 2025 10:22:34 +0000 (11:22 +0100)]
nspawn: Add --restrict-address-families= option
Add a new --restrict-address-families= command line option and
corresponding RestrictAddressFamilies= setting for .nspawn files to
restrict which socket address families may be used inside a container.
Many address families such as AF_VSOCK and AF_NETLINK are not
network-namespaced, so restricting access to them in containers
improves isolation. The option supports allowlist and denylist modes
(via ~ prefix), as well as "none" to block all families, matching the
semantics of RestrictAddressFamilies= in unit files.
The address family parsing logic is extracted into a shared
parse_address_families() helper in parse-helpers.c, which is now also
used by config_parse_address_families() in load-fragment.c.
This is currently opt-in. In a future version, the default will be
changed to restrict address families to AF_INET, AF_INET6 and AF_UNIX.
Daan De Meyer [Fri, 27 Mar 2026 22:03:14 +0000 (22:03 +0000)]
systemctl: replace kexec-tools dependency with direct kexec_file_load() syscall
Replace the fork+exec of /usr/bin/kexec in load_kexec_kernel() with a
direct kexec_file_load() syscall, removing the runtime dependency on
kexec-tools for systemctl kexec.
The kexec_file_load() syscall (available since Linux 3.17) accepts
kernel and initrd file descriptors directly, letting the kernel handle
image parsing, segment setup, and purgatory internally. This is much
simpler than the older kexec_load() syscall which requires complex
userspace setup of memory segments and boot protocol structures — that
complexity is the raison d'être of kexec-tools.
The implementation follows the established libc wrapper pattern: a
missing_kexec_file_load() fallback in src/libc/kexec.c calls the
syscall directly when glibc doesn't provide a wrapper (which is
currently always the case). The syscall is not available on all
architectures — alpha, i386, ia64, m68k, mips, sh, and sparc lack
__NR_kexec_file_load — so the wrapper and caller are guarded with
HAVE_KEXEC_FILE_LOAD_SYSCALL to compile cleanly everywhere.
When kexec_file_load() rejects the kernel image with ENOEXEC (e.g. the
image is compressed or wrapped in a PE container that the kernel's kexec
handler doesn't understand natively), we attempt to unwrap/decompress
and retry. This is effectively the same decompression and extraction
logic that already lives in src/ukify/ukify.py (maybe_decompress() and
get_zboot_kernel()), now implemented in C so that systemctl can handle
it natively without shelling out to external tools:
- Compressed kernels (Image.gz, Image.zst, Image.xz, Image.lz4): the
format is detected by magic bytes (per RFC 1952, RFC 8878,
tukaani.org xz spec, and lz4 frame format spec) and decompressed to
a memfd using the existing decompress_stream_*() infrastructure plus
the new gzip support from the previous commit. This is primarily
needed on arm64 where kexec_file_load() only accepts raw Image files.
On x86_64, bzImage is already the native format and works directly.
- EFI ZBOOT PE images (vmlinuz.efi): detected by "MZ" + "zimg" magic
at the start of the file. The compressed payload offset, size, and
compression type are read from the ZBOOT header defined in
linux/drivers/firmware/efi/libstub/zboot-header.S.
- Unified Kernel Images (UKI): detected as PE files with a .linux
section via the existing pe_is_uki() infrastructure. The .linux
section (kernel) and optionally .initrd section are extracted to
memfds. When a UKI provides an embedded initrd and the boot entry
doesn't specify one separately, the embedded initrd is used.
The try-first-then-decompress approach means we never decompress
unnecessarily: on x86_64 the first kexec_file_load() call succeeds
immediately with the raw bzImage, and on architectures where the
kernel's kexec handler natively understands PE (like LoongArch with
kexec_efi_ops), ZBOOT/UKI images work without decompression too.
If kexec_file_load() is unavailable (architectures without the syscall)
or all attempts fail, we fall back to forking+execing the kexec binary.
This preserves compatibility on architectures like i386 and mips where
only the older kexec_load() syscall exists and kexec-tools is needed to
handle the complex userspace setup.
Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>
compress: rework decompressor_detect() on top of compression_detect_from_magic()
Replace the duplicated magic byte signatures in decompressor_detect()
with a call to the new compression_detect_from_magic() helper and use a
switch statement to initialize the appropriate decompression context.
time-util: encode our assumption that clock_gettime() never can return 0 or USEC_INFINITY
We generally assume that valid times returned by clock_gettime() are > 0
and < USEC_INFINITY. If this wouldn't hold all kinds of things would
break, because we couldn't distuingish our niche values from regular
values anymore.
Let's hence encode our assumptions in C, already to help static
analyzers and LLMs.
One more round, this time with the help of the claudebot, especially for
spelunking in git blame to find the original commit and writing commit
messages from the list of warnings exported from coverity
Co-developed-by: Claude
[claude@anthropic.com](mailto:claude@anthropic.com)
core: varlink enum for io.systemd.Unit interface (#40972)
Convert string fields to varlink enums in io.systemd.Unit
Following
https://github.com/systemd/systemd/pull/39391#discussion_r2489599449,
convert all configuration setting fields in the io.systemd.Unit varlink
interface from bare SD_VARLINK_STRING to proper enum types, adding type
safety to the IDL.
This converts ~30 fields across ExecContext, CGroupContext, and
UnitContext, adding 25 new varlink enum types.
Weak compatibility breakage (per
https://github.com/systemd/systemd/pull/40972#issuecomment-4222294318):
Varlink enum identifiers cannot contain - or +, so affected values are
underscorified on the wire. For example, "tty-force" becomes tty_force,
"kmsg+console" becomes kmsg_console.
journal: limit decompress_blob() output to DATA_SIZE_MAX
We already have checks in place during compression that limit the data
we compress, so they shouldn't decompress to anything larger than
DATA_SIZE_MAX unless they've been tampered with. Let's make this
explicit and limit all our decompress_blob() calls in journal-handling
code to that limit.
One possible scenario this should prevent is when one tries to open and
verify a journal file that contains a compression bomb in its payload:
$ systemd-run --user --wait --pipe -- build-local/journalctl --verify --file=$PWD/test.journal
Running as unit: run-p682422-i4875779.service
000110: Invalid hash (00000000 vs. 11e4948d73bdafdd)
000110: Invalid object contents: Bad message
File corruption detected at /home/fsumsal/repos/@systemd/systemd/test.journal:272 (of 1249896 bytes, 0%).
FAIL: /home/fsumsal/repos/@systemd/systemd/test.journal (Bad message)
Finished with result: exit-code
Main processes terminated with: code=exited, status=1/FAILURE
Service runtime: 48.051s
CPU time consumed: 47.941s
Memory peak: 8G (swap: 0B)
Same could be, in theory, possible with just `journalctl --file=`, but
the reproducer would be a bit more complicated (haven't tried it, yet).
Lastly, the change in journal-remote is mostly hardening, as the maximum
input size to decompress_blob() there is mandated by MHD's connection
memory limit (set to JOURNAL_SERVER_MEMORY_MAX which is 128 KiB at the
time of writing), so the possible output size there is already quite
limited (e.g. ~800 - 900 MiB for xz-compressed data).
Michael Vogt [Sun, 12 Apr 2026 13:47:48 +0000 (15:47 +0200)]
coccinelle: add SIZEOF() macro to work-around sizeof(*private)
We have code like `size_t max_size = sizeof(*private)` in three
places. This is evaluated at compile time so its safe to use. However
the new pointer-deref checker in coccinelle is not smart enough to know
this and will flag those as errors. To avoid these false positives
we have some options:
1. Reorder so that we do:
```C
size_t max_size = 0;
assert(private);
max_size = sizeof(*private);
```
2. Use something like `size_t max_size = sizeof(*ASSERT_PTR(private));`
3. Place the assert before the declaration
4. Workaround coccinelle via SIZEOF(*private) that we can then hide
via parsing_hacks.h
5. Fix coccinelle (OCaml, hard)
6. ... somehting I missed?
None of these is very appealing. I went for (4) but happy about
suggestions.