git.ipfire.org Git - thirdparty/systemd.git/log

ssh-issue: replace verb options by normal verbs

The interface of this program was rather strange. It took an option that
specified what to do, but that option behaved exactly like a verb. Let's
change the interface to the more modern style with verbs. Since the
inteface was documented in the man page, provide a compat shim to handle
the old options.

(In practice, I doubt anybody will notice the change. But since it was
documented, it's easier to provide the compat then to think too much
whether it is actually needed. I think we can drop it an year or so.)

ssh-issue: convert to the new option parser

--make-vsock and --rm-vsock are now described in --help, not only
shown in synopsis.

Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>

po: Update Serbian and add Serbian Latin translation (#41597)

Updates the existing Serbian (`sr`) translation and adds a new Serbian
Latin (`sr@latin`) translation, with both locales registered in
`po/LINGUAS`.

machined: gate metadata querying behind inspect-machines/images action

Ensure only privileged users can call the system scope machined's
APIs that get data out of a machine

Follow-up for 1bd979dddbb6ed3ffe410d78a7ff80cbb1c42a64
Follow-up for 9153b02bb5030e29d6008992fb74b9028d7c392c

sysupdated: don't crash when an mstack machine image is found

As soon as machinectl list-images has an mstack entry updatectl fails
because systemd-sysupdated crashes with an assertion failing because
the mstack case was not handled.
For now mstack is not supported as image for sysupdate to operate on
and we can skip it.

Fixes https://github.com/systemd/systemd/issues/41649

sd-varlink: Don't log successful sentinel error dispatch as a failure

sd_varlink_error() deliberately returns a negative errno mapped from
the error id on success so callbacks can `return sd_varlink_error(...);`
to enqueue the reply and propagate a matching errno at once. When
varlink_dispatch_method() dispatches a configured error sentinel itself,
it doesn't need that mapping — but it was treating any negative return
as a dispatch failure and logging "Failed to process sentinel" even
though the error reply had been successfully enqueued.

Detect success via the state transition to VARLINK_PROCESSED_METHOD
instead, so only genuine enqueue failures are logged.

systemd-vmspawn: QMP-varlink bridge for VM runtime control (#41449)

systemd-vmspawn currently has zero runtime control over the VMs it
launches. It can kill QEMU (SIGTERM) or SSH in, but it cannot pause,
resume, request a graceful power-off, query status, or
react to VM events. QEMU exposes all of this via its QMP protocol;
systemd's native IPC is varlink. This series bridges the two.

  Architecture

```
  machinectl → machined (Machine.List → discovers controlAddress)
  machinectl → vmspawn varlink socket (direct connection)
                 ├── io.systemd.MachineInstance       (generic VM control)
                 ├── io.systemd.VirtualMachineInstance (placeholder)
                 └── io.systemd.QemuMachineInstance   (QEMU-specific; AcquireQMP stub)
  vmspawn internally → socketpair → QEMU QMP
```

machined stores the controlAddress but never connects to vmspawn.
machinectl discovers the address from Machine.List and connects
directly. Socket mode 0600 is the access-control boundary —
the socket is rooted in vmspawn's $RUNTIME_DIRECTORY, so only the UID
that launched the VM can talk to it.

  QMP client library (src/shared/qmp-client.{c,h})

  A small non-blocking QMP client modeled on sd-varlink's pump contract:

- Reference-counted QmpClient with an explicit five-state machine:
HANDSHAKE_INITIAL → HANDSHAKE_GREETING_RECEIVED →
HANDSHAKE_CAPABILITIES_SENT → RUNNING → DISCONNECTED.
- qmp_client_connect_fd() is non-blocking: it wraps the fd in a
JsonStream and returns immediately. The greeting + qmp_capabilities
handshake is driven lazily on the first
qmp_client_invoke() or by the event loop — whichever comes first — so
callers never block during connect.
- qmp_client_attach_event() attaches to sd_event for async operation;
qmp_client_process() performs one pump step (write → dispatch → parse →
read → disconnect) with the same contract as
sd_varlink_process(); qmp_client_wait() blocks until the next I/O event.
- qmp_client_invoke() sends an async command and fires the registered
qmp_command_callback_t with (result, error_desc, error, userdata) on
completion. Synchronous callers drive
  process()/wait() in a loop until qmp_client_is_idle() is true.
- QmpClientArgs bundles the JSON arguments and an FD list for a single
command; the QMP_CLIENT_ARGS_FD() macro hands one fd to the callee for
SCM_RIGHTS passing. On partial-stage failure the
args list is narrowed so the caller's cleanup closes only the
untransferred tail.
- Event broadcast to a registered callback via qmp_client_bind_event();
transport loss surfaces through qmp_client_bind_disconnect().
- qmp_schema_has_member() walks the query-qmp-schema result for optional
runtime capability probes.

  vmspawn device setup via QMP

vmspawn starts QEMU paused (-S), sets up devices via QMP, then resumes
with cont. The entire device plane moves off the legacy INI config path
and onto the bridge.

A new MachineConfig aggregate in vmspawn-qmp.h groups the per-device
info (DriveInfos, NetworkInfo, VirtiofsInfos, VsockInfo) with a single
machine_config_done() cleanup that chains the
sub-structure destructors; each conversion patch populates exactly the
field it owns.

  What the conversion enables:

- FD-based device passing via add-fd / getfd + SCM_RIGHTS — vmspawn
opens every image file, TAP, VSOCK, and virtiofs socket itself and hands
the fd to QEMU. QEMU never needs filesystem
  access.
- Ephemeral overlays via blockdev-create + async job-concluded
continuations on anonymous O_TMPFILE / memfd backings — no named overlay
files on disk.
- PCIe root-port pre-allocation for q35/virt machine types so
hotplug-capable slots exist at boot (NVMe, virtio-scsi, etc.).
- io_uring availability probing with automatic fallback to the default
AIO backend if QEMU's build doesn't support it.

Per-command callbacks call sd_event_exit() on setup failure so vmspawn
shuts down cleanly if any device can't be attached.

  machinectl integration

- machinectl pause / resume / poweroff / reboot / terminate go through
the varlink control socket for VMs.
- D-Bus fallback for containers: poweroff sends SIGRTMIN+4, terminate
calls the existing TerminateMachine method — unchanged container
behavior.
- Multi-machine parallel dispatch via sd_event for bulk operations
(machinectl pause vm1 vm2 ...) so one slow VM doesn't serialize the
rest.
- SubscribeEvents streaming with per-subscriber event-name filters
(importd Pull-style pattern: initial {ready:true} notify, fan out via
varlink_many_notifybo(), lazy init — QMP event pump
  runs only while subscribers exist).

  Tests

- Unit test with a mock QMP server covering handshake, command/response,
events, and EOF.
- Integration test against real QEMU (-machine none) exercising
handshake + query-qmp-schema (~200 KB reply, validates the buffered
reader across multiple read()s) and query-status.
- Integration test for the machinectl verbs end-to-end: pause / resume /
describe / subscribe / terminate.
- Integration test for the multi-drive pipeline and ephemeral overlays
(blockdev-create async job continuations).
- Stress test: 5 cycles of start → 3× (pause/describe/resume/describe) →
terminate.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

TODO: add some vmspawn todos

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

vmspawn: add integration test for multi-drive and ephemeral QMP setup

Test the async QMP drive pipeline with real QEMU:

Test 1 (multi-drive): launches vmspawn with --image plus two
--extra-drive flags. This exercises multiple fdset allocations,
pipelined blockdev-add commands relying on FIFO ordering, io_uring
retry callbacks, and multiple device_add commands — all fired
without waiting for responses.

Test 2 (ephemeral): launches vmspawn with --image --ephemeral. This
exercises the most complex async path: blockdev-create fires a
background job, JOB_STATUS_CHANGE events are watched via the event
callback, and when the job concludes the deferred continuation fires
the overlay format node + device_add. If the continuation fails, the
root drive is never attached, the kernel panics, and vmspawn exits
without registering — so successful registration proves the pipeline
works.

Both tests use a raw ext4 image with a minimal init (sleep infinity)
and direct kernel boot. No virtiofsd needed.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

vmspawn: add integration test for machinectl VM control verbs

Add TEST-87-AUX-UTILS-VM.vmspawn.sh that validates the QMP-varlink
bridge end-to-end using a real QEMU instance:

- Launches vmspawn with --directory and --linux for direct kernel boot
(no UEFI firmware or bootable image needed)
- Waits for machine registration with machined
- Verifies varlinkAddress is exposed in Machine.List
- Tests machinectl pause, resume, poweroff
- Exercises MachineInstance varlink interface directly via varlinkctl:
QueryStatus state verification across pause/resume, Pause, Resume

Skipped automatically if vmspawn, QEMU, or a bootable kernel is not
available. Runs as part of TEST-87-AUX-UTILS-VM in the mkosi
integration test suite.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

vmspawn: add integration test for QMP client library against real QEMU

Add a test that launches QEMU with -machine none (no bootable image
needed) and exercises the QMP client library against the real QMP
implementation:

- test_qmp_client_qemu_handshake_and_schema: sends query-qmp-schema
  (~200KB response that exercises the buffered multi-read() path)
  via qmp_client_invoke(), then cleanly shuts down QEMU via quit.
  The QMP handshake completes transparently inside invoke().

- test_qmp_client_qemu_query_status: validates query-status response
  parsing, stop/cont command sequencing with id correlation, and state
  verification between commands

The test is automatically skipped when QEMU is not installed.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

vmspawn: add integration test for QMP client library

Test the QMP client library using a mock QMP server over a socketpair:

- test_qmp_client_basic: Verifies full handshake, query-status with
  response parsing, stop/cont commands, and asynchronous STOP event
  delivery via the sd-event I/O callback
- test_qmp_client_eof: Verifies that the client properly detects
  server disconnection (EOF) and returns a disconnect error

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

shell-completion: add bash/zsh completions for machinectl pause/resume

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

man: document machinectl pause/resume and update poweroff for VMs

Add manpage entries for the new pause and resume verbs. Update the
poweroff description to cover VMs (ACPI powerdown via QMP) in addition
to containers (SIGRTMIN+4).

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

machinectl: add VM control commands

Add new verbs for controlling vmspawn VMs:

  machinectl poweroff <machine>
  machinectl pause <machine>
  machinectl resume <machine>

Each verb discovers the machine's varlinkAddress via machined's
Machine.List, connects directly to vmspawn's varlink socket, and
calls the corresponding io.systemd.MachineInstance method.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

vmspawn: convert drive device setup to bridge

Remove the static blockdev/device/snapshot INI config sections and
the SCSI controller setup for both the root image drive and extra
drives. Replace with DriveInfos that are constructed in the parent
after fork: vmspawn opens all image files and passes fds to QEMU via
the add-fd path. For ephemeral mode, anonymous overlay files are
created via O_TMPFILE or memfd.

The resolve_disk_driver() helper maps DiskType to the appropriate
QEMU driver name and serial format.

The post-fork device-info preparation is split into helpers:
prepare_primary_drive() and prepare_extra_drives() for per-drive
construction, assign_pcie_ports() for naming the pre-allocated
pcie-root-port bridges once every device type is known, and
prepare_device_info() that stitches them together against the
MachineConfig aggregate.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

vmspawn: convert virtiofs device setup to bridge

Remove the static chardev/device INI config sections for both the
root filesystem and runtime mount virtiofs instances. Replace with
VirtiofsInfos that capture socket paths and tags for each virtiofs
mount, passed to vmspawn_varlink_setup_virtiofs() for runtime
configuration via QMP.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

vmspawn: convert VSOCK device setup to bridge

Remove the static vsock0 INI config section and the related pass_fds
plumbing. Replace with a VsockInfo struct that captures the vhost fd
and guest CID, passed to vmspawn_qmp_setup_vsock() for runtime
configuration via QMP. The VSOCK fd is now sent to QEMU via QMP getfd
+ SCM_RIGHTS instead of being inherited.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

vmspawn: convert network device setup to bridge

Remove the static netdev/nic INI config sections for both privileged
TAP, nsresourced TAP, and user-mode networking. Replace them with a
NetworkInfo struct that captures the network type, TAP fd or interface
name, and MAC address, passed to vmspawn_varlink_setup_network() for
runtime configuration via QMP.

For the nsresourced TAP path the fd is now passed to QEMU via QMP
getfd + SCM_RIGHTS instead of being inherited through pass_fds.

Declare the MachineConfig aggregate that this and the following
conversion patches populate, zero-initialized with explicit -EBADF
for the fd fields so every sub-structure cleans up safely regardless
of which device types the invocation ends up using.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

vmspawn: QMP-varlink bridge for VM runtime control

Create a QMP socketpair for QEMU machine monitor control, configure
the QMP chardev+mon via the QEMU config file, and wire up the bridge
infrastructure.

After fork, vmspawn initializes the QMP bridge, probes QEMU feature
support synchronously (driving the QMP handshake to RUNNING
transparently), resumes vCPUs, then sets up the varlink server for
runtime VM control. The control socket path is passed to machined via
the controlAddress field in machine registration.

Device configuration still uses the legacy INI config path and will
be converted to bridge calls in subsequent commits.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

vmspawn: pre-allocate PCIe root ports for device hotplug

On PCIe machine types (q35, virt), QMP device_add is always hotplug —
even with vCPUs stopped. The root PCIe bus (pcie.0) does not support
hotplugging; only pcie-root-port bridges do. Pre-allocate enough root
ports in the QEMU config file for all devices that will be set up via
QMP, plus 10 spare ports for future runtime hotplug.

Add ARCHITECTURE_NEEDS_PCIE_ROOT_PORTS macro to guard PCIe-specific
setup on x86, ARM, RISC-V, and LoongArch (the architectures whose
QEMU machine type is q35 or virt).

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

vmspawn: set up varlink bridge infrastructure

Create the QMP-to-varlink bridge layer (vmspawn-qmp.{c,h}) and the
varlink server layer (vmspawn-varlink.{c,h}).

The QMP bridge (VmspawnQmpBridge) owns the QmpClient connection and
manages pending background jobs (e.g. blockdev-create continuations).
vmspawn_qmp_init() creates the client and attaches it to the event
loop. vmspawn_qmp_probe_features() drives io_uring and qcow2
discard-no-unref probes synchronously via a qmp_client_process() +
qmp_client_wait() loop — the QMP handshake completes transparently on
the first invoke. vmspawn_qmp_start() resumes vCPUs via an async
"cont" command.

The varlink server (VmspawnVarlinkContext) exposes three interfaces:

- io.systemd.MachineInstance: generic machine control (Terminate,
  PowerOff, Reboot, Pause, Resume, Describe, SubscribeEvents).
  Method handlers forward to QMP commands asynchronously — the
  varlink reply is deferred until the QMP response arrives.
- io.systemd.VirtualMachineInstance: VM-specific (placeholder for
  future snapshot/migration methods).
- io.systemd.QemuMachineInstance: QEMU-specific (AcquireQMP stub).

The server listens on <runtime_dir>/control with mode 0600.

Event streaming follows the importd Pull pattern: SubscribeEvents
sends an initial {ready:true} notification, then fans out QMP events
to all subscribers. The disconnect handler only unrefs subscriber
links (matching resolved's vl_on_notification_disconnect pattern).

Introduce the MachineConfig aggregate in vmspawn-qmp.h grouping the
per-device info structures (DriveInfos, NetworkInfo, VirtiofsInfos,
VsockInfo) together with machine_config_done() that chains the
individual done helpers. Callers populate it field-by-field and rely
on the _cleanup_ attribute for orderly teardown regardless of which
device types the invocation ends up using.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

shared: add QMP client library

Async QMP client for talking to QEMU's machine monitor from
libsystemd-shared. The I/O core (buffered read/write, output queue,
event source management) follows sd-varlink's patterns.

State machine:

  INITIAL -> GREETING_RECEIVED -> CAPABILITIES_SENT -> RUNNING
                                                          |
                                                          v
                                                     DISCONNECTED

The QMP handshake (greeting + qmp_capabilities) is driven transparently
by qmp_client_invoke() through an internal qmp_client_ensure_running()
helper, matching sd-bus's bus_ensure_running() pattern. Callers never
wait for it explicitly.

qmp_client_invoke() is the only command interface: asynchronous, with
per-command callback. Slots are tracked in a Set keyed by id; replies
are dispatched by id match. SCM_RIGHTS fd passing is bundled through
QmpClientArgs and the QMP_CLIENT_ARGS_FD macro.

qmp_client_process() and qmp_client_wait() are exposed publicly,
mirroring sd_varlink_process() and sd_varlink_wait(). Callers that
need to drive the client synchronously — e.g. feature probing before
entering the sd_event main loop — can loop on them exactly like
varlink_call_internal() does on its varlink equivalents.

Other features:

- Buffered stream reader for QMP's \r\n-delimited JSON, handling
  multi-read responses (query-qmp-schema is ~200 KiB).
- Fdset id allocation via qmp_client_next_fdset_id().
- Synthetic SHUTDOWN event on unexpected disconnect.
- Disconnect detection with callback notification and pending-command
  cleanup.
- -ENOBUFS from the 16 MiB input-buffer cap is treated as recoverable
  (not a transport error).

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

json-stream: expose log helpers for consumers

Move json_stream_description(), json_stream_log(), and
json_stream_log_errno() from json-stream.c into json-stream.h so that
consumers like the QMP client can use the same description-prefixed
logging that json-stream itself uses internally.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

machined: add controlAddress field to Machine.Register and Machine.List

Follow the existing sshAddress pattern to add a controlAddress field
that allows machine registrants (like vmspawn) to advertise a varlink
socket address for direct VM control. machined stores and exposes
the address but never connects to it itself.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

shared: add varlink interface definitions for machine instance control

Add three varlink interface definitions for the machine instance control
hierarchy:

- io.systemd.MachineInstance: generic operations applicable to both
  containers and VMs (PowerOff, Reboot, Pause, Resume, QueryStatus,
  SubscribeEvents). nspawn could implement this same interface later.

- io.systemd.VirtualMachineInstance: VM-specific but VMM-agnostic
  operations. Empty for now, future home for AddBlockDevice and similar.

- io.systemd.QemuMachineInstance: QEMU-specific operations. Defines
  AcquireQMP() for protocol upgrade to a direct QMP connection.

The "Instance" suffix avoids collision with machined's existing
io.systemd.Machine interface.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

Merge pull request #41375 from brauner/work.todo.cleanup

[zjs: I did the merge by hand to incorporate the changes that
were made on 'main' in the meantime. That is easier then rebasing
the series.]

various: fix compilation with openssl-4.0.0-beta1

Various types have been made opaque, so we need to use some accessor
functions.

mkosi: update fedora commit reference to 207e2d004468bf79a8bd78182d9b10956edf45c7

* 207e2d0044 Stop building support for openssl engines
* 36a234147f Upload sources
* 3681163f81 Version 260.1
* 8f4f0f58e3 Version 260
* e3fab23aa0 Version 260~rc4
* e4c1c2100b Version 260~rc3
* 453696813e Fix typo in unit name in %post scriptlet
* 154edb7cdb Silence false positive "HWID match failed, no DT blob" error (rhbz#2444759)
* 03b6637c35 riscv64 port has LTO disabled
* ce1dec6a40 Version 260~rc2
* 809049777c Add patch for symlink creation error
* 6ff27708f7 Enable getty@.service through presets
* ba7807fbce Drop scriptlet for upgrades from versions <253
* 455f277188 Move support for tpm2 to systemd-udev subpackage
* 0183bc784e Version 260~rc1

Convert executor to OPTION macros (#41618)

This was causing failures in CI previously for reasons I couldn't
fathom. So I'm submitting this conversion separately as a draft.

resolved: check for reset-statistics polkit action via D-Bus too

The varlink method checks for polkit authorization, so also
update the D-Bus method to match it.

Follow-up for cf01bbb7a45fb1eec28cd0a813bd68fde413410f

TEST-75-RESOLVED: Make sure --suppress-sync is not used

test-seccomp: Handle environment where sync() is already suppressed

We might be running in an nspawn container booted with --suppress-sync,
so make sure we handle that scenario gracefully.

TEST-06-SELINUX: Relabel in the initrd rather than at image build time

This gets rid of the requirement to run the image build as root.

TEST-64-UDEV-STORAGE: Add missing scsi controllers

TEST-24-CRYPTSETUP: Use virtio-blk-pci

Doesn't require a controller.

TEST-13-NSPAWN: Use timeout --foreground in two more places

TEST-07-PID1: Use --foreground with timeout

Otherwise the test fails if a TTY is attached to stdio.

TEST-07-PID1: Don't fail in vm without ESP or XBOOTLDR mount

docs: update footer to 2026

test: do not use nanoseconds width specifier in date command

Using the format specifier +%s%6N with GNU date is honored, and only
prints 6 digits of the nanoseconds portion of the seconds since epoch.
The uutils implementation of date does not honor this, and always prints
all 9 digits. This is a known bug[1], but can be worked around by
adapting this test to use nanoseconds instead of microseconds.

[1] https://github.com/uutils/coreutils/issues/11658

parse-helpers: Silence coverity warning

tree-wide: convert remaining varlink string fields to enum types (#41615)

Follow-up to #40972. Convert remaining plain string fields to proper
varlink enum types across all interfaces, per the policy that
user-controlled/API fields should be declared as proper enums in the
IDL.

Interfaces converted:
- io.systemd.Manager — LogTarget, DefaultStandardOutput/Error,
DefaultMemory/CPU/IOPressureWatch, DefaultOOMPolicy,
CtrlAltDelBurstAction (8 fields)
- io.systemd.Unit — CPUSchedulingPolicy, IOSchedulingClass, NUMAPolicy,
MountFlags (4 fields)
- io.systemd.Machine — class, whom (2 fields, input dispatch converted
to JSON_DISPATCH_ENUM_DEFINE for automatic dashify fallback)
- io.systemd.oom / io.systemd.ManagedOOM — mode in ControlGroup (1
field)

Shared types moved to varlink-idl-common: ExecOutputType,
CGroupPressureWatch, EmergencyAction, ManagedOOMMode — these are reused
across multiple interfaces.

Each interface change includes a corresponding enum sync test to catch
future drift between C string tables and varlink IDL definitions.

tree-wide: use JSON_BUILD_PAIR_ENUM() more often

docs: clarify when to use varlink enum types vs plain strings

Add guidance on when a field should use a proper varlink enum type
versus remaining a plain string: user-controlled/API fields should be
enums, engine-internal state fields may stay as strings.

Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>

varlink: add ManagedOOMMode enum type to io.systemd.oom

Convert the mode field in ControlGroup from plain string to the
ManagedOOMMode enum type from varlink-idl-common. Register
ManagedOOMMode in both io.systemd.oom and io.systemd.ManagedOOM
interfaces since both use the ControlGroup struct.

Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>

varlink: add enum types for class and whom fields in io.systemd.Machine

Convert the class field (Register input, List output) from plain string
to MachineClass enum type, and the whom field (Kill input) to KillWhom
enum type.

Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>

varlink: add enum types for scheduling and mount settings in io.systemd.Unit

Convert CPUSchedulingPolicy, IOSchedulingClass, NUMAPolicy and MountFlags
fields from plain strings to proper varlink enum types in the io.systemd.Unit
interface. Update the corresponding serialization code to use
json_underscorify() for correct enum value formatting.

Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>

varlink: add enum types for configuration settings in io.systemd.Manager

Convert 8 string fields in the io.systemd.Manager varlink interface to
proper enum types:

- LogTarget: new enum (console, console_prefixed, kmsg, journal, ...)
- DefaultStandardOutput/Error: reuse ExecOutputType from common
- DefaultMemory/CPU/IOPressureWatch: reuse CGroupPressureWatch from common
- DefaultOOMPolicy: new enum (continue, stop, kill)
- CtrlAltDelBurstAction: reuse EmergencyAction from common

Output serialization updated to use JSON_BUILD_PAIR_ENUM for automatic
underscorification of dash-containing values.

Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>

executor: move reopening of the console after option parsing

It seems to be interfering with systemd:check-help-systemd-executor
test in CI. In practice, any messages from parse_argv() are going
to be from manual invocations, since if called from PID1 the option
syntax is going to be correct. So I hope this fixes the redirection
of --help but otherwise is of little consequence.

executor: do not abort on invalid serialization fd

E.g. --deserialize=15 would cause the program to abrt in safe_close.
But in fact, we shouldn't try to do the close in any case: if the
fd is not valid, we should return an error without modifying state.
And if it _is_ valid, we set O_CLOEXEC on it, so it'll be closed
automatically later.

importd: harden curl file protocol handling

With old libcurl versions file:// can get redirects which can be messy, while
the new version rejects them. Set an option to explicit block them.

Assorted cleanups and fixes (#41626)

NEWS: pre-announce removal of /run/boot-loader-entries/ support in lo… (#41622)

…gind

logind could read UAPI.1 Boot Loader Spec entries from
/run/boot-loader-entries/ in addition to ESP/XBOOTLDR. This was pretty
half-assed, and to my knowledge was never actually used much.

Let's remove support for it and simplify our codebase.

Let's schedule it for removal via NEWS in a future version, to give
people a chance to speak up.

hwdb: Add extended SteelSeries Arctis headset device support (#41628)

Add USB device IDs for additional SteelSeries Arctis headset models to
the sound card hardware database.

Newly added device IDs:

- Arctis Nova 7x v2 (22AD)
- Arctis Nova 7 Diablo IV (22A9)
- Arctis Nova 7X (22A4)
- Arctis Nova 7X (22A5)
- Arctis Nova 7P V2 (22A7)

journal-upload: also disable VERIFYHOST when --trust=all is used

When --trust=all disables CURLOPT_SSL_VERIFYPEER, the residual
CURLOPT_SSL_VERIFYHOST check is ineffective since an attacker can
present a self-signed certificate with the expected hostname. Disable
both for consistency and log that server certificate verification is
disabled.

Follow-up for 8847551bcbfa8265bae04f567bb1aadc7b480325

machined: pass user as positional argument in machine_default_shell_args()

Instead of interpolating the user name directly into the sh -c script
body via asprintf %s, pass it as a positional parameter ($1) in a
separate argv entry. This avoids the user string being parsed as part
of the shell script syntax.

Also validate the user name in bus_machine_method_open_shell() with
valid_user_group_name(), matching the validation already done on the
Varlink path via json_dispatch_const_user_group_name().

Follow-up for 49af9e1368571f4e423cde0fd45ee284451434d1

logind: reject wall messages containing control characters

method_set_wall_message() and the property setter only checked the
message length but not its content. Since wall messages are broadcast
to all TTYs, control characters in the message could interfere with
terminal state. Reject messages containing control characters other
than newline and tab.

Follow-up for 9ef15026c0e7e6600372056c43442c99ec53746e
Follow-up for e2fa5721c3ee5ea400b99a6463e8c1c257e20415

core: check selinux access on each unit when listing

Units might have different access rules, so check the access on each
unit when querying the full list.

core: add missing SELinux access checks when listing units

Add mac_selinux_unit_access_check_varlink() to the unit enumeration
loop in vl_method_list_units(), silently skipping units the caller
is not permitted to see, matching the D-Bus ListUnits behavior.
Add mac_selinux_access_check_varlink() to vl_method_describe_manager().

Follow-up for 472abf7bec89caeb1cc413c1de17984ab8ccb5d6
Follow-up for 736349958efe34089131ca88950e2e5bb391d36a

vmspawn: Support RUNTIME_DIRECTORY again (#41619)

In ccecae0efd ("vmspawn: use machine name in runtime directory path")
support for RUNTIME_DIRECTORY was dropped which makes it difficult to
run systemd-vmspawn in a service unit which doesn't have write access to
the regular /run but should use its own managed RUNTIME_DIRECTORY. What
worked before was --keep-unit --system but we can't use XDG_RUNTIME_DIR
and --user because then --keep-unit breaks which we need because it
can't create a scope as there is no session. Switch back to
runtime_directory which handles RUNTIME_DIRECTORY and tells us whether
we should use it as is without later cleanup or if we need to use the
regular path where we create and delete the directory ourselves.

NEWS: pre-announce removal of /run/boot-loader-entries/ support in logind

logind could read UAPI.1 Boot Loader Spec entries from
/run/boot-loader-entries/ in addition to ESP/XBOOTLDR. This was pretty
half-assed, and to my knowledge was never actually used much.

Let's remove support for it and simplify our codebase.

Let's schedule it for removal via NEWS in a future version, to give
people a chance to speak up.

docs: fix capability name, it's CAP_MKNOD not CAP_SYS_MKNOD (#41621)

ci: Two claude-review fixes

- Use persist-credentials: false for actions/checkout, so we don't
  leak the github token credentials to subsequent jobs.
- Remove one / from the Edit/Write permissions. Currently, with the
  absolute path from github.workspace, we expand to three slashes while
  we only need two.

varlink: move shared enum types to varlink-idl-common

Move ExecOutputType, CGroupPressureWatch, EmergencyAction and
ManagedOOMMode enum type definitions from varlink-io.systemd.Unit to
varlink-idl-common, as these types are shared across multiple varlink
interfaces.

Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>

vmspawn: Support RUNTIME_DIRECTORY again

In ccecae0efd ("vmspawn: use machine name in runtime directory path")
support for RUNTIME_DIRECTORY was dropped which makes it difficult to
run systemd-vmspawn in a service unit which doesn't have write access
to the regular /run but should use its own managed RUNTIME_DIRECTORY.
What worked before was --keep-unit --system but we can't use
XDG_RUNTIME_DIR and --user because then --keep-unit breaks which
we need because it can't create a scope as there is no session.
Switch back to runtime_directory which handles RUNTIME_DIRECTORY and
tells us whether we should use it as is without later cleanup or if we
need to use the regular path where we create and delete the directory
ourselves.

many: final final set of coccinelle check-pointer-deref tweaks (#41595)

I promised in https://github.com/systemd/systemd/pull/41426 that its the
final update for coccinelle pointer deref checks. However it turned out
there is this coccinelle/parsing_hacks.h that I wasn't aware of. The
file missed some important things like _cleanup_(x) that prevented
coccinelle to check a bunch of functions.

This PR adds some missing defines to the parsing_hacks.h and fixes the
missing asserts(). I apologize that its a bit long (and frankly boring)
and that I missed this earlier.

The last commit contains one small behavior change (ret in
sd_varlink_idl_parse() is now really optional) but the big one is very
mechanical.

executor: convert to the new option parser, define OPTION_COMMON_LOG_*

Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>

stat-util: use stat_verify_xyz() and friends more (#41613)

run: allow setting output format

This is useful when moving from `--pty` or `--pipe` to using
`--verbose`: you can use `--verbose-output=cat` to get similar output on
stdout while still having all of the advantages of `--verbose` over the
other options.

stat-util: always check S_ISDIR() before S_ISLNK()

Check S_ISDIR() before S_ISLINK() for all stat_verify_xyz() helpers
first, where we check them. Just to ensure we systematically return the
same errors.

stat-util: also add stat_verify_char() and use it everywhere

socket-netlink: use stat_verify_socket() more

tree-wide: use stat_verify_directory() more

tree-wide: use stat_verify_regular() more

stat-util: also introduce stat_verify_block() and use it everywhere

stat-util: add shortcut for fd_verify_symlink()

We have a similar shortcut in the other fd_verify_xyz() calls, let's add
it here too, just for completion's sake.

tree-wide: make more use of stat_verify_device_node()

stat-util: introduce a common (stat|fd)_verify_regular_or_block() helper

We already had one in repart.c, let's generalize it, and use it
everywhere.

boot: fix loop bound and OOB in devicetree_get_compatible()

The loop used the byte offset end (struct_off + struct_size) as the
iteration limit, but cursor[i] indexes uint32_t words. This reads
past the struct block when end > size_words.

Use size_words (struct_size / sizeof(uint32_t)) which is the correct
number of words to iterate over.

Also fix a pre-existing OOB in the FDT_BEGIN_NODE handler: the guard
i >= size_words is always false inside the loop (since the loop
condition already ensures i < size_words), so cursor[++i] at the
boundary reads one word past the struct block. Use i + 1 >= size_words
to check before incrementing.

Fixes: https://github.com/systemd/systemd/issues/41590

boot: fix integer overflow and division by zero in BMP splash parser

Bound image dimensions before computing row_size to prevent overflow
in the depth * x multiplication on 32-bit. Without this, crafted
dimensions like depth=32 x=0x10000001 wrap to a small row_size that
passes all subsequent checks.

Reject channel masks where all bits are set (popcount == 32), since
1U << 32 is undefined behavior and causes division by zero on
architectures where it evaluates to zero. Move the validation before
computing derived values for clarity. Use unsigned 1U in shifts to
avoid signed integer overflow UB for popcount == 31.

Also reject zero-width and zero-height images.

Fixes: https://github.com/systemd/systemd/issues/41589

core: use JSON_BUILD_CONST_STRING() where appropriate

journal: limit decompress_blob() output to DATA_SIZE_MAX (#41604)

We already have checks in place during compression that limit the data
we compress, so they shouldn't decompress to anything larger than
DATA_SIZE_MAX unless they've been tampered with. Let's make this
explicit and limit all our decompress_blob() calls in journal-handling
code to that limit.

One possible scenario this should prevent is when one tries to open and
verify a journal file that contains a compression bomb in its payload:

```
$ ls -lh test.journal
-rw-rw-r--+ 1 fsumsal fsumsal 1.2M Apr 12 15:07 test.journal

$ systemd-run --user --wait --pipe -- build-local/journalctl --verify --file=$PWD/test.journal
Running as unit: run-p682422-i4875779.service
000110: Invalid hash (00000000 vs. 11e4948d73bdafdd)
000110: Invalid object contents: Bad message
File corruption detected at /home/fsumsal/repos/@systemd/systemd/test.journal:272 (of 1249896 bytes, 0%).
FAIL: /home/fsumsal/repos/@systemd/systemd/test.journal (Bad message)
          Finished with result: exit-code
Main processes terminated with: code=exited, status=1/FAILURE
               Service runtime: 48.051s
             CPU time consumed: 47.941s
                   Memory peak: 8G (swap: 0B)
```
Same could be, in theory, possible with just `journalctl --file=`, but
the reproducer would be a bit more complicated (haven't tried it, yet).

Lastly, the change in journal-remote is mostly hardening, as the maximum
input size to decompress_blob() there is mandated by MHD's connection
memory limit (set to JOURNAL_SERVER_MEMORY_MAX which is 128 KiB at the
time of writing), so the possible output size there is already quite
limited (e.g. ~800 - 900 MiB for xz-compressed data).

udev/scsi-id: hardening against malformed kernel data (#41585)

nspawn: Add --restrict-address-families= option

Add a new --restrict-address-families= command line option and
corresponding RestrictAddressFamilies= setting for .nspawn files to
restrict which socket address families may be used inside a container.

Many address families such as AF_VSOCK and AF_NETLINK are not
network-namespaced, so restricting access to them in containers
improves isolation. The option supports allowlist and denylist modes
(via ~ prefix), as well as "none" to block all families, matching the
semantics of RestrictAddressFamilies= in unit files.

The address family parsing logic is extracted into a shared
parse_address_families() helper in parse-helpers.c, which is now also
used by config_parse_address_families() in load-fragment.c.

This is currently opt-in. In a future version, the default will be
changed to restrict address families to AF_INET, AF_INET6 and AF_UNIX.

mkosi: Drop kexec-tools

Not needed anymore now that we use kexec_file_load().

systemctl: replace kexec-tools dependency with direct kexec_file_load() syscall

Replace the fork+exec of /usr/bin/kexec in load_kexec_kernel() with a
direct kexec_file_load() syscall, removing the runtime dependency on
kexec-tools for systemctl kexec.

The kexec_file_load() syscall (available since Linux 3.17) accepts
kernel and initrd file descriptors directly, letting the kernel handle
image parsing, segment setup, and purgatory internally. This is much
simpler than the older kexec_load() syscall which requires complex
userspace setup of memory segments and boot protocol structures — that
complexity is the raison d'être of kexec-tools.

The implementation follows the established libc wrapper pattern: a
missing_kexec_file_load() fallback in src/libc/kexec.c calls the
syscall directly when glibc doesn't provide a wrapper (which is
currently always the case). The syscall is not available on all
architectures — alpha, i386, ia64, m68k, mips, sh, and sparc lack
__NR_kexec_file_load — so the wrapper and caller are guarded with
HAVE_KEXEC_FILE_LOAD_SYSCALL to compile cleanly everywhere.

When kexec_file_load() rejects the kernel image with ENOEXEC (e.g. the
image is compressed or wrapped in a PE container that the kernel's kexec
handler doesn't understand natively), we attempt to unwrap/decompress
and retry. This is effectively the same decompression and extraction
logic that already lives in src/ukify/ukify.py (maybe_decompress() and
get_zboot_kernel()), now implemented in C so that systemctl can handle
it natively without shelling out to external tools:

- Compressed kernels (Image.gz, Image.zst, Image.xz, Image.lz4): the
   format is detected by magic bytes (per RFC 1952, RFC 8878,
   tukaani.org xz spec, and lz4 frame format spec) and decompressed to
   a memfd using the existing decompress_stream_*() infrastructure plus
   the new gzip support from the previous commit. This is primarily
   needed on arm64 where kexec_file_load() only accepts raw Image files.
   On x86_64, bzImage is already the native format and works directly.

- EFI ZBOOT PE images (vmlinuz.efi): detected by "MZ" + "zimg" magic
   at the start of the file. The compressed payload offset, size, and
   compression type are read from the ZBOOT header defined in
   linux/drivers/firmware/efi/libstub/zboot-header.S.

- Unified Kernel Images (UKI): detected as PE files with a .linux
   section via the existing pe_is_uki() infrastructure. The .linux
   section (kernel) and optionally .initrd section are extracted to
   memfds. When a UKI provides an embedded initrd and the boot entry
   doesn't specify one separately, the embedded initrd is used.

The try-first-then-decompress approach means we never decompress
unnecessarily: on x86_64 the first kexec_file_load() call succeeds
immediately with the raw bzImage, and on architectures where the
kernel's kexec handler natively understands PE (like LoongArch with
kexec_efi_ops), ZBOOT/UKI images work without decompression too.

If kexec_file_load() is unavailable (architectures without the syscall)
or all attempts fail, we fall back to forking+execing the kexec binary.
This preserves compatibility on architectures like i386 and mips where
only the older kexec_load() syscall exists and kexec-tools is needed to
handle the complex userspace setup.

Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>

libc: Add kexec_file_load() syscall wrapper

Allow tabs in UAPI headers in .gitattributes since they are copied
verbatim from the kernel.

compress: rework decompressor_detect() on top of compression_detect_from_magic()

Replace the duplicated magic byte signatures in decompressor_detect()
with a call to the new compression_detect_from_magic() helper and use a
switch statement to initialize the appropriate decompression context.

time-util: encode our assumption that clock_gettime() never can return 0 or USEC_INFINITY

We generally assume that valid times returned by clock_gettime() are > 0
and < USEC_INFINITY. If this wouldn't hold all kinds of things would
break, because we couldn't distuingish our niche values from regular
values anymore.

Let's hence encode our assumptions in C, already to help static
analyzers and LLMs.

Inspired by: https://github.com/systemd/systemd/pull/41601#pullrequestreview-4094645891

udev/scsi-id: various typing refactorings

udev/scsi-id: check for invalid header from kernel buffer

udev/scsi-id: check for invalid chars in various fields received from the kernel

Follow-up for 16325b35fa6ecb25f66534a562583ce3b96d52f3

nss-systemd: fix off-by-one in nss_pack_group_record_shadow()

nss_count_strv() counts trailing NULL pointers in n. The pointer area
then used (n + 1), reserving one slot more than the size check
accounted for.

Drop the + 1 since n already includes the trailing NULLs, unlike the
non-shadow nss_pack_group_record() where n does not.

Fixes: https://github.com/systemd/systemd/issues/41591

More assorted coverity fixes (#41601)

One more round, this time with the help of the claudebot, especially for
spelunking in git blame to find the original commit and writing commit
messages from the list of warnings exported from coverity

Co-developed-by: Claude
[claude@anthropic.com](mailto:claude@anthropic.com)

core: varlink enum for io.systemd.Unit interface (#40972)

Convert string fields to varlink enums in io.systemd.Unit

Following
https://github.com/systemd/systemd/pull/39391#discussion_r2489599449,
convert all configuration setting fields in the io.systemd.Unit varlink
interface from bare SD_VARLINK_STRING to proper enum types, adding type
safety to the IDL.

This converts ~30 fields across ExecContext, CGroupContext, and
UnitContext, adding 25 new varlink enum types.

Weak compatibility breakage (per
https://github.com/systemd/systemd/pull/40972#issuecomment-4222294318):
Varlink enum identifiers cannot contain - or +, so affected values are
underscorified on the wire. For example, "tty-force" becomes tty_force,
"kmsg+console" becomes kmsg_console.

The full list of affected values:
```
  - ExecInputType: tty-force, tty-fail
  - ExecOutputType: kmsg+console, journal+console
  - ProtectHome: read-only
  - CGroupController: bpf-firewall, bpf-devices, bpf-foreign, bpf-socket-bind, bpf-restrict-network-interfaces, bpf-bind-network-interface
  - CollectMode: inactive-or-failed
  - EmergencyAction: exit-force, reboot-force, reboot-immediate, poweroff-force, poweroff-immediate, soft-reboot, soft-reboot-force, kexec-force, halt-force, halt-immediate
  - JobMode: replace-irreversibly, ignore-dependencies, ignore-requirements, restart-dependencies
```

journal: limit decompress_blob() output to DATA_SIZE_MAX

We already have checks in place during compression that limit the data
we compress, so they shouldn't decompress to anything larger than
DATA_SIZE_MAX unless they've been tampered with. Let's make this
explicit and limit all our decompress_blob() calls in journal-handling
code to that limit.

One possible scenario this should prevent is when one tries to open and
verify a journal file that contains a compression bomb in its payload:

$ ls -lh test.journal
-rw-rw-r--+ 1 fsumsal fsumsal 1.2M Apr 12 15:07 test.journal

$ systemd-run --user --wait --pipe -- build-local/journalctl --verify --file=$PWD/test.journal
Running as unit: run-p682422-i4875779.service
000110: Invalid hash (00000000 vs. 11e4948d73bdafdd)
000110: Invalid object contents: Bad message
File corruption detected at /home/fsumsal/repos/@systemd/systemd/test.journal:272 (of 1249896 bytes, 0%).
FAIL: /home/fsumsal/repos/@systemd/systemd/test.journal (Bad message)
          Finished with result: exit-code
Main processes terminated with: code=exited, status=1/FAILURE
               Service runtime: 48.051s
             CPU time consumed: 47.941s
                   Memory peak: 8G (swap: 0B)

Same could be, in theory, possible with just `journalctl --file=`, but
the reproducer would be a bit more complicated (haven't tried it, yet).

Lastly, the change in journal-remote is mostly hardening, as the maximum
input size to decompress_blob() there is mandated by MHD's connection
memory limit (set to JOURNAL_SERVER_MEMORY_MAX which is 128 KiB at the
time of writing), so the possible output size there is already quite
limited (e.g. ~800 - 900 MiB for xz-compressed data).

compress: limit the output to dst_max bytes with LZ4 if set

We already do that with other algorithms, so let's make
decompress_blob_lz4() consistent with the rest.

journal: move the {DATA,ENTRY}_SIZE constants to sd-journal

So we can access them from the code there as well.

coccinelle: add SIZEOF() macro to work-around sizeof(*private)

We have code like `size_t max_size = sizeof(*private)` in three
places. This is evaluated at compile time so its safe to use. However
the new pointer-deref checker in coccinelle is not smart enough to know
this and will flag those as errors. To avoid these false positives
we have some options:
1. Reorder so that we do:
```C
size_t max_size = 0;
assert(private);
max_size = sizeof(*private);
```
2. Use something like `size_t max_size = sizeof(*ASSERT_PTR(private));`
3. Place the assert before the declaration
4. Workaround coccinelle via SIZEOF(*private) that we can then hide
via parsing_hacks.h
5. Fix coccinelle (OCaml, hard)
6. ... somehting I missed?

None of these is very appealing. I went for (4) but happy about
suggestions.