systemd-vmspawn: QMP-varlink bridge for VM runtime control (#41449)
systemd-vmspawn currently has zero runtime control over the VMs it
launches. It can kill QEMU (SIGTERM) or SSH in, but it cannot pause,
resume, request a graceful power-off, query status, or
react to VM events. QEMU exposes all of this via its QMP protocol;
systemd's native IPC is varlink. This series bridges the two.
machined stores the controlAddress but never connects to vmspawn.
machinectl discovers the address from Machine.List and connects
directly. Socket mode 0600 is the access-control boundary —
the socket is rooted in vmspawn's $RUNTIME_DIRECTORY, so only the UID
that launched the VM can talk to it.
QMP client library (src/shared/qmp-client.{c,h})
A small non-blocking QMP client modeled on sd-varlink's pump contract:
- Reference-counted QmpClient with an explicit five-state machine:
HANDSHAKE_INITIAL → HANDSHAKE_GREETING_RECEIVED →
HANDSHAKE_CAPABILITIES_SENT → RUNNING → DISCONNECTED.
- qmp_client_connect_fd() is non-blocking: it wraps the fd in a
JsonStream and returns immediately. The greeting + qmp_capabilities
handshake is driven lazily on the first
qmp_client_invoke() or by the event loop — whichever comes first — so
callers never block during connect.
- qmp_client_attach_event() attaches to sd_event for async operation;
qmp_client_process() performs one pump step (write → dispatch → parse →
read → disconnect) with the same contract as
sd_varlink_process(); qmp_client_wait() blocks until the next I/O event.
- qmp_client_invoke() sends an async command and fires the registered
qmp_command_callback_t with (result, error_desc, error, userdata) on
completion. Synchronous callers drive
process()/wait() in a loop until qmp_client_is_idle() is true.
- QmpClientArgs bundles the JSON arguments and an FD list for a single
command; the QMP_CLIENT_ARGS_FD() macro hands one fd to the callee for
SCM_RIGHTS passing. On partial-stage failure the
args list is narrowed so the caller's cleanup closes only the
untransferred tail.
- Event broadcast to a registered callback via qmp_client_bind_event();
transport loss surfaces through qmp_client_bind_disconnect().
- qmp_schema_has_member() walks the query-qmp-schema result for optional
runtime capability probes.
vmspawn device setup via QMP
vmspawn starts QEMU paused (-S), sets up devices via QMP, then resumes
with cont. The entire device plane moves off the legacy INI config path
and onto the bridge.
A new MachineConfig aggregate in vmspawn-qmp.h groups the per-device
info (DriveInfos, NetworkInfo, VirtiofsInfos, VsockInfo) with a single
machine_config_done() cleanup that chains the
sub-structure destructors; each conversion patch populates exactly the
field it owns.
What the conversion enables:
- FD-based device passing via add-fd / getfd + SCM_RIGHTS — vmspawn
opens every image file, TAP, VSOCK, and virtiofs socket itself and hands
the fd to QEMU. QEMU never needs filesystem
access.
- Ephemeral overlays via blockdev-create + async job-concluded
continuations on anonymous O_TMPFILE / memfd backings — no named overlay
files on disk.
- PCIe root-port pre-allocation for q35/virt machine types so
hotplug-capable slots exist at boot (NVMe, virtio-scsi, etc.).
- io_uring availability probing with automatic fallback to the default
AIO backend if QEMU's build doesn't support it.
Per-command callbacks call sd_event_exit() on setup failure so vmspawn
shuts down cleanly if any device can't be attached.
machinectl integration
- machinectl pause / resume / poweroff / reboot / terminate go through
the varlink control socket for VMs.
- D-Bus fallback for containers: poweroff sends SIGRTMIN+4, terminate
calls the existing TerminateMachine method — unchanged container
behavior.
- Multi-machine parallel dispatch via sd_event for bulk operations
(machinectl pause vm1 vm2 ...) so one slow VM doesn't serialize the
rest.
- SubscribeEvents streaming with per-subscriber event-name filters
(importd Pull-style pattern: initial {ready:true} notify, fan out via
varlink_many_notifybo(), lazy init — QMP event pump
runs only while subscribers exist).
Tests
- Unit test with a mock QMP server covering handshake, command/response,
events, and EOF.
- Integration test against real QEMU (-machine none) exercising
handshake + query-qmp-schema (~200 KB reply, validates the buffered
reader across multiple read()s) and query-status.
- Integration test for the machinectl verbs end-to-end: pause / resume /
describe / subscribe / terminate.
- Integration test for the multi-drive pipeline and ephemeral overlays
(blockdev-create async job continuations).
- Stress test: 5 cycles of start → 3× (pause/describe/resume/describe) →
terminate.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>