systemd-vmspawn currently has zero runtime control over the VMs it
launches. It can kill QEMU (SIGTERM) or SSH in, but it cannot pause,
resume, request a graceful power-off, query status, or
react to VM events. QEMU exposes all of this via its QMP protocol;
systemd's native IPC is varlink. This series bridges the two.
Architecture
```
machinectl → machined (Machine.List → discovers controlAddress)
machinectl → vmspawn varlink socket (direct connection)
├── io.systemd.MachineInstance (generic VM control)
├── io.systemd.VirtualMachineInstance (placeholder)
└── io.systemd.QemuMachineInstance (QEMU-specific; AcquireQMP stub)
vmspawn internally → socketpair → QEMU QMP
```
machined stores the controlAddress but never connects to vmspawn.
machinectl discovers the address from Machine.List and connects
directly. Socket mode 0600 is the access-control boundary —
the socket is rooted in vmspawn's $RUNTIME_DIRECTORY, so only the UID
that launched the VM can talk to it.
QMP client library (src/shared/qmp-client.{c,h})
A small non-blocking QMP client modeled on sd-varlink's pump contract:
- Reference-counted QmpClient with an explicit five-state machine:
HANDSHAKE_INITIAL → HANDSHAKE_GREETING_RECEIVED →
HANDSHAKE_CAPABILITIES_SENT → RUNNING → DISCONNECTED.
- qmp_client_connect_fd() is non-blocking: it wraps the fd in a
JsonStream and returns immediately. The greeting + qmp_capabilities
handshake is driven lazily on the first
qmp_client_invoke() or by the event loop — whichever comes first — so
callers never block during connect.
- qmp_client_attach_event() attaches to sd_event for async operation;
qmp_client_process() performs one pump step (write → dispatch → parse →
read → disconnect) with the same contract as
sd_varlink_process(); qmp_client_wait() blocks until the next I/O event.
- qmp_client_invoke() sends an async command and fires the registered
qmp_command_callback_t with (result, error_desc, error, userdata) on
completion. Synchronous callers drive
process()/wait() in a loop until qmp_client_is_idle() is true.
- QmpClientArgs bundles the JSON arguments and an FD list for a single
command; the QMP_CLIENT_ARGS_FD() macro hands one fd to the callee for
SCM_RIGHTS passing. On partial-stage failure the
args list is narrowed so the caller's cleanup closes only the
untransferred tail.
- Event broadcast to a registered callback via qmp_client_bind_event();
transport loss surfaces through qmp_client_bind_disconnect().
- qmp_schema_has_member() walks the query-qmp-schema result for optional
runtime capability probes.
vmspawn device setup via QMP
vmspawn starts QEMU paused (-S), sets up devices via QMP, then resumes
with cont. The entire device plane moves off the legacy INI config path
and onto the bridge.
A new MachineConfig aggregate in vmspawn-qmp.h groups the per-device
info (DriveInfos, NetworkInfo, VirtiofsInfos, VsockInfo) with a single
machine_config_done() cleanup that chains the
sub-structure destructors; each conversion patch populates exactly the
field it owns.
What the conversion enables:
- FD-based device passing via add-fd / getfd + SCM_RIGHTS — vmspawn
opens every image file, TAP, VSOCK, and virtiofs socket itself and hands
the fd to QEMU. QEMU never needs filesystem
access.
- Ephemeral overlays via blockdev-create + async job-concluded
continuations on anonymous O_TMPFILE / memfd backings — no named overlay
files on disk.
- PCIe root-port pre-allocation for q35/virt machine types so
hotplug-capable slots exist at boot (NVMe, virtio-scsi, etc.).
- io_uring availability probing with automatic fallback to the default
AIO backend if QEMU's build doesn't support it.
Per-command callbacks call sd_event_exit() on setup failure so vmspawn
shuts down cleanly if any device can't be attached.
machinectl integration
- machinectl pause / resume / poweroff / reboot / terminate go through
the varlink control socket for VMs.
- D-Bus fallback for containers: poweroff sends SIGRTMIN+4, terminate
calls the existing TerminateMachine method — unchanged container
behavior.
- Multi-machine parallel dispatch via sd_event for bulk operations
(machinectl pause vm1 vm2 ...) so one slow VM doesn't serialize the
rest.
- SubscribeEvents streaming with per-subscriber event-name filters
(importd Pull-style pattern: initial {ready:true} notify, fan out via
varlink_many_notifybo(), lazy init — QMP event pump
runs only while subscribers exist).
Tests
- Unit test with a mock QMP server covering handshake, command/response,
events, and EOF.
- Integration test against real QEMU (-machine none) exercising
handshake + query-qmp-schema (~200 KB reply, validates the buffered
reader across multiple read()s) and query-status.
- Integration test for the machinectl verbs end-to-end: pause / resume /
describe / subscribe / terminate.
- Integration test for the multi-drive pipeline and ephemeral overlays
(blockdev-create async job continuations).
- Stress test: 5 cycles of start → 3× (pause/describe/resume/describe) →
terminate.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>