From: Christian Brauner Date: Wed, 15 Apr 2026 10:33:35 +0000 (+0200) Subject: systemd-vmspawn: QMP-varlink bridge for VM runtime control (#41449) X-Git-Url: http://git.ipfire.org/gitweb.cgi?a=commitdiff_plain;h=09bc9ecc08d21c2ca17cd41d12d1dba579afc9dc;p=thirdparty%2Fsystemd.git systemd-vmspawn: QMP-varlink bridge for VM runtime control (#41449) systemd-vmspawn currently has zero runtime control over the VMs it launches. It can kill QEMU (SIGTERM) or SSH in, but it cannot pause, resume, request a graceful power-off, query status, or react to VM events. QEMU exposes all of this via its QMP protocol; systemd's native IPC is varlink. This series bridges the two. Architecture ``` machinectl → machined (Machine.List → discovers controlAddress) machinectl → vmspawn varlink socket (direct connection) ├── io.systemd.MachineInstance (generic VM control) ├── io.systemd.VirtualMachineInstance (placeholder) └── io.systemd.QemuMachineInstance (QEMU-specific; AcquireQMP stub) vmspawn internally → socketpair → QEMU QMP ``` machined stores the controlAddress but never connects to vmspawn. machinectl discovers the address from Machine.List and connects directly. Socket mode 0600 is the access-control boundary — the socket is rooted in vmspawn's $RUNTIME_DIRECTORY, so only the UID that launched the VM can talk to it. QMP client library (src/shared/qmp-client.{c,h}) A small non-blocking QMP client modeled on sd-varlink's pump contract: - Reference-counted QmpClient with an explicit five-state machine: HANDSHAKE_INITIAL → HANDSHAKE_GREETING_RECEIVED → HANDSHAKE_CAPABILITIES_SENT → RUNNING → DISCONNECTED. - qmp_client_connect_fd() is non-blocking: it wraps the fd in a JsonStream and returns immediately. The greeting + qmp_capabilities handshake is driven lazily on the first qmp_client_invoke() or by the event loop — whichever comes first — so callers never block during connect. - qmp_client_attach_event() attaches to sd_event for async operation; qmp_client_process() performs one pump step (write → dispatch → parse → read → disconnect) with the same contract as sd_varlink_process(); qmp_client_wait() blocks until the next I/O event. - qmp_client_invoke() sends an async command and fires the registered qmp_command_callback_t with (result, error_desc, error, userdata) on completion. Synchronous callers drive process()/wait() in a loop until qmp_client_is_idle() is true. - QmpClientArgs bundles the JSON arguments and an FD list for a single command; the QMP_CLIENT_ARGS_FD() macro hands one fd to the callee for SCM_RIGHTS passing. On partial-stage failure the args list is narrowed so the caller's cleanup closes only the untransferred tail. - Event broadcast to a registered callback via qmp_client_bind_event(); transport loss surfaces through qmp_client_bind_disconnect(). - qmp_schema_has_member() walks the query-qmp-schema result for optional runtime capability probes. vmspawn device setup via QMP vmspawn starts QEMU paused (-S), sets up devices via QMP, then resumes with cont. The entire device plane moves off the legacy INI config path and onto the bridge. A new MachineConfig aggregate in vmspawn-qmp.h groups the per-device info (DriveInfos, NetworkInfo, VirtiofsInfos, VsockInfo) with a single machine_config_done() cleanup that chains the sub-structure destructors; each conversion patch populates exactly the field it owns. What the conversion enables: - FD-based device passing via add-fd / getfd + SCM_RIGHTS — vmspawn opens every image file, TAP, VSOCK, and virtiofs socket itself and hands the fd to QEMU. QEMU never needs filesystem access. - Ephemeral overlays via blockdev-create + async job-concluded continuations on anonymous O_TMPFILE / memfd backings — no named overlay files on disk. - PCIe root-port pre-allocation for q35/virt machine types so hotplug-capable slots exist at boot (NVMe, virtio-scsi, etc.). - io_uring availability probing with automatic fallback to the default AIO backend if QEMU's build doesn't support it. Per-command callbacks call sd_event_exit() on setup failure so vmspawn shuts down cleanly if any device can't be attached. machinectl integration - machinectl pause / resume / poweroff / reboot / terminate go through the varlink control socket for VMs. - D-Bus fallback for containers: poweroff sends SIGRTMIN+4, terminate calls the existing TerminateMachine method — unchanged container behavior. - Multi-machine parallel dispatch via sd_event for bulk operations (machinectl pause vm1 vm2 ...) so one slow VM doesn't serialize the rest. - SubscribeEvents streaming with per-subscriber event-name filters (importd Pull-style pattern: initial {ready:true} notify, fan out via varlink_many_notifybo(), lazy init — QMP event pump runs only while subscribers exist). Tests - Unit test with a mock QMP server covering handshake, command/response, events, and EOF. - Integration test against real QEMU (-machine none) exercising handshake + query-qmp-schema (~200 KB reply, validates the buffered reader across multiple read()s) and query-status. - Integration test for the machinectl verbs end-to-end: pause / resume / describe / subscribe / terminate. - Integration test for the multi-drive pipeline and ephemeral overlays (blockdev-create async job continuations). - Stress test: 5 cycles of start → 3× (pause/describe/resume/describe) → terminate. Signed-off-by: Christian Brauner (Amutable) --- 09bc9ecc08d21c2ca17cd41d12d1dba579afc9dc