vmspawn: multifunction-pack pcie-root-ports on pcie.0
The pre-allocated pcie-root-port block in run_virtual_machine() places
every port directly on pcie.0 with an auto-assigned PCI address. A
minimal VM already costs 4 builtin + 10 hotplug spares = 14 pcie.0
slots, on top of 3 implicit virtio devices (virtio-rng-pci,
virtio-balloon, virtio-serial-pci) for another 3.
pcie.0 has 32 device-numbers; q35 reserves 0x00 (host bridge) and 0x1f
(ICH9 LPC), leaving ~30 auto-assignable slots. TEST-64-UDEV-STORAGE-
nvme_basic pushes 20 '-device nvme' lines through
$SYSTEMD_VMSPAWN_QEMU_EXTRA, which vmspawn does not see — total demand
14 + 3 + 20 = 37 > 30. Bus realization fails after QEMU's chardev has
already emitted the QMP greeting, and the monitor socket POLLHUPs
while we are mid-feature-probe, reported as 'QMP connection dropped
during feature probing'.
Pack the root ports as multifunction devices, 8 per pcie.0 device-
number (QEMU docs/pcie.txt:84, 117-120, 255-258). Function 0 of each
group carries multifunction=on; functions 1-7 ride the same slot via
addr=N.F. Each function remains independently hot-pluggable so
vmspawn's QMP device_add machinery is unaffected. 14 ports collapse to
2 pcie.0 slots; the nvme_basic budget becomes 2 + 3 + 20 = 25.
The chassis/slot properties (used for ACPI hotplug identity) stay as
i+1 — they live in a uint8_t namespace independent of the PCI BDF and
are still unique. Base PCI slot 0x10 sits above the auto-assigned
virtio devices (which land at 0x01-0x03 in config order) and below
the q35 LPC reservation at 0x1f.
While here, rebuild the slot-count formula to match what
assign_pcie_ports() actually allocates. The +1 'SCSI controller' term
was bogus — virtio-scsi-pci comes from the hotplug-spares pool via
hotplug_port_owner[] in vmspawn-qmp.c, never from a builtin port (see
the comment in assign_pcie_ports()). The +1 'network' and +1 'vsock'
terms are now conditional on arg_network_stack and use_vsock. Bind
volumes were missing entirely. And the per-drive accounting now
mirrors assign_pcie_ports()'s skip-SCSI behaviour: non-SCSI drives
(root + extras + bind volumes) take one builtin port each, SCSI
drives take none — they share a controller drawn from the hotplug
pool at device-add time. Cap at 120 ports (15 device-numbers × 8) so
we cannot run off the end of the 5-bit PCI device-number space — the
usable range starting at 0x10 ends at 0x1e because ICH9 LPC sits at
0x1f.0 single-function, blocking the rest of that slot for
multifunction packing.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
When verb groups were added, I assumed that the first group will always
by the unnamed group, or in other words, that VERB_GROUP() line cannot
appear first. This provides an additional check on the whether the verbs
haven't been reordered by the compiler or linker. But that check is weak
and we can do a better check anyway. And this limitation is unexpected,
since we allow that for OPTIONs. The code should all work without an
unnamed group, once this assertion is removed.
A follow-up to the AddStorage / RemoveStorage series. ReplaceStorage
swaps the *backing file* of an already-attached storage device on a
running vmspawn-managed VM, leaving the guest-visible device frontend
(virtio-blk, virtio-scsi, nvme, scsi-cd) and every other property of
the device untouched. The intended use is to point an existing disk
at a new image without the guest seeing a hot-unplug/hot-plug cycle.
The signature mirrors AddStorage minus the 'config' field: the
device frontend doesn't change, only the backing behind it. Read-
only / read-write is derived from the new fd's O_ACCMODE; scsi-cd is
forced read-only to match the boot-time policy. S_ISBLK on the new
fd selects host_device vs file driver, matching AddStorage.
The QMP primitive is blockdev-reopen. It cannot change a file /
host_device node's 'filename' so we can't just point the existing
file node at a new fd, but it can swap a format node's 'file' child
to a different existing monitor-owned node by node-name reference
(case 3 in qemu/qapi/block-core.json:5034-5040). The chain is:
add-fd (host fd → new fdset)
blockdev-add (new file node, filename=/dev/fdset/N — fd-only)
remove-fd (release monitor's ref; new file holds the dup)
blockdev-reopen (format node, file = new file node-name)
blockdev-del (old file node; its dup release frees old fdset)
The reopen options must restate every option the original blockdev-
add emitted on the format node — blockdev-reopen resets any
unspecified option to its driver default. The 'file' field is a
node-name string reference, never a path.
No new errors and no new IDL types beyond the method itself;
everything is built on the existing NoSuchStorage / StorageImmutable
/ NotConnected / EBUSY vocabulary.
The series is:
vmspawn: split blockdev-add into separate file and format calls
Preparatory refactor. qemu/blockdev.c:3440 only marks the
top-level BDS returned by blockdev-add as monitor-owned;
inline children are NOT, so blockdev-del later rejects them
with "Node X is not owned by the monitor". Split into two
blockdev-add calls so the file node is independently
deletable. DriveInfo gains qmp_file_node_name and a
file_generation counter; the teardown helper deletes format
then file (file-first is rejected as "node used as 'file'
of Y"). The ephemeral path was already structured this way;
only the regular add path changes. Drops the now-unused
qmp_build_blockdev_add_inline().
shared/varlink-io.systemd.MachineInstance: add ReplaceStorage method
IDL only: ReplaceStorage(fileDescriptorIndex, name). No new
errors.
vmspawn: implement io.systemd.MachineInstance.ReplaceStorage
vmspawn_qmp_replace_block_device() entry point, ReplaceCtx
(refcounted, ReplaceCtxStateFlags for partial-state tracking)
and four async callbacks plus an idempotent replace_fail.
file_generation is bumped before issuing blockdev-add so
retries don't collide on node-name.
BLOCK_DEVICE_STATE_REPLACE_PENDING gates concurrent
Replace / Remove on the same drive. On reopen success the
trailing blockdev-del of the old file node fires from the
reopen callback; its failure logs a warning and still replies
success (the swap already committed; the orphan resolves at VM
exit). QMP disconnect mid-replace routes via
qmp_client_fail_pending → replace_fail → NotConnected.
test: integration test for io.systemd.MachineInstance.ReplaceStorage
TEST-87-AUX-UTILS-VM.replace-storage covers happy-path replace,
successive replaces (file_generation rotation), StorageImmutable
rejection on the boot-time drive, NoSuchStorage on unknown
names, InvalidParameter on malformed names, and clean
RemoveStorage after a replace (proves the new file node is
monitor-owned and the teardown order works). Backing files are
passed via 'varlinkctl --push-fd'; no machinectl front-end is
added in this round.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
The DHCP option 120 (SIP server) option takes a list of addresses or
domain names, and the first byte in the data classifies which type is
stored. Let's extend _addresses() and _domains() to make them support
the SIP server option.
Frantisek Sumsal [Tue, 12 May 2026 15:09:41 +0000 (17:09 +0200)]
sd-bus: handle non-string keys in dictionaries in JSON dump
JSON only supports string keys in objects, but D-Bus specification is a
bit more lenient and allows dict entries to have any basic type as key.
Let's stringify allowed non-string keys so that we can represent them as
JSON objects.
Relevant snippet from the D-Bus specification:
A DICT_ENTRY works exactly like a struct, but rather than parentheses
it uses curly braces, and it has more restrictions. The restrictions
are: it occurs only as an array element type; it has exactly two
single complete types inside the curly braces; the first single
complete type (the "key") must be a basic type rather than a container
type. Implementations must not accept dict entries outside of arrays,
must not accept dict entries with zero, one, or more than two fields,
and must not accept dict entries with non-basic-typed keys. A dict
entry is always a key-value pair.
Yaping Li [Sun, 10 May 2026 14:50:13 +0000 (14:50 +0000)]
logind: zero-initialize dispatch struct in vl_method_release_session()
The local struct passed to sd_varlink_dispatch() was not
zero-initialized. Since sd_json_dispatch_full() does not call handlers
for absent optional fields, p.id could be left indeterminate when
the client omits the Id parameter, leading to use of uninitialized
memory.
report-cgroup: use errno_or_else in one more place
Old gcc is confused about initialization:
In function ‘io_read_send’,
inlined from ‘walk_cgroups’ at ../src/report/report-cgroup.c:288:24:
../src/report/report-cgroup.c:167:21: error: ‘values[0]’ may be used uninitialized [-Werror=maybe-uninitialized]
167 | r = metric_build_send_unsigned(mf + i, link, unit, values[i], /* fields= */ NULL);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
I picked the fields that contain useful information about the specific
version/image/variant/experiment/flavour of the system. Also, either
NAME or PRETTY_NAME is included. This one is intended for human readers
to be able to identify the OS version easily.
report: drop MetricsFamilyContext, CGroupContext, CGroupInfo
Previously, we passed around information about the MetricFamily'ies
and the varlink connection in a helper structure. Having a hybrid of
const static and runtime stuff is iffy. Let's simplify things by passing
two separate parameters.
Also, in report-cgroup.c we built a cache of parsed values. This
requires additional storage requirements and introduces complexity when
dealing with population of the cache at the appropriate time.
This cache is not useful: for each cgroup, we generate a list of
metrics, and we have all the information at hand. The only reason
why we'd create the cache and not generate all the relevant replies
at once was that the helper functions called the .generate function
for each MetricFamily separately.
The MetricFamily interface is changed, so that metrics can be
defined without a .generate function. This is understood to mean
that the preceding metric family's .generate function will also
genarate this family. This allows us to define related metrics
nicely in a table:
{ METRIC_IO_SYSTEMD_CGROUP_PREFIX "CpuUsage", generate_func },
{ METRIC_IO_SYSTEMD_CGROUP_PREFIX "IOReadBytes", NULL },
{ METRIC_IO_SYSTEMD_CGROUP_PREFIX "IOReadOperations", NULL },
{ METRIC_IO_SYSTEMD_CGROUP_PREFIX "SomethingElse", generate_func2 },
...
When implementing .Describe, we list all the families. When implementing
.List, we only call those with .generate, and we get the same results
as before.
This allows the .generate functions to be simplified: instead of
keeping state, they just spit out all the metrics for a given
object in a tight loop.
varlink-io.systemd.MachineInstance,vmspawn: treat AddStorage/RemoveStorage name as opaque
The 'name' field on AddStorage and RemoveStorage was documented as
'<provider>:<volume>' and enforced via machine_storage_name_split() at
the varlink boundary. That form is only the convention machinectl
inherits from the StorageProvider routing path; the API itself only
needs a unique identifier the caller can re-use to detach the binding.
Drop the strict format check, require only a non-empty string, and
update the IDL docs to describe the field as a caller-supplied
identifier with machinectl's convention as a non-normative example.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
vmspawn: reject O_PATH and O_WRONLY fds in AddStorage
An fd opened O_PATH cannot be read, and an O_WRONLY fd cannot serve as
a backing file for a virtual disk image. Reject both at the bind-volume
entry point with -EBADF instead of letting the request proceed to QMP
where QEMU's file backend would fail to read from the fd. The
ReplaceStorage entry point grew the same checks in parallel.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
test: integration test for io.systemd.MachineInstance.ReplaceStorage
Modelled on TEST-87-AUX-UTILS-VM.bind-volume.sh. Boots vmspawn with
one boot-time bind-volume, hot-adds a runtime volume via machinectl
bind-volume, then exercises ReplaceStorage:
1. happy-path replace of a runtime drive
2. successive replace (verify file_generation rotation — no
node-name collisions on the second swap)
3. replace of the boot-time drive must fail with StorageImmutable
4. replace of an unknown name must fail with NoSuchStorage
5. invalid name (no provider:volume separator) must fail with
InvalidParameter
6. unbind-volume after replace must succeed — proves the new file
node is monitor-owned and the format-then-file teardown order
in vmspawn_qmp_block_device_teardown() correctly cleans up both
blockdev nodes
Pushes the new backing file via varlinkctl --push-fd; the file is a
plain truncate'd image. Auto-discovered by run_subtests in
TEST-87-AUX-UTILS-VM.sh.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Wire up the runtime hot-swap Varlink method. The signature mirrors
AddStorage minus 'config': the device frontend (virtio-blk,
virtio-scsi, nvme, scsi-cd) doesn't change, only the backing file
behind it. Read-only/read-write may flip based on the new fd's
O_ACCMODE; scsi-cd is forced read-only to match the boot-time policy.
add-fd → on_replace_observe_stage
blockdev-add (new file) → on_replace_blockdev_add_complete
remove-fd (new fdset) → on_replace_observe_stage
blockdev-reopen (format) → on_replace_blockdev_reopen_complete
[commit + fire trailing del]
blockdev-del (old file) → on_replace_old_blockdev_del_complete
The reopen options must be a superset of every option that
qmp_build_blockdev_add_format() may emit, otherwise reopen rejects
'Cannot reset option X to default'. The 'file' field is a string
reference to the new file node — case 3 of the schema in
qemu/qapi/block-core.json:5034-5040 ("the current child is replaced
with that other node"). The format node's qmp_node_name is preserved
so the device frontend's drive=<X> binding does not move.
ReplaceCtx tracks the per-call state with a refcount mirroring the
add-stage drive-info pattern. On any pre-commit failure replace_fail
tears down whatever new-side state we created on the wire and replies
on drive->link via reply_qmp_error (disconnect → NotConnected). On
post-commit del failure we log a warning, leak the orphan, and reply
success — the swap itself succeeded and the leak resolves at VM exit.
file_generation is bumped before issuing blockdev-add so failed
attempts cannot collide on node-name when the user retries.
Errors:
NoSuchStorage - drive not in the registry
StorageImmutable - drive lacks QMP_DRIVE_REMOVABLE (boot-time)
EBUSY - add still pending or another replace/remove in flight
NotConnected - QMP transport disconnect during the chain
EIO - QEMU rejected blockdev-reopen
Also gates RemoveStorage on REPLACE_PENDING so a device_del cannot
race a mid-flight blockdev-reopen on the same drive.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Define the IDL for io.systemd.MachineInstance.ReplaceStorage, a
runtime hot-swap of an already-attached storage volume's backing
file. The signature mirrors AddStorage minus the 'config' field
because the device frontend (virtio-blk, virtio-scsi, nvme, scsi-cd)
does not change — only the backing file behind it.
The implementation lives in vmspawn (next commit) and uses QMP
blockdev-reopen to swap the file child of the existing format node.
The reused error vocabulary (NoSuchStorage, StorageImmutable,
NotConnected, plus the generic errno path) covers every failure
mode; no new errors are added.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
vmspawn: split blockdev-add into separate file and format calls
The current vmspawn_qmp_add_block_device() emits a single blockdev-add
that combines the format-level node ("vmspawn-N-storage") with an
inline file child. QEMU's qmp_blockdev_add() only marks the top-level
returned BDS as monitor-owned (qemu/blockdev.c:3440); inline children
are NOT, so qmp_blockdev_del() rejects them with "Node X is not owned
by the monitor" (qemu/blockdev.c:3513-3517).
To prepare for ReplaceStorage — which needs to swap the file child of
an existing format node via blockdev-reopen, and then blockdev-del the
old file node — make the file node monitor-owned by issuing it as its
own blockdev-add call. The 4-stage add chain becomes 5 stages:
DriveInfo gains qmp_file_node_name ("vmspawn-N-file-G", G a generation
counter bumped on every replace), file_generation, and a stashed
fdset_id so future ReplaceStorage can target both for cleanup.
vmspawn_qmp_block_device_teardown() now deletes both nodes in order —
format first, then file — because the format holds a strong reference
to its file child and a file-first del is rejected with "Node X is
busy: node is used as 'file' of Y".
Folds bridge->features VMSPAWN_QMP_FEATURE_IO_URING into the file
node's flags so the new path inherits io_uring just like the old
inline form did. The format-level options (read-only, discard,
discard-no-unref) are unchanged.
The ephemeral path is structurally already separate file+format with
monitor-owned children; no behavioural change there beyond the
on_add_blockdev_stage → on_add_format_node_stage rename.
Drops the now-unused qmp_build_blockdev_add_inline() helper.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
loginctl: move options and verbs to match order in --help
First, "output modifier" options --no-pager/--no-legend/--no-ask-password are
moved to the end next to --output and --json. I think it makes sense to group
them. Then the implementing code is reordered to match the order in --help.
Daan De Meyer [Tue, 12 May 2026 20:24:51 +0000 (22:24 +0200)]
repart: Add BtrfsReplace= (#41109)
This is a series of commits which adds a feature needed by GNOME OS'
installer. This was show during All Systems Go 2025 talk:
https://cfp.all-systems-go.io/all-systems-go-2025/talk/QRJVL3/
To sum up this PR, this changes first systemd-repart to use BLKPG
partition instead of loop devices when possible. We need then to always
rescan the partitions to try remove partitions if it failed. We allow
encrypted partitions to stay activated and with a chosen name. And we
add a new partition configuration `BtrfsReplace=`.
Note that "replace" comes from the command `btrfs replace`. But in the
case of systemd-repart, maybe "inplace" or "move" would make more sense.
I open to suggestions.
If it is better I can split this into several PRs.
The commits:
## repart: Reuse the backing fd for fdisk
Because fdisk_assign_device tries to open block devices with O_EXCL,
when it does it blocks cryptsetup from using partition block devices for
the same disk.
Since we already have a file descriptor for the device, we can just
share it and use fdisk_assign_device_by_fd instead.
## repart: Use blkpg partitions instead of loop devices when possible
We will want to allow future features to keep some devices mounted or
active. So in order to avoid leaving a mess of many loop devices, we can
just already use the partition block device already.
## repart: Rescan disk on failure if we create blkpg partitions on the
fly
Since we did not write the partition table, then the created partitions
should get removed on error.
## repart: Allow keeping luks2 volumes opened
## repart: Add BtrfsReplace=
BtrfsReplace=/mntpnt will move the btrfs filesystem from mount point to
the partition created. After moving, it will resize to take the whole
partition.
This is useful for OS installers that move a live system into a disk and
do not require a reboot.
## repart: Add VolumeName=
When a luks2 device mapper is to be kept alive after execution of
systemd-cryptsetup, the name of the volume will be taken from this
value.
Daan De Meyer [Tue, 12 May 2026 13:03:49 +0000 (13:03 +0000)]
vmspawn: Prefer systemd-journal-remote from $PATH
$PATH might point to a systemd checkout containing
a newer version of systemd-journal-remote which we
should use, hence prefer an executable from $PATH
over the one from /usr/lib/systemd.
Luca Boccassi [Tue, 12 May 2026 11:40:54 +0000 (12:40 +0100)]
test: make TEST-75-RESOLVED robust against journald metadata race
Even after switching the wait loop to a polling `journalctl --grep`, the
test still fails intermittently because the very first messages emitted by
the freshly-spawned systemd-networkd-wait-online process can carry stale
journald metadata. journald associates `_SYSTEMD_UNIT=` (and friends) with
each entry by reading `/proc/$pid/cgroup` of the originating PID; if those
messages are produced before journald notices the cgroup migration into the
new service, they get tagged with `_SYSTEMD_UNIT=init.scope`. The
`-u $unit` filter then fails to match them.
Capture a journal cursor before launching the unit, and grep using
`--after-cursor=` plus `SYSLOG_IDENTIFIER=systemd-networkd-wait-online`
instead of `-u $unit`. SYSLOG_IDENTIFIER is set by the program itself, so
it's not subject to the cgroup-discovery race. The cursor bounds the search
to entries produced by this invocation, so prior wait-online runs in
earlier testcases don't interfere.
Logs from the failing run showing the messages exist but are tagged with
the wrong unit:
_SYSTEMD_CGROUP=/init.scope
_SYSTEMD_UNIT=init.scope
_EXE=/usr/lib/systemd/systemd-executor
_CMDLINE=/usr/lib/systemd/systemd-executor --deserialize 68 ...
SYSLOG_IDENTIFIER=systemd-networkd-wait-online
MESSAGE=dns0: No DNS server is accessible.
Daan De Meyer [Tue, 12 May 2026 10:26:37 +0000 (12:26 +0200)]
json-stream: tolerate truncated SCM_RIGHTS on inbound messages
When an LSM (e.g. SELinux) denies an fd transfer or the receiver hits
RLIMIT_NOFILE, the kernel drops the fd(s) from the SCM_RIGHTS cmsg and
sets MSG_CTRUNC on the recvmsg(). recvmsg_safe() turns that into
-ECHRNG, which causes json_stream_read() to discard the data bytes
that were nevertheless received and the varlink server to silently
tear down the connection — leaving the caller waiting for a reply
that never comes.
Inline the recvmsg() call instead and, on MSG_CTRUNC, drop the partial
fds but keep the message data. The method handler will surface a clean
-ENXIO when it tries to peek the missing fd, which sd-varlink wraps as
io.systemd.System for the peer, instead of a hang. This matches the
recent sd-bus fix in 6c8de404c9 ('sd-bus: allow receiving messages with
MSG_CTRUNC set').
Valentin David [Thu, 12 Mar 2026 22:14:52 +0000 (23:14 +0100)]
repart: Add BlockDeviceReplace=
BlockDeviceReplace=/mntpnt will move the btrfs filesystem from mount point to
the partition created. After moving, it will resize to take the whole
partition.
This is useful for OS installers that move a live system into a disk and
do not require a reboot.
Valentin David [Thu, 12 Mar 2026 22:14:34 +0000 (23:14 +0100)]
repart: Use blkpg partitions instead of loop devices when possible
We will want to allow future features to keep some devices mounted or
active. So in order to avoid leaving a mess of many loop devices, we can
just already use the partition block device already.
Valentin David [Thu, 12 Mar 2026 22:14:23 +0000 (23:14 +0100)]
repart: Reuse the backing fd for fdisk
Because fdisk_assign_device tries to open block devices with O_EXCL, when it
does it blocks cryptsetup from using partition block devices for the same
disk.
Since we already have a file descriptor for the device, we can just share it
and use fdisk_assign_device_by_fd instead.
This requires at least libfdisk 2.35 (part of util-linux) which was
released in 2020.
core: when skipping state deserializing units, also skip job subsections (#41957)
If a unit has active jobs, when it gets serialized there are job
subsections, each with their own empty line marker. The skipping
function ignores this and skips until the marker, but then leaves
the job in place, breaking deserialization.
Consume jobs subsections too.
This shows up now that there's TEST-07-PID1.alias-corruption,
which occasionally fails when the aliased unit happens to
still have a job when the reexec happens.
Implement Path/Scope/Swap/Timer Context/Runtime for `io.systemd.Unit.List` (#41980)
The PR implements the following objects + tests for
io.systemd.Unit.List:
* PathContext
* PathRuntime
* ScopeContext
* ScopeRuntime
* SwapContext
* SwapRuntime
* TimerContext
* TimerRuntime
It's a continuation of the following PRs:
* https://github.com/systemd/systemd/pull/37432
* https://github.com/systemd/systemd/pull/37646
* https://github.com/systemd/systemd/pull/38032
* https://github.com/systemd/systemd/pull/38212
* https://github.com/systemd/systemd/pull/39391
Daan De Meyer [Tue, 12 May 2026 11:19:18 +0000 (13:19 +0200)]
btrfs-util: clear RDONLY flag on subvolume before destroy ioctl
Without CAP_SYS_ADMIN, btrfs_ioctl_snap_destroy() runs an
inode_permission(MAY_WRITE) check against the target subvolume root, which
btrfs_permission() rejects with EROFS for a read-only subvolume. As a
result, unprivileged removal of a read-only subvolume fails — both via
btrfs_subvol_remove_at() directly and via the recursive cleanup path used
by rm_rf_subvolume_and_freep(), which propagates the EROFS up.
Detect EROFS after the destroy ioctl, clear the RDONLY flag (only inode
ownership is required for BTRFS_IOC_SUBVOL_SETFLAGS), and retry once.
While at it, fix the surrounding comments: BTRFS_IOC_SNAP_DESTROY drops the
entire subvolume tree, so regular files inside are irrelevant; ENOTEMPTY
from the ioctl indicates nested subvolumes (BTRFS_ROOT_REF_KEY entries) via
may_destroy_subvol(), not non-empty contents.
journalctl: move handling of --smart-relinquish-var to action logic
The help string for --smart-relinquish-var and --relinquish-var
were in reversed order because of the _fallthrough_.
We would resolve the conditions for "smart relinquish" immediately
in parse_argv() and call 'return 0' if the conditions were wrong,
terminating option parsing and the program. It seems nicer to delay
action until later. This makes the logic flow more standard. This
also allows the option parsing cases to be exchanged, fixing the
issue with --help.
Two namespaces are used: "journalctl" and "journalctl-varlink". Help for
--user/--system in the latter is added, even though it is not used yet.
I think it'll be good to have this for introspection.
The four FSS-related options (--interval, --verify-key, --force,
--setup-keys) unfortunately each gain an inline #if HAVE_GCRYPT / #else;
the EOPNOTSUPP fallback is duplicated four times.
The metavar for --identifier/--exclude-identifier is changed to "ID"
to make the layout nicer. (And because that seems to make more sense.)
Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>
We have different help strings for --user/--system in different places, so this
only covers a subset of --system/--user instances. But this particular help
seems to be the most widely used.
(In a few cases, the help string is fixed: it should be "system mode", not
"per-system mode".)
journalctl: reorder parse_argv() cases to match --help
Pure reordering. ARG_SMART_RELINQUISH_VAR is kept immediately before
ARG_RELINQUISH_VAR because of the existing _fallthrough_; that's the
only deviation from strict --help order.
Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>
Yu Watanabe [Sun, 22 Mar 2026 08:00:33 +0000 (17:00 +0900)]
dhcp: use TLV object to manage extra and vendor options
Note, previously we replaced the previous option with the same option code with
new one. But, DHCP message can have multiple options with same option code.
Hence, this make the conf parser not replace, but append new one.
sd-dhcp-protocol: rename DHCP option 43, 124, and 125
There are four DHCP options with confusing names:
Option 43: Vendor-Specific Information
Option 60: Vendor Class Identifier
Option 124: Vendor-Identifying Vendor Class
Option 125: Vendor-Identifying Vendor-Specific Information
Let's use their full names for their corresponding enums.
Daan De Meyer [Mon, 11 May 2026 19:58:24 +0000 (21:58 +0200)]
btrfs-util: Make nested subvolume operations work unpriv
BTRFS_IOC_SEARCH is only available to root in the
initial userns. This means we fail to recursively
snapshot even if a subvolume has no nested subvolumes
at the moment.
Let's fix this by using the newer btrfs ioctls which
do work even if we don't have CAP_SYS_ADMIN in the initial
userns.
Artem Proskurnev [Tue, 12 May 2026 08:07:39 +0000 (11:07 +0300)]
hwdb/keyboard: Map f21 key on Wareus B15
Addition to PR https://github.com/systemd/systemd/pull/41181
Plasma-workspace OSD notifications about turning the touchpad on
and off are guided by f21. When this match is specified,
KDE notifies on this laptop that the on/off switch of the atchpad
state is pressed.
Fix dmesg:
atkbd serio0: Unknown key pressed (translated set 2, code 0xc1 on isa0060/serio0).
firstboot,sysinstall,hostnamed: always show FANCY_NAME=
This makes sure that whenever we want to show the OS name we can show
the fancy name. Thus this moves the escaping/validation of the fancy
name out of hostnamed into generic code, and then makes use of it in
sysinstall,firstboot,prompt-util.
Daan De Meyer [Mon, 11 May 2026 19:58:24 +0000 (21:58 +0200)]
mkosi: Drop CPUs= limit
Limiting VMs to 2 cpus was cargo culting without any
actual data that this benefits performance. The host OS
has a scheduler, let's make use of it and give the VM access
to all the CPUs. This doesn't mean they become inaccessible to
the host, it just means the VM gets as many virtual CPUs as the
host has CPU cores (threads). How they get scheduled is still up
to the host OS.
units: pull in basic.target rather than sysinit.target from system-install.target
Many of our services are nowadays implemented via socket activation, and
hence require sockets.target to be active to be accessible. One of them
is mute-console.socket, which we typically want to use from
systemd-firstboot.service, systemd-sysinstall.service and other related
services. Hence let's pull in basic.target rather than sysinit.target
from system-install.target since it pulls sockets.target in too.
Effectively, this doesn't change much except for pulling in a bunch more
sockets, and frankly going for sysinit.target was really a bug to begin
width.
Daan De Meyer [Mon, 11 May 2026 13:03:49 +0000 (13:03 +0000)]
boot,vconsole: Propagate UEFI HII keyboard layout to the OS
UEFI firmware can report the currently-active keyboard layout via
EFI_HII_DATABASE_PROTOCOL.GetKeyboardLayout(). The layout descriptor
includes an RFC 4646 / BCP 47 language tag (e.g. "en-US"). Query this
from sd-boot/sd-stub and write it to a new LoaderKeyboardLayout EFI
variable, advertised through a new EFI_LOADER_FEATURE_KEYBOARD_LAYOUT
feature bit.
On the OS side, systemd-vconsole-setup reads the variable as a
lowest-priority fallback for the console keymap. To map the BCP 47
tag to a vconsole keymap we extend /usr/share/systemd/kbd-model-map
with an optional sixth column listing the comma-separated BCP 47 tags
each row covers; a new find_vconsole_keymap_for_bcp47() helper walks
the file, preferring an exact tag match and otherwise falling back to
the row whose tag matches the input's primary subtag. Credentials,
/etc/vconsole.conf, and vconsole.keymap= on the kernel command line
continue to take precedence.
bootctl status surfaces the new variable, printing the language tag
or "n/a (not reported by firmware)" when sd-boot advertises the
feature but the firmware HII database didn't expose a layout (common
on QEMU without a USB keyboard, since EDK2's PS/2 driver does not
register an HII keyboard layout).