claude-review: improve review quality for large PRs
Several issues were identified from analyzing logs of a large (52-commit) PR
review:
- Claude was batching multiple commits into a single review agent instead of
one per worktree. Strengthen the prompt to explicitly prohibit grouping.
- Claude was reading pr-context.json and commit messages before spawning
agents despite instructions not to, wasting time. Tighten the pre-spawn
rules to only allow listing worktrees/ and reading review-schema.json.
- Subagents were spawned with model "sonnet" instead of "opus". Add explicit
instruction to use opus.
- After agents returned, Claude spent 9 minutes re-verifying findings with
bash/grep/sed commands, duplicating the agents' work. Add instruction to
trust subagent findings and only read pr-context.json in phase 2.
- Subagents returned markdown-wrapped JSON instead of raw JSON arrays. Add
instruction requiring raw JSON output only.
- Each subagent was independently reading review-schema.json. Instead have
the main agent read it once and paste it into each subagent prompt.
- The "drop low-confidence findings" instruction was being used to justify
dropping findings that Claude itself acknowledged as valid ("solid cleanup
suggestions", "reasonable consistency improvement"). Remove the instruction.
- Simplify the deduplication instructions
- Stop adding the severity to the body in the post processing job as claude is
also adding it so they end up duplicated.
azureuser [Tue, 3 Mar 2026 08:41:45 +0000 (08:41 +0000)]
resolved: skip cache flush on server switch/re-probe when StaleRetentionSec is set
manager_set_dns_server() and dns_server_flush_cache() call dns_cache_flush()
unconditionally, wiping the entire cache even when StaleRetentionSec is
configured. This defeats serve-stale by discarding cached records that should
remain available during server switches and feature-level re-probes.
The original serve-stale commit (5ed91481ab) added a stale_retention_usec
guard to link_set_dns_server(), and a later commit (7928c0e0a1) added the
same guard to dns_delegate_set_dns_server(), but these two call sites in
resolved-dns-server.c were missed.
This is particularly visible with DNSOverTLS, where TLS handshake failures
trigger frequent feature-level downgrades and re-probes via
dns_server_flush_cache(), flushing the cache each time.
Add the same stale_retention_usec guard to both call sites so that cache
entries are allowed to expire naturally via dns_cache_prune() when
serve-stale is enabled.
Fixes: #40781
This commit was prepared with assistance from an AI coding agent (GitHub
Copilot). All changes have been reviewed for correctness and adherence to the
systemd coding style.
With the old version there was a potential connection count leak if
either of the two hashmap operations in count_connection() failed. In
that case we'd return from sd_varlink_server_add_connection_pair()
_before_ attached the sd_varlink_server object to an sd_varlink object,
and since varlink_detach_server() is the only place where the connection
counter is decremented (called through sd_varlink_close() in various
error paths later _if_ the "server" object is not null, i.e. attached to
the sd_varlink object) we'd "leak" a connection every time this
happened. However, the potential of abusing this is very theoretical,
as one would need to hit OOM every time either of the hashmap operations
was executed for a while before exhausting the connection limit.
Let's just increment the connection counter after any potential error
path, so we don't have to deal with potential rollbacks.
Milan Kyselica [Thu, 9 Apr 2026 17:45:19 +0000 (19:45 +0200)]
udev: fix bounds check in dev_if_packed_info()
The check compared bLength against (size - sizeof(descriptor)), which
is an absolute limit unrelated to the current buffer position. Since
bLength is uint8_t (max 255), this can never exceed size - 9 for any
realistic input, making the check dead code.
Use (size - pos) instead so the check actually catches descriptors
that extend past the end of the read data.
Daan De Meyer [Sat, 28 Mar 2026 23:21:18 +0000 (23:21 +0000)]
compress: consolidate all compression into compress.c with dlopen
Move the push-based streaming compression API from import-compress.c
into compress.c and delete import-compress.c/h. This consolidates all
compression code in one place and makes all compression libraries
(liblzma, liblz4, libzstd, libz, libbz2) runtime-loaded via dlopen
instead of directly linked.
Introduce opaque Compressor/Decompressor types backed by a heap-
allocated struct defined only in compress.c, keeping all third-party
library headers out of compress.h.
Rewrite the per-codec fd-to-fd stream functions as thin wrappers around
the push API via generic compress_stream()/decompress_stream() taking a
Compression type parameter. Integrate LZ4 into this framework using the
LZ4 Frame API, eliminating all LZ4 special-casing.
Extend the Compression enum with COMPRESSION_GZIP and COMPRESSION_BZIP2
and add the corresponding blob, startswith, and stream functions for
both.
Rename the ImportCompress types and functions: ImportCompressType becomes
the existing Compression enum, ImportCompress becomes Compressor (with
Decompressor typedef), and all import_compress_*/import_uncompress_*
become compressor_*/decompressor_*. Rename dlopen_lzma() to dlopen_xz()
for consistency. Make compression_to_string() return lowercase by
default.
Add INT_MAX/UINT_MAX overflow checks for LZ4, zlib, and bzip2 blob
functions where the codec API uses narrower integer types than our
uint64_t parameters.
Migrate test-compress.c and test-compress-benchmark.c to the TEST()
macro framework, new assertion macros, and codec-generic loops instead
of per-codec duplication.
Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>
Revert "mkosi: Mark minimal images as Incremental=relaxed"
The setting has fundamental flaws that can't be easily fixed
(see https://github.com/systemd/mkosi/pull/4273) so revert it's
use as we're dropping it in systemd. Image builds will take a bit
longer again until I figure out a proper fix for this.
vconsole-setup: skip setfont(8) when the console driver lacks font support
Don't run setfont(8) on consoles that don't support
fonts. systemd-vconsole-setup neither fails nor reports errors on such consoles
unlike setfont(8) which emits the following error [1]:
systemd-vconsole-setup[169]: setfont: ERROR kdfontop.c:183 put_font_kdfontop: Unable to load such font with such kernel version
The check already existed in setup_remaining_vcs() but it was performed too
late.
Michael Vogt [Tue, 7 Apr 2026 15:54:28 +0000 (17:54 +0200)]
sd-varlink: use MSG_PEEK for protocol_upgrade connections
When there is a potential protocol upgrade we need to be careful that
we do not read beyond our json message as the custom protocol may be
anything. This was archived via a byte-by-byte read. This is of course
very inefficient. So this commit moves to use MSG_PEEK to find the
boundary of the json message instead. This makes the performance hit
a lot smaller.
Michael Vogt [Tue, 7 Apr 2026 15:47:50 +0000 (17:47 +0200)]
varlink: use single byte reads on SD_VARLINK_SERVER_UPGRADABLE
When the server side of a varlink connection supports connection
upgrades we need to go into single byte-read mode to avoid the
risk of a client that sends the json to protocol upgrade and then
immediately the custom protocol payload. This commit implements
this.
The next step is using MSG_PEEK to avoid the single-byte overhead.
Michael Vogt [Sun, 5 Apr 2026 08:05:30 +0000 (10:05 +0200)]
libsystemd,varlink: always return two fds in varlink upgrade API
This commit tweaks the API of sd_varlink_call_and_upgrade and
sd_varlink_reply_and_upgrade to return two independent fds even
if the internal {input,output}_fd are the same (e.g. a socket).
This makes the external API easier as there is no longer the risk
of double close. The sd_varlink_call_and_upgrade() is not in a
released version of systemd yet so I presume it is okay to update
it still.
This also allowed some simplifications in varlinkctl.c now that
the handling is easier.
Michael Vogt [Thu, 2 Apr 2026 07:38:41 +0000 (09:38 +0200)]
varlinkctl: add new `serve` verb to allow wrapping command in varlink
With the new protocol upgrade support in varlinkctl client we can
now do the equivalent for the server side. This commit adds a new
`serve` verb that will serve any command that speaks stdin/stdout
via varlink and its protocol upgrade feature. This is the
"inetd for varlink".
This is useful for various reasons:
1. Allows to e.g. provide a heavily sandboxed io.myorg.xz.Decompress
varlink endpoint, c.f. xz CVE-2024-3094)
2. Allow sftp over varlink which is quite useful with the
varlink-http-bridge (that has more flexible auth mechanism than
plain sftp).
3. Makes testing the varlinkctl client protocol upgrade simpler.
4. Because we can.
Extract the fd-handling logic from sd_varlink_call_and_upgrade() into a
shared static helper so that it can be reused by the upcoming server-side
sd_varlink_reply_and_upgrade().
compress: write sparse files when decompressing to regular files
Core dumps are often very sparse, containing large zero-filled regions
whose actual disk usage can be significantly reduced by preserving
holes. Previously, decompress_stream() always wrote dense output,
expanding all zero regions into allocated disk blocks.
Each decompression backend (xz, lz4, zstd) now auto-detects whether the
output fd is suitable for sparse writes via a shared should_sparse()
helper. The check requires both S_ISREG (regular file) and !O_APPEND,
since O_APPEND causes write() to ignore the file position set by
lseek(), which would collapse the holes and corrupt the output. For
pipes, sockets, and append-mode files, dense writes are preserved via
loop_write_full() with USEC_INFINITY timeout, matching the original
behavior. After sparse decompression, finalize_sparse() sets the final
file size to account for any trailing holes.
This is transparent to callers — all public signatures are unchanged.
coredumpctl benefits automatically:
- coredumpctl debug: temp file in /var/tmp is now sparse
- coredumpctl dump -o file: output file is now sparse
- coredumpctl dump > file: redirected stdout is now sparse
- coredumpctl dump | ...: pipe output unchanged (dense)
- coredumpctl dump >> file: append mode, falls back to dense
Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com> Co-developed-by: Codex (GPT-5) <noreply@openai.com>
Daan De Meyer [Fri, 27 Mar 2026 13:26:16 +0000 (14:26 +0100)]
vmspawn: Support direct kernel boot without UEFI firmware
When --linux= specifies a non-PE kernel image, automatically disable
UEFI firmware loading (as if --firmware= was passed). If --firmware=
is explicitly set to a path in this case, fail with an error. Booting
a UKI with --firmware= is also rejected since UKIs require UEFI.
--firmware= (empty string) can also be used explicitly to disable
firmware loading for PE kernels.
Other changes:
- Extract OVMF pflash drive setup into cmdline_add_ovmf()
- Extract kernel image type detection into determine_kernel()
- Add smbios_supported() helper to centralize the SMBIOS availability
check (always available on x86, elsewhere requires firmware)
- Gate SMM, OVMF drives, SMBIOS11 and credential SMBIOS paths
on firmware/SMBIOS being available
- Beef up the credential logic to fall back to fw_cfg and kernel
command line in case SMBIOS is not available
coredumpctll: avoid unnecessary heap copy and decompression for field existence checks (#41520)
`print_list()` and `print_info()` used `RETRIEVE()` to `strndup()` the
entire
`COREDUMP` field into a heap-allocated string, only to check whether it
exists.
With `sd_journal_set_data_threshold(j, 0)` in `print_info()`, this
copies the
full coredump binary (potentially hundreds of MB) to heap just to print
"Storage: journal".
This PR:
1. Makes `sd_journal_get_data()` output parameters optional
(`NULL`-safe), so
callers can do pure existence checks without receiving the data.
2. Short-circuits `maybe_decompress_payload()` after
`decompress_startswith()`
succeeds when neither output pointer is requested, skipping full blob
decompression for compressed journal entries.
3. Switches coredumpctl to pass `NULL, NULL` for the existence checks
instead
of heap-copying via `RETRIEVE()`.
clangd: Strip GCC-only flags and silence unknown-attributes
Several GCC-only options in our compile_commands.json
(-fwide-exec-charset=UCS2, used by EFI boot code for UTF-16 string
literals, and -maccumulate-outgoing-args) cause clangd to emit
driver-level "unknown argument" errors. These can't be silenced through
Diagnostics.Suppress, so remove them via CompileFlags.Remove before
clang ever sees them.
Also suppress the -Wunknown-attributes warning that fires on every use
of _no_reorder_, since meson unconditionally expands it to the GCC-only
__no_reorder__ attribute when configured with GCC.
networkd-wwan: drop unreachable unknown-bearer fallback path
bearer_get_by_path() only succeeds when both modem and bearer are found.
On failure, trying bearer_new_and_initialize(modem, path) was
unreachable and relied on a modem value that is not returned on that
path.
Treat unknown bearers as no-op and rely on modem_map_bearers() for
association during initialization.
coredumpctl: use NULL outputs for COREDUMP existence checks
print_list() and print_info() used RETRIEVE() to strndup() the entire
COREDUMP field into a heap-allocated string, only to check whether it
exists. With sd_journal_set_data_threshold(j, 0) in print_info(),
this copies the full coredump binary (potentially hundreds of MB) to
heap just to print "Storage: journal".
Now that sd_journal_get_data() accepts NULL output pointers, use a
direct NULL/NULL existence check instead.
Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>
sd-journal: skip full decompression when caller only checks field existence
When both ret_data and ret_size are NULL after decompress_startswith()
has confirmed the field matches, skip the decompress_blob() call.
This avoids decompressing potentially large payloads (e.g. inline
coredumps) just to discard the result.
Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>
sd-journal: make sd_journal_get_data() output params optional
Allow callers to pass NULL for ret_data and/or ret_size when they only
need to check whether a field exists. Initialize provided output
pointers to safe defaults and update the manual page accordingly.
Propagate the NULL-ness through to journal_file_data_payload() so that
downstream helpers can optimize for the existence-check case.
Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>
tmpfiles: skip redundant label writes to avoid unnecessary timestamp changes
When systemd-tmpfiles processes a 'z' (relabel) entry, fd_set_perms()
unconditionally calls label_fix_full() even when mode, owner, and group
already match. This causes setfilecon_raw() (SELinux) or xsetxattr() (SMACK)
to write the security label even if it is already correct, which on some
kernels updates the file's timestamps unnecessarily.
Fix this by comparing the current label with the desired label before
writing, and skipping the write when they already match. This is consistent
with how fd_set_perms() already skips chmod/chown when the values are
unchanged.
networkd-wwan: handle link_get_by_name() errors in modem_simple_connect()
modem_simple_connect() ignored the return value of link_get_by_name()
and then checked link for NULL. Since the helper only sets the output
pointer on success, that could read an indeterminate value.
Check and log the return code directly with log_debug_errno().
Timestamps are not guaranteed to be set by `statx()`, and their presence
should not be asserted as a proxy to judge the kernel version. In
particular, `STATX_ATIME` is omitted from the return when querying a
file on a `noatime` superblock, causing spurious errors from tmpfiles:
```console
# SYSTEMD_LOG_LEVEL=debug systemd-tmpfiles --clean
<...>
Running clean action for entry X /var/tmp/systemd-private-94cc8a77688e497f96d5b9019e66ed6f-*/tmp
statx() does not support 'STATX_ATIME' mask (running on an old kernel?)
statx(/var/tmp/systemd-private-94cc8a77688e497f96d5b9019e66ed6f-prometheus-smartctl-exporter.service-GKguQK/tmp) failed: Protocol driver not attached
statx() does not support 'STATX_ATIME' mask (running on an old kernel?)
statx(/var/tmp/systemd-private-94cc8a77688e497f96d5b9019e66ed6f-systemd-logind.service-k8j52T/tmp) failed: Protocol driver not attached
statx() does not support 'STATX_ATIME' mask (running on an old kernel?)
statx(/var/tmp/systemd-private-94cc8a77688e497f96d5b9019e66ed6f-irqbalance.service-7RJkev/tmp) failed: Protocol driver not attached
statx() does not support 'STATX_ATIME' mask (running on an old kernel?)
statx(/var/tmp/systemd-private-94cc8a77688e497f96d5b9019e66ed6f-chronyd.service-8hkO5G/tmp) failed: Protocol driver not attached
statx() does not support 'STATX_ATIME' mask (running on an old kernel?)
statx(/var/tmp/systemd-private-94cc8a77688e497f96d5b9019e66ed6f-dbus-broker.service-6P6LVl/tmp) failed: Protocol driver not attached
statx() does not support 'STATX_ATIME' mask (running on an old kernel?)
statx(/var/tmp/systemd-private-94cc8a77688e497f96d5b9019e66ed6f-nginx.service-B5HX8B/tmp) failed: Protocol driver not attached
Running clean action for entry x /var/tmp/systemd-private-94cc8a77688e497f96d5b9019e66ed6f-*
Running clean action for entry q /var/tmp
statx() does not support 'STATX_ATIME' mask (running on an old kernel?)
statx(/var/tmp) failed: Protocol driver not attached
<...>
```
Additionally, refactor `dir_cleanup()` slightly for self-consistency to
make
it evident that the `NSEC_INFINITY` transformation is correct.
fstab-generator: support swap on network block devices
Teach swap units to support the _netdev option as well, which should
make swaps on iSCSI possible. This mirrors the logic we already have for
regular mounts in both the fstab-generator and the core
(mount.c/swap.c).
Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>
One more round, this time with the help of the claudebot, especially for
spelunking in git blame to find the original commit and writing commit
messages from the list of warnings exported from coverity
Co-developed-by: Claude
[claude@anthropic.com](mailto:claude@anthropic.com)
sysext: provide systemd-{sysext,confext}-sysroot.service services (#41161)
This should pretty much close #38985
The new services are used to activate system and configuration
extensions for the main system from the initrd, this allows to overcome
the limitation that sysext/confext cannot be used to update the
resources which are required in the earliest boot of the system (before
systemd-sysext/systemd-confext start).
To make it possible to disable sysext/confext merging logic,
`systemd.sysext=0`, `systemd.confext=0`, `rd.systemd.sysext=0`,
`rd.systemd.confext=0` kernel cmdline options are introduced.
limits-util: use MUL_SAFE for physical memory calculation
Coverity flags (uint64_t)sc * (uint64_t)ps as a potential overflow.
Use MUL_SAFE which Coverity understands via __builtin_mul_overflow.
Physical page count times page size cannot realistically overflow
uint64_t, but this makes it provable to static analyzers.
Coverity flags si.ssi_signo as tainted data from read(), and warns
that casting it to signed could produce a negative value. Add an
explicit range check against INT_MAX before the SIGNAL_VALID check
to prove the cast is safe.
Coverity flags ALIGN(sizeof(sd_bus_message)) as potentially
returning SIZE_MAX, making the subsequent + sizeof(BusMessageHeader)
overflow. Store the ALIGN result in a local and assert it is not
SIZE_MAX.
sd-bus: use INC_SAFE and assert for message_from_header allocation
Coverity flags ALIGN() as potentially returning SIZE_MAX and the
subsequent a += label_sz + 1 as overflowing. Assert ALIGN result
is not SIZE_MAX and use INC_SAFE for the addition.
Coverity flags now() + 30 * USEC_PER_SEC as overflowing because
now() can return USEC_INFINITY. Use usec_add() which saturates
on overflow instead of wrapping.
Coverity flags sizeof(BusMessageHeader) + ALIGN8(m->fields_size)
as overflowing because ALIGN_TO can return SIZE_MAX as an overflow
sentinel. Assert that the aligned value is not SIZE_MAX to prove
the addition is safe.
recurse-dir: add assert for MALLOC_SIZEOF_SAFE lower bound
Coverity flags MALLOC_SIZEOF_SAFE(de) - offsetof(DirectoryEntries,
buffer) as a potential underflow when MALLOC_SIZEOF_SAFE returns 0.
After a successful malloc the return value is at least as large as
the requested size, but Coverity cannot trace this. Add an assert
to establish the lower bound.
Coverity flags range->n_entries - j - 1 and j-- as potential
underflows. Add an assert that j > 0 before decrementing, since
j starts at i + 1 >= 1 and is never decremented below its
initial value.
scsi_id: null-terminate serial after append_vendor_model
append_vendor_model() uses memcpy() to write VENDOR_LENGTH +
MODEL_LENGTH bytes without null-terminating. While the caller
zeroes the buffer beforehand, Coverity cannot trace this. Add
explicit null termination so the subsequent strlen() is provably
safe.
Uses stop_at_first_nonoption for POSIX-style option parsing.
Includes a fixup for b4df0a9ee62d553e21f3b70c28841cfd1b8736f1, where
global optarg was used instead of the function param. This made no
difference previously because they were always equal.
Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>
coredumpctl: use loop_write() for dumping inline journal coredumps
Replace the bare write() call with loop_write(), which handles short
writes and EINTR retries. This also drops the now-unnecessary ssize_t
variable and the redundant r = log_error_errno(r, ...) self-assignment,
since loop_write() already stores its result in r.
vmspawn: Always enable CXL on supported architectures
Drop the --cxl= option and unconditionally enable cxl=on the QEMU
machine type whenever the host architecture supports it (x86_64 and
aarch64). The flag was only added for testing parity with mkosi's CXL=
setting and there is no reason to leave it as an opt-in toggle: with no
pxb-cxl device or cxl-fmw window attached, enabling it on the machine
only reserves a small MMIO region and emits an empty CEDT, so the cost
is negligible while removing one knob users would otherwise have to
flip explicitly to exercise the CXL code paths in QEMU.
Reject entries once the configured maximum field count is reached.
The previous check used n > ENTRY_FIELD_COUNT_MAX before appending a new field,
which let one extra field through in boundary cases. Switch the check to
n >= ENTRY_FIELD_COUNT_MAX so an entry at the limit is rejected before adding
another property.
Jonas Rebmann [Tue, 7 Apr 2026 09:03:48 +0000 (11:03 +0200)]
test-specifier: update comment to moved file
src/partition/repart.c was renamed to src/repart/repart.c in commit 211d2f972dd1 ("Rename src/partition to src/repart"), update the comment
accordingly.
The -E short option previously used fallthrough into the --more case;
since macro-generated case labels don't support fallthrough (with some
older compilers), the --more logic is now duplicated inline in the -E
handler.
Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>
shared/options: quote the metavar in --help output
imdsd uses --extra-header='NAME: VALUE'. We could include the quotes
in the metavar string, but I think it's nicer to only do that in the
printed output, so that later, when we add introspection, the value
there will not include the quotes.
Vitaly Kuznetsov [Thu, 19 Mar 2026 15:04:34 +0000 (16:04 +0100)]
sysext: provide a cmdline kill switch for the sysext/confext merging logic
While it is possible to disable sysext/confext merging in the main system
with 'systemctl disable', sysext/confext are always merged in the initrd,
both by systemd-{sys,conf}ext-initrd.service and by
systemd-{sys,conf}ext-sysroot.service and especially the latter can be
unexpected. Provide kernel cmdline options systemd.{sys,conf}ext=0 and
rd.systemd.{sys,conf}ext=0 covering all options.
Vitaly Kuznetsov [Wed, 18 Mar 2026 16:09:24 +0000 (17:09 +0100)]
sysext: provide systemd-{sysext,confext}-sysroot.service services
The new services are used to activate system and configuration extensions
for the main system from the initrd, this allows to overcome the limitation
that sysext/confext cannot be used to update the resources which are required
in the earliest boot of the system (before systemd-sysext/systemd-confext
start).
- Fix sd_json_variant_unsigned() dispatching to the wrong accessor
for json variant references.
- Fix a use-after-free of a borrowed varlink reply reference in
ssh-proxy.
vmspawn: use machine name in runtime directory path (#41530)
Replace the random hex suffix in the runtime directory with the machine
name, changing the layout from /run/systemd/vmspawn.<random> to
/run/systemd/vmspawn/<machine-name>/.
This makes runtime directories machine-discoverable from the filesystem
and groups all vmspawn instances under a shared parent directory,
similar to how nspawn uses /run/systemd/nspawn/.
Use runtime_directory_generic() instead of runtime_directory() since
vmspawn is not a service with RuntimeDirectory= set and the
$RUNTIME_DIRECTORY check in the latter never succeeds. The directory is
always created by vmspawn itself and cleaned up via
rm_rf_physical_and_freep on exit. The parent vmspawn/ directory is
intentionally left behind as a shared namespace.
Ivan Shapovalov [Fri, 20 Mar 2026 15:45:07 +0000 (16:45 +0100)]
tmpfiles: do not mandate `STATX_ATIME` and `STATX_MTIME`
Timestamps are not guaranteed to be set by `statx()`, and their presence
should not be asserted as a proxy to judge the kernel version. In
particular, `STATX_ATIME` is omitted from the return when querying a
file on a `noatime` superblock, causing spurious errors from tmpfiles.
Correctness analysis
====================
The timestamps produced by the `statx()` call in `opendir_and_stat()`
are only ever used once, in `clean_item_instance()` (lines 3148-3149)
as inputs to `dir_cleanup()`. Convert absent timestamps into
`NSEC_INFINITY` as per the previous commit.
Ivan Shapovalov [Fri, 20 Mar 2026 15:36:44 +0000 (16:36 +0100)]
tmpfiles: use `NSEC_INFINITY` consistently in dir_cleanup()
Correctness analysis
====================
The *time_nsec variables are used for a total of 2 or 3 times:
- twice in needs_cleanup() (lines 788, 839)
- once in a recursive dir_cleanup() (line 764) as self_*time_nsec
In needs_cleanup(), all passed timestamps are guarded against
NSEC_INFINITY (this does not fix any real bugs as a 0 value is also
older than any cutoff point and thus would not cause any deletions).
Recursively in dir_cleanup(), the self_* variables are used to reset
the toplevel directory utimes, where they are superficially compared
against NSEC_INFINITY as a guard, but subsequently mishandled in the
case when only one of the times is NSEC_INFINITY: in this case, it will
be a) logged as a bogus value and b) passed through directly to
timespec_store_nsec(), which does special-case it, but in a way that
is invalid for futimens(). This is further fixed up by explicitly
mapping NSEC_INFINITY to TIMESPEC_OMIT.
This constitutes a bugfix in theory, as a ~STATX_ATIME return from
statx() would have previously caused the corresponding utime to be
reset to 0 epoch) rather than being omitted from being set. However,
in a directory with ~STATX_ATIME, attempts to set atime would likely
be ignored as well.
Mostly this is a self-consistency fix that establishes that
dir_cleanup() should be called with NSEC_INFINITY in place of
absent timestamps.
vmspawn: use machine name in runtime directory path
Replace the random hex suffix in the runtime directory with the machine
name, changing the layout from /run/systemd/vmspawn.<random> to
/run/systemd/vmspawn/<machine-name>/.
This makes runtime directories machine-discoverable from the filesystem
and groups all vmspawn instances under a shared parent directory, similar
to how nspawn uses /run/systemd/nspawn/.
Use runtime_directory_generic() instead of runtime_directory() since
vmspawn is not a service with RuntimeDirectory= set and the
$RUNTIME_DIRECTORY check in the latter never succeeds. The directory is
always created by vmspawn itself and cleaned up via
rm_rf_physical_and_freep on exit. The parent vmspawn/ directory is
intentionally left behind as a shared namespace.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
sd-json: fix sd_json_variant_unsigned() dispatching to wrong accessor for references
sd_json_variant_unsigned() incorrectly calls sd_json_variant_integer()
for reference-type variants instead of recursing to itself. This silently
returns 0 for unsigned values in the range INT64_MAX+1 through
UINT64_MAX, since sd_json_variant_integer() cannot represent them.
The sibling functions sd_json_variant_integer() and
sd_json_variant_real() correctly recurse to themselves.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
ssh-proxy: fix use-after-free of borrowed varlink reply reference
sd_varlink_call_full() returns borrowed references into the varlink
connection's receive buffer (v->current). fetch_machine() stored this
borrowed reference with _cleanup_(sd_json_variant_unrefp), which would
unref it on error paths -- potentially freeing the parent object while
the varlink connection still owns it. On success, TAKE_PTR passed the
raw borrowed pointer to the caller, but the varlink connection (and its
receive buffer) is freed when fetch_machine returns, leaving the caller
with a dangling pointer.
Fix by removing the cleanup attribute (the reference is borrowed, not
owned) and taking a real ref via sd_json_variant_ref() before returning
to the caller, so the data survives the varlink connection's cleanup.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
shared: introduce MachineRegistrationContext to track bus and registration state
Bundle scope, buses, and registration success booleans into a
MachineRegistrationContext struct. This eliminates the reterr_registered_system and
reterr_registered_user output parameters from
register_machine_with_fallback_and_log() and the corresponding input
parameters from unregister_machine_with_fallback_and_log().
The struct carries state from registration to unregistration so the
caller no longer needs to manually thread individual booleans between
the two calls.
register_machine_with_fallback_and_log() goes from 7 to 3 parameters,
unregister_machine_with_fallback_and_log() goes from 5 to 2.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
shared: introduce MachineRegistration struct for machine registration
Replace the long positional parameter lists in register_machine() and
register_machine_with_fallback_and_log() with a MachineRegistration
struct that bundles all machine-describing fields.
This reduces register_machine() from 13 parameters to 3 and
register_machine_with_fallback_and_log() from 17 parameters to 7.
Callers now use designated initializers, which makes omitted fields
(zero/NULL/false) implicit and the code much more readable.
Field names are aligned with the existing Machine struct in machine.h
(id, root_directory, vsock_cid, ssh_address, ssh_private_key_path).
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
shared: document allocateUnit limitation on D-Bus fallback path
The D-Bus registration methods (RegisterMachineEx, RegisterMachineWithNetwork)
do not support the allocateUnit feature that the varlink path provides.
When varlink is unavailable and registration falls back to D-Bus, machined
discovers the caller's existing cgroup unit instead of creating a dedicated
scope. Callers that skip client-side scope allocation (relying on the
server to do it via allocateUnit) will end up without a dedicated scope
on the D-Bus fallback path.
Document this limitation at the fallback site so callers are aware.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
vmspawn: only open runtime bus when needed for registration or scope allocation
The runtime bus (user bus in user scope, system bus in system scope) is
only needed for scope allocation (!arg_keep_unit) or machine registration
(arg_register != 0). When both are disabled the bus was still opened
unconditionally which causes unnecessary failures if the user bus is
unavailable.
Gate the runtime bus opening on the same condition nspawn already uses.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Daan De Meyer [Sun, 29 Mar 2026 11:10:42 +0000 (11:10 +0000)]
nspawn: rename --user= to --uid= and repurpose --user/--system for runtime scope
Rename nspawn's --user=NAME option to --uid=NAME for selecting the
container user. The -u short option is preserved. --user=NAME and
--user NAME are still accepted but emit a deprecation warning. A
pre-parsing step stitches the space-separated --user NAME form into
--user=NAME before getopt sees it, preserving backwards compatibility
despite --user now being an optional_argument.
Repurpose --user (without argument) and --system as standalone
switches for selecting the runtime scope (user vs system service
manager).
Replace all uses of the arg_privileged boolean with
arg_runtime_scope comparisons throughout nspawn. The default scope
is auto-detected from the effective UID.
Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>
Daan De Meyer [Sun, 29 Mar 2026 18:22:40 +0000 (18:22 +0000)]
shared: move machine registration to shared machine-register.{c,h}
Move register_machine() and unregister_machine() from
vmspawn-register.{c,h} into shared machine-register.{c,h} so both
nspawn and vmspawn can use the same implementation.
The unified register_machine() uses varlink first (for richer
features like SSH support and unit allocation) with a D-Bus
RegisterMachineWithNetwork fallback for older machined. The
interface adds a class parameter ("vm" or "container") and
local_ifindex for nspawn's network interface support.
The unified unregister_machine() similarly tries varlink first
(io.systemd.Machine.Unregister) before falling back to D-Bus.
Both register_machine() and unregister_machine() only log at debug
level internally, leaving error/notice logging to callers.
Add register_machine_with_fallback() which tries system and/or user
scope registration based on a RuntimeScope parameter
(_RUNTIME_SCOPE_INVALID for both), and
unregister_machine_with_fallback() as its counterpart. Both use
RET_GATHER() to collect errors from each scope.
Make --register= a tristate (yes/no/auto) defaulting to auto. When
set to auto, registration failures are logged at notice level and
ignored. When set to yes, failures are fatal.
Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>
machined: skip leader ownership check for user scope
When registering a machine, machined verifies that the leader process
is owned by the calling user via process_is_owned_by_uid(). This
check fails for user scope machined when the leader is inside a user
namespace: after the leader calls setns(CLONE_NEWUSER), it becomes
non-dumpable, and the subsequent ptrace_may_access() check in the
kernel denies access to the process's user namespace, since the
calling user lacks CAP_SYS_PTRACE in the mm's user namespace (the
host namespace), even though the user owns the child user namespace.
Skip this check when running in user scope. For system scope, the
check is important because multiple users share the same machined
instance, so one user must not be able to claim another user's process
as a machine leader. For user scope this is unnecessary: the varlink
socket lives under $XDG_RUNTIME_DIR (mode 0700), so only the owning
user can connect, and the user machined instance can only perform
operations bounded by that user's own privileges. Registering a
foreign PID does not escalate capabilities.
vmspawn: Redirect QEMU's stdin/stdout/stderr to the PTY
When a PTY is allocated for the console, QEMU's own stdio file
descriptors were still inherited directly from vmspawn, meaning any
output QEMU writes to stdout/stderr (e.g. warnings) would bypass the
PTY forwarder and go straight to the terminal. Similarly, QEMU could
read directly from the terminal's stdin.
Fix this by opening the PTY slave side and passing it as stdio_fds to
the fork call with FORK_REARRANGE_STDIO, so that all of QEMU's I/O
goes through the PTY and is properly forwarded.
vmspawn: Use ~ instead of ! as negation prefix for --firmware-features=
Switch the negation character for firmware feature exclusion from
"!" to "~" to be consistent with other systemd options that support
negation such as SystemCallFilter=.
vmspawn: Add comment explaining substring match in firmware_data_matches_machine()
The machine types in QEMU firmware descriptions are glob patterns
like "pc-q35-*", so we use strstr() substring matching to check if
our machine type is covered by a given firmware entry.
There's no way to configure the log level for swtpm_setup, so pipe
it's logfile (which defaults to stderr) to /dev/null unless debug
logging is enabled.