NEWS: pre-announce removal of /run/boot-loader-entries/ support in lo… (#41622)
…gind
logind could read UAPI.1 Boot Loader Spec entries from
/run/boot-loader-entries/ in addition to ESP/XBOOTLDR. This was pretty
half-assed, and to my knowledge was never actually used much.
Let's remove support for it and simplify our codebase.
Let's schedule it for removal via NEWS in a future version, to give
people a chance to speak up.
journal-upload: also disable VERIFYHOST when --trust=all is used
When --trust=all disables CURLOPT_SSL_VERIFYPEER, the residual
CURLOPT_SSL_VERIFYHOST check is ineffective since an attacker can
present a self-signed certificate with the expected hostname. Disable
both for consistency and log that server certificate verification is
disabled.
machined: pass user as positional argument in machine_default_shell_args()
Instead of interpolating the user name directly into the sh -c script
body via asprintf %s, pass it as a positional parameter ($1) in a
separate argv entry. This avoids the user string being parsed as part
of the shell script syntax.
Also validate the user name in bus_machine_method_open_shell() with
valid_user_group_name(), matching the validation already done on the
Varlink path via json_dispatch_const_user_group_name().
logind: reject wall messages containing control characters
method_set_wall_message() and the property setter only checked the
message length but not its content. Since wall messages are broadcast
to all TTYs, control characters in the message could interfere with
terminal state. Reject messages containing control characters other
than newline and tab.
core: add missing SELinux access checks when listing units
Add mac_selinux_unit_access_check_varlink() to the unit enumeration
loop in vl_method_list_units(), silently skipping units the caller
is not permitted to see, matching the D-Bus ListUnits behavior.
Add mac_selinux_access_check_varlink() to vl_method_describe_manager().
In ccecae0efd ("vmspawn: use machine name in runtime directory path")
support for RUNTIME_DIRECTORY was dropped which makes it difficult to
run systemd-vmspawn in a service unit which doesn't have write access to
the regular /run but should use its own managed RUNTIME_DIRECTORY. What
worked before was --keep-unit --system but we can't use XDG_RUNTIME_DIR
and --user because then --keep-unit breaks which we need because it
can't create a scope as there is no session. Switch back to
runtime_directory which handles RUNTIME_DIRECTORY and tells us whether
we should use it as is without later cleanup or if we need to use the
regular path where we create and delete the directory ourselves.
NEWS: pre-announce removal of /run/boot-loader-entries/ support in logind
logind could read UAPI.1 Boot Loader Spec entries from
/run/boot-loader-entries/ in addition to ESP/XBOOTLDR. This was pretty
half-assed, and to my knowledge was never actually used much.
Let's remove support for it and simplify our codebase.
Let's schedule it for removal via NEWS in a future version, to give
people a chance to speak up.
- Use persist-credentials: false for actions/checkout, so we don't
leak the github token credentials to subsequent jobs.
- Remove one / from the Edit/Write permissions. Currently, with the
absolute path from github.workspace, we expand to three slashes while
we only need two.
Kai Lüke [Mon, 13 Apr 2026 12:21:39 +0000 (21:21 +0900)]
vmspawn: Support RUNTIME_DIRECTORY again
In ccecae0efd ("vmspawn: use machine name in runtime directory path")
support for RUNTIME_DIRECTORY was dropped which makes it difficult to
run systemd-vmspawn in a service unit which doesn't have write access
to the regular /run but should use its own managed RUNTIME_DIRECTORY.
What worked before was --keep-unit --system but we can't use
XDG_RUNTIME_DIR and --user because then --keep-unit breaks which
we need because it can't create a scope as there is no session.
Switch back to runtime_directory which handles RUNTIME_DIRECTORY and
tells us whether we should use it as is without later cleanup or if we
need to use the regular path where we create and delete the directory
ourselves.
many: final final set of coccinelle check-pointer-deref tweaks (#41595)
I promised in https://github.com/systemd/systemd/pull/41426 that its the
final update for coccinelle pointer deref checks. However it turned out
there is this coccinelle/parsing_hacks.h that I wasn't aware of. The
file missed some important things like _cleanup_(x) that prevented
coccinelle to check a bunch of functions.
This PR adds some missing defines to the parsing_hacks.h and fixes the
missing asserts(). I apologize that its a bit long (and frankly boring)
and that I missed this earlier.
The last commit contains one small behavior change (ret in
sd_varlink_idl_parse() is now really optional) but the big one is very
mechanical.
This is useful when moving from `--pty` or `--pipe` to using
`--verbose`: you can use `--verbose-output=cat` to get similar output on
stdout while still having all of the advantages of `--verbose` over the
other options.
stat-util: always check S_ISDIR() before S_ISLNK()
Check S_ISDIR() before S_ISLINK() for all stat_verify_xyz() helpers
first, where we check them. Just to ensure we systematically return the
same errors.
Milan Kyselica [Sat, 11 Apr 2026 08:26:13 +0000 (10:26 +0200)]
boot: fix loop bound and OOB in devicetree_get_compatible()
The loop used the byte offset end (struct_off + struct_size) as the
iteration limit, but cursor[i] indexes uint32_t words. This reads
past the struct block when end > size_words.
Use size_words (struct_size / sizeof(uint32_t)) which is the correct
number of words to iterate over.
Also fix a pre-existing OOB in the FDT_BEGIN_NODE handler: the guard
i >= size_words is always false inside the loop (since the loop
condition already ensures i < size_words), so cursor[++i] at the
boundary reads one word past the struct block. Use i + 1 >= size_words
to check before incrementing.
Milan Kyselica [Sat, 11 Apr 2026 08:25:19 +0000 (10:25 +0200)]
boot: fix integer overflow and division by zero in BMP splash parser
Bound image dimensions before computing row_size to prevent overflow
in the depth * x multiplication on 32-bit. Without this, crafted
dimensions like depth=32 x=0x10000001 wrap to a small row_size that
passes all subsequent checks.
Reject channel masks where all bits are set (popcount == 32), since
1U << 32 is undefined behavior and causes division by zero on
architectures where it evaluates to zero. Move the validation before
computing derived values for clarity. Use unsigned 1U in shifts to
avoid signed integer overflow UB for popcount == 31.
journal: limit decompress_blob() output to DATA_SIZE_MAX (#41604)
We already have checks in place during compression that limit the data
we compress, so they shouldn't decompress to anything larger than
DATA_SIZE_MAX unless they've been tampered with. Let's make this
explicit and limit all our decompress_blob() calls in journal-handling
code to that limit.
One possible scenario this should prevent is when one tries to open and
verify a journal file that contains a compression bomb in its payload:
$ systemd-run --user --wait --pipe -- build-local/journalctl --verify --file=$PWD/test.journal
Running as unit: run-p682422-i4875779.service
000110: Invalid hash (00000000 vs. 11e4948d73bdafdd)
000110: Invalid object contents: Bad message
File corruption detected at /home/fsumsal/repos/@systemd/systemd/test.journal:272 (of 1249896 bytes, 0%).
FAIL: /home/fsumsal/repos/@systemd/systemd/test.journal (Bad message)
Finished with result: exit-code
Main processes terminated with: code=exited, status=1/FAILURE
Service runtime: 48.051s
CPU time consumed: 47.941s
Memory peak: 8G (swap: 0B)
```
Same could be, in theory, possible with just `journalctl --file=`, but
the reproducer would be a bit more complicated (haven't tried it, yet).
Lastly, the change in journal-remote is mostly hardening, as the maximum
input size to decompress_blob() there is mandated by MHD's connection
memory limit (set to JOURNAL_SERVER_MEMORY_MAX which is 128 KiB at the
time of writing), so the possible output size there is already quite
limited (e.g. ~800 - 900 MiB for xz-compressed data).
Daan De Meyer [Mon, 22 Dec 2025 10:22:34 +0000 (11:22 +0100)]
nspawn: Add --restrict-address-families= option
Add a new --restrict-address-families= command line option and
corresponding RestrictAddressFamilies= setting for .nspawn files to
restrict which socket address families may be used inside a container.
Many address families such as AF_VSOCK and AF_NETLINK are not
network-namespaced, so restricting access to them in containers
improves isolation. The option supports allowlist and denylist modes
(via ~ prefix), as well as "none" to block all families, matching the
semantics of RestrictAddressFamilies= in unit files.
The address family parsing logic is extracted into a shared
parse_address_families() helper in parse-helpers.c, which is now also
used by config_parse_address_families() in load-fragment.c.
This is currently opt-in. In a future version, the default will be
changed to restrict address families to AF_INET, AF_INET6 and AF_UNIX.
Daan De Meyer [Fri, 27 Mar 2026 22:03:14 +0000 (22:03 +0000)]
systemctl: replace kexec-tools dependency with direct kexec_file_load() syscall
Replace the fork+exec of /usr/bin/kexec in load_kexec_kernel() with a
direct kexec_file_load() syscall, removing the runtime dependency on
kexec-tools for systemctl kexec.
The kexec_file_load() syscall (available since Linux 3.17) accepts
kernel and initrd file descriptors directly, letting the kernel handle
image parsing, segment setup, and purgatory internally. This is much
simpler than the older kexec_load() syscall which requires complex
userspace setup of memory segments and boot protocol structures — that
complexity is the raison d'être of kexec-tools.
The implementation follows the established libc wrapper pattern: a
missing_kexec_file_load() fallback in src/libc/kexec.c calls the
syscall directly when glibc doesn't provide a wrapper (which is
currently always the case). The syscall is not available on all
architectures — alpha, i386, ia64, m68k, mips, sh, and sparc lack
__NR_kexec_file_load — so the wrapper and caller are guarded with
HAVE_KEXEC_FILE_LOAD_SYSCALL to compile cleanly everywhere.
When kexec_file_load() rejects the kernel image with ENOEXEC (e.g. the
image is compressed or wrapped in a PE container that the kernel's kexec
handler doesn't understand natively), we attempt to unwrap/decompress
and retry. This is effectively the same decompression and extraction
logic that already lives in src/ukify/ukify.py (maybe_decompress() and
get_zboot_kernel()), now implemented in C so that systemctl can handle
it natively without shelling out to external tools:
- Compressed kernels (Image.gz, Image.zst, Image.xz, Image.lz4): the
format is detected by magic bytes (per RFC 1952, RFC 8878,
tukaani.org xz spec, and lz4 frame format spec) and decompressed to
a memfd using the existing decompress_stream_*() infrastructure plus
the new gzip support from the previous commit. This is primarily
needed on arm64 where kexec_file_load() only accepts raw Image files.
On x86_64, bzImage is already the native format and works directly.
- EFI ZBOOT PE images (vmlinuz.efi): detected by "MZ" + "zimg" magic
at the start of the file. The compressed payload offset, size, and
compression type are read from the ZBOOT header defined in
linux/drivers/firmware/efi/libstub/zboot-header.S.
- Unified Kernel Images (UKI): detected as PE files with a .linux
section via the existing pe_is_uki() infrastructure. The .linux
section (kernel) and optionally .initrd section are extracted to
memfds. When a UKI provides an embedded initrd and the boot entry
doesn't specify one separately, the embedded initrd is used.
The try-first-then-decompress approach means we never decompress
unnecessarily: on x86_64 the first kexec_file_load() call succeeds
immediately with the raw bzImage, and on architectures where the
kernel's kexec handler natively understands PE (like LoongArch with
kexec_efi_ops), ZBOOT/UKI images work without decompression too.
If kexec_file_load() is unavailable (architectures without the syscall)
or all attempts fail, we fall back to forking+execing the kexec binary.
This preserves compatibility on architectures like i386 and mips where
only the older kexec_load() syscall exists and kexec-tools is needed to
handle the complex userspace setup.
Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>
compress: rework decompressor_detect() on top of compression_detect_from_magic()
Replace the duplicated magic byte signatures in decompressor_detect()
with a call to the new compression_detect_from_magic() helper and use a
switch statement to initialize the appropriate decompression context.
time-util: encode our assumption that clock_gettime() never can return 0 or USEC_INFINITY
We generally assume that valid times returned by clock_gettime() are > 0
and < USEC_INFINITY. If this wouldn't hold all kinds of things would
break, because we couldn't distuingish our niche values from regular
values anymore.
Let's hence encode our assumptions in C, already to help static
analyzers and LLMs.
One more round, this time with the help of the claudebot, especially for
spelunking in git blame to find the original commit and writing commit
messages from the list of warnings exported from coverity
Co-developed-by: Claude
[claude@anthropic.com](mailto:claude@anthropic.com)
core: varlink enum for io.systemd.Unit interface (#40972)
Convert string fields to varlink enums in io.systemd.Unit
Following
https://github.com/systemd/systemd/pull/39391#discussion_r2489599449,
convert all configuration setting fields in the io.systemd.Unit varlink
interface from bare SD_VARLINK_STRING to proper enum types, adding type
safety to the IDL.
This converts ~30 fields across ExecContext, CGroupContext, and
UnitContext, adding 25 new varlink enum types.
Weak compatibility breakage (per
https://github.com/systemd/systemd/pull/40972#issuecomment-4222294318):
Varlink enum identifiers cannot contain - or +, so affected values are
underscorified on the wire. For example, "tty-force" becomes tty_force,
"kmsg+console" becomes kmsg_console.
journal: limit decompress_blob() output to DATA_SIZE_MAX
We already have checks in place during compression that limit the data
we compress, so they shouldn't decompress to anything larger than
DATA_SIZE_MAX unless they've been tampered with. Let's make this
explicit and limit all our decompress_blob() calls in journal-handling
code to that limit.
One possible scenario this should prevent is when one tries to open and
verify a journal file that contains a compression bomb in its payload:
$ systemd-run --user --wait --pipe -- build-local/journalctl --verify --file=$PWD/test.journal
Running as unit: run-p682422-i4875779.service
000110: Invalid hash (00000000 vs. 11e4948d73bdafdd)
000110: Invalid object contents: Bad message
File corruption detected at /home/fsumsal/repos/@systemd/systemd/test.journal:272 (of 1249896 bytes, 0%).
FAIL: /home/fsumsal/repos/@systemd/systemd/test.journal (Bad message)
Finished with result: exit-code
Main processes terminated with: code=exited, status=1/FAILURE
Service runtime: 48.051s
CPU time consumed: 47.941s
Memory peak: 8G (swap: 0B)
Same could be, in theory, possible with just `journalctl --file=`, but
the reproducer would be a bit more complicated (haven't tried it, yet).
Lastly, the change in journal-remote is mostly hardening, as the maximum
input size to decompress_blob() there is mandated by MHD's connection
memory limit (set to JOURNAL_SERVER_MEMORY_MAX which is 128 KiB at the
time of writing), so the possible output size there is already quite
limited (e.g. ~800 - 900 MiB for xz-compressed data).
Michael Vogt [Sun, 12 Apr 2026 13:47:48 +0000 (15:47 +0200)]
coccinelle: add SIZEOF() macro to work-around sizeof(*private)
We have code like `size_t max_size = sizeof(*private)` in three
places. This is evaluated at compile time so its safe to use. However
the new pointer-deref checker in coccinelle is not smart enough to know
this and will flag those as errors. To avoid these false positives
we have some options:
1. Reorder so that we do:
```C
size_t max_size = 0;
assert(private);
max_size = sizeof(*private);
```
2. Use something like `size_t max_size = sizeof(*ASSERT_PTR(private));`
3. Place the assert before the declaration
4. Workaround coccinelle via SIZEOF(*private) that we can then hide
via parsing_hacks.h
5. Fix coccinelle (OCaml, hard)
6. ... somehting I missed?
None of these is very appealing. I went for (4) but happy about
suggestions.
Michael Vogt [Sat, 11 Apr 2026 17:52:33 +0000 (19:52 +0200)]
sd-varlink: make ret optional in sd_varlink_idl_parse()
We have a test failure where the testsuite is calling
sd_varlink_idl_parse() with *ret being NULL. This is now an
assert error. So we could either fix the test or fix the code
Given that it seems genuinely useful to run sd_varlink_idl_parse()
without *ret to e.g. just check if the idl is valid I opted to
fix the code.
test-json: add iszero_safe guards for float division at index 0 and 1
The existing iszero_safe guards at index 9 and 10 were added to
silence Coverity, but the same division-by-float-zero warning also
applies to the divisions at index 0 (DBL_MIN) and 1 (DBL_MAX).
debug-generator: assert breakpoint type is valid before bit shift
The BreakpointType enum includes _BREAKPOINT_TYPE_INVALID (-EINVAL),
so Coverity flags the bit shift as potentially using a negative shift
amount. Add an assert to verify the type is in valid range, since the
static table only contains valid entries.
uid-range: add assert to prevent underflow in coalesce loop
Coverity flags range->n_entries - j as a potential underflow
in the memmove size calculation. Add assert(range->n_entries > 0)
before decrementing n_entries, which holds since the loop condition
guarantees j < n_entries.
sd-varlink: scale down the limit of connections per UID to 128
1024 connections per UID is unnecessarily generous, so let's scale this
down a bit. D-Bus defaults to 256 connections per UID, but let's be even
more conservative and go with 128.
Michael Vogt [Tue, 31 Mar 2026 17:01:28 +0000 (19:01 +0200)]
tools: run check-coccinelle.sh with (updated) parsing_hacks.h
This commit runs the check-coccinelle checker scripts with the
parsing_hacks.h. Because this was missing before there were some
issues that did not get flagged.
While at it it also adds some missing cleanup attributes and
iterators to get better results. Its a bit sad that there is no
(easy/obvious) way to detect when new things are needed for
parsing_hacks.h
Coverity was complaining that we we're doing a integer division and then
casting that to double. This was OK, but it was also a bit pointless.
An operation on a double and unsigned promoted the unsigned to a double,
so it's enough if we have a double somewhere as an argument early enough.
Drop noop casts and parens to make the formulas easier to read.
Sometimes we want need to diff two unsigned numbers, which is awkward
because we need to cast them to something with a sign first, if we want
to use abs(). Let's add a helper that avoids the function call
altogether.
Also drop unnecessary parens arounds args which are delimited by commas.
Coverity complains that r is overridden. In fact it isn't, but
we shouldn't set it like this anyway. exec_with_listen_fds() already
logs, so we only need to call _exit() if it fails.
importctl: fix -N to actually clear keep-download flag
-N was clearing and re-setting the same bit in arg_import_flags_mask,
which is a no-op. It should clear the bit in arg_import_flags instead,
matching what --keep-download=no does via SET_FLAG().
shared/verbs: add _SCOPE variants of the verb macros
In some of the large programs, verbs are defined as non-static
functions. To support this cases, add variants of the VERB macros that
take an explicit scope parameter. The existing macros then call those
new macros with scope=static. The variant without static is the
exception, so the macros are "optimized" toward the static helpers.
I also considered allowing VERB macros to be used in different files,
i.e. in different compilation units. This would actually work without
too many changes, except for one caveat: the order in the array would be
unspecified, so we'd need to somehow order the verbs appropriately. This
most likely means that the verbs would need to be annotated with a
number. But that doesn't seem attractive at all: we'd need to coordinate
changes in different files. So just listing the verbs in one file seems
like least bad option.
shared/options: add option to generate a help line for custom option format
Sometimes we want to document what -RR or -vv does or some other
special thing. Let's allow this by (ab-)using long_code pointer
to store a preformatted string.
json-stream: fix NULL pointer passed to memcpy on first read with INPUT_SENSITIVE
When JSON_STREAM_INPUT_SENSITIVE is set before the first read,
input_buffer is NULL, input_buffer_size is 0, and input_buffer_index
is 0. The old condition '!INPUT_SENSITIVE && index == 0' would route
this case into the else branch which calls memcpy() with a NULL source
pointer, which is undefined behavior even when the length is zero, and
is caught by UBSan.
Fix by checking input_buffer_index == 0 first, then allowing the
GREEDY_REALLOC fast path also when input_buffer_size == 0, since
there is no sensitive data to protect from realloc() copying in that
case. The else branch is now only entered when there is actual data
to copy (input_buffer_size > 0), guaranteeing input_buffer is
non-NULL.
core: fix EBUSY on restart and clean of delegated services
When a service is configured with Delegate=yes and DelegateSubgroup=sub,
the delegated container may write domain controllers (e.g. "pids") into the
service cgroup's cgroup.subtree_control via its cgroupns root. On container
exit the stale controllers remain, and on service restart clone3() with
CLONE_INTO_CGROUP fails with EBUSY because placing a process into a cgroup
that has domain controllers in subtree_control violates the no-internal-
processes rule. The same issue affects systemctl clean, where cg_attach()
fails with EBUSY for the same reason.
Add unit_cgroup_disable_all_controllers() helper in cgroup.c that clears
stale controllers via cg_enable(mask=0) and updates cgroup_enabled_mask to
keep internal tracking in sync. Call it from service_start() and
service_clean() right before spawning, so that resource control is preserved
for any lingering processes from the previous invocation as long as possible.
sd-json: add JsonStream transport-layer module and migrate sd-varlink
Introduces JsonStream, a generic transport layer for JSON-line message
exchange over a pair of file descriptors. It owns the input/output
buffers, SCM_RIGHTS fd passing, the deferred output queue, the
read/write/parse step functions, sd-event integration (input/output/time
event sources), the idle timeout machinery, and peer credential caching,
but knows nothing about the specific JSON protocol on top — the consumer
drives its state machine via phase/dispatch callbacks supplied at
construction.
sd-varlink is reworked to delegate the entire transport layer to a
JsonStream owned by sd_varlink. The varlink struct drops every
transport-related field (input/output buffers and fds, output queue,
fd-passing state, ucred/pidfd cache, prefer_read/write fallback, idle
timeout, description, event sources) — all of that lives in JsonStream
now. What remains in sd_varlink is the varlink-protocol state machine
(state, n_pending, current/previous/sentinel, server linkage, peer
credentials accounting, exec_pidref, the varlink-specific quit and defer
sources) and a thin wrapper layer over the JsonStream API. The
should_disconnect / get_timeout / get_events / wait helpers all live in
JsonStream now and are driven by a JsonStreamPhase the consumer reports
via its phase callback.
Ivan Kruglov [Thu, 5 Mar 2026 11:05:00 +0000 (03:05 -0800)]
test: add core-specific varlink enum sync test
Add test-varlink-idl-unit that validates all varlink enum types in
io.systemd.Unit match their corresponding C string tables. This
catches drift between varlink IDL enum definitions and internal
enum values.
Uses core_test_template since it links against libcore for access
to the string table lookup functions.
ExecOutput uses TEST_IDL_ENUM_TO_STRING only because the '+' in
'kmsg+console' doesn't survive the underscorify/dashify round-trip.
With yeswehack.com suspended due to funding issues for triagers being
worked out, reports on GH are starting to pile up. Explicitly define
some ground rules to avoid noise and time wasting.
Ivan Kruglov [Thu, 5 Mar 2026 10:31:24 +0000 (02:31 -0800)]
varlink: add enum types for configuration settings in io.systemd.Unit
Define proper varlink enum types for unit configuration settings that
are part of the user-facing API (values users/clients can select).
This replaces SD_VARLINK_STRING with SD_VARLINK_DEFINE_FIELD_BY_TYPE
for these fields, giving them strong type semantics in the IDL.
Engine-reported runtime state fields (Type, LoadState, ActiveState,
FreezerState, SubState, UnitFileState) remain as strings since only
the engine selects those values.
Add DEFINE_ARRAY_FREE_FUNC and mount_image_free_array
This is similar to DEFINE_POINTER_ARRAY_FREE_FUNC, but one
pointer chase less. The name of the outer and inner functions are
specified separately. The inner function does not free, so it'll
be generally something like 'foo_done', but the outer function
does free, so it can be called 'foo_array_free'.
Add DEFINE_POINTER_ARRAY_FREE_FUNC and conf_file_free_array
As mentioned in the grandfather commit, I want to use the _many
suffix for freeing of the contents of an array, so the functions
to free the array to get the suffix _array.
This is a helper macro that defines a function to drop elements of an
array but not the array itself. I used the "_many" suffix because it
most closely matches what happens here: we are calling the cleanup
function a bunch of times.
Ivan Kruglov [Thu, 5 Mar 2026 10:30:39 +0000 (02:30 -0800)]
test: extract varlink IDL test helpers into shared header
Move the TEST_IDL_ENUM_TO_STRING, TEST_IDL_ENUM_FROM_STRING, and
TEST_IDL_ENUM macros along with test_enum_to_string_name() from
test-varlink-idl.c into test-varlink-idl-util.h so they can be
reused by other test files.
newa(t, n) already allocates sizeof(t) * n bytes, so previously we'd
actually allocate sizeof(t) * sizeof(t) * n bytes, which is ~16x more
(on x86_64) that we actually needed.