git.ipfire.org Git - thirdparty/systemd.git/log

boot: fix integer overflow and division by zero in BMP splash parser

Bound image dimensions before computing row_size to prevent overflow
in the depth * x multiplication on 32-bit. Without this, crafted
dimensions like depth=32 x=0x10000001 wrap to a small row_size that
passes all subsequent checks.

Reject channel masks where all bits are set (popcount == 32), since
1U << 32 is undefined behavior and causes division by zero on
architectures where it evaluates to zero. Move the validation before
computing derived values for clarity. Use unsigned 1U in shifts to
avoid signed integer overflow UB for popcount == 31.

Also reject zero-width and zero-height images.

Fixes: https://github.com/systemd/systemd/issues/41589

core: use JSON_BUILD_CONST_STRING() where appropriate

journal: limit decompress_blob() output to DATA_SIZE_MAX (#41604)

We already have checks in place during compression that limit the data
we compress, so they shouldn't decompress to anything larger than
DATA_SIZE_MAX unless they've been tampered with. Let's make this
explicit and limit all our decompress_blob() calls in journal-handling
code to that limit.

One possible scenario this should prevent is when one tries to open and
verify a journal file that contains a compression bomb in its payload:

```
$ ls -lh test.journal
-rw-rw-r--+ 1 fsumsal fsumsal 1.2M Apr 12 15:07 test.journal

$ systemd-run --user --wait --pipe -- build-local/journalctl --verify --file=$PWD/test.journal
Running as unit: run-p682422-i4875779.service
000110: Invalid hash (00000000 vs. 11e4948d73bdafdd)
000110: Invalid object contents: Bad message
File corruption detected at /home/fsumsal/repos/@systemd/systemd/test.journal:272 (of 1249896 bytes, 0%).
FAIL: /home/fsumsal/repos/@systemd/systemd/test.journal (Bad message)
          Finished with result: exit-code
Main processes terminated with: code=exited, status=1/FAILURE
               Service runtime: 48.051s
             CPU time consumed: 47.941s
                   Memory peak: 8G (swap: 0B)
```
Same could be, in theory, possible with just `journalctl --file=`, but
the reproducer would be a bit more complicated (haven't tried it, yet).

Lastly, the change in journal-remote is mostly hardening, as the maximum
input size to decompress_blob() there is mandated by MHD's connection
memory limit (set to JOURNAL_SERVER_MEMORY_MAX which is 128 KiB at the
time of writing), so the possible output size there is already quite
limited (e.g. ~800 - 900 MiB for xz-compressed data).

udev/scsi-id: hardening against malformed kernel data (#41585)

nspawn: Add --restrict-address-families= option

Add a new --restrict-address-families= command line option and
corresponding RestrictAddressFamilies= setting for .nspawn files to
restrict which socket address families may be used inside a container.

Many address families such as AF_VSOCK and AF_NETLINK are not
network-namespaced, so restricting access to them in containers
improves isolation. The option supports allowlist and denylist modes
(via ~ prefix), as well as "none" to block all families, matching the
semantics of RestrictAddressFamilies= in unit files.

The address family parsing logic is extracted into a shared
parse_address_families() helper in parse-helpers.c, which is now also
used by config_parse_address_families() in load-fragment.c.

This is currently opt-in. In a future version, the default will be
changed to restrict address families to AF_INET, AF_INET6 and AF_UNIX.

mkosi: Drop kexec-tools

Not needed anymore now that we use kexec_file_load().

systemctl: replace kexec-tools dependency with direct kexec_file_load() syscall

Replace the fork+exec of /usr/bin/kexec in load_kexec_kernel() with a
direct kexec_file_load() syscall, removing the runtime dependency on
kexec-tools for systemctl kexec.

The kexec_file_load() syscall (available since Linux 3.17) accepts
kernel and initrd file descriptors directly, letting the kernel handle
image parsing, segment setup, and purgatory internally. This is much
simpler than the older kexec_load() syscall which requires complex
userspace setup of memory segments and boot protocol structures — that
complexity is the raison d'être of kexec-tools.

The implementation follows the established libc wrapper pattern: a
missing_kexec_file_load() fallback in src/libc/kexec.c calls the
syscall directly when glibc doesn't provide a wrapper (which is
currently always the case). The syscall is not available on all
architectures — alpha, i386, ia64, m68k, mips, sh, and sparc lack
__NR_kexec_file_load — so the wrapper and caller are guarded with
HAVE_KEXEC_FILE_LOAD_SYSCALL to compile cleanly everywhere.

When kexec_file_load() rejects the kernel image with ENOEXEC (e.g. the
image is compressed or wrapped in a PE container that the kernel's kexec
handler doesn't understand natively), we attempt to unwrap/decompress
and retry. This is effectively the same decompression and extraction
logic that already lives in src/ukify/ukify.py (maybe_decompress() and
get_zboot_kernel()), now implemented in C so that systemctl can handle
it natively without shelling out to external tools:

- Compressed kernels (Image.gz, Image.zst, Image.xz, Image.lz4): the
   format is detected by magic bytes (per RFC 1952, RFC 8878,
   tukaani.org xz spec, and lz4 frame format spec) and decompressed to
   a memfd using the existing decompress_stream_*() infrastructure plus
   the new gzip support from the previous commit. This is primarily
   needed on arm64 where kexec_file_load() only accepts raw Image files.
   On x86_64, bzImage is already the native format and works directly.

- EFI ZBOOT PE images (vmlinuz.efi): detected by "MZ" + "zimg" magic
   at the start of the file. The compressed payload offset, size, and
   compression type are read from the ZBOOT header defined in
   linux/drivers/firmware/efi/libstub/zboot-header.S.

- Unified Kernel Images (UKI): detected as PE files with a .linux
   section via the existing pe_is_uki() infrastructure. The .linux
   section (kernel) and optionally .initrd section are extracted to
   memfds. When a UKI provides an embedded initrd and the boot entry
   doesn't specify one separately, the embedded initrd is used.

The try-first-then-decompress approach means we never decompress
unnecessarily: on x86_64 the first kexec_file_load() call succeeds
immediately with the raw bzImage, and on architectures where the
kernel's kexec handler natively understands PE (like LoongArch with
kexec_efi_ops), ZBOOT/UKI images work without decompression too.

If kexec_file_load() is unavailable (architectures without the syscall)
or all attempts fail, we fall back to forking+execing the kexec binary.
This preserves compatibility on architectures like i386 and mips where
only the older kexec_load() syscall exists and kexec-tools is needed to
handle the complex userspace setup.

Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>

libc: Add kexec_file_load() syscall wrapper

Allow tabs in UAPI headers in .gitattributes since they are copied
verbatim from the kernel.

compress: rework decompressor_detect() on top of compression_detect_from_magic()

Replace the duplicated magic byte signatures in decompressor_detect()
with a call to the new compression_detect_from_magic() helper and use a
switch statement to initialize the appropriate decompression context.

time-util: encode our assumption that clock_gettime() never can return 0 or USEC_INFINITY

We generally assume that valid times returned by clock_gettime() are > 0
and < USEC_INFINITY. If this wouldn't hold all kinds of things would
break, because we couldn't distuingish our niche values from regular
values anymore.

Let's hence encode our assumptions in C, already to help static
analyzers and LLMs.

Inspired by: https://github.com/systemd/systemd/pull/41601#pullrequestreview-4094645891

udev/scsi-id: various typing refactorings

udev/scsi-id: check for invalid header from kernel buffer

udev/scsi-id: check for invalid chars in various fields received from the kernel

Follow-up for 16325b35fa6ecb25f66534a562583ce3b96d52f3

nss-systemd: fix off-by-one in nss_pack_group_record_shadow()

nss_count_strv() counts trailing NULL pointers in n. The pointer area
then used (n + 1), reserving one slot more than the size check
accounted for.

Drop the + 1 since n already includes the trailing NULLs, unlike the
non-shadow nss_pack_group_record() where n does not.

Fixes: https://github.com/systemd/systemd/issues/41591

More assorted coverity fixes (#41601)

One more round, this time with the help of the claudebot, especially for
spelunking in git blame to find the original commit and writing commit
messages from the list of warnings exported from coverity

Co-developed-by: Claude
[claude@anthropic.com](mailto:claude@anthropic.com)

core: varlink enum for io.systemd.Unit interface (#40972)

Convert string fields to varlink enums in io.systemd.Unit

Following
https://github.com/systemd/systemd/pull/39391#discussion_r2489599449,
convert all configuration setting fields in the io.systemd.Unit varlink
interface from bare SD_VARLINK_STRING to proper enum types, adding type
safety to the IDL.

This converts ~30 fields across ExecContext, CGroupContext, and
UnitContext, adding 25 new varlink enum types.

Weak compatibility breakage (per
https://github.com/systemd/systemd/pull/40972#issuecomment-4222294318):
Varlink enum identifiers cannot contain - or +, so affected values are
underscorified on the wire. For example, "tty-force" becomes tty_force,
"kmsg+console" becomes kmsg_console.

The full list of affected values:
```
  - ExecInputType: tty-force, tty-fail
  - ExecOutputType: kmsg+console, journal+console
  - ProtectHome: read-only
  - CGroupController: bpf-firewall, bpf-devices, bpf-foreign, bpf-socket-bind, bpf-restrict-network-interfaces, bpf-bind-network-interface
  - CollectMode: inactive-or-failed
  - EmergencyAction: exit-force, reboot-force, reboot-immediate, poweroff-force, poweroff-immediate, soft-reboot, soft-reboot-force, kexec-force, halt-force, halt-immediate
  - JobMode: replace-irreversibly, ignore-dependencies, ignore-requirements, restart-dependencies
```

journal: limit decompress_blob() output to DATA_SIZE_MAX

We already have checks in place during compression that limit the data
we compress, so they shouldn't decompress to anything larger than
DATA_SIZE_MAX unless they've been tampered with. Let's make this
explicit and limit all our decompress_blob() calls in journal-handling
code to that limit.

One possible scenario this should prevent is when one tries to open and
verify a journal file that contains a compression bomb in its payload:

$ ls -lh test.journal
-rw-rw-r--+ 1 fsumsal fsumsal 1.2M Apr 12 15:07 test.journal

$ systemd-run --user --wait --pipe -- build-local/journalctl --verify --file=$PWD/test.journal
Running as unit: run-p682422-i4875779.service
000110: Invalid hash (00000000 vs. 11e4948d73bdafdd)
000110: Invalid object contents: Bad message
File corruption detected at /home/fsumsal/repos/@systemd/systemd/test.journal:272 (of 1249896 bytes, 0%).
FAIL: /home/fsumsal/repos/@systemd/systemd/test.journal (Bad message)
          Finished with result: exit-code
Main processes terminated with: code=exited, status=1/FAILURE
               Service runtime: 48.051s
             CPU time consumed: 47.941s
                   Memory peak: 8G (swap: 0B)

Same could be, in theory, possible with just `journalctl --file=`, but
the reproducer would be a bit more complicated (haven't tried it, yet).

Lastly, the change in journal-remote is mostly hardening, as the maximum
input size to decompress_blob() there is mandated by MHD's connection
memory limit (set to JOURNAL_SERVER_MEMORY_MAX which is 128 KiB at the
time of writing), so the possible output size there is already quite
limited (e.g. ~800 - 900 MiB for xz-compressed data).

compress: limit the output to dst_max bytes with LZ4 if set

We already do that with other algorithms, so let's make
decompress_blob_lz4() consistent with the rest.

journal: move the {DATA,ENTRY}_SIZE constants to sd-journal

So we can access them from the code there as well.

coccinelle: add SIZEOF() macro to work-around sizeof(*private)

We have code like `size_t max_size = sizeof(*private)` in three
places. This is evaluated at compile time so its safe to use. However
the new pointer-deref checker in coccinelle is not smart enough to know
this and will flag those as errors. To avoid these false positives
we have some options:
1. Reorder so that we do:
```C
size_t max_size = 0;
assert(private);
max_size = sizeof(*private);
```
2. Use something like `size_t max_size = sizeof(*ASSERT_PTR(private));`
3. Place the assert before the declaration
4. Workaround coccinelle via SIZEOF(*private) that we can then hide
via parsing_hacks.h
5. Fix coccinelle (OCaml, hard)
6. ... somehting I missed?

None of these is very appealing. I went for (4) but happy about
suggestions.

sd-varlink: make ret optional in sd_varlink_idl_parse()

We have a test failure where the testsuite is calling
sd_varlink_idl_parse() with *ret being NULL. This is now an
assert error. So we could either fix the test or fix the code

Given that it seems genuinely useful to run sd_varlink_idl_parse()
without *ret to e.g. just check if the idl is valid I opted to
fix the code.

many: fix remaining check-pointer-deref issues

The updated parsing_hacks.h file uncovered a bunch of extra
things that the check-pointer-deref coccinelle script flags.

This commit fixes them to make the tree check-pointer-deref clean.

test-json: add iszero_safe guards for float division at index 0 and 1

The existing iszero_safe guards at index 9 and 10 were added to
silence Coverity, but the same division-by-float-zero warning also
applies to the divisions at index 0 (DBL_MIN) and 1 (DBL_MAX).

CID#1587762

Follow-up for 7f133c996c8b1ea9219540ec8f966b64b58d30a6

debug-generator: assert breakpoint type is valid before bit shift

The BreakpointType enum includes _BREAKPOINT_TYPE_INVALID (-EINVAL),
so Coverity flags the bit shift as potentially using a negative shift
amount. Add an assert to verify the type is in valid range, since the
static table only contains valid entries.

CID#1568482

Follow-up for 1929226e7e649b72f3f9acd464eaac771c00945c

nss-myhostname: add more INC_SAFE for buffer index accumulation

Use overflow-safe INC_SAFE() instead of raw addition for idx
accumulation, so that Coverity can see the addition is checked.

CID#1548028

Follow-up for a05483a921a518fd283e7cb32dc8c8e816b2ab2c

uid-range: add assert to prevent underflow in coalesce loop

Coverity flags range->n_entries - j as a potential underflow
in the memmove size calculation. Add assert(range->n_entries > 0)
before decrementing n_entries, which holds since the loop condition
guarantees j < n_entries.

CID#1548015

Follow-up for 8dcc66cefc8ab489568c737adcba960756d76a3c

Some coverity cleanups (#41596)

Another batch of option+verb conversions (#41586)

sd-varlink: scale down the limit of connections per UID to 128

1024 connections per UID is unnecessarily generous, so let's scale this
down a bit. D-Bus defaults to 256 connections per UID, but let's be even
more conservative and go with 128.

po: Translated using Weblate (Arabic)

Currently translated at 100.0% (266 of 266 strings)

Co-authored-by: joo es <jonnyse@users.noreply.translate.fedoraproject.org>
Translate-URL: https://translate.fedoraproject.org/projects/systemd/main/ar/
Translation: systemd/main

tools: run check-coccinelle.sh with (updated) parsing_hacks.h

This commit runs the check-coccinelle checker scripts with the
parsing_hacks.h. Because this was missing before there were some
issues that did not get flagged.

While at it it also adds some missing cleanup attributes and
iterators to get better results. Its a bit sad that there is no
(easy/obvious) way to detect when new things are needed for
parsing_hacks.h

homed: drop unnecessary cast to double

Coverity was complaining that we we're doing a integer division and then
casting that to double. This was OK, but it was also a bit pointless.
An operation on a double and unsigned promoted the unsigned to a double,
so it's enough if we have a double somewhere as an argument early enough.
Drop noop casts and parens to make the formulas easier to read.

CID#1466459

fundamental: add ABS_DIFF macro

Sometimes we want need to diff two unsigned numbers, which is awkward
because we need to cast them to something with a sign first, if we want
to use abs(). Let's add a helper that avoids the function call
altogether.

Also drop unnecessary parens arounds args which are delimited by commas.

varlinkctl: drop bogus variable assignment

Coverity complains that r is overridden. In fact it isn't, but
we shouldn't set it like this anyway. exec_with_listen_fds() already
logs, so we only need to call _exit() if it fails.

CID#1646716

sd-event: replace dead code path with an assert

Coverity complains that the -EOPNOTSUPP can never be returned, because
we always have !watch_fallback==locked.

CID#1654169

cryptsetup: convert to the new option and verb parsers

The synopisis is moved from the header to the a new section:

  -systemd-cryptsetup attach VOLUME SOURCE-DEVICE [KEY-FILE] [CONFIG]
  -systemd-cryptsetup detach VOLUME
  +systemd-cryptsetup [OPTIONS...] {COMMAND} ...

   Attach or detach an encrypted block device.

  +Commands:
  +  attach VOLUME SOURCE-DEVICE [KEY-FILE] [CONFIG] Attach an encrypted block
  +                                                  device
  +  detach VOLUME                                   Detach an encrypted block
  +                                                  device
  +
  +Options:

I think that's OK… With the autogenerated table that's the natural
thing to do.

Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>

cryptenroll: convert to the new option parser

--help is the same, apart from linewrapping.

Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>

cryptenroll: reorder option cases to match --help output

Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>

cat: convert to the new option parser

--help is identical except for whitespace.

Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>

bsod: convert to the new option parser

Option indentation in --help is fixed.
Description for --continuous is shortened.

Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>

delta: convert to the new option parser

--help for --diff= is changed from old-style "1|0" to "yes|no".

Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>

escape: convert to the new option parser

Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>

pull: convert to the new option and verb parsers

Duplicated word in description of --keep-download= is fixed.

Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>

importctl: convert to the new option and verb parsers

Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>

importctl: fix -N to actually clear keep-download flag

-N was clearing and re-setting the same bit in arg_import_flags_mask,
which is a no-op. It should clear the bit in arg_import_flags instead,
matching what --keep-download=no does via SET_FLAG().

import: convert to the new option and verb parsers

--help output is the same.

Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>

bootctl: convert options and verbs to the new macros

-RR is formatted using the new OPTION_HELP_ENTRY_VERBATIM so that
we get the same --help as before.

Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>

shared/verbs: add _SCOPE variants of the verb macros

In some of the large programs, verbs are defined as non-static
functions. To support this cases, add variants of the VERB macros that
take an explicit scope parameter. The existing macros then call those
new macros with scope=static. The variant without static is the
exception, so the macros are "optimized" toward the static helpers.

I also considered allowing VERB macros to be used in different files,
i.e. in different compilation units. This would actually work without
too many changes, except for one caveat: the order in the array would be
unspecified, so we'd need to somehow order the verbs appropriately. This
most likely means that the verbs would need to be annotated with a
number. But that doesn't seem attractive at all: we'd need to coordinate
changes in different files. So just listing the verbs in one file seems
like least bad option.

shared/options: add option to generate a help line for custom option format

Sometimes we want to document what -RR or -vv does or some other
special thing. Let's allow this by (ab-)using long_code pointer
to store a preformatted string.

json-stream: fix NULL pointer passed to memcpy on first read with INPUT_SENSITIVE

When JSON_STREAM_INPUT_SENSITIVE is set before the first read,
input_buffer is NULL, input_buffer_size is 0, and input_buffer_index
is 0. The old condition '!INPUT_SENSITIVE && index == 0' would route
this case into the else branch which calls memcpy() with a NULL source
pointer, which is undefined behavior even when the length is zero, and
is caught by UBSan.

Fix by checking input_buffer_index == 0 first, then allowing the
GREEDY_REALLOC fast path also when input_buffer_size == 0, since
there is no sensitive data to protect from realloc() copying in that
case. The else branch is now only entered when there is actual data
to copy (input_buffer_size > 0), guaranteeing input_buffer is
non-NULL.

Follow-up for 6b1a59d59426cdda56648b00394addde2d454418

core: fix EBUSY on restart and clean of delegated services

When a service is configured with Delegate=yes and DelegateSubgroup=sub,
the delegated container may write domain controllers (e.g. "pids") into the
service cgroup's cgroup.subtree_control via its cgroupns root. On container
exit the stale controllers remain, and on service restart clone3() with
CLONE_INTO_CGROUP fails with EBUSY because placing a process into a cgroup
that has domain controllers in subtree_control violates the no-internal-
processes rule. The same issue affects systemctl clean, where cg_attach()
fails with EBUSY for the same reason.

Add unit_cgroup_disable_all_controllers() helper in cgroup.c that clears
stale controllers via cg_enable(mask=0) and updates cgroup_enabled_mask to
keep internal tracking in sync. Call it from service_start() and
service_clean() right before spawning, so that resource control is preserved
for any lingering processes from the previous invocation as long as possible.

sd-json: limit the stack depth during parsing as well

Define array cleanup funcs through macros (#41559)

sd-json: add JsonStream transport-layer module and migrate sd-varlink

Introduces JsonStream, a generic transport layer for JSON-line message
exchange over a pair of file descriptors. It owns the input/output
buffers, SCM_RIGHTS fd passing, the deferred output queue, the
read/write/parse step functions, sd-event integration (input/output/time
event sources), the idle timeout machinery, and peer credential caching,
but knows nothing about the specific JSON protocol on top — the consumer
drives its state machine via phase/dispatch callbacks supplied at
construction.

sd-varlink is reworked to delegate the entire transport layer to a
JsonStream owned by sd_varlink. The varlink struct drops every
transport-related field (input/output buffers and fds, output queue,
fd-passing state, ucred/pidfd cache, prefer_read/write fallback, idle
timeout, description, event sources) — all of that lives in JsonStream
now. What remains in sd_varlink is the varlink-protocol state machine
(state, n_pending, current/previous/sentinel, server linkage, peer
credentials accounting, exec_pidref, the varlink-specific quit and defer
sources) and a thin wrapper layer over the JsonStream API. The
should_disconnect / get_timeout / get_events / wait helpers all live in
JsonStream now and are driven by a JsonStreamPhase the consumer reports
via its phase callback.

news: new record about strings vs enums in varlink

test: add core-specific varlink enum sync test

Add test-varlink-idl-unit that validates all varlink enum types in
io.systemd.Unit match their corresponding C string tables. This
catches drift between varlink IDL enum definitions and internal
enum values.

Uses core_test_template since it links against libcore for access
to the string table lookup functions.

ExecOutput uses TEST_IDL_ENUM_TO_STRING only because the '+' in
'kmsg+console' doesn't survive the underscorify/dashify round-trip.

docs: beef up SECURITY.md rules for reporting

With yeswehack.com suspended due to funding issues for triagers being
worked out, reports on GH are starting to pile up. Explicitly define
some ground rules to avoid noise and time wasting.

varlink: add enum types for configuration settings in io.systemd.Unit

Define proper varlink enum types for unit configuration settings that
are part of the user-facing API (values users/clients can select).
This replaces SD_VARLINK_STRING with SD_VARLINK_DEFINE_FIELD_BY_TYPE
for these fields, giving them strong type semantics in the IDL.

Enum types added for ExecContext (ExecInputType, ExecOutputType,
ExecUtmpMode, ExecPreserveMode, ExecKeyringMode, MemoryTHP,
ProtectProc, ProcSubset, ProtectSystem, ProtectHome, PrivateTmp,
PrivateUsers, ProtectHostname, ProtectControlGroups, PrivatePIDs,
PrivateBPF), CGroupContext (CGroupDevicePolicy, ManagedOOMMode,
ManagedOOMPreference, CGroupPressureWatch, NFTSetSource, NFProto,
BPFCGroupAttachType, CGroupController), and UnitContext (CollectMode,
EmergencyAction, JobMode).

Engine-reported runtime state fields (Type, LoadState, ActiveState,
FreezerState, SubState, UnitFileState) remain as strings since only
the engine selects those values.

various: use DEFINE_ARRAY_FREE_FUNC

sysupdate: use DEFINE_POINTER_ARRAY_FREE_FUNC, rename func

shared/tar-util: use DEFINE_ARRAY_FREE_FUNC, rename funcs

sd-journal: use NormalCasing for struct

nsresourced: use DEFINE_ARRAY_FREE_FUNC, make func static and rename

libsystemd-network: use DEFINE_POINTER_ARRAY_FREE_FUNC, rename cleanup function

libsystemd-network: use DEFINE_ARRAY_FREE_FUNC, rename cleanup func

stub: use DEFINE_ARRAY_FREE_FUNC

Add DEFINE_ARRAY_FREE_FUNC and mount_image_free_array

This is similar to DEFINE_POINTER_ARRAY_FREE_FUNC, but one
pointer chase less. The name of the outer and inner functions are
specified separately. The inner function does not free, so it'll
be generally something like 'foo_done', but the outer function
does free, so it can be called 'foo_array_free'.

Add DEFINE_POINTER_ARRAY_FREE_FUNC and conf_file_free_array

As mentioned in the grandfather commit, I want to use the _many
suffix for freeing of the contents of an array, so the functions
to free the array to get the suffix _array.

firewall-util: use DEFINE_ARRAY_DONE_FUNC for netlink message cleanup

Replace the open-coded netlink_message_unref_many() function and its
DEFINE_TRIVIAL_CLEANUP_FUNC wrapper with DEFINE_ARRAY_DONE_FUNC.

Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>

Add DEFINE_ARRAY_DONE_FUNC

This is a helper macro that defines a function to drop elements of an
array but not the array itself. I used the "_many" suffix because it
most closely matches what happens here: we are calling the cleanup
function a bunch of times.

test: extract varlink IDL test helpers into shared header

Move the TEST_IDL_ENUM_TO_STRING, TEST_IDL_ENUM_FROM_STRING, and
TEST_IDL_ENUM macros along with test_enum_to_string_name() from
test-varlink-idl.c into test-varlink-idl-util.h so they can be
reused by other test files.

Update TODO

meson: Check if files returned by git ls-files actually exist

Otherwise you run into errors such as:

"""
../meson.build:2899:28: ERROR: File src/test/test-loop-util.c does not exist.
"""

when deleting a file in git without staging the deletion.

sd-bus: don't overallocate the message buffer

newa(t, n) already allocates sizeof(t) * n bytes, so previously we'd
actually allocate sizeof(t) * sizeof(t) * n bytes, which is ~16x more
(on x86_64) that we actually needed.

This is probably an oversight from a tree-wide change in
6e9417f5b4f29938fab1eee2b5edf596cc580452 that replaced alloca() with
newa().

Follow-up for 6e9417f5b4f29938fab1eee2b5edf596cc580452.

logind: add io.systemd.Shutdown varlink interface (#41229)

The shutdown interface is currently only exposed via dbus. This PR
adds a comparable varlink implementation. It is inspired by the existing
dbus methods and implements PowerOff, Reboot, Halt, Kexec, SoftReboot
as varlink methods on a new io.systemd.Shutdown interface.

It is (intentional) simpler than the dbus for now, i.e. no Can* yet,
mostly
because I want to get feedback first (happy to do that in a followup).

The only real difficulty here is what to do about
verify_shutdown_creds()
as this is needed by both dbus and varlink and its dbus only. I went for
an ugly but (hopefully) pragmatic choice (see the commit message for
details). But I can totally understand if a refactor instead is
preferred.

Help users with incorrect / permission bits (#41431)

This error causes the computer to pass the emergency.target and go to
graphical.target.
Then, your window manager will have problems because it cant access any
directories, your network manager wont startup the network. In my case,
the screen just goes black. Ideally, you'd get an error message
explaining this edge scenario that's occuring to you, and an emergency
shell that makes it easy to run the necessary chmod 0755 / to proceed
with booting. IDK not sure if this is the correct way to implement this,
sorry it's my first contribution.

I ran
`meson test -C build`
and got

Ok:                1806
Fail:              26
Skipped:           25
on my cloned systemd repo before any changes, and got the same result
after my commit ¯\_(ツ)_/¯
So I hope I did that right.
Thanks

report: add cgroup metrics in a separate varlink service (#41489)

Add CpuUsage, MemoryUsage, IOReadBytes, IOReadOperations, and
TasksCurrent in a standalone socket-activated varlink service. These
metrics are gathered from the kernel via cgroup files and PID1's only
role is mapping unit names to cgroup paths — a separate process can
query PID1 once for that mapping and then read the cgroup files
directly, minimizing PID1 involvement.

The new systemd-report-cgroup-metrics service listens at
/run/systemd/report/io.systemd.CGroup and exposes:
  - io.systemd.CGroup.CpuUsage
  - io.systemd.CGroup.IOReadBytes
  - io.systemd.CGroup.IOReadOperations
  - io.systemd.CGroup.MemoryUsage (with type=current/available/peak)
  - io.systemd.CGroup.TasksCurrent

This is spun out of #41078 and based on top of it. Will rebase once
that's merged.

resolved: replace assert() with error return in DNSSEC verify functions

dnssec_rsa_verify_raw() asserts that RSA_size(key) matches the RRSIG
signature size, and dnssec_ecdsa_verify_raw() asserts that
EC_KEY_check_key() succeeds. Both conditions depend on parsed DNS
record content. Replace with proper error returns.

The actual crypto verify calls (EVP_PKEY_verify / ECDSA_do_verify)
handle mismatches fine on their own, so the asserts were also redundant.

While at it, fix the misleading "EC_POINT_bn2point failed" log message
that actually refers to an EC_KEY_set_public_key() failure.

Fixes: https://github.com/systemd/systemd/issues/41569

claude-review: improve review quality for large PRs

Several issues were identified from analyzing logs of a large (52-commit) PR
review:

- Claude was batching multiple commits into a single review agent instead of
  one per worktree. Strengthen the prompt to explicitly prohibit grouping.
- Claude was reading pr-context.json and commit messages before spawning
  agents despite instructions not to, wasting time. Tighten the pre-spawn
  rules to only allow listing worktrees/ and reading review-schema.json.
- Subagents were spawned with model "sonnet" instead of "opus". Add explicit
  instruction to use opus.
- After agents returned, Claude spent 9 minutes re-verifying findings with
  bash/grep/sed commands, duplicating the agents' work. Add instruction to
  trust subagent findings and only read pr-context.json in phase 2.
- Subagents returned markdown-wrapped JSON instead of raw JSON arrays. Add
  instruction requiring raw JSON output only.
- Each subagent was independently reading review-schema.json. Instead have
  the main agent read it once and paste it into each subagent prompt.
- The "drop low-confidence findings" instruction was being used to justify
  dropping findings that Claude itself acknowledged as valid ("solid cleanup
  suggestions", "reasonable consistency improvement"). Remove the instruction.
- Simplify the deduplication instructions
- Stop adding the severity to the body in the post processing job as claude is
  also adding it so they end up duplicated.

resolved: skip cache flush on server switch/re-probe when StaleRetentionSec is set

manager_set_dns_server() and dns_server_flush_cache() call dns_cache_flush()
unconditionally, wiping the entire cache even when StaleRetentionSec is
configured. This defeats serve-stale by discarding cached records that should
remain available during server switches and feature-level re-probes.

The original serve-stale commit (5ed91481ab) added a stale_retention_usec
guard to link_set_dns_server(), and a later commit (7928c0e0a1) added the
same guard to dns_delegate_set_dns_server(), but these two call sites in
resolved-dns-server.c were missed.

This is particularly visible with DNSOverTLS, where TLS handshake failures
trigger frequent feature-level downgrades and re-probes via
dns_server_flush_cache(), flushing the cache each time.

Add the same stale_retention_usec guard to both call sites so that cache
entries are allowed to expire naturally via dns_cache_prune() when
serve-stale is enabled.

Fixes: #40781
This commit was prepared with assistance from an AI coding agent (GitHub
Copilot). All changes have been reviewed for correctness and adherence to the
systemd coding style.

report: add cgroup metrics in a separate varlink service

Add CpuUsage, MemoryUsage, IOReadBytes, IOReadOperations, and
TasksCurrent in a standalone socket-activated varlink service.

The new systemd-report-cgroup service listens at
/run/systemd/report/io.systemd.CGroup and exposes:
  - io.systemd.CGroup.CpuUsage
  - io.systemd.CGroup.IOReadBytes
  - io.systemd.CGroup.IOReadOperations
  - io.systemd.CGroup.MemoryUsage (with type=current/available/peak)
  - io.systemd.CGroup.TasksCurrent

cgroup-util: add cg_get_keyed_attribute_uint64() helper

Multiple callers of cg_get_keyed_attribute() follow the same pattern of
reading a single keyed attribute and then parsing it as uint64_t with
safe_atou64(). Add a helper that combines both steps.

Convert all existing single-key + uint64 call sites in cgtop, cgroup.c,
and oomd-util.c to use the new helper.

sd-varlink: fix a potential connection count leak

With the old version there was a potential connection count leak if
either of the two hashmap operations in count_connection() failed. In
that case we'd return from sd_varlink_server_add_connection_pair()
_before_ attached the sd_varlink_server object to an sd_varlink object,
and since varlink_detach_server() is the only place where the connection
counter is decremented (called through sd_varlink_close() in various
error paths later _if_ the "server" object is not null, i.e. attached to
the sd_varlink object) we'd "leak" a connection every time this
happened. However, the potential of abusing this is very theoretical,
as one would need to hit OOM every time either of the hashmap operations
was executed for a while before exhausting the connection limit.

Let's just increment the connection counter after any potential error
path, so we don't have to deal with potential rollbacks.

udev: fix bounds check in dev_if_packed_info()

The check compared bLength against (size - sizeof(descriptor)), which
is an absolute limit unrelated to the current buffer position. Since
bLength is uint8_t (max 255), this can never exceed size - 9 for any
realistic input, making the check dead code.

Use (size - pos) instead so the check actually catches descriptors
that extend past the end of the read data.

Fixes: https://github.com/systemd/systemd/issues/41570

docs: Fix window in PRESSURE.md

docs: Update MEMORY_PRESSURE.md => PRESSURE.md

Make the doc more generic and mention all pressure types, not just
memory.

core: Add I/O pressure support

core: Add support for CPU pressure notifications

Works the same way as memory pressure notifications. Code is refactored
to work on enum arrays to reduce duplication.

test-mempress: Support unprivileged operation

test-mempress: Migrate to new assertion macros

compress: consolidate all compression into compress.c with dlopen

Move the push-based streaming compression API from import-compress.c
into compress.c and delete import-compress.c/h. This consolidates all
compression code in one place and makes all compression libraries
(liblzma, liblz4, libzstd, libz, libbz2) runtime-loaded via dlopen
instead of directly linked.

Introduce opaque Compressor/Decompressor types backed by a heap-
allocated struct defined only in compress.c, keeping all third-party
library headers out of compress.h.

Rewrite the per-codec fd-to-fd stream functions as thin wrappers around
the push API via generic compress_stream()/decompress_stream() taking a
Compression type parameter. Integrate LZ4 into this framework using the
LZ4 Frame API, eliminating all LZ4 special-casing.

Extend the Compression enum with COMPRESSION_GZIP and COMPRESSION_BZIP2
and add the corresponding blob, startswith, and stream functions for
both.

Rename the ImportCompress types and functions: ImportCompressType becomes
the existing Compression enum, ImportCompress becomes Compressor (with
Decompressor typedef), and all import_compress_*/import_uncompress_*
become compressor_*/decompressor_*. Rename dlopen_lzma() to dlopen_xz()
for consistency. Make compression_to_string() return lowercase by
default.

Add INT_MAX/UINT_MAX overflow checks for LZ4, zlib, and bzip2 blob
functions where the codec API uses narrower integer types than our
uint64_t parameters.

Migrate test-compress.c and test-compress-benchmark.c to the TEST()
macro framework, new assertion macros, and codec-generic loops instead
of per-codec duplication.

Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>

portablectl: fix swapped arguments for setns()

Follow-up for 824fcb95c9e66abe6b350ebab6e0593498ff7aa1.

added root.conf to meson.build

Revert "mkosi: Mark minimal images as Incremental=relaxed"

The setting has fundamental flaws that can't be easily fixed
(see https://github.com/systemd/mkosi/pull/4273) so revert it's
use as we're dropping it in systemd. Image builds will take a bit
longer again until I figure out a proper fix for this.

This reverts commit 7a70c323681b091328fcf6c9ca3104c7958a1331.

vconsole-setup: skip setfont(8) when the console driver lacks font support

Don't run setfont(8) on consoles that don't support
fonts. systemd-vconsole-setup neither fails nor reports errors on such consoles
unlike setfont(8) which emits the following error [1]:

systemd-vconsole-setup[169]: setfont: ERROR kdfontop.c:183 put_font_kdfontop: Unable to load such font with such kernel version

The check already existed in setup_remaining_vcs() but it was performed too
late.

[1] this was simply ignored by setfont(8) until
https://github.com/legionus/kbd/commit/1e15af4d8b272ca50e9ee1d0c584c5859102c848

varlink: add sd_varlink_reply_and_upgrade and varlinkctl serve (#41474)

sd-varlink: use MSG_PEEK for protocol_upgrade connections

When there is a potential protocol upgrade we need to be careful that
we do not read beyond our json message as the custom protocol may be
anything. This was archived via a byte-by-byte read. This is of course
very inefficient. So this commit moves to use MSG_PEEK to find the
boundary of the json message instead. This makes the performance hit
a lot smaller.

Thanks to Lennart for suggesting this.

varlink: use single byte reads on SD_VARLINK_SERVER_UPGRADABLE

When the server side of a varlink connection supports connection
upgrades we need to go into single byte-read mode to avoid the
risk of a client that sends the json to protocol upgrade and then
immediately the custom protocol payload. This commit implements
this.

The next step is using MSG_PEEK to avoid the single-byte overhead.

libsystemd,varlink: always return two fds in varlink upgrade API

This commit tweaks the API of sd_varlink_call_and_upgrade and
sd_varlink_reply_and_upgrade to return two independent fds even
if the internal {input,output}_fd are the same (e.g. a socket).

This makes the external API easier as there is no longer the risk
of double close. The sd_varlink_call_and_upgrade() is not in a
released version of systemd yet so I presume it is okay to update
it still.

This also allowed some simplifications in varlinkctl.c now that
the handling is easier.

varlinkctl: add new `serve` verb to allow wrapping command in varlink

With the new protocol upgrade support in varlinkctl client we can
now do the equivalent for the server side. This commit adds a new
`serve` verb that will serve any command that speaks stdin/stdout
via varlink and its protocol upgrade feature. This is the
"inetd for varlink".

This is useful for various reasons:
1. Allows to e.g. provide a heavily sandboxed io.myorg.xz.Decompress
   varlink endpoint, c.f. xz CVE-2024-3094)
2. Allow sftp over varlink which is quite useful with the
   varlink-http-bridge (that has more flexible auth mechanism than
   plain sftp).
3. Makes testing the varlinkctl client protocol upgrade simpler.
4. Because we can.