git.ipfire.org Git - thirdparty/systemd.git/log

test-qmp-client: run mock QMP servers as fibers on the shared event loop

The mock servers used to be driven out-of-band: each test created a
socketpair, forked a child, ran a hand-coded request/response script
against the raw fd, and sent SIGTERM to tear it down. That worked but
required pidref/process-util/signal plumbing in every test, two
distinct execution contexts that couldn't share state, and a JsonStream
attached to the mock side that pretended to be event-loop-driven while
actually being driven manually via blocking reads.

Now that JsonStream suspends when on a fiber, the mocks can live
inside the same process and event loop as the client. Each mock is
rewritten as an sd-fiber that runs alongside the client fiber: so the
mock fiber yields on I/O and the event loop schedules the client in the
meantime. Both sides progress cooperatively, no fork/SIGTERM/PID tracking,
no manual phase tracking.

Two cleanups fall out of the rewrite:

- A QMP_TEST(name, mock_fn) { ... } macro encapsulates the per-test
  scaffolding (event loop, socketpair, mock fiber spawn, exit-on-idle
  shim) and injects an already-connected QmpClient *client into the
  test body. Each test now reads as a flat sequence of
  qmp_client_call() invocations against that client.

- Repeated mock command/reply scripting is factored into
  mock_qmp_expect(), mock_qmp_reply(), mock_qmp_expect_and_reply(),
  mock_qmp_handshake(), and mock_qmp_query_status_running(). The
  greeting JSON is built with sd_json_buildo() instead of being parsed
  from a literal.

The file shrinks from 756 to 494 lines, mostly through deletions.

qmp-client: add fiber-aware call paths

The synchronous qmp_client_call() pumps the event loop until its reply
arrives, pinning the parsed reply on c->current so it can hand out
borrowed pointers to the caller. That model only fits one in-flight
sync call: a second qmp_client_call() on the same client clears
c->current before issuing its own send, invalidating the first
caller's borrowed pointers. On a single-threaded event loop that was
fine, but with fibers two concurrent calls on the same client can
interleave through the pump (json_stream_wait() suspends the running
fiber) and trample each other.

To fix this, make qmp_client_call() detect when it's running on a fiber
whose event loop matches the client and transparently delegate to
qmp_client_call_suspend(), which makes use of a new QmpFuture to allow
multiple concurrent calls to qmp_client_call().

To make this work concurrently, we also change qmp_client_call() to
hand out references and copies of errors so that we don't have to store
the borrowed pointers we hand out in the QmpClient struct.

sd-varlink: make sd-varlink fiber-aware

Add varlink_server_bind_fiber() and varlink_server_bind_fiber_many()
in varlink-util.{c,h} for registering a method handler that should
run on a dedicated fiber per dispatch. The fiber-bound methods live
in a separate s->fiber_methods map alongside the regular s->methods;
bind_internal()/bind_many_internal() are factored out so the regular
and fiber bind variants share their parsing/insertion code.
Registering the same method in both maps is rejected because the
dispatcher consults the regular map first and would otherwise
silently shadow the fiber binding.

varlink_dispatch_fiber() builds a VarlinkFiberData (refs to the
connection, parameters, and method name), spawns a fiber via
sd_fiber_new(), and makes the future floating so the fiber
self-manages its lifetime — neither the dispatcher nor the
connection has to track it. The fiber's priority is set to one
below the connection's quit event source so that on graceful
shutdown the fiber's exit handler fires (and runs its cleanup)
before varlink's quit_callback() closes the connection underneath
it; this is what lets a fiber-bound handler reply or flush its
sentinel on a still-open connection during shutdown.

The connection state transitions are reordered so they happen before
the fiber spawn rather than after the synchronous callback returns:
the fiber runs after dispatch has already moved past PROCESSING, which
matches the behaviour expected for a deferred reply (the fiber may
either reply immediately, or stash the connection and reply later, in
which case the post-callback logic treats it as a PENDING_METHOD).

Note that all the synchronous varlink APIs (sd_varlink_call() and friends)
already behave properly when on a fiber because they call json_stream_wait()
which calls ppoll_usec() which we already fixed to suspend when called from
a fiber.

The client/server varlink tests are migrated to fibers (threads → mock
server fibers on the same event loop) to exercise the new paths.

sd-bus: make sd-bus fiber-aware

Two changes to teach sd-bus how to behave when called from a fiber, in
order of increasing depth:

2. sd_bus_call() now redirects to a new bus_call_suspend() helper when
   the caller is a fiber whose event loop is the same one the bus is
   attached to. The plain bus_poll() path serializes all bus traffic on
   the slot's reply (only one method call can be in flight per
   sd_bus*), which would defeat the point of running multiple fibers
   against one bus. bus_call_suspend() builds on the async sd-bus API:
   it wraps the call in a new BusFuture (sd-bus/bus-future.{c,h}) that
   resolves when the reply or method-error arrives, lets the fiber
   await that future, and surfaces the reply to the caller via
   future_get_bus_reply(). Because the futures live on the event loop
   rather than a per-bus slot, multiple fibers can drive concurrent
   method calls against the same bus.

3. A new private SD_BUS_VTABLE_METHOD_FIBER flag dispatches a vtable
   method handler on its own fiber, so handlers are free to use
   sd_bus_call() against the same bus, sd_fiber_sleep(), loop_read(),
   etc. without stalling the event loop for other connections or
   handlers. The flag stays out of sd-bus-vtable.h (its bit value is
   reserved there to prevent collisions) — the fiber runtime is a
   systemd-internal implementation detail.

Lifecycle of fiber-dispatched handlers is tracked on the bus itself: a
new bus->fiber_futures set holds a ref to each in-flight handler.
bus_enter_closing() cancels every entry and process_closing() returns
with the bus still in CLOSING state until the set drains, so we can be
sure no fiber handler outlives the bus. bus_fiber_resolved() removes
the entry on completion. bus_free()'s assert(set_isempty()) makes the
invariant load-bearing.

Note that plain sd_bus_call() already works correctly on a fiber as it
calls ppoll_usec() which has already been modified to suspend when
running on a fiber.

To exercise these changes the existing thread-based client/server
sd-bus tests (test-bus-chat, test-bus-objects, test-bus-peersockaddr,
test-bus-server, test-bus-watch-bind) are migrated to fibers, and a
new test-bus-fiber is added that covers SD_BUS_VTABLE_METHOD_FIBER —
including handlers that issue nested sd_bus_call() on the same bus, the
cancel-on-close path, and concurrent dispatches across multiple fibers.

sd-event: suspend instead of blocking when sd_event_run() runs on a fiber

sd_event_run() blocks the calling thread on the event loop's epoll fd
until something happens. When the caller is a fiber, that's the wrong
behaviour: blocking the thread also stalls every other fiber and the
outer event loop driving them. The most common way to hit this is a
fiber that creates its own inner event loop (e.g. a server-style fiber
that wants to dispatch its own sources independently of whatever loop
the test or supervising fiber is running on) — with the existing
implementation the inner sd_event_run() would hold the thread while the
outer scheduler should be free to advance other fibers.

Add an event_run_suspend() variant in sd-event/event-future.c that
performs the same prepare/wait/dispatch dance, but when the fast path
finds nothing ready it (a) creates an IO future watching the inner
event loop's epoll fd on the *outer* event loop, (b) optionally creates
a time future for the timeout, and (c) suspends the fiber. When either
future fires the fiber is resumed and the prepare/wait/dispatch sequence
runs once more to actually dispatch what's pending. sd_event_run()
checks sd_fiber_is_running() and delegates to this variant when on a
fiber; profile_delays accounting is intentionally skipped on that path
since the underlying prepare/wait/dispatch primitives already account
for themselves.

sd-future: make src/basic blocking helpers fiber-aware

Some helpers in src/basic — ppoll_usec_full() (used by fd_wait_for_event()),
loop_read(), loop_read_exact(), loop_write_full() and
pidref_wait_for_terminate_full() — block the calling thread. That's the
right behaviour outside a fiber but not inside one, where blocking the
thread also stalls every other fiber running on the same event loop.
Rewriting every caller to pick a fiber or non-fiber variant explicitly
would be a lot of churn and would split otherwise-shared code paths in
two.

Instead, the helpers detect at runtime whether they're running on a fiber
and dispatch to a suspending variant when they are. FiberOps in
fiber-ops.h holds five function pointers (ppoll, read, write, timeout,
cancel_wait_unref); a fiber_ops global constant is populated whenever we
enter a fiber with functions that delegate to suspending variants of common
syscalls. With this approach, the variants themselves stay in libsystemd
which is required because they make use of sd-event.

- loop_read()/loop_read_exact() take the fiber read hook on a fiber
  unless the caller asked for a non-blocking attempt (do_poll=false) and
  the fd is already non-blocking — in that case we fall through to read()
  to preserve the existing return-EAGAIN-immediately semantic. The hook
  itself suspends on EAGAIN until data is available, so neither the
  do_poll knob nor the explicit fd_wait_for_event() retry loop are
  needed on the fiber path.

- loop_write_full() likewise takes the fiber write hook on a fiber,
  except when timeout=0 with an already-non-blocking fd (preserving the
  fast-return-EAGAIN semantic). The fiber path runs inside a
  FIBER_OPS_TIMEOUT() scope so the caller's timeout is honoured via a
  deadline future, mirroring SD_FIBER_TIMEOUT() but reachable from
  src/basic without pulling in sd-future.h.

- pidref_wait_for_terminate_full() polls the pidfd via fd_wait_for_event()
  before each waitid() when either a finite timeout is set or we're on a
  fiber, and requires pidref->fd >= 0 in those cases (returning
  -ENOMEDIUM otherwise — extending the rule that already applied to
  finite timeouts). The poll suspends the fiber via the ppoll hook above;
  the subsequent waitid() doesn't block because the pidfd is already
  signalled.

sd-future: add fiber-aware non-blocking I/O wrappers

Add a family of sd_fiber_*() I/O wrappers that, when called from a
fiber, behave like blocking I/O from the caller's perspective but
yield to the event loop instead of blocking the thread:

  sd_fiber_read / sd_fiber_write
  sd_fiber_readv / sd_fiber_writev
  sd_fiber_recv / sd_fiber_send
  sd_fiber_connect
  sd_fiber_recvmsg / sd_fiber_sendmsg
  sd_fiber_recvfrom / sd_fiber_sendto
  sd_fiber_accept
  sd_fiber_ppoll

Most of them share a single helper, fiber_io_operation(), which when
invoked outside a fiber falls through to the underlying syscall
directly, preserving the regular blocking behaviour. Inside a fiber
the helper flips the fd to non-blocking (restoring its original mode
on return), tries the syscall once on the fast path, and on EAGAIN/
EWOULDBLOCK creates an sd-event-backed IO future via future_new_io(),
suspends the fiber, and retries the syscall once the event source
fires.

future_new_io() itself is added to sd-event/event-future.{c,h} as a
new IoFuture kind. It wraps sd_event_add_io() into an sd_future:
oneshot enable, EPOLLERR translated via SO_ERROR (suppressed for
non-sockets), and the fd duplicated with F_DUPFD_CLOEXEC to avoid
EEXIST when multiple sources watch the same descriptor.

Together these let fiber-using code write straight-line socket and
pipe I/O without bundling state into callbacks.

Introduce support for running code in fibers

Traditionally, asynchronous programming in systemd has been achieved using
sd-event along with the asynchronous interfaces of sd-bus and sd-varlink.
This works well when the system is reacting to events and all code triggered
by those events can run without blocking. In these scenarios, the global
Manager object is passed as userdata to the callback, and the callback can
use the stack as usual, declaring local state and ensuring proper cleanup via
_cleanup_. Control flow structures, such as loops, work as expected, and
everything runs smoothly.

However, challenges arise when the code needs to perform long-running
operations within these callbacks. Since the system cannot block execution
within the callback, we can't directly invoke a long-running operation and
wait for its result without introducing complexities. Instead, we need to
initiate the long-running task, register for completion with sd-event,
sd-bus, or sd-varlink, and provide a callback to be invoked when the
operation completes.

This callback, however, only receives a single userdata pointer, which
forces us to bundle all local variables into a struct and pass it along as
part of the callback. On top of that, after queuing the asynchronous
operation, the caller continues executing. As the caller's stack unwinds
when the function exits, the resources and state within the local scope may
be prematurely cleaned up. Therefore, the struct must store copies of the
local variables or ensure proper reference counting to prevent premature
resource cleanup.

When multiple long-running operations need to be initiated within a loop,
the complexity grows further. We must introduce additional shared state to
track the completion of all operations before we can run any code that
depends on their results.

Furthermore, since the daemon may be shut down at any time, we must track
the lifecycle of each long-running operation in the global Manager struct,
ensuring proper cleanup even when stack unwinding can no longer manage the
resources for us.

Fibers, or green threads, provide a more natural way of handling
asynchronous operations. By enabling cooperative multitasking within a
single thread, fibers allow us to write code that looks like it’s running
synchronously, but with the ability to yield control at predefined points,
such as when waiting for long-running tasks to complete.

With fibers, we can simplify the control flow by running asynchronous
operations within a fiber, allowing us to "pause" execution while waiting
for the long-running operation to finish and then "resume" the operation once
it's complete. This eliminates the need for multiple callback chains,
extensive state tracking, and the potential pitfalls of stack unwinding.

This commit introduces the ability to execute long-running operations in a
non-blocking manner while maintaining the simplicity and readability of
synchronous code. The fiber-based approach will significantly improve the
handling of complex workflows, making the code easier to write and maintain.

The implementation is based on ucontext.h's makecontext() (with a fallback
to the venerable sigaltstack() approach on musl), sigsetjmp()/siglongjmp()
and sd-event. ucontext.h provides us with alternate stacks that we can switch
between. We use sigsetjmp()/siglongjmp() instead of swapcontext() because the
latter forcibly saves/restores a per context signal mask every time it is called.
Using sigsetjmp()/siglongjmp(), we can avoid the unnecessary syscall and maintain
a per thread signal mask, which makes much more sense than having a per fiber
signal mask.

The default stack size is the same as a regular thread. Because we
use mmap() to allocate the stack, the memory won't actually be used until it
is paged in by the kernel, so we don't actually use 8MB per fiber.

To integrate fibers with the event loop, each fiber is assigned a deferred
event source which resumes the fiber when enabled. The deferred event source
is oneshot by default so the fiber will run immediately until it yields or
suspends. If it yields, the deferred event source is enabled again (oneshot)
immediately. If it suspends, before it suspends, one or more event sources
are registered with sd-event that will enable the deferred event source
(oneshot) to resume the fiber once the operation it is waiting for completes.

Yielding or suspending the fiber is done by calling sd_fiber_yield() or
sd_fiber_suspend() respectively. Both of these return zero on success or any
error value from the async operation that caused the fiber to resume.

This is also how fiber cancellation is implemented. When a fiber is cancelled,
sd_fiber_yield() and sd_fiber_suspend() will return ECANCELED when the fiber
is resumed, allowing the fiber to unwind its stack (which allows cleanup to
happen automatically) and finish.

Instead of having applications work directly with fibers, we hide them behind
a generic futures interface to represent long-running operations, regardless of
whether those operations are running on a fiber or not. Aside from fibers, the
futures library (sd-future) will for example allow waiting for sd-event sources
and doing sd-bus calls in the background as well. Fibers can suspend until a
future is ready with sd_fiber_await() or by having the future wake up the fiber
explicitly in its callback. A future always defaults to waking up the current
fiber.

Each future kind plugs into the library by providing an sd_future_ops vtable
(alloc, free, cancel, set_priority). The library treats the impl pointer
returned by alloc() as a black box. Future Implementations retrieve it via
sd_future_get_private().

A future starts in SD_FUTURE_PENDING and transitions exactly once to
SD_FUTURE_RESOLVED, carrying an integer result. Consumers can react to that
transition either by installing a one-shot callback with
sd_future_set_callback() (callback-style code) or by waiting on it from a
fiber via sd_fiber_await() (synchronous-looking fiber code). sd_fiber_await()
is itself built on a "wait future" that resolves when its target resolves;
sd_future_new_wait() exposes the same primitive directly so non-fiber callers
can chain futures without involving a fiber.

Cancellation is cooperative: sd_future_cancel() invokes the future impl's
cancel callback, which is responsible for tearing down its work and ultimately
resolving the promise with -ECANCELED. For fiber futures this is what
surfaces as the ECANCELED return from sd_fiber_yield()/sd_fiber_suspend()
mentioned above.

Fire-and-forget fibers — created by passing a NULL ret to sd_fiber_new() —
take a self-reference on their future so they outlive the caller's scope.
The self-ref is dropped when the fiber resolves. This floating mechanism
(sd_fiber_set_floating()) is restricted to fiber futures because they
uniquely guarantee resolution; allowing it for arbitrary future kinds would
risk silent leaks for kinds that may never resolve.

Note that fiber cleanup depends on the runtime operating normally. Each
fiber's _cleanup_-style cleanups live on the fiber's own stack and run
only when the fiber is resumed and allowed to unwind, which requires a
working event loop to drive it to completion. The exit event source
registered for top-level fibers ensures unwind on a normal sd_event_exit(),
but if the event loop itself terminates abnormally (e.g. an unrecoverable
allocation failure mid-dispatch) before all fibers have resolved, their
stacks never unwind and any resources they own leak.

The code lives in libsystemd as sd-future (not exported) for the following reasons:
- We may want to make this a public libsystemd API in the future
- The code can't live in src/basic as it makes heavy use of sd-event
- The code can't live in src/shared as sd-bus and sd-event make use of it

The log and log-context headers are updated with functions to allow
fibers to have their own log prefix and log context.

ci: run the musl build & test under mkosi with a postmarketOS tools tree (#42171)

include: move several missing definitions to musl

Those moved ones have been defined in glibc <= 2.34, and only
necessary when built with musl.

Follow-up for c8c1bcf1941047d1fe43d9827ad4826b4620297a.

firstboot: auto-fill keymap/locale questions with data from firmware, if available (#42177)

Let's make the installer experience a tiny bit nicer, and prefill
everything with the data the firmware gives us.

dns-packet: bail out early if the packet is too short (#42189)

This should address the nit from Claude in
https://github.com/systemd/systemd/pull/42178#pullrequestreview-4320763076.

ci: run the musl build & test under mkosi with a postmarketOS tools tree

Drop the standalone Unit-tests (musl) workflow that ran on an Alpine sandbox
spun up by jirutka/setup-alpine, and merge it into unit-tests.yml as a new
build-musl job that provisions a postmarketOS tools tree via mkosi and runs
the meson build + test suite through 'mkosi box'. postmarketOS is musl-native,
so the musl-gcc / -idirafter /usr/include wrappers the Fedora tools tree
needed are gone; the linter.yml's own musl build step also goes away since
the unit-tests workflow now covers it (and tests it).

postmarketOS doesn't ship a downstream systemd packaging spec, so the new
tools tree config in mkosi.tools.conf/mkosi.conf.d/postmarketos.conf does not
set PrepareScripts and lists build deps manually. mkosi.sync now early-exits
when PKG_SUBDIR is unset so the missing pkgenv entry doesn't trip set -u.

Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>

test-path: Skip test when we can't create a cgroup

Instead of having CI runner specific checks, let's just
skip the test if we get EXIT_CGROUP which is what we get
when we can't create a cgroup. This makes the check work
independently of CI runner, and specifically also on github
actions.

test-path: Migrate to test framework and macros

- Also clean up the logging in check_states() to
only log on state changes so it's less noisy.

test-path: Fail earlier on start-limit-hit

meson: Skip coccinelle suite by default

It's not fully passing yet so disable it by default until it is.
clang-tidy follows the same model.

test-ukify: Skip kernel images we can't access

mkosi: update mkosi ref to be746d51bc90568b196951a60095ba87bf51ca8b

* be746d51bc Make full $PATH available when building tools tree

ptyfwd: Imply PTY_FORWARD_READ_ONLY if stdin isn't readable

if stdin is connected to a closed pipe or similar, imply
PTY_FORWARD_READ_ONLY so we don't even try to read from it
in the first place. Otherwise we'll immediately get a hangup
which will cause the forwarder to call sd_event_exit() and
shut down the event loop.

Debugged-by: Christian Brauner <brauner@kernel.org>

vmspawn: initial support for SEV-SNP guests (#42193)

systemd-udevd: configure NIC IRQs CPU affinity (#40304)

# Context

#40195 defines the initial proposal and the motivation behind this PR.

This PR introduces 3 new options for `.link` files `[Link]` section:
- `IRQAffinityPolicy=`
- `IRQAffinity=`
- `IRQAffinityNUMA=`

The purpose is to allow `systemd-udevd` to configure a NIC's IRQs
affinity to specific CPU(s).

`IRQAffinityPolicy=` supports two policies:
- `single`: assign all the NIC IRQs to CPU 0, or the first CPU in the
CPU set resulting from the union of `IRQAffinity=` and
`IRQAffinityNUMA=`.
- `spread`: assign all the NIC IRQs to all the CPUs (or the union of
`IRQAffinity=` and `IRQAffinityNUMA=` if defined) in a round-robin
fashion while optimizing for cache locally while spreading apart queues
on CPUs as much as possible.

Both `IRQAffinity=` and `IRQAffinityNUMA=` behaves as filters to reduce
the CPU set to assign IRQs to, and are only valid if
`IRQAffinityPolicy=` is defined.

# Spreading IRQs

This section describes the algorithm responsible for spreading IRQs over
different CPUs to maximize performance.

## 1. Discover CPU topology

Read from `/sys/devices/system/cpu/cpu*/topology` to identify:
- L3 cache domains (dies)
- Physical cores VS hyperthreads
- NUMA nodes
- Core ordering within each die

```
Example: Dual-socket server with 2 dies per socket, 4 cores per die

      ┌─────────────────────────────────────────────────────────────────┐
      │                         NUMA Node 0                             │
      │   ┌─────────────────────────┐   ┌─────────────────────────┐     │
      │   │   Die 0 (L3 Cache)      │   │   Die 1 (L3 Cache)      │     │
      │   │  ┌────┐┌────┐┌────┐┌────│   │  ┌────┐┌────┐┌────┐┌────│     │
      │   │  │ C0 ││ C1 ││ C2 ││ C3 │   │  │ C4 ││ C5 ││ C6 ││ C7 │     │
      │   │  │0,16││1,17││2,18││3,19│   │  │4,20││5,21││6,22││7,23│     │
      │   │  └────┘└────┘└────┘└────│   │  └────┘└────┘└────┘└────│     │
      │   └─────────────────────────┘   └─────────────────────────┘     │
      └─────────────────────────────────────────────────────────────────┘
      ┌─────────────────────────────────────────────────────────────────┐
      │                         NUMA Node 1                             │
      │   ┌─────────────────────────┐   ┌─────────────────────────┐     │
      │   │   Die 2 (L3 Cache)      │   │   Die 3 (L3 Cache)      │     │
      │   │  ┌────┐┌────┐┌────┐┌────│   │  ┌────┐┌────┐┌────┐┌────│     │
      │   │  │ C8 ││ C9 ││C10 ││C11 │   │  │C12 ││C13 ││C14 ││C15 │     │
      │   │  │8,24││9,25││10,2││11,2│   │  │12,2││13,2││14,3││15,3│     │
      │   │  └────┘└────┘└────┘└────│   │  └────┘└────┘└────┘└────│     │
      │   └─────────────────────────┘   └─────────────────────────┘     │
      └─────────────────────────────────────────────────────────────────┘
                  (Numbers show CPU IDs: first HT, second HT)
```

## 2. Filter the first hyperthread only

Use only the first hyperthread of each physical core to avoid SMT
contention. Two IRQs on sibling HTs contend for ALU/cache without cache
benefit.

```
      Before: CPUs 0-31 (16 cores × 2 HTs)
      After:  CPUs 0-15 (first HT of each core)
```

## 3. Equidistant permutations

Reorder dies and CPUs so consecutive selections are maximally spread
apart.

```
        Original order:  [0, 1, 2, 3]          (adjacent dies/CPUs)
                            |
                            v
        Equidistant:     [0, 2, 1, 3]          (spread apart)
```

This ensures that even if only 2 IRQs are assigned, they land on dies 0
and 2 (not 0 and 1), maximizing physical distance. The permutation is
also applied within each die:

```
        Die permutation:   Die0 -> Die2 -> Die1 -> Die3

        Within each die:
        ┌───────────────────────────────────────────────┐
        │ Die 0: [C0,C1,C2,C3]     -> [C0,C2,C1,C3]     │
        │ Die 1: [C4,C5,C6,C7]     -> [C4,C6,C5,C7]     │
        │ Die 2: [C8,C9,C10,C11]   -> [C8,C10,C9,C11]   │
        │ Die 3: [C12,C13,C14,C15] -> [C12,C14,C13,C15] │
        └───────────────────────────────────────────────┘
```

## 4. Round-robin selection across dies

Pick one CPU from each die in rotation, following permuted order.

```

      Round 1:     Die0->C0   Die2->C8   Die1->C4   Die3->C12
                      |          |          |          |
                      v          v          v          v
      IRQs:         [IRQ0]     [IRQ1]     [IRQ2]     [IRQ3]

      Round 2:     Die0->C2   Die2->C10  Die1->C6   Die3->C14
                      |          |          |          |
                      v          v          v          v
      IRQs:         [IRQ4]     [IRQ5]     [IRQ6]     [IRQ7]
```

If there are more IRQs than physical cores, this logic wraps around and
reuse CPUs. Only the first hyperthread of each core is used to avoid
cache line contention between queues.

Closes #40195.

Revert "meson: shrink developer-mode build artifacts"

This reverts commit 68910161491cdd161bff29a32032e52301831164.

update TODO

firstboot: port help() to help-util.[ch] APIs

firstboot: prefill language prompt with firmware language if it makes
sense

bootctl: show platform lang in bootctl status output

locale-setup: pick up platform lang from firmware

sysinstall: prefill firmware variable question with 'yes'

firstboot: prefill keymap question with EFI provided info

ask-string: support prefilling a string query with an explicit string

vconsole-util: move code to read EFI keyboard layout into generic code

basic/math-util: drop libm where possible

- test-random-util is reworked to not use sqrt()
- pretty-print.c inlines ceil() so libm doesn't have
  to be linked into libshared
- We add fno-math-errno to allow inlining of more math
  functions by not requiring standard math functions to
  set errno on invalid input.

meson: shrink developer-mode build artifacts

Two complementary changes in the developer-mode branch of meson.build:

  1. -ffunction-sections -fdata-sections: pair with the existing
     -Wl,--gc-sections so the linker can drop unused individual functions
     and data instead of being forced to pull whole .o files into each
     binary. Biggest impact on statically-linked NSS/PAM modules (a single
     call into creds-util.c used to drag in the entire creds-util
     translation unit, which transitively pulled TPM2, OpenSSL, PKCS11 and
     KDF helpers via tpm2-util.c / openssl-util.c) and on tests that embed
     daemon objects via meson's objects: extraction.

  2. -gz=zstd + -fdebug-types-section + -Wl,--compress-debug-sections=zstd:
     compress every .debug_* section with zstd, and move type DIEs into a
     COMDAT-mergeable section so identical types described across many TUs
     land once. Both are transparent to GDB / readelf / addr2line.

Gated to mode == 'developer' for now: no major distro (Fedora, Debian/
Ubuntu, Arch, Alpine, Gentoo, openSUSE, Yocto) enables -ffunction-sections
in their system-wide default CFLAGS, and the interaction with -flto=auto +
-ffat-lto-objects (which Fedora et al. ship by default) deserves a broader
evaluation before turning it on for release builds. Developer mode benefits
straightforwardly: smaller plugins, smaller tests, smaller libraries, no
interference with the hardening/LTO flag combinations distros pin.

Size impact on a clean developer-mode build, 626 ELF objects:

| Category                   |   n |      Before |       After |               Δ |     % |
| -------------------------- | --: | ----------: | ----------: | --------------: | ----: |
| NSS plugins                |   4 |  22,163,272 |   6,681,120 |    −15,482,152  | −69.9 |
| PAM plugins                |   3 |  18,132,160 |   5,764,880 |    −12,367,280  | −68.2 |
| libsystemd.so (public ABI) |   1 |   6,009,360 |   3,350,168 |     −2,659,192  | −44.3 |
| libudev.so (public ABI)    |   1 |   4,379,336 |   1,441,256 |     −2,938,080  | −67.1 |
| libsystemd-shared-261.so   |   1 |  15,130,264 |  11,293,952 |     −3,836,312  | −25.4 |
| libsystemd-core-261.so     |   1 |   7,356,408 |   4,992,600 |     −2,363,808  | −32.1 |
| cryptsetup-token plugins   |   3 |     139,800 |     113,344 |        −26,456  | −18.9 |
| daemons / CLI tools        | 178 |  43,581,288 |  31,911,368 |    −11,669,920  | −26.8 |
| test binaries              | 386 | 124,717,072 |  60,185,904 |    −64,531,168  | −51.7 |
| fuzzers                    |  48 |  33,138,056 |  13,864,952 |    −19,273,104  | −58.2 |
| **TOTAL**                  | 626 | 274,747,016 | 139,599,544 | **−135,147,472** | **−49.2** |

Biggest individual wins:

| Binary                   |   Before |    After |      Δ |
| ------------------------ | -------: | -------: | -----: |
| test-networkd-address    |  6.21 MB |  0.82 MB | −86.8% |
| test-network-tables      |  6.23 MB |  0.85 MB | −86.4% |
| test-networkd-conf       |  6.27 MB |  1.32 MB | −78.9% |
| libnss_myhostname.so.2   |  6.40 MB |  1.53 MB | −76.1% |
| fuzz-netdev-parser       |  6.22 MB |  1.51 MB | −75.7% |
| fuzz-network-parser      |  6.22 MB |  1.70 MB | −72.6% |
| libnss_systemd.so.2      |  6.90 MB |  2.04 MB | −70.4% |
| libnss_resolve.so.2      |  4.42 MB |  1.38 MB | −68.8% |
| pam_systemd_home.so      |  6.91 MB |  2.22 MB | −67.9% |
| libudev.so.1.7.13        |  4.18 MB |  1.37 MB | −67.1% |
| pam_systemd.so           |  7.02 MB |  2.55 MB | −63.6% |
| libsystemd-shared-261.so | 14.43 MB | 10.77 MB | −25.4% |

The big test wins come from the ~30 daemons (systemd-networkd,
systemd-resolved, systemd-journald, systemd-logind, systemd-homed,
systemd-importd, systemd-machined, …) whose compiled .o files are embedded
directly into their unit tests via meson's objects: extraction mechanism.
With per-function sections on the daemon sources, the test binary can GC
the bulk of code it never exercises; the remaining DWARF is then shared
zstd-compressed across every .o.

Build-speed cost is below noise on a 24-core build: across four clean
builds (with-flags / sections-only / baseline / with-flags rerun) the
range was 23.6–26.0 s real time and 7m39s–7m48s user time, with the
two with-flags runs faster than the baseline by a couple of seconds —
overhead from per-function-section bookkeeping and zstd compression
disappears into parallel-build noise.

repart: canonicalize node in varlink Run method

Run acquire_root_devno() on the varlink-provided node so symlinks (e.g.
/dev/disk/by-id/...) resolve to their canonical /dev/ path before being
used. Without this, sym_fdisk_partname() produces a "-partN" symlink
that udev hasn't created yet when repart calls open() on it right after
BLKPG_ADD_PARTITION, failing with ENOENT.

This also brings the varlink path in line with the CLI path's
partition-to-whole-disk and dm-crypt-to-backing resolution.

update TODO

po: skip automated fuzzy translations when generating new po files

The fuzzy translations are always wrong, but meson's integration does
not allow skipping them. Add a tiny wrapper for 'msgmerge' to
workaround the issue and skip them when running ninja systemd-update-po

vmspawn: use EPYC-v4 cpu for SNP

SNP requires a named, stable CPU model so the launch measurement is
reproducible across hosts. EPYC-v4 is the baseline that covers all
SNP-capable processors (Milan and later).

Signed-off-by: Paul Meyer <katexochen0@gmail.com>

vmspawn: initial support for SEV-SNP guests

Add --confidential-computing=sev-snp to run the guest as an AMD SEV-SNP
confidential VM. Loads a raw OVMF firmware blob via -bios (SNP doesn't
support the pflash + NVRAM split), attaches a sev-snp-guest object,
and hashes the kernel, initrd and cmdline into the launch measurement
when direct kernel boot is used. Incompatible features (Secure Boot,
CXL, virtio-balloon, SMBIOS credentials) are rejected or disabled; an
attached vTPM must be treated as untrusted by the guest.

The feature is marked experimental in the man page.

Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Paul Meyer <katexochen0@gmail.com>

update TODO

udev/net: document IRQAffinityPolicy=, IRQAffinity=, and
IRQAffinityNUMA=

udev/net: add IRQAffinityNUMA= option for NUMA-aware filtering

Add support for filtering IRQ affinity to CPUs on a specific NUMA node
via the new IRQAffinityNUMA= option in .link files. The option accepts:
- "local": use the NUMA node local to the NIC's PCIe slot
- Explicit node number (0, 1, 2, ...): use CPUs on the specified node

When both IRQAffinity= and IRQAffinityNUMA= are specified, their
intersection is used. If the intersection is empty, an error is logged
and IRQ affinity configuration is skipped.

When "local" is specified but the device's NUMA node cannot be
determined (numa_node shows -1), a warning is logged and IRQ affinity
configuration is skipped.

udev/net: add IRQAffinity= option to filter eligible CPUs

Add IRQAffinity= option to .link files that filters the set of CPUs
eligible for IRQ placement. This works in conjunction with
IRQAffinityPolicy= to constrain which CPUs receive network IRQs.

When specified with spread policy, only the listed CPUs are considered
for IRQ distribution. When specified with single policy, IRQs are
pinned to the first CPU in the allowed set instead of CPU 0.

udev/net: implement IRQAffinityPolicy=spread with topology awareness

Implement the spread policy for IRQ affinity distribution using a
topology-aware algorithm. The algorithm:

1. Discovers CPU topology from sysfs (NUMA node, package, die/L3, core)
2. Groups CPUs by L3 cache domain (die) with equidistant ordering
3. Round-robins across dies, spreading IRQs across the system
4. Uses first hyperthread of each core before second hyperthreads
5. Applies IRQ affinity via /proc/irq/<n>/smp_affinity

When there are more IRQs than CPUs, queues wrap around using round-robin.

udev/net: add IRQAffinityPolicy= option for .link files

Add support for configuring IRQ affinity for network interfaces via
systemd .link files. For now, the new IRQAffinityPolicy= option in the [Link]
section only accepts "single", which pins all MSI IRQs to CPU 0.

This allows declarative IRQ affinity configuration for network devices
during udev processing, which is useful for optimizing network
performance on multi-core systems.

Further commits will expand the options supported by IRQAffinityPolicy=.

dns-packet: bail out early if the packet is too short

Let's bail out early if the packet claims to contain some
questions or answer RRs, but the remaining packet data size is not
enough to hold a single such entry.

Follow-up for e7cd836dcffb5f85d66a156904fc68f8b654a290.

dns-packet: drop unnecessary indentation

Bail out early if QDCOUNT == 0, similarly to what we already do in
dns_packet_extract_answer() when RRCOUNT == 0.

test: add test-link-abi to enforce link-time ABI invariants

For every built executable, internal shared library, and plugin module,
verify two link-time properties via readelf:

1. No imported GLIBC symbol's version is newer than 2.34.
2. The dynamic section's NEEDED entries reference only glibc, the
runtime linker, our own libraries.

tree-wide: Replace exp10() with our own impl

exp10() has a symbol version > 2.34 on latest glibc. To allow
dropping our baseline required glibc runtime version to <= 2.34,
let's add our own version to prevent pulling in the newer symbol
from glibc.

core: introduce varlink `io.systemd.Job` interface (#42104)

Methods:

- `io.systemd.Job.List` — list all queued jobs or look up by `id`/`unit`
name, with streaming support. Uses context/runtime split: `JobContext`
(Unit, JobType) and `JobRuntime` (Id, State, Result,
ActivationDetails). Follows the same SELinux and parameter-conflict
patterns as `io.systemd.Unit.List`.
- `io.systemd.Job.Cancel` — cancel a specific job by ID, with SELinux
and polkit authorization.
- `io.systemd.Job.ClearAll` — cancel all pending jobs, with SELinux and
polkit authorization.

networkd: fix race condition in per-interface ICMPv6 processing

There exists a small window of time in icmp6_bind() between creating the
ICMPv6 socket and binding it to an ifindex, where the link-scoped socket
can process an ICMPv6 packet received on any interface. The applies to
both sd-radv and sd-ndisc codepaths.

This change adds an explicit check for ifindex on the receive path and
ignores packets received on other interfaces.

Co-developed-by: OpenAI Codex <noreply@openai.com>

sd-bus: add depth limit to message_skip_fields() to prevent stack overflow (#42164)

`message_skip_fields()` recursively processes D-Bus variant types in
message header fields with no depth limit. A crafted message with deeply
nested variants can cause unbounded recursion and overflow the stack.

Add a `depth` parameter checked against `BUS_CONTAINER_DEPTH` (128),
matching the limit already enforced by the public
`sd_bus_message_skip()` API. All recursive call sites pass `depth + 1`,
and the top-level caller in `message_parse_fields()` passes `0`.

Couple of test fixes (#42185)

core: improve errors from varlink io.systemd.Unit.StartTransient

The existing error reporting for the varlink `StartTransient` code
was converting all errors into `VARLINK_ERROR_UNIT_BAD_SETTING`.

This is not correct in some cases, we need to have a more targted
pattern here, i.e. only convert EINVAL to VARLINK_ERROR_UNIT_BAD_SETTING
and otherwise return the matching varlink error from the errno instead.

This commit fixes this issue. Thanks to Ivan Kruglov for raising
this.

tree-wide: move static dl handles into their dlopen_*() functions (#42168)

Each dlopen_*() wrapper kept its dl handle as a file-scope
'static void *xxx_dl = NULL;' even though only the wrapper itself
ever referenced it. Move each one inside the corresponding function
so its scope matches its actual use, leaving the rest of each
translation unit free of the unused file-scope name.

In pcre2-util.c the assert(pcre2_dl) in pattern_matches_and_log()
becomes assert(sym_pcre2_match), which carries the same invariant
(pattern_compile_and_log() ran dlopen_pcre2()).

test: switch TEST-55-OOMD stress-ng --vm-method to lfsr32

Commit 881e4717c7 ("test: pin stress-ng --vm-method to a portable
scalar method in TEST-55-OOMD") pinned --vm-method=zero-one with the
rationale that it is "a long-standing scalar method". That rationale is
wrong: stress_vm_zero_one() in stress-ng's stress-vm.c is declared

static size_t TARGET_CLONES stress_vm_zero_one(...)

i.e. it carries the exact same TARGET_CLONES attribute as 33 of the 35
other vm methods. On x86_64 with GCC >=5, TARGET_CLONES expands (see
core-target-clones.h in stress-ng) to a target_clones attribute
including "arch=skylake-avx512", "arch=cooperlake", "arch=tigerlake",
"arch=sapphirerapids", and several other AVX-512-bearing arch variants,
plus "default". GCC generates AVX-512 clones of stress_vm_zero_one() and
the IFUNC resolver picks them on any CPU that advertises AVX-512.

The only vm methods in stress-ng's registry whose function definitions
omit TARGET_CLONES entirely (and are therefore guaranteed not to
dispatch to an AVX-512 clone) are lfsr32 (portable, always registered)
and write64ds (x86_64-only, gated on HAVE_ASM_X86_MOVDIRI, i.e. Intel
Tremont / Tiger Lake+ MOVDIRI instruction).

Switch the four stress-ng --vm invocations in TEST-55-OOMD to
--vm-method=lfsr32 so the AVX-512 SIGILL on CPUs without AVX-512 (e.g.
AMD Zen 1-3) can no longer occur regardless of compiler version,
optimization level, or stress-ng package build.

Follow-up for 881e4717c7981b274853309e68b39153e3b292f4

Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>

test: fix race in TEST-07-PID1.socket-on-failure.sh

The test waited for the OnFailure= service's filesystem side effect
(`rmdir` of the directory) and then immediately invoked
`systemctl is-active`. Between `rmdir(2)` returning (which causes the
shell loop to exit) and PID1 reaping the child and transitioning the
oneshot service from `activating` to `active`, there is a small window
where `is-active` can observe `activating` and fail the test.

Wait directly on the unit state instead, matching the pattern used a
few lines above for the `is-failed` case.

  [ 1880.326704] TEST-07-PID1.sh[21489]: + timeout --foreground 60 bash -c 'while [[ -d '\''/tmp/TEST-07-PID1-socket-8467/test'\'' ]]; do sleep .5; done'
  [ 1880.330482] TEST-07-PID1.sh[21489]: + [[ ! -e /tmp/TEST-07-PID1-socket-8467/test ]]
  [ 1880.330482] TEST-07-PID1.sh[21489]: + systemctl is-active TEST-07-PID1-socket-OnFailure.service
  [ 1880.347470] TEST-07-PID1.sh[21520]: activating
  [ 1880.349508] TEST-07-PID1.sh[21489]: + at_exit
  [ 1880.349508] TEST-07-PID1.sh[21489]: + systemctl stop TEST-07-PID1-socket-8467.socket
  [ 1880.367331] TEST-07-PID1.sh[107]: Subtest /usr/lib/systemd/tests/testdata/units/TEST-07-PID1.socket-on-failure.sh failed
  [ 1880.367331] TEST-07-PID1.sh[107]: + return 1

Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>

test: add test for variant recursion depth limit in message_skip_fields()

Craft a raw D-Bus message with an unknown header field containing
BUS_CONTAINER_DEPTH+1 nested variants and verify that message parsing
rejects it with -EBADMSG rather than recursing until stack overflow.

sd-bus: add depth limit to message_skip_fields() to prevent stack overflow

message_skip_fields() recurses for each nested variant ('v') type in
D-Bus message header fields. A crafted message with deeply nested
variants (e.g., a variant containing a variant containing a variant...)
causes unbounded stack growth, leading to stack overflow and crash.

Add a depth parameter that increments on each recursive call and
rejects messages exceeding BUS_CONTAINER_DEPTH with -EBADMSG. This
matches the existing depth limits enforced elsewhere in the sd-bus
message processing (e.g., bus_message_enter_container).

tools: add a test wrapper that replays crashing tests under gdb

meson test --wrapper hook to print a gdb backtrace inline in the test
log when a test exits with an actual crash signal (SIGSEGV, SIGABRT,
SIGBUS, SIGFPE, SIGILL). Wired into the default add_test_setup() so it
runs automatically on every `meson test`.

Environmental terminations (SIGTERM/SIGKILL/SIGPIPE/SIGALRM) are passed
through without replay, and the original signal is re-raised so the
parent's wait() observes WIFSIGNALED rather than a plain exit code.

socket-proxy: implement PROXY protocol v1

as specified by the haproxy documentation:
https://www.haproxy.org/download/1.8/doc/proxy-protocol.txt

only protocol v1 is implemented for now, protocol v2 is binary,
and will be implemented in the future.

the proxy protocol allows the destination/target to know the
address/port of the client connectiing.

in nginx it's supported by enabling the `proxy_protocol` parameter to
the `listen` directive.

tree-wide: move static dl handles into their dlopen_*() functions

Each dlopen_*() wrapper kept its dl handle as a file-scope
'static void *xxx_dl = NULL;' even though only the wrapper itself
ever referenced it. Move each one inside the corresponding function
so its scope matches its actual use, leaving the rest of each
translation unit free of the unused file-scope name.

In pcre2-util.c the assert(pcre2_dl) in pattern_matches_and_log()
becomes assert(sym_pcre2_match), which carries the same invariant
(pattern_compile_and_log() ran dlopen_pcre2()).

libcrypt-util: Clean up dlopen_libcrypt()

bpf-util; Add back caching on error/success

Follow up for 7d822ca8

udev: predictable names for auxiliary sub-function (SF) devices (#42154)

Some drivers (currently mlx5_core) expose sub-functions (SFs) of a PCI
Physical or Virtual Function as auxiliary devices that carry a stable
`sfnum` sysfs attribute (the user-defined SF number passed to `devlink
port add ... sfnum N`). Their leaf devices (net, infiniband, ...)
currently inherit the parent PF/VF's `ID_PATH` and the SF netdev falls
through to the kernel-assigned `eth<N>` name.

Two patches:

1. `udev-builtin-path_id`: prepend an `sf-<N>` token when the walk
crosses an aux device with `sfnum`, so SF leaf devices get an `ID_PATH`
distinct from the parent PF/VF (e.g. `pci-0000:c1:00.0-sf-88`). Aux
devices without `sfnum` keep the pre-patch behaviour, so existing
`ID_PATH` values are unchanged and the patch needs no naming-scheme gate
(path_id is unversioned).

2. `udev-builtin-net_id`: name SF host netdevs analogously to SR-IOV VF
host netdevs — walk to the parent PCI function and append a
single-character `S<sfnum>` suffix. SFs hosted on SR-IOV VFs (VF-SF) get
both suffixes chained on the PF's base name (`enp193s0f0v0S88`). Gated
behind `NAMING_SUBFUNC` / `NAMING_V261`. Man page
(`systemd.net-naming-scheme(7)`) updated.

The patches apply to any driver that exposes its SF leaf devices below
an aux device exposing `sfnum`.

Validated on mlx5 hardware: PF, VF, SF-on-PF (`enp8s0f0S88`,
`pci-0000:08:00.0-sf-88`). VF-SF case verified with a fake-sysfs setup
since this is not supported by any real device and is more a theoretical
use case.

mkosi: update mkosi ref to 77fce77807a9a92bc37edc8f1c967102e6236d94

* 77fce77807 apk: Implement repository_key_fetch for the postmarketOS distribution
* 7068ed49ab postmarketos: Add ruff to tools tree
* dea4b6bfc8 Add newline when writing machine id into /etc/machine-id
* 944b775d40 tools: add libtss2-tcti-device0 to opensuse tools tree
* d856d65d3b mkosi-initrd: Also add cryptsetup-libs explicitly to the initrd
* 1cc967c5b3 mkosi-initrd: Trim orphaned GPU/audio modules, add ACPI platform attrs
* a3e95a7c29 mkosi-tools: Add fish to misc profile
* 76b02d1f84 mkosi-tools: Add jujutsu to misc profile
* 0afe4cd254 mkosi-tools: Move gh to misc profile
* 9077634bad mkosi-tools: Add cryptsetup-libs to centos/fedora/opensuse
* 82846347af box: Drop background tinting
* 3e50b97101 mkosi-tools: Add libfido2
* 78c2784827 vmspawn: Use --ephemeral rather than copy_ephemeral()
* dc801b00a3 Added second call to update kerneltype after kernel is defined
* 0c5cc04a8b vmspawn: Forward journal-remote settings to vmspawn
* 2518468c65 nspawn: Use --forward-journal instead of running journal-remote ourselves
* d2b798d00c apk: skip removal of packages that aren't installed

resolve: cap pre-allocation for questions/RRs

Since [0] and [1] questions & answer RRs from the incoming packets are
parsed into a hashmap to speed things up. The hashmaps are even
pre-allocated to speed things up even more, but there's one caveat - the
size for the pre-allocation comes from one or more fields from the
incoming packets that are under sender's control.

This can be abused by a malicious DNS server which can send a packet
with a spoofed QDCOUNT (for question packets) or ANCOUNT/NSCOUNT/ARCOUNT
(for answer packets). The limit of the final value in both cases is 64K.
This value is then used to pre-allocate the hashmap (via
set_reserve()/ordered_set_reserve(), where the caller also multiplies
the input value by 2 in both cases), which in turns calls
resize_buckets() that memzero()s the pre-allocated area, so all the
pages are faulted in, showing in process' RSS. Each such spoofed packet
then can translate into a ~4 MiB allocation in the systemd-resolved
process, which doesn't sound that bad.

However, this can be further amplified if the spoofed packet ends up in
resolved's cache. So, if the spoofed packet contains one valid A record
and then an OPT record with a spoofed ARCOUNT, the whole packet ends up
in the cache that can hold 4K of entries, which can eventually cause
resolved to keep up to 16 GiB of memory just for the cache (and thanks
to the memzero() above it's all RSS). Note that all this requires
someone with enough privileges to configure resolved to actually point
to such malicious DNS server or it could come from a malicious DHCP
server on the network. This could also get exploited via LLMNR, but in
thas case an attacker would have to match an ID of a valid transaction
for the packet to end up in resolved's cache.

For example, with a malicious DNS already in resolved configuration:

$ resolvectl dns eth0
Link 2 (eth0): 192.168.99.1:5354

Filling resolved's cache:

$ for i in {0..4200}; do resolvectl query test-$i.example.com; done
...
test-4200.example.com: 192.0.2.1               -- link: dummy0

-- Information acquired via protocol DNS in 1.6ms.
-- Data is authenticated: no; Data was acquired via local or encrypted transport: no
-- Data from: network

Yields following memory increase:

$ while :; do grep VmRSS /proc/$(pidof systemd-resolved)/status; sleep 1; done
VmRSS:     14280 kB
VmRSS:     14280 kB
...
VmRSS:    403352 kB
VmRSS:   1017976 kB
VmRSS:   1603876 kB
VmRSS:   2202028 kB
...
VmRSS:  16795724 kB
VmRSS:  16795724 kB

In my testing I also noticed one annoyance - after certain threshold the
RSS increase persisted even after the malicious entries were evicted
from the cache (or flushed via `resolvectl flush-caches`). This was most
likely due to mmap_threshold getting bumped to > 4 MiB and neither cache
eviction nor flush-caches call malloc_trim(0) (via
sd_event_trim_memory() or similar).

To mitigate this, let's cap the pre-allocation to a maximum number of
records the given packet body can realistically contain. If the minimum
size would be, for whatever unlikely reason, not enough, nothing serious
would happen - the hashmap would still get resized automatically by
resize_buckets(), it'd be just slightly slower.

[0] ae45e1a3832fbb6c96707687e42f0b4aaab52c9b
[1] 2d34cf0c16dd8fa71fb593e65ce4734cb61d9170

systemd-coredump: add COREDUMP_CODE (#42019)

Add COREDUMP_CODE to the fields captured by systemd-coredump. This makes
it possible for system administrators to filter coredumps based on si_code,
which describes the reason why a given signal was sent.

For example, to find processes killed due to invalid permissions
(SEGV_ACCERR):

$ journalctl COREDUMP_SIGNAL=11 COREDUMP_CODE=2

I've decided to add the value of si_code to the 'Signal: ' line of
coredumpctl info:
Signal: 11 (SEGV) si_code: SEGV_ACCERR

networkd: add RouteTable= to [DHCPv6] section

Allow users to allow DHCPv6 unreachable/blackhole routes (installed
for delegated prefixes) into a specific routing table, analogous to
the existing RouteTable= in [DHCPv4] and [IPv6AcceptRA].

The config parser config_parse_dhcp_or_ra_route_table() is extended
with an AF_UNSPEC ltype discriminator for DHCPv6 (AF_INET6 is already
taken by NDISC/RA). link_get_dhcp6_route_table() follows the same
pattern as link_get_dhcp4_route_table() and link_get_ndisc_route_table(),
falling back to the VRF table when not explicitly set.

In dhcp_request_unreachable_route(), the table is applied only for
NETWORK_CONFIG_SOURCE_DHCP6 routes (the uplink unreachable aggregates),
not DHCP_PD routes (per-subnet routes on downstream interfaces), matching
the intent of the feature. The !route->table_set guard avoids overriding
a table already set by the route code.

build(deps): bump the actions group with 2 updates

Bumps the actions group with 2 updates: [github/codeql-action](https://github.com/github/codeql-action) and [aws-actions/configure-aws-credentials](https://github.com/aws-actions/configure-aws-credentials).

Updates `github/codeql-action` from 4.35.2 to 4.35.4
- [Release notes](https://github.com/github/codeql-action/releases)
- [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md)
- [Commits](https://github.com/github/codeql-action/compare/95e58e9a2cdfd71adc6e0353d5c52f41a045d225...68bde559dea0fdcac2102bfdf6230c5f70eb485e)

Updates `aws-actions/configure-aws-credentials` from 6.1.0 to 6.1.1
- [Release notes](https://github.com/aws-actions/configure-aws-credentials/releases)
- [Changelog](https://github.com/aws-actions/configure-aws-credentials/blob/main/CHANGELOG.md)
- [Commits](https://github.com/aws-actions/configure-aws-credentials/compare/ec61189d14ec14c8efccab744f656cffd0e33f37...d979d5b3a71173a29b74b5b88418bfda9437d885)

---
updated-dependencies:
- dependency-name: github/codeql-action
  dependency-version: 4.35.4
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: actions
- dependency-name: aws-actions/configure-aws-credentials
  dependency-version: 6.1.1
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: actions
...

Signed-off-by: dependabot[bot] <support@github.com>

Translations update from Fedora Weblate (#42181)

Translations update from [Fedora
Weblate](https://translate.fedoraproject.org) for
[systemd/main](https://translate.fedoraproject.org/projects/systemd/main/).

Current translation status:

![Weblate translation
status](https://translate.fedoraproject.org/widget/systemd/main/horizontal-auto.svg)

po: Translated using Weblate (Georgian)

Currently translated at 100.0% (270 of 270 strings)

Co-authored-by: Temuri Doghonadze <temuri.doghonadze@gmail.com>
Translate-URL: https://translate.fedoraproject.org/projects/systemd/main/ka/
Translation: systemd/main

po: Translated using Weblate (Polish)

Currently translated at 100.0% (270 of 270 strings)

Co-authored-by: Marek Adamski <maradam@users.noreply.translate.fedoraproject.org>
Translate-URL: https://translate.fedoraproject.org/projects/systemd/main/pl/
Translation: systemd/main

po: Translated using Weblate (Ukrainian)

Currently translated at 100.0% (270 of 270 strings)

Co-authored-by: Yuri Chornoivan <yurchor@ukr.net>
Translate-URL: https://translate.fedoraproject.org/projects/systemd/main/uk/
Translation: systemd/main

po: Translated using Weblate (Swedish)

Currently translated at 100.0% (270 of 270 strings)

Co-authored-by: Luna Jernberg <droidbittin@gmail.com>
Translate-URL: https://translate.fedoraproject.org/projects/systemd/main/sv/
Translation: systemd/main

po: Translated using Weblate (Portuguese)

Currently translated at 100.0% (270 of 270 strings)

Co-authored-by: Américo Monteiro <a_monteiro@gmx.com>
Translate-URL: https://translate.fedoraproject.org/projects/systemd/main/pt/
Translation: systemd/main

po: Translated using Weblate (Turkish)

Currently translated at 100.0% (270 of 270 strings)

Co-authored-by: Oğuz Ersen <oguz@ersen.moe>
Translate-URL: https://translate.fedoraproject.org/projects/systemd/main/tr/
Translation: systemd/main

dependabot: Ignore mkosi

It doesn't update MinimumVersion= in mkosi.conf which breaks
tools/fetch-mkosi.py.

test: pin stress-ng --vm-method to a portable scalar method in TEST-55-OOMD

The stress-ng "vm" stressor's default --vm-method=all cycles through every
VM stress method, including newer ones that use AVX-512 instructions. On
CPUs without AVX-512 support (e.g. AMD Zen 1 to 3) those methods crash with
SIGILL. In testcase_oom_rulesets_lasting_sec all 10 stress-ng workers die
within ~2.34 seconds, so by the time the 6 second sleep elapses the unit
is already in failed/exit-code state and the assert_eq for
ActiveState=active trips.

Pin --vm-method=zero-one, a long-standing scalar method, on all four
stress-ng --vm invocations in this test (the two transient services in
testcase_oom_rulesets and testcase_oom_rulesets_lasting_sec, plus
TEST-55-OOMD-testbloat.service and TEST-55-OOMD-testmunch.service) so the
workers do not crash on AVX-512-less CPUs. testbloat, testmunch and
testcase_oom_rulesets have not been observed failing because they get
OOM-killed by systemd-oomd within ~1 to 2 seconds, before stress-ng cycles
into an AVX-512 method, but they share the same latent flake.

Journal excerpts from the failing run, TEST-55-OOMD-slowrule.service in
testcase_oom_rulesets_lasting_sec (journalctl -o short-monotonic):

[   58.018676] stress-ng[1015]: invoked with '/usr/bin/stress-ng --timeout 15s --vm 10 --vm-bytes 50M --vm-keep' by user 0 'root'
[   59.866072] stress-ng[1030]: stress-ng: debug: [1030] caught SIGILL, address 0x000055bd8d609140 (ILL_ILLOPN)
[   59.921050] stress-ng[1030]: stress-ng: debug: [1030] stress-ng: info: 0x000055bd8d609140:<62>71 fd 48 6f 2d 36 14 1c 00 c5 d1 ef ed 49 29
[   59.929310] stress-ng[1015]: stress-ng: error: [1015] vm: [1021] terminated with an error, exit status=2 (stressor failed)
[   60.364111] stress-ng[1015]: stress-ng: info:  [1015] failed: 10: vm (10)
[   60.364493] stress-ng[1015]: stress-ng: info:  [1015] unsuccessful run completed in 2.34 secs
[   60.371290] systemd[1]: TEST-55-OOMD-slowrule.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
[   60.371396] systemd[1]: TEST-55-OOMD-slowrule.service: Failed with result 'exit-code'.
[   64.017061] TEST-55-OOMD.sh[1010]: + assert_eq failed active
[   64.018167] TEST-55-OOMD.sh[1039]: FAIL: expected: 'active' actual: 'failed'

The faulting bytes marked by stress-ng with <62> (the byte at the
instruction pointer) decode unambiguously to an AVX-512 VMOVDQA64 using
the 512-bit zmm13 register, confirmed independently by two disassemblers:

  $ printf '\x62\x71\xfd\x48\x6f\x2d\x36\x14\x1c\x00' | ndisasm -b 64 -
  00000000  6271FD486F2D3614  vmovdqa64 zmm13,zword [rel 0x1c1440]
           -1C00

  $ echo '0x62, 0x71, 0xfd, 0x48, 0x6f, 0x2d, 0x36, 0x14, 0x1c, 0x00' | \
        llvm-mc -disassemble -triple=x86_64 -mattr=+avx512f
          .text
          vmovdqa64       1840182(%rip), %zmm13

The leading 0x62 is the EVEX prefix (exclusive to AVX-512 on this target),
zmm13 is a 512-bit register that only exists when AVX-512 is implemented,
and VMOVDQA64 requires the AVX512F (Foundation) CPUID feature (Intel SDM
Vol 2C). Executing this on a CPU without AVX-512 raises #UD, delivered by
the kernel as SIGILL/ILL_ILLOPN, matching the journal entry above. The
same journal shows the kernel reporting "kvm_amd: TSC scaling supported",
i.e. the guest is on AMD KVM, and AMD did not ship AVX-512 before Zen 4.

Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>

vmspawn: port help() to help-util.h APIs

Update NEWS

homed/fscrypt: add new xattr format hardening key sealing (#41816)

The current key sealing format has some less-than-ideal weaknesses:

- PBKDF2 with only 65k iterations, where recommendations are ~200k
- AES with null IV, relying on salt for uniqueness
- lack of AES MAC/AEAD

However improbable, it is at least theorically possible that with
a lot of resources an offline bruteforce could be attempted.

Add a v2 sealing format, keeping unsealing compatibility with
the current format:

`v2:<iterations>:<salt>:<IV>:<ciphertext>:<aes tag>`

and use 600k iterations for the PBKDF2 sha512

NEWS: mention auxiliary sub-function (SF) network device naming

Document user-visible effect of the new NAMING_SUBFUNC / v261 scheme.

Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>

update TODO

udev-builtin-net_id: name auxiliary sub-function (SF) host network devices

Some drivers (currently mlx5_core) expose sub-functions (SFs) of a PCI
Physical Function as auxiliary devices. Each SF carries a host network
interface that sits below the aux device in sysfs:

  /sys/devices/.../<PF BDF>/mlx5_core.sf.<idx>/net/eth<N>

Because the network device's immediate parent is the aux device and not
a PCI device, names_pci() bails out and these interfaces fall through
to the kernel-assigned eth<N> name, which is not stable across reboots,
module reloads or topology changes.

The naming applies when the SF network device's direct sysfs parent is
the aux device that exposes sfnum, i.e. the kernel driver passes the
aux device to SET_NETDEV_DEV(). mlx5_core does so. ice's
ice_sf_cfg_netdev() currently passes the parent PF's PCI device, so ice
SF network devices sit as siblings of the PF rather than below the aux
device and fall outside this precondition; pending a kernel change in
ice to mirror mlx5's SET_NETDEV_DEV(netdev, &adev->dev), they continue
to receive the kernel-assigned name as they do today.

The aux device exposes 'sfnum', the user-defined sub-function number
(the value passed to "devlink port add ... sfnum N"), which is stable
and unique within its parent PF. The aux device's direct sysfs parent
is the PF's PCI device.

Treat an SF host network device analogously to an SR-IOV VF host
network device: walk to the parent PCI function, derive the base name
from there, then append a single-character "S<sfnum>" suffix. Lowercase
's' is already taken (slot) and the existing grammar uses one character
per token, so 'S' is the best option.

E.g. for an SF whose parent PF is at PCI 0000:c1:00.0 and which was
added with "sfnum 88":

  ID_NET_NAME_PATH=enp193s0f0S88
  ID_NET_NAME_SLOT=enp193s0f0S88

This is parallel to how SR-IOV VFs get a "v<N>" suffix on top of the
parent PF's name.

Gate the new behaviour behind NAMING_SUBFUNC and NAMING_V261. Document
the new suffix in both the ID_NET_NAME_SLOT and ID_NET_NAME_PATH
grammars in systemd.net-naming-scheme(7) and add a v261 history entry.

Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>

udev-builtin-path_id: emit 'sf-<N>' token for auxiliary sub-functions

Some drivers expose sub-functions (SFs) of a PCI Physical or Virtual
Function as auxiliary devices that carry a stable 'sfnum' sysfs
attribute — the user-defined sub-function number (e.g. the value
passed to "devlink port add ... sfnum N"). The SF's leaf devices
(uverbs, infiniband, net, ...) sit below this aux device in sysfs:

  /sys/devices/.../<PF or VF BDF>/<sf aux>/.../<leaf>

Currently path_id walks straight past the aux device, so all leaf
devices below an SF end up sharing ID_PATH=pci-<BDF> with their
parent PF or VF. For uverbs this causes a /dev/infiniband/by-path/
symlink collision, and for any other consumer of ID_PATH/ID_PATH_TAG
it makes PF and SF (or VF and VF-SF) indistinguishable.

Recognise the 'auxiliary' subsystem in path_id's walk: when the aux
device exposes 'sfnum', prepend an 'sf-<N>' token; otherwise leave
it untokenised. The result for an SF whose parent PF is at PCI
0000:c1:00.0 and which was added with "sfnum 88" is:

  ID_PATH=pci-0000:c1:00.0-sf-88
  ID_PATH_TAG=pci-0000_c1_00_0-sf-88

This is parallel to how net_id's NAMING_SUBFUNC scheme appends 'S<N>'
on top of the PF base name. Aux devices without 'sfnum' keep the
pre-patch behaviour: the walk skips over them with no token. Existing
ID_PATH values for PF and VF leaf devices are therefore unchanged,
the change is purely additive, and there is no need to gate it behind
a naming scheme (path_id itself is unversioned).

Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>

some NEWS fixed by Claude

homed/fscrypt: add new xattr format hardening key sealing

The current key sealing format has some less-than-ideal weaknesses:

- PBKDF2 with only 65k iterations, where recommendations are ~200k
- AES with null IV, relying on salt for uniqueness
- lack of AES MAC/AEAD

However improbable, it is at least theorically possible that with
a lot of resources an offline bruteforce could be attempted.

Add a v2 sealing format, keeping unsealing compatibility with
the current format:

v2:<iterations>:<salt>:<IV>:<ciphertext>:<aes tag>

and use 600k iterations for the PBKDF2 sha512

fstab-generator: fix spurious quota warning for xfs

Filesystems like xfs, btrfs, gfs2 and ocfs2 handle quotas internally
and do not need external quotacheck/quotaon services. When usrquota or
grpquota mount options are used in fstab for these filesystems,
generator_hook_up_quotacheck() falls through to the !fstype_needs_quota()
branch and emits a misleading warning that quotas are "not supported"
when they actually work fine — the kernel handles them internally.

Add fstype_has_internal_quota() to return early with a debug message,
and adopt a tri-state return convention so the caller skips quotaon
when quotacheck was not needed.

The buggy code path was introduced in #24824 and #24880.

Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>

test: add test coverage for homed+fscrypt

Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>

update TODO

test: add integration tests for io.systemd.Job varlink methods

Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>

test: wait systemd to finish reexec in TEST-74-AUX-UTILS.varlinkctl.sh

test: split TEST-74-AUX-UTILS.varlinkctl.sh into per-interface subtests

Split the monolithic varlinkctl test script into separate files per varlink interface for better organization and easier maintenance:
- varlinkctl.sh: core varlinkctl tool tests (CLI, transports, socket discovery, upgrade/serve) and io.systemd.Manager
- varlinkctl-network.sh: io.systemd.Network
- varlinkctl-unit.sh: io.systemd.Unit (system + user manager)
- varlinkctl-metrics.sh: io.systemd.Metrics

No functional changes — the test content is moved as-is.

Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>

core: introduce io.systemd.Job interface with List, Cancel, and ClearAll methods

Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>

shared: extend Job varlink type with Unit and ActivationDetails fields

Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>

coredump: add COREDUMP_CODE field for signal reason

Introduce COREDUMP_CODE as a new captured field alongside the existing
COREDUMP_SIGNAL. While COREDUMP_SIGNAL identifies the signal number that
terminated the process, COREDUMP_CODE provides the reason the signal was sent.

For example, a process terminated by SIGSEGV due to invalid permissions would
produce COREDUMP_SIGNAL=11 and COREDUMP_CODE=2 (SEGV_ACCERR).

The kernel exposes coredump_code via pidfd starting with v7.1:
https://git.kernel.org/torvalds/c/701f7f4fbabbf4989ba6fbf033b160dd943221d5

System administrators can find both the signal and code in coredumpctl info:

$ coredumpctl info | grep Signal:
Signal: 11 (SEGV) si_code: SEGV_MAPERR

Signed-off-by: Emanuele Rocca <emanuele.rocca@arm.com>

signal-util: add signal_code_to_string

Add signal_code_to_string() in signal-util.c and cover the si_code values
defined in libc's siginfo-consts.h. Fall back to the numeric value when no
symbolic name is known.

Co-developed-by: Codex (GPT-5) <noreply@openai.com>
Signed-off-by: Emanuele Rocca <emanuele.rocca@arm.com>

include: add coredump_code to struct pidfd_info

Linux v7.1 adds coredump_code to struct pidfd_info and defines a few new
constants. Reflect the changes in include/override/sys/pidfd.h too.

Stop including the libc version of sys/pidfd.h to be able to override the
definition of pidfd_info.

Signed-off-by: Emanuele Rocca <emanuele.rocca@arm.com>

Various dlopen/linking cleanups from #42100 (#42166)

- **bpf-util: rename from bpf-dlopen, unify version-specific symbol
handling**
- **cryptsetup: dlopen libcryptsetup in tokens**
- **tree-wide: dlopen libpam in pam plugins**
- **test-bus-marshal: dlopen() glib and libdbus instead of linking
directly**
- **lock-util: Simplify timeout for lock_generic_with_timeout()**
- **color-util: simplify hsv_to_rgb, fix rgb_to_hsv negative-hue wrap**
- **tree-wide: Use our own macros instead of fabs()/fmax()/fmin()**
- **locale-util: dlopen() libintl instead of linking against it**
- **home: Use log2u64() over log2()**
- **meson: drop libdl, threads, and librt dependencies**
- **libc: Use dlsym() from a constructor instead of weak symbols**
- **libc: Make sure C23 versions of strtol(), sscanf() are not used**