git.ipfire.org Git - thirdparty/systemd.git/log

repart: Fix growing the partition preceding a FreeArea from leftover space

The "Donate to preceding partition" logic is dead code since commit
19903a4 ("repart: split out context_grow_partition_one()").
context_grow_partition_one() gets passed a free area and a partition, and it has
an early-return check to ensure the partition it got passed belongs to the free
area it got passed. That means we compare the FreeArea a to the FreeArea
a->after->allocated_to_area, which always yields FALSE.

Fix the behavior of donating any left over space to the preceding partition
by adding that partition to the loop below (and relying on the partitions list
being ordered according to physical partition offsets).

Since this behavior is not that easy to trigger, mention how to trigger it in a
comment, and add a test for it as well.

process-util: introduce prctl_safe() and port everything over to it (#43006)

prctl() is an API full of pitfalls: it is variadic, and some interfaces
don't expect zero-initialization of excess arguments, and others do.
Moreover, the parameters are "long", and nonetheless we usually pass
"int" to them. If we are too dumb to call it properly, let's just not
call it directly anymore, but let's add a wrapper around it that makes
the function non-variadic and declares the right types. Then, let's port
over everything to it.

This is inspired by #42996, but we had issues with this many times,
before and looking at this PR one can see that we otherwise still are
having the issue at numerous other places.

cryptsetup: add Argon2id-based PIN mode for TPM2 enrollment (#41859)

The current TPM2 PIN mode is flawed as a compromised TPM directly
exposes
the sealed secret which is the LUKS volume key itself
(https://github.com/systemd/systemd/pull/27502 and
https://github.com/systemd/systemd/issues/37386).

Goal: add Argon2id-based PIN hardening to TPM2 enrollment, making
the TPM a second factor rather than a single point of failure:

1. Password + salt → Argon2id → 512-bit key split into Key1 + Key2
2. Key2 (base64-encoded) is used as the PIN to seal a random secret
in the TPM
3. Key1 + unsealed secret → HKDF-SHA256 → final LUKS volume key

This implementation ensures that if the TPM is compromised, an attacker
still needs the password to derive Key1 and combine it with the unsealed
secret.

The --tpm2-with-pin= option now accepts three values:
- false (no PIN used)
- true (PIN hardened with Argon2id - default)
- "direct" (legacy PIN without Argon2id for backward compatibility)

Argon2id parameters are customizable via:

--tpm2-argon2id-memory=
--tpm2-argon2id-iterations=
--tpm2-argon2id-parallelism=
--tpm2-argon2id-iter-time=

These default to a function of available CPUs and physical memory, with
a benchmark that scales iterations to the target time (default: 2s) and
falls back to ARGON2ID_PARAMETERS_DEFAULT (64 MiB, 8 iter, 4 lanes) when
auto detection fails.
Also if the runtime OpenSSL lacks Argon2id support (< 3.2), the feature
silently falls back to direct PIN mode with a warning.

Added includes:
- src/cryptenroll/cryptenroll.c: cpu-set-util.h, limits-util.h,
time-util.h
for Argon2id benchmark auto-tuning (cpus_online, physical_memory_scale,
  now/usec_t)
- src/cryptenroll/cryptenroll-tpm2.c: crypto-util.h for
Argon2IdParameters
  struct in load_volume_key_tpm2()
- src/shared/tpm2-util.h: crypto-util.h for Argon2IdParameters in
  tpm2_make_luks2_json() API
- src/cryptsetup/cryptsetup-tokens/luks2-tpm2.c: crypto-util.h for
  kdf_argon2id_derive()/kdf_hkdf_sha256() on the token unlock path

sysupdate: add support for --component-all + --feature-all + --feature-suggested to enable-feature verb

This adds similar concepts as we already have for enable-component:
let's add a way to enable all features or the suggeste dones, and
possibly on all components.

Or in other words, with this:

systemd-sysupdate enable-component -S
systemd-sysupdate enable-feature -A -s

We'll automatically enable all suggested components, and all features of
them.

sysupdate: validate feature name syntax before using it

Let's tighten the rules on features, and enforce a sensible naming
policy.

sysupdate: add feature select

Prepartion for the next commit: also allow selecting the all or
suggested features.

sysupdate: implement --component-all and --component-suggested for enable-component and disable-component

Let's now introduce "systemd-sysupdate enable-component
--component-suggested" and systemd-sysupdate enable-component
--component-all" for enabling all or all suggested components at once.
Similar, if used for disable-component will disable all components or
those not suggested.

sysupdate: introduce --component-suggested similar to --component-all

While --component-all really picks all components --component-suggested
picks only the suggested ones.

Note that none of the verbs currently implement the concept, they will
all refuse the option. Hooking this up is going to be added next.

sysupdate: add a "suggests" concept to features and components

Let's make it possible to "suggest" that certain features or components
are enabled under some conditions.

For this, both features and components gain two things:

1. A Suggested= field which takes a boolean. If true the
   feature/component will be suggested for installation, if false it
   will not.

2. A set of SuggestedOnXYZ= settings are modelled after ConditionXYZ= in
   unit files (and implement a subset of them), will suggest some
   component/feature under specific conditions.

The result of the condition is shown in the various output tools.

timer: avoid re-arming WakeSystem=yes timer after suspend

There's an edge case where a single-shot timer using WakeSystem=yes
could get re-armed after elapsing. For example, when a timer is created
using:

$ systemd-run --user --on-active="1m" --timer-property=WakeSystem=yes flatpak run io.bassi.Amberol

and the system is then suspended, following sequence of events may
happen:

Jul 08 06:57:25 systemd[2640]: run-p192456-i199160.timer: Installed new job run-p192456-i199160.timer/start as 7675
Jul 08 06:57:25 systemd[2640]: run-p192456-i199160.timer: Enqueued job run-p192456-i199160.timer/start as 7675
Jul 08 06:57:25 systemd[2640]: run-p192456-i199160.timer: Monotonic timer elapses in 59.999999s.
Jul 08 06:57:25 systemd[2640]: run-p192456-i199160.timer: Changed dead -> waiting
Jul 08 06:57:25 systemd[2640]: run-p192456-i199160.timer: Job 7675 run-p192456-i199160.timer/start finished, result=done
Jul 08 06:57:25 systemd[2640]: Started [systemd-run] /usr/bin/flatpak run io.bassi.Amberol.
Jul 08 06:58:13 systemd[2640]: run-p192456-i199160.timer: Time change, recalculating next elapse.
Jul 08 06:58:13 systemd[2640]: run-p192456-i199160.timer: Monotonic timer elapses in 12.785674s.
Jul 08 06:58:26 systemd[2640]: run-p192456-i199160.timer: Timer elapsed.
Jul 08 06:58:26 systemd[2640]: run-p192456-i199160.timer: Changed waiting -> running
Jul 08 06:58:29 systemd[2640]: run-p192456-i199160.timer: Got notified about unit deactivation.
Jul 08 06:58:29 systemd[2640]: run-p192456-i199160.timer: Monotonic timer elapses in 33.544681s.
Jul 08 06:58:29 systemd[2640]: run-p192456-i199160.timer: Changed running -> waiting
Jul 08 06:59:48 systemd[2640]: run-p192456-i199160.timer: Timer elapsed.

1) The timer is armed at 06:57:25. timer_enter_waiting()
   calculates following values:

   base = M0 (current CLOCK_MONOTONIC)
   usec_shift_clock(base, CLOCK_MONOTONIC, CLOCK_BOOTTIME_ALARM) = M0 (no delta yet)
   v->next_elapse = M0 + 60s

   The timer is on CLOCK_BOOTTIME_ALARM at M0 + 60s

2) System suspends. If the system ran for ~12s and was suspended for
   ~36s, at resume we'd get:

   CLOCK_MONOTONIC = M0 + 12s
   CLOCK_BOOTTIME = M0 + 12s + 36s = M0 + 48s

3) timer_time_change() fires and calls timer_enter_waiting(t, true)

   Because time_change=true, v->next_elapse is not recalculated and
   keeps the original M0 + 60s value. The timer then correctly computes
   the remaining time as ~12.8 seconds:

        Jul 08 06:58:13 ...: Monotonic timer elapses in 12.785674s.

4) Timer elapses at 06:58:26

        Jul 08 06:58:26 ...: Timer elapsed.

   timer_enter_running() sets last_trigger.monotonic to CLOCK_MONOTONIC
   (which equals to M0 + 12s before suspend + 13s after suspend, thus
   + 25s)

5) Unit deactivates at 06:58:29

   timer_trigger_notify() calls timer_enter_waiting(t, false) -
   time_change=false; that means that this time v->next_elapse gets
   recalculated:

   base = inactive_exit_timestamp.monotonic = M0 (i.e. when the timer was originally armed)
   usec_shift_clock(M0, CLOCK_MONOTONIC, CLOCK_BOOTTIME_ALARM)
        a = now(CLOCK_MONOTONIC) = M0 + 28s
        b = now(CLOCK_BOOTTIME_ALARM) = M0 + 64s
        result = b - (a - M0) = (M0 + 64s) - 28s = M0 + 36s => the time spent in suspend

   Hence v->next_elapse = (M0 + 36s) + 60s = M0 + 96s

   This then skips disabling of one-shot timers in the following check,
   because the expression

        v->next_elapse < triple_timestamp_by_clock(&ts, TIMER_MONOTONIC_CLOCK(t))

   is false, because the v->next_elapse (M0 + 96s) is not less than the
   current time (M0 + 64s), so the timer is re-armed again:

        Jul 08 06:58:29 ...: Monotonic timer elapses in 33.544681s

Let's mitigate this by skipping the next elapse timestamp recalculations
for one-shot timers for which we've already calculated the value in this
activation cycle.

Resolves: #42929

resolved: publish browsed service after initialization

dns_add_new_service() links a _cleanup_ service into the browser
before copying its record and registering its maintenance timer has fully
succeeded. A timer setup failure then frees the service while leaving the
list head pointing at it.

Follow-up for 8458b7fb91ea5e5109b6f3c94f8a781a120c798b

tree-wide: port to prctl_safe()

nsresource: port from PR_GET_NAME to pid_get_comm()

SImilar reasons as for the memfd change: let's normalize behaviour and
always use pid_get_comm().

This adds escaping here for the first time, which is a good thing.

memfd-util: port to pid_get_comm()

Let's the common function for querying the local thread name.

This changes the escaping rules when the therad name is not quite
kosher: previously we'd just escape invalid UTF-8 charcaters, now we do
what we usually do: also escape control characters and such, and limit
us to ASCII.

The description generated here is mostly for debug purposes, and process
names should normally not require this escaping anyway (it's mostly
paranoia), hence I think this change in behaviour should be fine, it's
not part of the API in any form.

process-util: introduce proc_get_timerslack()

process-util: introduce trivial proc_set_nnp() helper

process-util: introduce proc_set_comm() helper

This operation is done at a bunch of places, let's add a type-safe
helper for it.

process-util: introduce prctl_safe()

udev: fix several option parsing edge cases (#42997)

Boundary tests were conducted on the udev subsystem, and some issues
were identified and resolved.

repart: Some fixes for --copy-from= (#42976)

A bunch of things I noticed that aren't correct about `--copy-from` and
grain sizes + paddings.

sd-varlink: async upgrade support (#42974)

With 2c6f9af8e5425c2086fbc8ca496843f162e4af9b sd-varlink gained protocol
upgrade support, however only in a synchronous fashion. This adds
asynchronous protocol upgrade for the server side, thus enabling
multiplexing daemons (i.e. those that handle multiple connections from
the same event loop) to support protocol upgrades too.

I plan to use this in the upcoming "ptybroker" component that allows
acquiring a pty through varlink.

/cc @mvo5

libudev: replace unique list entries in place

udev_list_entry_add() freed the existing entry for a name before inserting
its replacement. Today that cannot lose the old entry, but only because
re-adding the key that was just removed never makes the hashmap grow, which
is an internal detail of the hashmap. Use hashmap_ensure_replace() instead:
it updates the existing bucket in place, making it explicit that replacing
an entry neither allocates nor can drop the previous value under OOM.

Follow-up for c01130824f22ec3835a35c8ab5f9aea65195a40f

udev/net: reset the config list head in link_configs_free()

link_configs_free() freed every LinkConfig via LIST_FOREACH but left
ctx->configs pointing at the (freed) former list head. That is harmless for
link_config_ctx_free(), which frees the context immediately afterwards, but
link_config_load() calls link_configs_free() up front to clear the context
before (re)loading — and the dangling head then makes the subsequent
LIST_PREPEND() and the eventual teardown dereference freed memory.

Use LIST_CLEAR(), which pops and frees each entry and resets the head to NULL.

Follow-up for af6f0d422c521374ee6a2dd92df5935a5a476ae5

repart: Don't copy trailing padding when using --copy-from=

Currently, --copy-from= copies the paddings in between the source partitions,
as well as the trailing padding that is at the end of the source partition table.
It doesn't copy the leading padding at the start of the source partition table
though.

This seems inconsistent, and likely it was an oversight that the trailing padding
is copied.

Fix that, and add a test to ensure we don't regress.

repart: Clarify and test that --copy-from= argument respects grain size

The --copy-from= argument currently is documented as "copied partitions will have
the same size". This doesn't hold true in the case where a different grain-size is
passed to repart. Because `partition_min/max_size()` currently do rounding, the
size is implicitly rounded to grain size, and therefore partitions are enlarged
to align to grain size whenever possible.

Clarify this behavior and change the manpage, and also add a test for it.

repart: Don't get old grain size from fdisk for --copy-from=

This is a little bit confusing, but grain size is not actually stored in the gpt
metadata. Rather, fdisk's `get_grain_size()` returns an autodiscovered "optimal io
size" value as grain size. This might not actually be the grain size that the
disk we're copying is using.

Since we're setting the padding of the copied partitions using that value from
fdisk, we're rounding the new paddings by fdisk's optimal grain size, which is
usually 1MiB (a lot more then the default 4KiB that we're using otherwise).

Set the grain size here to 1 byte instead, ensuring that the min/max padding set
is exactly the padding that was present before.

Also add a test to confirm the behavior is fixed: The test calls --copy-from= on
an existing disk with 4MiB grain size, and because we pass --grain-size=512, now
no rounding should happen and the paddings should be transferred to exactly the
same size.

nresourced: add user.userbd.* xattrs to nsresourced userdb socket

udevadm-trigger: reject invalid wait-daemon timeout

Return from argument parsing when --wait-daemon= cannot be parsed.

This matches the other timeout options.

Reproducer:
  udevadm trigger --dry-run --wait-daemon=bad

Before:
  Failed to parse timeout value 'bad', ignoring: Invalid argument

Follow-up:
  2001622c58c1989f386086d11bd2a00d5fe00a30

udev-config: merge configured children max

Merge children_max from the per-source UdevConfig values before
defaulting it from system resources.

Reproducer:
  systemd-udevd --debug --children-max=1

Before:
  Set children_max to 32

Follow-up:
  04fa5c1580ad388af3477ecfd7d4aa7d7f5cee30

man/udevadm: update device-id-of-file arguments

udevadm info rejects positional devices together with
--device-id-of-file=.

Document that behavior instead of saying positional arguments are
ignored.

Reproducer:
  udevadm info --device-id-of-file=/etc/passwd /sys

Before:
  Devices are not allowed with -d/--device-id-of-file and -c/--cleanup-db.

Follow-up:
  31767b92a0b3980e0ae8a0f44715f86e72c35f77

udevadm-settle: reject positional arguments

The settle command does not define positional arguments.

Reject them during argument parsing instead of silently ignoring
them.

Reproducer:
  udevadm settle /no/such/argument

Before:
  The command exits successfully and ignores the argument.

Follow-up:
  c71509028fc2741b25dd537f9e1b7e8896df745e

udevadm-wait: let --removed override initialization

The documentation says --initialized= is ignored when --removed is
specified.

Track --removed separately so it wins regardless of option order.

Reproducer:
  udevadm wait --timeout=0 --removed --initialized=no /dev/no-such-test-device

Before:
  Timed out for waiting devices being added.

Follow-up:
  aa2b0d8d291a1f1dc2b50016c076ff8196989f84

udevadm-info: allow valueless attr filters

The --attr-match= and --attr-nomatch= filters are documented as
FILE[=VALUE], like the trigger filters.

Accept a bare attribute name for sysfs attribute presence matches.

Reproducer:
  udevadm info --export-db --attr-match=ifindex

Before:
  Expected <ATTR>=<value> instead of 'ifindex'

Follow-up:
  a6b4b2fa010f6dc5e18f1a14d93204c6c1416278

udev: avoid reading before empty capability masks

When input_id prints a decoded capability bitmap, an all-zero
bitmap can decrement the index to zero.

Check the index before reading bitmask[val - 1].

Reproducer:
udevadm test-builtin input_id with an all-zero capability mask

Follow-up:
912541b0246ef315d4d851237483b98c9dd3f992

shell-completion: add new Argon2id TPM2 parameters for systemd-cryptenroll

tree-wide: add missing libopenssl_cflags to targets using tpm2-util.h

tpm2-util.h includes crypto-util.h, which requires OpenSSL compiler
flags when HAVE_OPENSSL is defined. Add libopenssl_cflags to every
build target that uses tpm2_cflags but was missing the transitive
OpenSSL dependency.

cryptsetup: add Argon2id-based PIN mode for TPM2 enrollment

The current TPM2 PIN mode is flawed as a compromised TPM directly exposes
the sealed secret which is the LUKS volume key itself (#27502 and #37386).

Goal: add Argon2id-based PIN hardening to TPM2 enrollment, making
the TPM a second factor rather than a single point of failure:

1. Password + salt -> Argon2id -> 512-bit key split into Key1 + Key2
2. Key2 (base64-encoded) is used as the PIN to seal a random secret
in the TPM
3. Key1 + unsealed secret -> HKDF-SHA256 -> final LUKS volume key

This implementation ensures that if the TPM is compromised, an attacker
still needs the password to derive Key1 and combine it with the unsealed
secret.

The --tpm2-with-pin= option now accepts three values:
- false (no PIN used)
- true (PIN hardened with Argon2id - default)
- "direct" (legacy PIN without Argon2id for backward compatibility)

Argon2id parameters are customizable via:

--tpm2-argon2id-memory=
--tpm2-argon2id-iterations=
--tpm2-argon2id-parallelism=
--tpm2-argon2id-iter-time=

These default to a function of available CPUs and physical memory, with
a benchmark that scales iterations to the target time (default: 2s) and
falls back to ARGON2ID_PARAMETERS_DEFAULT (64 MiB, 8 iter, 4 lanes) when
auto detection fails.
Also if the runtime OpenSSL lacks Argon2id support (< 3.2), the feature
silently falls back to direct PIN mode with a warning.

Added includes:
- src/cryptenroll/cryptenroll.c: cpu-set-util.h, limits-util.h, time-util.h
  for Argon2id benchmark auto-tuning (cpus_online, physical_memory_scale,
  now/usec_t)
- src/cryptenroll/cryptenroll-tpm2.c: crypto-util.h for Argon2IdParameters
  struct in load_volume_key_tpm2()
- src/shared/tpm2-util.h: crypto-util.h for Argon2IdParameters in
  tpm2_make_luks2_json() API
- src/shared/tpm2-util.c: limits-util.h, tpm2-util.h for physical_memory() validation
  of Argon2id memory cost and function prototypes
- src/cryptsetup/cryptsetup-tokens/luks2-tpm2.c: crypto-util.h for
  kdf_argon2id_derive()/kdf_hkdf_sha256() on the token unlock path

machined: set user.userdb.* xattrs on machined socket

homed: set user.userdb xattrs on homed's userdb socket

We can't make a lot of restrictions here, since users are allowed to
basically freely pick their user names and UIDs too. But let's at least
exclude system UID ranges and those beyond the 16bit range.

core: tag dynamic user userdb provider with dynamic uid/gid ranges

Let's optimize lookups to PID 1's dynamic UID/GID range feature.

userdb: suppress lookups outside of indicated ranges

When doing a userdb lookup we might end up issuing a lot of IPC calls in
parallel to backends and cause them all to do work, with us waiting for
it. Let's optimize this a bit, and indicate on the socket inodes via
xattrs hints which kind of records are provided by a backend. That way
we can suppress lookups to them and optimize runtime behaviour.

This only works on Linux 7.1 and newer where socket inodes gained
support for extended attributes.

userdb: drop redundant check

sd-varlink: add sd_varlink_call_and_upgradeb() + sd_varlink_call_and_upgradebo()

This are to the existing sd_varlink_call_and_upgrade() what
sd_varlink_callb() and sd_varlink_callbo() are to sd_varlink_call():
they put together an object on the fly, via the usual JSON builder
logic.

sd-varlink: add async server-side upgrade API

sd-varlink: move varlink_handle_upgrade_fds() before sd_varlink_process()

This is a simple move, in preparation for using this as part of the
sd_varlink_process() callchain.

sd-varlink: fold separate 'protocol_upgrade' flag into state machine

json-stream: add helper json_stream_has_buffered_output()

Assorted basic/shared/ndisc/import hardening fixes flagged by kres (#42934)

claude-review: downgrade effort from xhigh to high

Claude reviews keep failing due to hitting 60m timeouts, as they take
too long. Lower effort from xhigh to high to try and speed them up.

hwdb: sensor: Add Odys Windesk X10 accel mount matrix

Add a mount matrix for the accelerometer on the Odys Windesk X10 tablet to
fix auto-rotation.

Revert "claude-review: bump timeout from 60m to 120m"

This reverts commit 6f04b8e5717264c2547f1c3e764ba8b000c23d0c.

test: TEST-89: assert ifindex=0 browse does not flap

Add a testcase that browses all interfaces (ifindex=0) and asserts zero
'removed' events arrive while every publisher stays up. Before the
combined-answer reconciliation this flapped added/removed continuously for
services present on another interface.

claude-review: bump timeout from 60m to 120m

Claude got slower, but in many cases it's on the brink of finishing the
review when the timeout kicks in. Double it.

ci: check 'update-man-rules' to ensure it is not forgotten

We often forget to run this command when updating manpages, and notice
only much later. Add a step in the CI to flag it, as we already do for
the D-Bus docs.

resolved: fix spurious BrowseServices add/remove flapping with ifindex=0

When a BrowseServices subscription uses ifindex=0 ("all interfaces"),
mdns_browser_revisit_cache() looked up each mDNS scope's cache separately
and called mdns_manage_services_answer() once per scope. That function
derives "removed" events by diffing the browser's *global* service list
(all interfaces, filtered only by owner family) against the single answer
it is handed. So with two or more mDNS-relevant interfaces of the same
family, a service present on interface A is not contained in interface B's
answer and is therefore spuriously removed (emitting "removed") while B is
reconciled, then re-added (emitting "added") on the next revisit tick.

The result is a continuous added/removed/added/removed flap, within a
second, for a service that is still present and without any goodbye packet
involved -- as reported for the mDNS/DNS-SD browsing API.

Fix it by accumulating the pruned cache answers from every matching mDNS
scope into a single combined answer and reconciling exactly once, so the
removed pass diffs the global service list against the union across all
interfaces. Each scope's cache is still pruned before lookup, so genuinely
expired records still drop out of the union and real removals still fire.

The items are merged with dns_answer_add_full() rather than
dns_answer_extend(), so each item's ifindex, flags, rrsig and cache-expiry
"until" are preserved. dns_answer_extend()/dns_answer_merge() are not
expiry-aware and do not carry the source items' "until" through: it
defaults to USEC_INFINITY (the "no expiry" sentinel), which would skew the
RFC 6762 section 5.2 TTL maintenance schedule that
mdns_manage_services_answer() derives from item->until.

The single-interface (ifindex>0) path is unchanged.

nss-myhostname: keep IPv6 probe result stable

Cache the IPv6 enabled state while sizing and filling NSS result
buffers, so a transient sysctl read result cannot change the tuple or
address layout after the scratch buffer size has been computed. Also
zero-initialize gaih_addrtuple records before filling IPv4 addresses.

Follow-up for e8a7a315391a6a07897122725cd707f4e9ce63d7

creds: tolerate TPM2 seal failure in auto mode

Automatic key modes tolerate a failed TPM2 sealing attempt and
fall back to a host or null key. Do not consume TPM2 blob output in
that case, tpm2_seal() leaves it empty on failure, so the fallback
path should continue without TPM2 metadata.

Follow-up for 9e4379945b74ee7920fe375be0bcb04d8ef53873

ndisc: reject non-zero ICMPv6 codes in parsers

NDisc packets received through the socket path are filtered before
parser dispatch, but parser entry points should still reject malformed
packet bytes instead of asserting on them. Return EBADMSG for non-zero
ICMPv6 code values in RA, NA, and Redirect messages.

Follow-up for c34cb1d6451dd9fcd36e1c08c553ca7f25e9d83b
Follow-up for 696eb2b8de980a2b79c1de7fbf12195936b00441
Follow-up for 44e8cf303b1e54752637725d55d01811e05ed482

pull-oci: verify redirected manifest digest

An OCI index redirect already carries the digest of the selected
manifest. Store it in the expected checksum field so pull-job
verifies the downloaded manifest instead of overwriting the digest
with the computed checksum before comparison.

Follow-up for a9f6ba04969d6eb2e629e30299fab7538ef42a57

efivars: fix concurrent growth read accounting

efi_get_variable() allocates one byte for probing whether efivarfs
has grown since fstat(), then three more bytes for NUL termination.
Account for both sizes separately so a full readv() result is treated
as concurrent growth and retried before the terminators are written.

Follow-up for ab69a04600fd34c152c44be6864eb3bc64568e17

network: do not regenerate MAC address if already set by userspace

When MACAddress= is unspecified, a stable MAC is generated to seed a newly
created netdev. Since existing netdevs are reconfigured on reload/restart,
this seed got re-applied to an already existing interface, conflicting with
an explicit MACAddress= from the matching .network file and flipping MAC.

Now, when MACAddress= is unspecified, only generate a new address if
addr_assign_type is not NET_ADDR_SET (i.e. not already set by userspace). On
first creation the interface does not exist yet, so this is a no-op and the
address is generated as before. Mirrors src/udev/net/link-config.c.

Fixes: #42457

creds-util: implement TPM2 SRK pinning

Stores the TPM2 SRK within the credential header, allowing for parameter decryption to be utilized when decrypting the credential.
A new dimension is added to the credential ID matrix to encode this capability.

This also allows for usage of TPM2-bound credentials when a TPM owner password is set since `Esys_CreatePrimary` is no longer used for sealing credentials.

creds: allow normal users to encrypt with`--with-key=null`

When encrypting with the `--with-key=null` option systemd-creds
is currently doing the encryption via IPC. This is not needed
for the null key, no privs are required so we can just do the
in-process operation. So instead simply check for the null-key and
if its requested use the in-process path.

man: run forgotten 'update-man-rules'

mkosi: update debian commit reference to b322b8d98e0192763aa711dc2aa2f98d8276aae3

* b322b8d98e Install new files for upstream build
* 3fd1b81c94 Update changelog for 261.1-3 release
* 8f95f1370c Move modules-load from systemd package to udev package
* bcdf90c670 debian/libpam-systemd.postinst: pam-auth-update does not use getopt
* 743e3399ac d/README.source: document policy for adding new binary packages
* cbea74783c Install new files for upstream build
* ef267b3ad6 debian/libpam-systemd.postinst: run pam-auth-update with --root=$DPKG_ROOT
* 54df2859b4 d/t/upstream: use mkosi from archive for Ubuntu autopkgtest
* 21959e8a59 Fix zsh installation path, again
* f01dddb047 Fix zsh installation path
* e31737edc4 Install new files for upstream build
* ca0630a51e Update changelog for 261.1-2 release
* 1ebb987599 d/t/control: pull in cpio for upstream suite
* fcf5a24f47 Two more fixups for d/copyright
* 6ad198086d d/t/control: pull new packages in upstream test suite
* bb9fd757fd Make systemd actually temporarily depend on systemd-tpm

sysupdate: enable/disable verb for features and components (#42947)

These are the next 3 commits from #42651. Split out again, in the hope
to get a review from claude.

ptal.

full test suite and docs are in #42651. test suite passed there.

obs: explicitly disable Ubuntu/i586 builds

i586 is a partial architecture on Ubuntu, so package builds cannot
work, explicitly disable it to avoid visual clutter in the build
report page.

man: drop '\r' from systemd-clonesetup.xml

Follow-up for 104970a8bd7a3b53067f6e50283183406a579f0b.

credentials: add policy that can allow key=null creds from the ESP (#42555)

This PR only sets the default to "relaxed" - I can change the default
to "tofu" if desired. But for that we will also need to update the NEWS
file to ensure everyone is aware of this new default.

---

This PR adds a new `systemd.credentials-boot=` kernel
commandline that allows to control if credentials with
a `null` key are accepted.

The possible options are:
* strict: always insist on tpm encryption
* tofu: allow null encryption in firstboot mode and when no tpm is
available
* relaxed: allow null encryption when sb is off, or no tpm is available
* off: allow null encryption always

The default is `relaxed` which is exactly the behavior we had before.

This replaces the initial idea of using plaintext credentials
at firstboot (thanks to Lennart for this nicer and simpler design).

---

With that we can drop `- firstboot: optionally accept credentials at
firstboot without authentication` from TODO.md

Add support for aarch64 CPUFeatures (#42902)

Extend real_has_cpu_with_flag to support aarch64 CPU Features using
hwcaps.

With this PR, users can find out if their Arm system supports
architecture-specific features such as BTI as follows:

```
$ systemd-analyze condition 'ConditionCPUFeature=bti'
```

sysupdate: move cleanup verb up close to "vacuum"

These are philosophically similar concepts: one deletes old versions
based on whether they are old, and the other deletes old files based on
whether they are orphaned, let's list them together.

sysupdate: tighten component_name_valid() mildly

Let's make sure that we can still generate a component enablement
drop-in for any component that might be defined.

sysupdate: add new enable-component/disable-component verb

This mimics the enable-feature/disable-feature verbs, but operates on
whole components, not features.

sysupdate: add enable-feature/disable-feature verbs

These do what updatectl's verbs of the same name does, but are
implemented as low-level concepts in sysupdate itself.

(The idea is to eventually wrap this in Varlink IPC, and make updatectl
use them that way.)

sysupdate: generalize feature name validity check

Let's switch to string_is_safe(), and make this available to the rest of
the sysupdate sources too.

This both relaxes and tighten the rules slightly. i.e. control character
and stuff are no longer allowed, but valid UTF-8 (as opposed to ASCII)
now is.

sysupdate: explicitly refuse --root=/--image= in conjunction with --definitions=

This is currently effectively not supported (as we sometimes prefix the
definitions path with root and sometimes not), let's make this official
for now and refuse it. We could in theory support the combination but I
don't see the big benefit of it for now, hence let's just refuse it.

sysupdate: Evaluate all patterns before descending into a subdir

When we have two match patterns, e.g., because the repository layout
changed and we look for old and new style files, we can have the case
that one pattern suggests a descend into a subdir but the other pattern
would be a direct match. This was missed and only the descent was done.
Yet the other way round also has to be covered: We can't just use the
direct match because it might be that it's rejected if the type doesn't
fit (directory vs. regular file). In this case we should still descent.

Remember that we want to descend but continue checking the other
pattern first to maybe get a direct match. Introduce a new combined
return code YES_AND_RETRY to give the full information to the caller
which now can decide to fallback to a descent. While at it, use the
stat info for the directory check. Also add tests to ensure that the
order of patterns doesn't matter and we handle the above corner cases.

sysupdate: Support matching for filenames in subdirectories

While for sysupdate it's fine to consume a large set of all possible
update payloads in a single directory this is not so handy for managing
and serving the update payloads. Since this large update folder is not
where the build output directly gets written to one has to create
copies and later possibly delete this added set of files.

Support matching for filenames in subdirectories by having a new **/
match pattern prefix which matches any number of nested subdirectories
or no subdirectory at all. For simplicity it's only allowed at the start
of a pattern and not a regular wildcard as the rest because the main
use case is to descend into subdirectories and only do the pattern
matching for the basenames. This way one can create a SHA256SUMS file
in the top folder and have it include all update payloads from the
release-specific (or arch-specific) subdirectories. Something similar
was already supported for directory sources where the match pattern can
start with a subdirectory path. Do also support this for SHA256SUMS for
parity while we are at it. Having the new wildcard makes mirroring also
easier because one does not have to follow the exact subdirectory layout
and one can filter by folder instead of by filename. It also makes it
possible to point the same transfer files with the new wildcard to
either a SHA256SUMS file that uses release-specific (or arch-specific)
subdirectories and includes all versions or to a SHA256SUMS file as
generated from mkosi that does not use subdirectories because it only
has files for a single version.
With the upcoming UAPI.16 JSON format we will also be able to encode
subdirectories and it makes sense to add this to SHA256SUMS for being
able to convert them. It also supports custom URLs for each entry which
is more powerful than the (arbitrary) subdirectory feature used here but
subdirectories have the advantage that they don't break mirroring.
This change also fixes the existing subdirectory handling bugs where
everything greater than two subdirectory levels failed to work because
rel_joined instead of de->d_name got used, symlinks were followed, and
we would continue silently on non-ENOENT errors.

clonesetup: add support to clone devices via /etc/clonetab

Adds dm-clone device setup at boot via a new /etc/clonetab config file,
following the crypttab/veritytab pattern.

- Add systemd-clonesetup-generator to parse /etc/clonetab and generate units.
- Add systemd-clonesetup binary to create/remove dm-clone devices via ioctl.
- Add clonesetup.target for ordering dm-clone activation at boot.
- Add region_size= option in clonetab to configure dm-clone hydration granularity.
- Add clonetab(5) and systemd-clonesetup-generator(8) man pages.

Fixes: https://github.com/systemd/systemd/issues/39500

udev: require exact builtin command matches

Commit 9b917abe02505d4ad09c4963ec5a4b2744eb2fee (udev-builtin:
simplify code a bit) changed builtin lookup from copying the first
command word and comparing it with streq() to comparing only the first
command word length with strneq().

That made prefixes such as "k" or "key" resolve to the "keyboard"
builtin, depending on the builtins table order. This was not the
original behavior.

Require the matched builtin name to end at the first command word
boundary so only complete builtin names are accepted. Arguments after
the builtin name continue to work as before.

udev: fix several reproducible command handling issues (#42958)

This PR fixes several independent udev command and rule handling issues.
Each commit contains the corresponding reproducer and fix details.

man: document arm64 CPUFeatures

Signed-off-by: Emanuele Rocca <emanuele.rocca@arm.com>

virt: support architecture prefixes in CPUFeature

Extend ConditionCPUFeature to support architecture prefixes in feature names.
The goal is avoiding ambiguities on mixed-arch fleets, where CPU feature names
may potentially conflict.

With this patch, ConditionCPUFeature=arm64.bti is now equivalent to
ConditionCPUFeature=bti on arm64 systems.

Signed-off-by: Emanuele Rocca <emanuele.rocca@arm.com>

virt: add support for aarch64 CPU Features

Extend real_has_cpu_with_flag to support aarch64 CPU Features using hwcaps.

With this patch, users can find out if their Arm system supports
architecture-specific features such as BTI as follows:

$ systemd-analyze condition 'ConditionCPUFeature=bti'

Signed-off-by: Emanuele Rocca <emanuele.rocca@arm.com>

include: add hwcaps missing from glibc and musl

Add an override file for all capabilities missing in glibc v2.34 and musl
1.2.6. The constant AT_HWCAP3 is also not in glibc v2.34, ship a glibc-specific
elf.h in order to provide it.

Signed-off-by: Emanuele Rocca <emanuele.rocca@arm.com>

Optimize dlopen ELF notes via anchoring and explicitly embed them into executables (#42908)

This PR optimizes the handling of `dlopen` ELF notes and reduces binary
footprints across the tree.

By enabling compiler section-splitting
(`-ffunction-sections`/`-fdata-sections`), the linker can now accurately
garbage-collect unused code and data. To align with this,
`SD_ELF_NOTE_DLOPEN_ANCHORED()` macro is introduced, which ties `dlopen`
notes to their calling functions so that unused notes are automatically
removed by `--gc-sections`.

Additionally, this explicitly embeds required `dlopen` notes into
individual executables to fix a visibility issue where package managers
missed runtime dependencies invoked indirectly through
`libsystemd-shared.so`.

run: recognize clock change timer properties (#42949)

--on-clock-change and --on-timezone-change are documented as shortcuts for
setting the corresponding timer properties with --timer-property=.

However, --timer-property= only treated the monotonic and calendar timer
settings as actual timer triggers. OnClockChange= and OnTimezoneChange=
were stored as timer properties, but arg_with_timer remained false and the
command was rejected as having no timer options.

Treat both properties as timer triggers too, matching the shortcut options
and the documented equivalent command line.

nspawn-oci: match the spec-correct "swappiness" memory field key

The OCI runtime specification names the memory swappiness knob
"swappiness" (memory.swappiness), but the dispatch table
in oci_cgroup_memory() registered it as "swapiness".

Since sd_json_dispatch_full() matches object keys by exact string
comparison, a spec-correct config carrying "swappiness" never hit the
intended oci_unsupported() handler and was instead routed to the
bad-field callback oci_unexpected(), which returns -EINVAL and thus
fails the whole memory cgroup section.

Register the correct key so that real OCI bundles parse, keep the
misspelt one around for compatibility, mark both as such, and drop the
now-resolved reminder from the file header TODO list.

udevadm-info: handle missing data db cleanup

udevadm info --cleanup-db cleaned links and tags by comparing them
against /run/udev/data, but asserted when that data directory was
missing.

Treat a missing data directory as an empty database, so stale link
and tag entries can be removed without aborting.

Add a database test that runs cleanup-db with links and tags present
but no data directory.

Reproducer:

$ sudo unshare --mount --propagation private sh -c '
mount -t tmpfs tmpfs /run/udev
mkdir -p /run/udev/links/foo
udevadm info --cleanup-db
'
Assertion 'datadir' failed at src/udev/udevadm-info.c:676, function cleanup_dirs_after_db_cleanup(). Aborting.
Aborted (core dumped)

The command should handle a missing /run/udev/data directory instead
of aborting.

udevadm-control: reject oversized children-max

The --children-max= option parsed into an unsigned value and then
stored it in an int sentinel field.

Values above INT_MAX could wrap negative and make the command look as
if no control option had been specified.

Reject oversized values before assigning the parsed value.

Reproducer:

$ udevadm control --children-max=2147483648
No control command option is specified.
$ echo $?
1

The value should be rejected as out of range instead of being wrapped
into the unset sentinel value.

udev-rules: drop truncated import output line

When IMPORT{program} output was truncated, the code intended to drop
the last incomplete result line before importing properties.

It edited the command buffer instead of the output buffer, so the
incomplete line could still be parsed and imported.

Trim the result buffer and cover the behavior with the existing IMPORT
test.

Follow-up for 6b6e471a325bf149839c5c822b4ae3e66cb1d9a3.

Reproducer:

Configure an IMPORT{program} rule whose output exceeds UDEV_LINE_SIZE
and ends with an incomplete property line, for example:

TRUNCATED_OK=yes
TRUNCATED_BAD=<very long value without terminating newline>

Only TRUNCATED_OK should be imported. Before this fix, the truncation
handler edited the command buffer instead of the output buffer, so
TRUNCATED_BAD could still be parsed from the truncated result.

udevadm-hwdb: honor root when querying

The deprecated udevadm hwdb command parsed --root for both update and
test modes, but the test path always queried the host database.

Pass the configured root to hwdb_query() so --root affects --test in
the same way as systemd-hwdb query.

Reproducer:

$ tmp=$(mktemp -d)
$ mkdir -p "$tmp/etc/udev/hwdb.d"
$ printf 'test:codex-root-test\n ID_TEST_CODEX=from-root\n' > "$tmp/etc/udev/hwdb.d/99-codex.hwdb"
$ systemd-hwdb --root="$tmp" update
$ systemd-hwdb --root="$tmp" query test:codex-root-test
ID_TEST_CODEX=from-root
$ udevadm hwdb --root="$tmp" --test=test:codex-root-test
udevadm hwdb is deprecated. Use systemd-hwdb instead.

The last command was expected to print ID_TEST_CODEX=from-root, but
queried the host database instead.

udev-rules: accept cvm CONST matches

CONST{cvm} is documented and already handled when udev rules are
executed, but the parser rejected it before the rule could be used.

Include cvm in the supported CONST key list.

Follow-up for 6e2e83b48734e86992cbdbb329c48cc066cf7c96.

Reproducer:

$ printf 'CONST{cvm}=="none", NAME="x"\n' > /tmp/codex-cvm.rules
$ udevadm verify /tmp/codex-cvm.rules
/tmp/codex-cvm.rules:1 Invalid attribute for CONST.
/tmp/codex-cvm.rules: udev rules check failed.

1 udev rules files have been checked.
Success: 0
Fail: 1

condition: split condition_test_list() to minimize dlopen dependencies

Even though systemd-networkd and systemd-udevd do not support
ConditionSecurity= in .network, .netdev, and .link files, the unified
condition_test() function maintained a function pointer table that
unconditionally referenced condition_test_security(). Consequently,
linker garbage collection (--gc-sections) could not drop the security
test code, inadvertently pulling in dlopen dependencies and ELF notes
for apparmor, audit, and tpm2 libraries into these binaries.

To resolve this, introduce condition_test_net() and condition_test_list_net(),
which utilize a trimmed-down function pointer table that excludes
security-related condition evaluators.

By migrating networkd and udevd to these new network-specific variants,
the reference to condition_test_security() is completely severed in
their dependency chains. This allows the compiler and linker to
successfully garbage-collect the unused security logic and safely drops
the unnecessary dlopen notes from those binaries.

compress: split compress_blob() and friends to minimize dlopen dependencies

Previously, even though sd-journal does not support bzip2 and gzip
compression, the dlopen ELF notes for those libraries were still being
attached to libsystemd.so and various journal-handling executables.
This occurred because the unified compression/decompression interfaces
handled all formats unconditionally, causing the compiler and linker to
pull in all associated dlopen notes across the board.

To resolve this, split these functions into generic and journal-specific
variants (e.g., introducing compress_blob_journal() and decompress_blob_journal()).
The journal-specific variants only handle formats actually supported by
the journal (LZ4, XZ, and ZSTD).

By updating sd-journal, journald, journalctl, and related utilities to
use these new `_journal` interfaces and switching them to the narrower
COMPRESS_JOURNAL_NOTE macro, we ensure that unnecessary dlopen notes
(for bzip2 and gzip) are no longer embedded into these binaries.

ci/unit-tests: enable dlopen note verification tests

Enable the `dlopen` unit test suite in GitHub Actions, except for the
following configurations where linker garbage collection (`--gc-sections`)
or dead-code elimination fails to drop unused dlopen symbols:

- ppc64le (both GCC and Clang)
- s390x (GCC only)
- Sanitizer-enabled setups

test: add dlopen note checker

This introduces a test utility to validate the presence and correctness
of dlopen ELF notes.

The test extracts and compares dlopen ELF notes between the dynamic and
static versions of target binaries to catch missing entries. Furthermore,
it scans the code for dlopen_foo() invocations to ensure that no stale or
unused notes remain in the final binary, thereby verifying the linker's
garbage-collection integration.

sd-dlopen: introduce _dlopen_loader_ macro to prevent inlining and cloning

When compiler optimization or LTO (Link Time Optimization) is enabled,
dlopen helper functions (such as dlopen_libfoo()) can be aggressively
inlined into their callers or cloned for specific call sites.

While this is beneficial for production builds, it alters or completely
removes the original function symbols from the final executable.
Consequently, this breaks the test (to be added in this series) that
checks whether the dlopen_libfoo() function corresponding to a dlopen
ELF note is actually called.

To resolve this, introduce the `_dlopen_loader_` macro. Under developer
builds, it applies the `_noclone_` and `_noinline_` attributes to guarantee
that helper symbols remain intact and discoverable by the test suite. In
production builds, the macro evaluates to nothing, allowing full compiler
optimization to proceed unimpeded.

tree-wide: embed dlopen notes into individual executables

Previously, when a library was dynamically loaded via a helper function
inside libsystemd-shared.so, the resulting dlopen ELF note was not
propagated to the invoking executable's ELF metadata.

This commit explicitly annotates each executable with the relevant
dlopen ELF notes for any optional dependencies it might potentially
load. This ensures that package managers and build systems can properly
discover runtime dependencies that are triggered indirectly.

README: mention binutils requirement, bump LLVM requirement for ELF note macros

The new SD_ELF_NOTE_DLOPEN_ANCHORED() macro utilizes the 'o' assembler
flag (SHF_LINK_ORDER), which requires binutils >= 2.35 or LLVM >= 18.
If an LLVM version older than 18 is encountered, the macro automatically
falls back to the non-anchored variant.

This non-anchored fallback relies on the 'R' (SHF_GNU_RETAIN) flag,
which requires binutils >= 2.36 or LLVM >= 13.

Since the codebase now unconditionally adopts SD_ELF_NOTE_DLOPEN_ANCHORED()
tree-wide, the effective minimum toolchain requirements become:
- binutils >= 2.35 (to support the 'o' flag)
- LLVM/Clang >= 13 (to support the 'R' flag fallback for versions < 18)

Update the minimal toolchain versions in the README to reflect these
requirements for building systemd.