git.ipfire.org Git - thirdparty/systemd.git/log

Add ActivatingConcurrencyMax for slice startup pacing

Introduce Slice.ActivatingConcurrencyMax to limit how many units
within a slice hierarchy may be in activating state concurrently.

Expose the setting over D-Bus, support transient/property parsing,
enforce it during unit start dispatch, and re-check queued starts when
units leave activating state.

Document the new slice option and add a PID1 concurrency test covering
queued startup behavior.

udev: derive path ID for PNP devices from ACPI firmware node

PNP devices may represent ACPI-enumerated hardware but do not have a
parent type supported by path_id. Consequently, importing path_id fails
even when the PNP device exposes a stable ACPI firmware_node. This also
prevents later assignments in rules such as the systemd-backlight
activation rule from taking effect.

Resolve the PNP device's firmware_node and use its ACPI sysname for the
path component. This gives PNP-backed devices the same stable identity
as their firmware representation.

Add a regression test using an RTC device below a PNP parent with an
ACPI firmware node.

bus: Reduce sd_bus_message size by 9.9% by dropping offset bookkeeping

sd_bus_message carries an array of header field offsets and a counter,
but their only reader was part of the D-Bus v2/GVariant sealing path
removed in 0dd487681505 ("sd-bus: drop D-Bus version 2 format support").

There are three remaining callers of message_extend_fields(), for
SD_BUS_MESSAGE_HEADER_DESTINATION, _PATH, and _INTERFACE, and all of
them pass false for add_offset. The receive path does not populate the
array either, so nothing reads or writes it any more.

Let's drop the two fields and the now unused add_offset argument and
branch.

This reduces sizeof(sd_bus_message) from 792...

    $ gdb -batch build-baseline/test-bus-benchmark \
          -ex 'ptype /o struct sd_bus_message'
                              ...
                              /* 688 |     8 */ usec_t timeout;
                              /* 696 |    80 */ size_t header_offsets[10];
                              /* 776 |     4 */ unsigned int n_header_offsets;
                              /* XXX  4-byte hole */
                              /* 784 |     8 */ uint64_t read_counter;
                              /* total size (bytes):  792 */

to 704 bytes:

    $ gdb -batch build-patched/test-bus-benchmark \
          -ex 'ptype /o struct sd_bus_message'
                              ...
                              /* 688 |     8 */ usec_t timeout;
                              /* 696 |     8 */ uint64_t read_counter;
                              /* total size (bytes):  704 */

On Fedora 43 aarch64 with glibc 2.42, the usable allocation for a
method-call message falls from 808 to 728 bytes, a reduction of 9.9%.

In my tests, retaining 400,000 method-call messages reduces in median
peak RSS from 363M to 332M.

Using `test-bus-benchmark chart direct 500ms` across message sizes from
1 byte to 2 MiB one can also see things are around 1% faster, which is
another nice incidental boost.

hwdb: fix Lenovo B570e touchpad ABS ranges for edge scrolling (#43264)

The Lenovo B570e touchpad (ETPS/2 Elantech) reports ABS ranges that
are too wide by default, which makes edge scrolling trigger across
roughly the right half of the touchpad instead of only at the right
edge. Add an evdev hwdb entry in `60-evdev.hwdb` that overrides the
ABS_X/ABS_Y (and the matching MT position) ranges with the calibrated
values, following the same pattern already used for the Lenovo B590
and L430 entries.

Fixes #29666

hwdb: mark Adesso wireless keyboard trackball as trackball (#43263)

The Adesso wireless keyboard with an integrated trackball (MosArt
062a:4101) is identified as a regular mouse, so trackball-style
scrolling does not work out of the box. Add a hwdb entry in
`70-mouse.hwdb` setting `ID_INPUT_TRACKBALL=1` so libinput and other
clients treat the device as a trackball.

Fixes #29609

hwdb: mark Microsoft Surface Type Cover touchpad as internal (#43262)

The Microsoft Surface Type Cover touchpad (USB 045E:09C0) is attached
through a USB port that firmware reports as removable. Because of that,
`65-integration.rules` sets `ID_INPUT_TOUCHPAD_INTEGRATION=external`,
and libinput skips disable-while-typing (DWT) for the device.

The touchpad is physically integrated into the Type Cover, so add a hwdb
entry in `70-touchpad.hwdb` that overrides
`ID_INPUT_TOUCHPAD_INTEGRATION=internal`, restoring DWT.

Fixes #43256

include: update kernel headers from v7.2-rc5

It seems there is no notable changes to us.

socket: parse message queue size as IEC size

Allow MessageQueueMessageSize= to accept IEC size suffixes in socket unit files.
Support the same syntax for transient property assignments.
Keep MessageQueueMaxMessages= as a plain message count.

sd-id128: parse UUID URNs

Accept RFC4122 UUID URN strings with the `urn:uuid:` prefix in
sd_id128_from_string(), while keeping plain 128-bit IDs and regular
UUID strings working as before.

udevadm: improve symlink query output

Implement the TODO item for `udevadm info -q symlink`: keep the
default space-separated output pager-free, and make `--value` print
one symlink per line with an empty separator line between devices.

po: resynchronize translations on Weblate

Weblate got itself into a conflict and while resolving it it forced a
resynchronization of all translations, which in combination with a new
version of Weblate triggered a lot of rather pointless
multiline-to-singleline (and vice versa) changes. Let's squash all this
noise into a single commit to make both Weblate and us happy.

C.f. https://github.com/systemd/systemd/pull/43248.

report: replace boolean --sign with signing modes

Turn --sign=BOOL into --sign=no|best-effort|require-one|require-all,
making the multi-signer aggregation policy explicit: best-effort never
fails on signing, require-one requires at least one signature, and
require-all requires every signer to succeed (an empty reply, i.e. a
signer opting out, counts as failure). Signed reports are always emitted
as a JSON-SEQ stream. The mode is also exposed as an input to the
io.systemd.Report.GenerateSigned Varlink method.

Signed-off-by: Paul Meyer <katexochen0@gmail.com>

sd-json: encode pidref fd_id as unsigned

PidRef.fd_id is uint64_t.

selinux: alternate root support (#42768)

This PR changes some SELinux bits related to working with alternate
roots (specifically when using `--root` or `--image` on a bunch of
executables).

It addresses bug #42643 and it's hopefully the more whole approach than
the naive approach I PR'ed in #42644.

Before this PR the 'wrong' labels get applied because the path lookups
in the SELinux label database are prefixed with whatever the location of
the alternate root is (explained in more detail below).

Initially I had taken a very naive approach that did fix the issue by
stripping the alternate root from the path; however this still looks up
that path in the hosts' label database, which might differ from the one
contained in the alternate root.

So this expanded approach actually reads the label database from the
alternate root, strips the prefix *if* an alternate root is used
directly in `selinux-util.c` and then uses that to assign labels
instead.

See under the line for the behavior pre/post.

I've tried builds of this PR on both enforcing/non-enforcing/non-enabled
hosts *and* on enabled/non-enabled disk images and things seem to work
or at least fall back to ignoring MAC when required bits aren't present.

One thing is *if* an `/etc/selinux/config` is present that defines a
`SELINUXTYPE=` we *do* require the policy given to be present in the
image. This is the only new actual error in this code path that doesn't
get ignored.

We *could* verify that the path exists and also ignore it but I
personally don't think that's the right approach since the actual system
itself would likely also be broken anyhow. Let me know thoughts on that.

There's a tight coupling here still with the *hosts* SELinux policy in
that to set (potentially) unknown labels to the policy loaded in the
host kernel these executables would need to execute in a domain that
allows transitioning to `mac_admin`. I'd say that `install_t` is the
most likely candidate for that. See the first comment on this PR for
more explanation on it/request for input.

---

When mounting `a.raw` before running any tooling against it and showing
the `/etc/shadow` file labels we have:

```
€ sudo systemd-dissect --mount test/a.raw test/mnt/a
€ ls -Zlart test/mnt/a/etc/shadow
----------. 1 root root system_u:object_r:shadow_t:s0 520 Jun 27 07:55 test/mnt/a/etc/shadow
€ sudo systemd-dissect ---umount test/mnt/a
```

After running `systemd-firstboot` against the image, then remounting,
note the labels that have been changed to incorrect values:

```
€ sudo systemd-firstboot --image test/a.raw --root-password test
/home/user/src/github.com/teamsbc/artifacts/test/a.raw: /etc/passwd written.
/home/user/src/github.com/teamsbc/artifacts/test/a.raw: /etc/shadow written.
€ sudo systemd-dissect --mount test/a.raw test/mnt/a
€ ls -Zlart test/mnt/a/etc/shadow
----------. 1 root root system_u:object_r:init_var_run_t:s0 579 Jun 27 08:01 test/mnt/a/etc/shadow
```

The behavior before this PR looks up the labels in the label database of
the host, but the path that gets looked up is the path where the image
is temporarily mounted, or in the case of `--root` where the root is on
the host. Since that path doesn't define any labels we get the labels of
the location where the file was created on the host. In this case since
`--image` was used, which mounted things in a temporary location we end
up with `var_run_t`.

If this image is booted things that want to read `/etc/shadow` might not
be allowed to read files labeled this way; thus services fail to start,
and root can't login when SELinux is in enforcing mode.

After this PR is applied there are two main differences in how things
are handled. The first being that instead of reading the label database
from the host (which might have none, or have a different one from the
one contained inside an image or root) we read the label database from
inside the alternate root. This tries to make sure we get the correct
labels for given paths.

Second, and most importantly, if we did init SELinux with an alternate
root then any paths passed to the relevant label lookup functions strip
that alternate root from the path. While previously we'd look up a path
like `/run/dissect-XXXX/etc/shadow` we now look up a path like
`/etc/shadow` *and* this path gets looked up in the label database in
the alternate root.

Together these things give in my opinion better handling of SELinux in
alternate roots. To confirm things work here's the same operations on
the second copy of our image:

```
€ sudo systemd-dissect --mount test/b.raw test/mnt/b
artifacts € ls -Zlart test/mnt/b/etc/shadow
----------. 1 root root system_u:object_r:shadow_t:s0 520 Jun 27 07:55 test/mnt/b/etc/shadow
€ sudo ~/src/github.com/systemd/systemd/build/systemd-firstboot --image test/b.raw --root-password test
/home/user/src/github.com/teamsbc/artifacts/test/b.raw: /etc/passwd written.
/home/user/src/github.com/teamsbc/artifacts/test/b.raw: /etc/shadow written.
€ sudo systemd-dissect --mount test/b.raw test/mnt/b
€ ls -Zlart test/mnt/b/etc/shadow
----------. 1 root root system_u:object_r:shadow_t:s0 579 Jun 27 08:44 test/mnt/b/etc/shadow
```

Showing that we now have the correct labels applied.

properties: Skip unnecessary per property filtering (#43146)

In total this can save ~18% CPU on `systemctl show` nominal queries in
my tests.

po: Translated using Weblate (Russian)

Currently translated at 100.0% (286 of 286 strings)

Co-authored-by: Andrei Stepanov <adem4ik@gmail.com>
Translate-URL: https://translate.fedoraproject.org/projects/systemd/main/ru/
Translation: systemd/main

[zjs: made some small corrections based on Claude comments.]

build(deps): bump the actions group with 6 updates

Bumps the actions group with 6 updates:

| Package | From | To |
| --- | --- | --- |
| [actions/checkout](https://github.com/actions/checkout) | `7.0.0` | `7.0.1` |
| [actions/setup-python](https://github.com/actions/setup-python) | `6.3.0` | `7.0.0` |
| [github/codeql-action/upload-sarif](https://github.com/github/codeql-action) | `4.36.2` | `4.37.3` |
| [aws-actions/configure-aws-credentials](https://github.com/aws-actions/configure-aws-credentials) | `6.2.0` | `6.2.3` |
| [softprops/action-gh-release](https://github.com/softprops/action-gh-release) | `3.0.1` | `3.0.2` |
| [ossf/scorecard-action](https://github.com/ossf/scorecard-action) | `2.4.3` | `2.4.4` |

Updates `actions/checkout` from 7.0.0 to 7.0.1
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](https://github.com/actions/checkout/compare/9c091bb21b7c1c1d1991bb908d89e4e9dddfe3e0...3d3c42e5aac5ba805825da76410c181273ba90b1)

Updates `actions/setup-python` from 6.3.0 to 7.0.0
- [Release notes](https://github.com/actions/setup-python/releases)
- [Commits](https://github.com/actions/setup-python/compare/ece7cb06caefa5fff74198d8649806c4678c61a1...5fda3b95a4ea91299a34e894583c3862153e4b97)

Updates `github/codeql-action/upload-sarif` from 4.36.2 to 4.37.3
- [Release notes](https://github.com/github/codeql-action/releases)
- [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md)
- [Commits](https://github.com/github/codeql-action/compare/8aad20d150bbac5944a9f9d289da16a4b0d87c1e...e4fba868fa4b1b91e1fdab776edc8cfbe6e9fb81)

Updates `aws-actions/configure-aws-credentials` from 6.2.0 to 6.2.3
- [Release notes](https://github.com/aws-actions/configure-aws-credentials/releases)
- [Changelog](https://github.com/aws-actions/configure-aws-credentials/blob/main/CHANGELOG.md)
- [Commits](https://github.com/aws-actions/configure-aws-credentials/compare/e7f100cf4c008499ea8adda475de1042d6975c7b...e6de054238d6b7531b4efff3b6587d9aade6a06c)

Updates `softprops/action-gh-release` from 3.0.1 to 3.0.2
- [Release notes](https://github.com/softprops/action-gh-release/releases)
- [Changelog](https://github.com/softprops/action-gh-release/blob/master/CHANGELOG.md)
- [Commits](https://github.com/softprops/action-gh-release/compare/718ea10b132b3b2eba29c1007bb80653f286566b...3d0d9888cb7fd7b750713d6e236d1fcb99157228)

Updates `ossf/scorecard-action` from 2.4.3 to 2.4.4
- [Release notes](https://github.com/ossf/scorecard-action/releases)
- [Changelog](https://github.com/ossf/scorecard-action/blob/main/RELEASE.md)
- [Commits](https://github.com/ossf/scorecard-action/compare/4eaacf0543bb3f2c246792bd56e8cdeffafb205a...2d1146689b8cda280b9bc96326124645441f03bc)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-version: 7.0.1
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: actions
- dependency-name: actions/setup-python
  dependency-version: 7.0.0
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: actions
- dependency-name: github/codeql-action/upload-sarif
  dependency-version: 4.37.3
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: actions
- dependency-name: aws-actions/configure-aws-credentials
  dependency-version: 6.2.3
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: actions
- dependency-name: softprops/action-gh-release
  dependency-version: 3.0.2
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: actions
- dependency-name: ossf/scorecard-action
  dependency-version: 2.4.4
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: actions
...

Signed-off-by: dependabot[bot] <support@github.com>

build(deps): bump meson from 1.11.1 to 1.11.2 in /.github/workflows

Bumps [meson](https://github.com/mesonbuild/meson) from 1.11.1 to 1.11.2.
- [Release notes](https://github.com/mesonbuild/meson/releases)
- [Commits](https://github.com/mesonbuild/meson/compare/1.11.1...1.11.2)

---
updated-dependencies:
- dependency-name: meson
  dependency-version: 1.11.2
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

properties: Skip value building for nominal case

bus_message_print_all_properties() builds a PROP= string for every
property in the reply so that -p PROP=value filters can be matched
against it, but most queries never need this.

Take the normal `systemctl show` or `systemctl show UNIT` case. In that
case there is no filter. Even with `-p PROP` there is no value filter
since there is no value.

Avoid constructing the string entirely by comparing property names
directly against filter entries.

In my tests with a `systemctl show` over 160 units this brings the
instructions retired from 992.6M down to 960.5M, a reduction of 3.2%.

The same goes for property filters with units. When running:

systemctl show -p UnitFileState -p ActiveState UNIT

...the instructions retired drops from 9.52M to 9.22M, a reduction of
3.2%. The output in each case is unchanged.

properties: Peek the variant value type once (#43145)

bus_message_print_all_properties() peeks the variant type, but then the
print callback and the default bus_print_property() each peek the value
type again, so there can be up to two redundant calls per property. Peek
it once up front and pass it through.

With this, in my tests `systemctl show` over 160 units decreases in
instructions retired from 1167.6M to 1155.7M, so about 1%.

properties: Peek the variant value type once

bus_message_print_all_properties() peeks the variant type, but then the
print callback and the default bus_print_property() each peek the value
type again, so there can be up to two redundant calls per property. Peek
it once up front and pass it through.

With this, in my tests `systemctl show` over 160 units decreases in
instructions retired from 1167.6M to 1155.7M, so about 1%.

properties: Skip found set building for nominal case

bus_message_print_all_properties() inserts every name it walks into the
found-properties set, but the set is only used to report missing requested
properties at debug level.

Request the set from systemctl only when properties were specified and debug
logging is enabled, avoiding the unnecessary work in normal operation.

In my tests with `systemctl show` over 160 units this brings the
instructions retired from 1167.8M down to 992.6M, a reduction of 15.0%.
The output is unchanged.

test: TEST-89: remove temporary files when browse helpers return

The RETURN traps in the browse helpers only stopped the transient
varlinkctl unit; the mktemp'd output/error/scratch files were never
removed and leaked on every invocation. Remove them from the same trap,
after the unit has been stopped so that nothing is still writing to
them, and make that stop best-effort like in
testcase_browse_ifindex_zero_no_flap: if varlinkctl exited on its own,
the transient unit is already gone, and a failing stop would otherwise
abort the testcase under errexit and skip the removal.

testcase_browse_ifindex_zero_no_flap cleans up its output file from the
trap it already arms for the dummy link, which is an EXIT trap since
run_testcases runs each testcase in its own subshell.

While at it, tidy up the helpers' variable scoping: error_file was
accidentally a global, and i/svc were declared in the wrong functions
(they are used by check_both/check_first, via dynamic scoping).

tmpfiles: fix log message about BSD lock

test: stage volatile deb packages for upgrade test, and retry failed units (#43209)

tmpfiles: reject extra argument fields and make 'r'/'R' support age field (#43195)

This updates systemd-tmpfiles in two small areas:

- Reject non-empty argument fields for tmpfiles.d line types that do not
consume the argument field, instead of warning and silently ignoring
them.
- Let `r` and `R` tmpfiles.d entries honor the `Age` field when
`systemd-tmpfiles --clean` is used. The existing `--remove` behavior
remains unconditional.

The completed TODO entries are removed, and NEWS/man page documentation
is updated for the visible behavior changes.

core/timer: fix next trigger with RandomizedOffsetSec + Persistent (#42826)

When a calendar timer with RandomizedOffsetSec and Persistent=true fires a
catch-up activation, last_trigger is set to the current wall-clock time.
This inherently already includes any randomized offset, because the
trigger was that the activation time including the offset had passed.

When computing the next elapse, calendar_spec_next_usec() finds the next
calendar boundary after the base time, and then random_offset is added to the
result. If the base time already includes the offset, the next calendar
boundary is one period too far in the future, causing a scheduled activation
to be skipped.

Fix this by always subtracting random_offset from the base time before passing
it to calendar_spec_next_usec(), matching what the fallback branches (using
inactive_exit_timestamp or current time) already do. This puts the base into
"pre-offset calendar space" so that the next calendar match and subsequent
offset addition yield the correct next activation time.

Fixes #42337.

string-util: don't miss ANSI sequence at the very end in previous_ansi_sequence()

The backwards scan started at offset length-3, so the last position at which a
sequence can begin, length-2, was never examined. CSI sequences are at least
three bytes long and were thus unaffected, but two byte Fe sequences (ESC
followed by 0x40…0x5F) terminating the examined slice were missed.

ellipsize_mem() calls this to figure out whether a sequence ends exactly at the
current position, so that it can be skipped over, which is precisely the case
that was broken: such a sequence was instead counted as two visible cells and
copied through as text, and the string was ellipsized more aggressively than
requested. For example ellipsize("🐱🐱\x1bM🐱🐱\x1bM", 5, 0) returned a three
cell wide string rather than the five cells asked for.

core/dbus: do not flush a user manager's bus that is not RUNNING

destroy_bus() flushes unwritten data for unprivileged managers so that
queued messages are not lost when a connection is torn down. However,
sd_bus_flush() first drives the connection to completion via
bus_ensure_running(): for a connection still in OPENING or
AUTHENTICATING this blocks the manager synchronously - a single-fd
ppoll, the event loop is not running - until the peer answers or
BUS_AUTH_TIMEOUT (= DEFAULT_TIMEOUT_USEC, 90 s by default) expires; a
connection in HELLO blocks the same way on the Hello call's own
method timeout. destroy_bus() is reached from four places: the
disconnect handler, manager_recheck_dbus(), the failed-setup path in
api_bus_instance_id_reply(), and bus_done() during normal shutdown or
reexec (via manager_free()).

Such a peer legitimately never answers: during session teardown,
dbus.socket can be (or re-enter) listening while the D-Bus service
behind it is hung or already gone, so connect() succeeds against the
socket backlog and the authentication request is never read. The user
manager then freezes mid-shutdown - or mid-reexec - for 90 s, with
every remaining unit stop (or the reexec itself) gated behind it.

This is the block traced in #16471 (2020, v245): the reporter's
strace shows the manager hanging in a single-fd ppoll with an ~89 s
timeout right after SIGTERM - bus_ensure_running() driving an
AUTHENTICATING reconnection - and their summary attributes it to the
flush. Their tested sd-bus-level patch (breaking the wait via a
SIGTERM-set flag) was met with "a work-around once things are already
bad, but we shouldn't even get in that state"; the same reply stated
the expectation this commit implements - the manager "should normally
protect itself ... and not issue dbus messages when the dbus service
isn't fully up". bus_foreach_bus() already applies that principle in
the other direction, skipping enqueue for connections that "haven't
started yet" via the same sd_bus_is_ready() check. 1166f4472d
("core/dbus: do not block the manager on GetId during bus
(re-)connection") removed the connect-time instance of the same class
of block, in code added in 2025. The flush here is twelve years
older: it dates back to the libsystemd-bus conversion (718db96199,
2013).

Only flush when the connection is currently RUNNING. This does not
change what gets delivered in the failure case this fixes: whatever a
non-RUNNING connection's write queue holds (auth/Hello traffic, and
any subscriber signal a per-unit or per-job bus_track attached without
a readiness check) was never actually sent by the old code either -
sd_bus_flush() calls bus_ensure_running() before it ever looks at the
write queue, so a connection that times out without reaching RUNNING
had its queued data silently discarded on close exactly as before,
just after blocking for up to 90 s first rather than immediately. The
one narrowing is a connection that would have completed authentication
within the timeout window: previously such traffic could still reach
the peer after the block; now it will not. Every destroy_bus() call
site is reached only once the manager has already decided the
connection is being torn down (disconnected, recognized as down, or
the process itself exiting/reexecuting), so this narrowing does not
trade a working delivery for a broken one.

Fixes #16471

selinux: wire up LabelContext in tmpfiles, firstboot, sysusers

These three one-shot tools operate on alternate roots via --root/--image
but until now created files with host SELinux labels, producing images
that fail to boot or run with enforcing mode because every file carries
the wrong security context.

Create a LabelContext from arg_root at startup and thread it through all
labeling call sites so the target image gets labeled according to its own
policy.

Signed-off-by: Simon de Vlieger <cmdr@supakeen.com>

selinux: use LabelContext in label callbacks

With the plumbing and context type in place, make the SELinux pre/post
callbacks use the alternate context when label_context is non-NULL, so
files get labeled according to the target image's policy rather than the
host's.

Errors from the host kernel not recognising image-specific contexts
(EINVAL from setfscreatecon_raw) are logged at debug level and skipped
gracefully, since this is expected when the image carries labels the host
policy doesn't define.

Signed-off-by: Simon de Vlieger <cmdr@supakeen.com>

selinux: add LabelContext for alternate-root labeling

When operating on an alternate root (--root/--image), SELinux labels must
come from that root's policy, not the host's. This requires opening a
separate selabel_handle against the target's policy database and
remembering the root path for prefix stripping.

Introduce LabelContext to carry both, and mac_label_context_new() to
set it up: it reads the target's SELinux config, validates SELINUXTYPE to
prevent path traversal, sets the policy root, and opens a label database
scoped to the target image.

Signed-off-by: Simon de Vlieger <cmdr@supakeen.com>

label: plumb label_context through LabelOps

SELinux labeling with --root/--image needs per-call context to carry an
alternate policy database and root prefix. The current LabelOps interface
has no way to pass this, forcing any solution to rely on global state.

Add a LabelContext *label_context to the label_ops callbacks and all
intermediate layers. Generic filesystem functions that most callers use
(xopenat_full, write_string_file_full, etc.) get a _label variant
carrying the extra parameter, with inline wrappers preserving the
original signatures so the vast majority of call sites remain untouched.

Signed-off-by: Simon de Vlieger <cmdr@supakeen.com>

selinux: relax error handling in permissive mode (#36929)

Error returned from security_compute_create_raw() means that kernel
couldn't compute target context. Very likely because file context is not
known to the policy, i.e. security.selinux xattr contains some garbage
value and we are running in permissive mode, otherwise returned context
would be "unlabeled_t" instead of getting an error.

mac_selinux_get_create_label_from_exe() is used to figure out create
label for socket units and we fail to start the socket if we can't
figure out that label.

However, it may be necessary to start some sockets in order to get to
the point when we launch the service that relabels (in permissive mode)
the entire filesystem and reboots.

core/service: append the original error cause in the debugging logs

core/service: ignore SELinux label errors in permissive mode

Return -ENODATA instead of the raw error when SELinux is permissive, so
the caller falls back to the default label. This is needed to allow
relabeling service to start on systems where file contexts maybe
invalid.

test: find out tpm device rather than hardcoding it in TEST-92-TPM2-SWTPM

Looks like the device might change depending on the boot sequence, so
discover it in the test rather than hardcoding it to /dev/tpmrm0.

Fixes https://github.com/systemd/systemd/issues/43210

Follow-up for 1b1900a6f3162b3f16b6779bdcca9ec0f9aa11e9

core: quote each exec directory entry when serializing

Quote each serialized exec directory entry, and use extract flags
compatible with config_parse_exec_directories() when deserializing.
This allows paths containing spaces and escaped characters to round-trip
correctly.

Fixes #41853.
Replaces #42686.

ci: fix /etc and /usr ownership

Recent Ubuntu 24.04 GHA images have /etc and /usr owned by runner
instead of root, which breaks some of our tests. This has been filed to
GH as https://github.com/actions/runner-images/issues/14477, so let's
work around this in our jobs until it's fixed.

test: retry units failed during package replacement in TEST-88-UPGRADE

Retry units that still exist and clear stale state for units removed by
a downgrade, as packages might be old and not have new units that were
added in the latest version.

mkosi: stage volatile deb packages for upgrade test

Prepare scripts run before volatile packages are installed, so parse
the list from the config to ensure they are all included to avoid
failures due to some packages missing from the list.

Follow-up for 28e1f84d6a2721000b0d781220b154f0bed50cc8

tmpfiles: honor age for r/R cleanup

Let r and R lines participate in --clean when they specify an age. The
target itself is removed only after its selected file or directory
timestamps have aged enough; --remove remains unconditional.

Follow-up for: beca6b6e6b64cebfe9fc2c89117f6abd3c1b5701

po: Translated using Weblate (Kazakh)

Currently translated at 100.0% (286 of 286 strings)

Co-authored-by: Baurzhan Muftakhidinov <baurthefirst@gmail.com>
Translate-URL: https://translate.fedoraproject.org/projects/systemd/main/kk/
Translation: systemd/main

libudev: cache the errno for a failed parent lookup

udev_device_get_parent() caches the result of device_new_from_parent()
in udev_device->parent on first call. device_new_from_parent() sets
errno correctly via return_with_errno() when it fails, so a caller
gets the right errno on the first invocation. But on any subsequent
call for the same object, the cached NULL is returned directly without
recomputing anything, so errno reflects whatever happened in between
rather than the original failure reason.

Cache the errno alongside the parent pointer (new parent_errno field,
zero-initialized like the rest of the struct via udev_device_new()'s
compound literal) and restore it whenever the cached parent is NULL.

ci: build the release clang build with _FORTIFY_SOURCE=3

Since this combination is known to cause interesting issues in our
allocation machinery.

alloc-util: make malloc_sizeof_safe() compatible with clang's _FORTIFY_SOURCE=3

Turns out that clang's interprocedural analysis is quite smart and can
look through our expand_to_usable() trick, which then causes crashes
with _FORTIFY_SOURCE=3.

clang's interprocedural analysis can see that expand_to_usable() simply
returns its first argument, so it replaces all uses of the return value
with that argument (the original realloc() result). This effectively
bypasses expand_to_usable()'s alloc_size attribute, causing the fortify
check to use the (smaller) size from realloc() instead, which eventually
leads to a false-positive buffer overflow:

$ build/test-varlink-idl
/* test_parse_format */
...
*** buffer overflow detected ***: terminated
Aborted (core dumped) build/test-varlink-idl

This is not an issue with gcc (at least not yet), since gcc sees
expand_to_usable() as an opaque user-defined allocation-like function
and simply trusts the alloc_size attribute that comes with it.

To fix this, let's add a simple no-op barrier to malloc_sizeof_safe()
that clobbers the input pointer, which causes
__builtin_dynamic_object_size() to return (size_t)-1 - this is
interpreted as an "unknown" size by the following fortify check which is
then skipped instead of triggering the assertion.

Similarly, test-alloc-util now doesn't call malloc_usable_size()
directly but instead goes through malloc_sizeof_safe(), so it's also
guarded by the barrier.

This follows the already established pile of similar workaround for the
same class of issues we encountered with gcc, namely [0], which prompted
[1], that was later reverted in [2], and then followed by another couple
of fixes in [3] and [4].

Resolves: #43178

[0] https://github.com/systemd/systemd/issues/22801
[1] https://github.com/systemd/systemd/commit/0bd292567a543d124cd303f7dd61169a209cae64
[2] https://github.com/systemd/systemd/commit/2cfb790391958ada34284290af1f9ab863a515c7
[3] https://github.com/systemd/systemd/commit/7929e180aa47a2692ad4f053afac2857d7198758
[4] https://github.com/systemd/systemd/commit/4f79f545b3c46c358666c9f5f2b384fe50aac4b4

shared/switch-root: sync only file systems becoming unreachable, not everything

switch_root() calls a blanket sync() before detaching the old root
file system, in order to make sure it is in a good state before it
becomes unreachable via MNT_DETACH/pivot_root().

A global sync() however flushes out *every* mounted file system on
the system, not just the ones we are actually about to detach. On
real-world systems that commonly have several additional mounted file
systems (separate /home, /var, additional data partitions, network
shares, removable media, ...) this needlessly delays switch_root() with
completely unrelated I/O. This matters in particular for
initrd-switch-root.service, which runs this code on the critical path
of pretty much every single boot with an initrd, and for soft-reboot.

Replace the global sync() with a new sync_departing_file_systems()
helper that walks /proc/self/mountinfo and calls syncfs() on every
file system except:

  - 'new_root' and anything mounted below it: these remain mounted
    and reachable after the transition and keep being synced normally
    as part of their regular life cycle, so they don't need to be
    force-flushed here.

  - API/pseudo file systems (proc, sysfs, cgroupfs, autofs, ...),
    network file systems, and overlayfs (which has no backing store
    of its own), as determined by the new fstype_is_worth_syncing()
    predicate. There is nothing meaningful to flush on any of these,
    and more importantly, opening an untriggered autofs mount point
    would needlessly trigger it, and opening a stale network mount
    could block for a long time - exactly what we are trying to avoid
    on this code path.

  - Any flavour of FUSE (plain 'fuse', 'fuseblk', or a
    'fuse.<subtype>', e.g. sshfs, rclone, gvfs, ntfs-3g, exfat-fuse,
    ...), classified via the new fstype_is_fuse() predicate in
    src/basic/mountpoint-util.c, plus a few other, non-FUSE guest/host
    file sharing file systems with the same "backed by a companion
    daemon/hypervisor that could be wedged" risk profile (virtiofs,
    vboxsf, vmhgfs). All I/O against any of these, including the
    syncfs() we'd otherwise issue, is routed through an arbitrary
    userspace daemon (or, for virtiofs/vboxsf/vmhgfs, the host/
    hypervisor side), which could hang indefinitely if wedged, dead,
    or otherwise unresponsive - there's no timeout on this code path.
    'fuseblk' might sound exempt given the name, and does wrap an
    actual block device, but that doesn't bound its syncfs() latency
    by the kernel block layer alone the way a native block device
    file system's is: the request is still serviced by the same FUSE
    daemon as any other FUSE variant, and can hang exactly the same
    way, so it is excluded here too, trading its comparatively minor
    data-safety benefit for avoiding that unbounded hang risk.

    '9p' (which can be used with a writeback cache and hence carry
    real dirty data, e.g. common in QEMU/KVM guests) and the
    shared-storage cluster file systems 'gfs', 'gfs2' and 'ocfs2'
    (which fstype_is_network() also happens to classify as "network"
    file systems, since they additionally rely on a networked
    distributed lock manager for coordination) are deliberately *not*
    excluded: unlike FUSE/virtiofs/etc., these are serviced by a
    mature, in-kernel client (talking directly to the hypervisor over
    a bounded virtio transport, or to real - if shared - block
    storage), not an arbitrary, potentially wedged userspace daemon,
    so they carry the same bounded, local sync latency any other
    block device backed file system already does here. Skipping them
    would needlessly sacrifice the data-safety guarantee the original
    blanket sync() gave them, without meaningfully improving safety.

  - Mount table entries that we can positively confirm are currently
    shadowed by another mount stacked on top of them at the same
    path: since we can only reach a file system by (re-)opening its
    target path, and that always resolves to whatever is currently on
    top, syncing by path alone could end up flushing the wrong
    superblock. Detect this via the new shared
    libmount_fs_id_matches_path() helper (factored out of, and now
    also used by, the pre-existing get_sub_mounts(), which needed the
    exact same check for the same reason). This same check is also
    applied to a mountinfo entry whose target is 'new_root' itself
    (not just anything strictly below it): comparing its mount ID
    against new_root's own, freshly determined mount ID tells apart
    the file system that is actually still reachable there (which we
    continue to skip) from a stale entry that merely shares the exact
    same path (e.g. if new_root wasn't already its own mount point
    and got bind-mounted onto itself earlier in switch_root()), which
    is departing just the same and must not be skipped just because
    of that coincidence.

Every failure mode that means we can no longer be sure we've covered
every departing file system correctly - libmount being unavailable,
/proc/self/mountinfo (or a specific entry in it) failing to parse,
being unable to tell whether a specific entry is currently shadowed,
or syncfs_path() itself failing for an otherwise-eligible entry - is
handled the exact same way: propagate the error up and let the sole
caller, switch_root(), fall back to one plain, global sync() to cover
everything, rather than deciding on and performing that fallback (or,
worse, silently skipping the affected file system without any
fallback at all) at each of these different spots individually. This
should be rare in practice, so it doesn't meaningfully undercut the
benefit of the targeted sync in the common case.

Everything else that's actually about to become unreachable (the old
root itself, but also any other, unrelated real file system that
happens to be mounted underneath it and gets detached along with it)
is still synced, so this keeps the same safety guarantee the original
blanket sync() gave for file systems that actually do go away here.
Uses the existing syncfs_path() helper for the actual open+syncfs.

sync_departing_file_systems() itself returns -EOPNOTSUPP if libmount
support isn't compiled in, handled the same way by switch_root() as
any of its other error returns.

Note we intentionally don't use O_PATH file descriptors here: syncfs()
requires a 'real' file descriptor and fails with EBADF on O_PATH ones.

Also note there remains an inherent, narrow TOCTOU race between the
mount-ID check described above and the open() syncfs_path() performs
right after it: if something else mounts something new on top of a
given 'path' in between, that open() could still end up triggering an
automount, or hanging on a stale mount, since there is no open()/
openat() equivalent of statx()'s AT_NO_AUTOMOUNT to prevent this for a
"real" (non-O_PATH) file descriptor. Unlike the other failure modes
handled here, a hanging open() can't be recovered from by falling back
to sync() afterwards, since control never returns to do so. Closing
this fully would require disproportionate effort (e.g. performing the
open() in a separate, killable/timeout-bounded process) for a window
that is already narrow, since this code only runs with most other
activity on the system already quiesced during the switch_root()
transition itself, so it is accepted as-is (see the comment at the
call site for details).

This mirrors the same reasoning already applied to the shutdown path
in src/shutdown/shutdown.c, which deliberately avoids a 'dumb' sync()
there for identical reasons.

core: postpone dbus queue dispatch while API bus setup is pending (#43200)

Since 1166f4472d7669c8008c158a178dd6f76b601fe1 the API bus setup
and the subscriber coldplug happen only once the asynchronous GetId
reply is processed by the event loop. After a daemon-reexec,
manager_dispatch_dbus_queue() runs before subscribers were registered
and consumed send_reloading_done, so the one-shot Reloading(false)
signal was never sent.

Clients that wait for this signal to detect that a reexec finished time out.

Track the pending setup and hold the flag until the reply handler has
re-added the subscriptions.

Also affects v261.2 via backport 26f3717e27f6f08347adf851c605d5fe1f52ac57.

test: add case for the Reloading bus signal

Check that the manager broadcasts Reloading(true/false) on the API bus
for daemon-reload, and the one-shot Reloading(false) after a
daemon-reexec, which requires the subscribers of the previous instance
to be coldplugged before the D-Bus queue is dispatched.

core: postpone D-Bus queue dispatch until the API bus is set up

Since 1166f4472d ("core/dbus: do not block the manager on GetId during
bus (re-)connection") the API bus setup and the subscriber coldplug
happen only once the asynchronous GetId reply is processed by the
event loop. After a daemon-reexec, manager_dispatch_dbus_queue() runs
before subscribers were re-added, so bus_foreach_bus() skipped the
API bus and queued messages were lost for subscribers. In particular
the one-shot Reloading(false) signal was never sent, and clients that
wait for it to detect that a reexec finished timed out.

Track whether bus_setup_api() has run for the current API bus
connection, and postpone dispatching the queue until then.

tmpfiles: reject unused argument fields

Line types which do not use the argument field used to warn and ignore
a non-empty field. Treat that as invalid configuration instead, so typos
are not silently accepted.

Follow-up for: 614cc34f3a2a7c64a21c3f5256f2e2b2c1de1d51

cryptsetup: measure volume key and keyslot via the pcrextend Varlink service (#43109)

Motivated by
https://github.com/systemd/systemd/pull/43041#discussion_r3595022610.

Switch systemd-cryptsetup's volume-key and keyslot measurements from
driving the TPM directly (tpm2-util) to the io.systemd.PCRExtend Varlink
service, aligning it with how the verity and imds measurements already
work.

Some notes on decisions taken:
- The volume key is sent over the wire. systemd-pcrextend will do the
hmac. The socket is root only. Otherwise we would need to do bank
negotiation via varlink and pollute the interface with it.
- `tpm2-measure-bank=` deprecated/dropped. Same reason as above.
- `tpm2-device=` now only affects unlocking. The device for measurements
is selected by pcrextend.
- Measuring requires the presence of `systemd-pcrextend.socket` in the
initrd, should be already given as systemd-veritysetup relies on it,
too.
- Logs are done on the pcrextend side.

test-bpf-restrict-fs: skip if manager startup fails due to lack of privileges (#43202)

test-bpf-restrict-fs.c creates a Manager with RUNTIME_SCOPE_SYSTEM, which
tries to set up the real system runtime directory hierarchy (e.g. create
/run/systemd/), and that requires privileges the test process may not
have (e.g. unprivileged sandboxed builders such as OBS).

Previously this was masked because bpf_restrict_fs_supported() did a
trial open/load/attach of the BPF program itself, which also requires
elevated privileges and so failed first, causing the test to skip
before ever reaching manager_new()/manager_startup(). Since
bpf_restrict_fs_supported() no longer does that trial load, the test
now reaches manager_new()/manager_startup() in these unprivileged
environments and hard-fails instead of skipping, e.g.:

Assertion failed: Expected "manager_startup(m, NULL, NULL, NULL, NULL)"
to succeed, but got error: -13/EACCES

Use the same manager_errno_skip_test() pattern already used by other
tests (test-engine.c, test-execute.c, test-path.c, ...) to skip
gracefully when manager_new() or manager_startup() fail due to missing
privileges, instead of asserting.

Follow-up for c99678eedaad88defe440d6102952926a61e72cf.

update-utmp: shorten comm on boot/shutdown

audit_log_user_comm_message from libaudit 4.2 rejects comm arguments
that exceed the kernel's comm limit of 15 characters with EINVAL. The
hard-coded "systemd-update-utmp" exceeds this by 4 characters. Shorten
it to "update-utmp" instead.

po: Translated using Weblate (Interlingua)

Currently translated at 4.1% (12 of 286 strings)

Co-authored-by: Emilio Sepulveda <emism.translations@gmail.com>
Translate-URL: https://translate.fedoraproject.org/projects/systemd/main/ia/
Translation: systemd/main

tools: add -n shortcut for --dry-run

Accept -n as a short option for --dry-run in bootctl,
systemd-oomd, systemd-sysusers, and systemd-tmpfiles.

For systemd-repart, make -n equivalent to --dry-run=yes,
while keeping --dry-run=BOOL available.

Follow-up for: 2479f0bb095d9e9f9c56c8110efc87fe0b0f59c0

rules: do not install 60-tpm2-id.rules without TPM2 support

This change also allows to import `tpm2_id` as a built-in instead of as a
program.

test: cover DeviceAllow parsing

Add unit coverage for config_parse_device_allow().

Verify valid device paths and subsystem patterns, invalid
specifiers and rights, default permissions, and reset handling.

Follow-up for: 20d52ab60e7ba40f7cf23c148bcead8bd05bea3a

cryptsetup: measure via the pcrextend varlink service

Measure the volume key and unlock keyslot through io.systemd.PCRExtend
instead of driving the TPM directly via tpm2-util, matching how the
verity and imds measurements already work.

Bank selection and the TPM context now live entirely in
systemd-pcrextend. As a result the tpm2-measure-bank= crypttab option
can no longer be honored per volume and is now a deprecated no-op.

Signed-off-by: Paul Meyer <katexochen0@gmail.com>

network: document that Domains= may be specified more than once (#43194)

The `Domains=` option in the `[Network]` section did not document its
behaviour when specified repeatedly. In practice the option is additive
(each occurrence accumulates search/routing domains) and assigning an
empty string resets the list, matching the closely related `DNS=`
option. This is implemented by `config_parse_domains()` in
`src/network/networkd-dns.c`, which frees both the search and route
domain sets on an empty `rvalue` and otherwise inserts each
whitespace-separated entry into the corresponding set.

Document this explicitly, using the same wording already used for
`DNS=` in the same man page, so users know repeated assignments are
combined and that an empty value clears them.

Fixes #38740.

core/bpf-restrict-fs: avoid loading the LSM BPF program twice at boot

bpf_restrict_fs_setup() is always called right after a successful
bpf_restrict_fs_supported(true) probe (see manager_setup() in
manager.c). Previously the probe independently opened, sized and
kernel-verifier-loaded the BPF object via prepare_restrict_fs_bpf(),
then did a trial LSM attach/detach via bpf_can_link_lsm_program() to
confirm the program *could* attach, and threw the whole object away
-- only for bpf_restrict_fs_setup() to build and verifier-load an
identical one from scratch again for the real, permanent attach.

The trial attach in the probe is redundant: if BPF_LSM_MAC attach
isn't actually usable (e.g. no BPF trampoline support on the running
architecture/kernel), bpf_restrict_fs_setup()'s own
sym_bpf_program__attach_lsm() call will simply fail, and that failure
is already logged and handled gracefully by its caller in manager.c
(logged as a warning, systemd continues without RestrictFileSystems=
enforcement). So bpf_restrict_fs_supported() only needs to check
whether the BPF LSM hook is enabled in the kernel at all
(lsm_supported("bpf")); it doesn't need to open/load/attach the BPF
program itself. Drop that from the probe, so the program is opened,
sized and verified by the kernel exactly once per boot instead of
twice.

Since the probe no longer verifies that the LSM BPF program can
actually attach, test-bpf-restrict-fs.c can no longer rely on
bpf_restrict_fs_supported(true) alone to skip on kernels/architectures
where the hook is listed but the real attach fails (e.g. missing BPF
trampoline support). Have the test also check m->restrict_fs after
manager_startup() and skip if the program never got attached, instead
of proceeding to hard-fail the enforcement assertions.

sysupdate: Change feature/component enablement and disablement (#43191)

- sysupdate: In the auto-enable service, don't enable all features

    The auto-enable service should activate suggested components and
features but enabled all features which includes the default components
    unsuggested features and any unsuggested features of the suggested
components. This is unexpected behavior and we rather want this service
    to be limited to suggested features.
Switch the service flag to suggested and make the wording more explicit
    in the man page. Also fix the wrong statement that it operates on
    enabled components, it operates on all components, also explicitly
    disabled ones.

- sysupdate: Change disabling with
--component-suggested/--feature-suggested

    The disabling of features or components with the flag
--component-suggested/--feature-suggested didn't disable the suggested
ones but instead disabled all other ones. This is rather unintuitive due
    to how the flags are named and also not really needed because the
intended reconciliation outcome can instead be done by first disabling
everything and then enabling the suggested ones again which is easier to
    reason about. For components the tricky part is that they default to
enabled and thus it's better to have the disable/enable commands with
--component-suggested operate only on suggested ones instead of touching
others like "legacy" components that don't explicity say whether they
    are enabled and suggested or not.

    Make running disablement of components/features with
    --component-suggested/--feature-suggested undo a previous enablement
with the same flags. Document how one can align the system to only use
suggested components/features and not anything else by doing it in two
steps, first disabling everything and then enabling suggested ones. This
also makes it clearer now that all components that are not explicitly
    enabled nor suggested will be disabled then.

meson: add build option for /var/log mode

In Ubuntu, rsyslog is (a) part of the minimal image and (b) does not run
as root. To facilitate this, the rsyslog package configures /var/log
to be writeable by the syslog group.

There are currently conflicts in the Ubuntu packaging due to the way
tmpfiles are handled in package scripts: when systemd-tmpfiles is
invoked with both configurations (or for all configurations), things
work fine. But if invoked with only var.conf, which is the default case
for package upgrades, rsyslog is broken.

One suggestion to approach this was to make rsyslog's tmpfile
configuration use ACLs instead of trying to change the owning group and
directory mode. This can work, but since the ACL mask is stored in the
group permission bits, the effective mask for rsyslog becomes r-x again
when var.conf is invoked, because it sees that 0775 != 0755, and chmods
the directory.

Hence, for /var/log write permisssions to be extendable with ACLs,
the mode must be at least 0775. Rather than change the default,
add a build option to configure the /var/log mode.

locale: update Context only on success

Do not clear Context.vc, Context.x11_from_vc, and Context.x11_from_xorg
on errors like ENOMEM or so.

This does not change behavior on success.

repart: drop legacy FactoryReset EFI variable

Follow-up for 9e050b0458930b96dc9abebd822ab0b8fe2b14aa.

Fix typo in 'website' in README.md

sysupdate: Change disabling with --component-suggested/--feature-suggested

The disabling of features or components with the flag
--component-suggested/--feature-suggested didn't disable the suggested
ones but instead disabled all other ones. This is rather unintuitive due
to how the flags are named and also not really needed because the
intended reconciliation outcome can instead be done by first disabling
everything and then enabling the suggested ones again which is easier to
reason about. For components the tricky part is that they default to
enabled and thus it's better to have the disable/enable commands with
--component-suggested operate only on suggested ones instead of touching
others like "legacy" components that don't explicity say whether they
are enabled and suggested or not.

Make running disablement of components/features with
--component-suggested/--feature-suggested undo a previous enablement
with the same flags. Document how one can align the system to only use
suggested components/features and not anything else by doing it in two
steps, first disabling everything and then enabling suggested ones. This
also makes it clearer now that all components that are not explicitly
enabled nor suggested will be disabled then.

pcrextend: extract varlink call boilerplate into shared helpers

pcrextend_verity_now() and pcrextend_imds_userdata_now() carried
near-identical copies of the connect + io.systemd.PCRExtend.Extend call.
Factor that into pcrextend_pcr_now() and pcrextend_nvpcr_now() and
reimplement both on top of them. No functional change.

Signed-off-by: Paul Meyer <katexochen0@gmail.com>

pcrextend: add secret parameter to varlink interface

Add an optional 'secret' input to io.systemd.PCRExtend.Extend. When set,
the HMAC of the measured data keyed by the secret is extended instead of
a plain hash, matching the existing tpm2_{pcr,nvpcr}_extend_bytes()
secret parameter. This lets callers measure a secret (e.g. a volume key)
without leaking a hash of it.

Signed-off-by: Paul Meyer <katexochen0@gmail.com>

pcrextend: pass iovecs through the extend helpers

extend_pcr_now(), extend_nvpcr_now() and escape_and_truncate_data() took
a (void *data, size_t) pair, even though both the dispatch layer and
tpm2_{pcr,nvpcr}_extend_bytes() already speak struct iovec. Drop the
pointless deconstruct/reassemble and pass the iovec through directly.

Signed-off-by: Paul Meyer <katexochen0@gmail.com>

tpm2: Improve how NvPCR protection works.

NV indexes created in the storage hierarchy can be undefined and
redefined with TPM owner auth. Because of this, NvPCRs need some way
to prevent them from being redfined in a way that allows spoof
measurements to be replayed.

The current approach requires knowledge of a secret ("anchor secret")
in order to derive the initial NvPCR measurement and to derive a
measurement to an existing PCR (9). The credential is protected by the
TPM with a PCR policy. Without access to the credential, it's not
possible to replay measurements to a newly defined NvPCR without
breaking the binding with the measurement in PCR 9. However, this
approach has a couple of issues:

- The credential is currently only protected by PCR11. As it's not
  protected by the rest of the boot chain, it's possible to boot other
  operating systems in order to replay the PCR11 measurements and
  recover the secret. Note that as the NvPCR anchoring happens in early
  boot, the credential is stored in the ESP.
- Someone with privileged access to a system can just create a new
  credential containing a known secret and store this in /var/lib and
  the ESP. The NvPCRs are anchored with this known secret on subsequent
  boots, and therefore the measurements can no longer be trusted.
  Imagine the scenario where privileged access is theoretically possible
  as a result of some vulnerability. After upgrading the system to fix
  this vulnerability, the system should be able to attest that it is
  now in a good state. However, if an adversary were able to use their
  priviliges to replace the credential, they are able to obtain
  persistence and the NvPCR measurements are no longer trustworthy.

This PR changes things to take a different approach. Instead of
requiring knowledge of a secret, the NvPCRs are now created in a way
that requires a policy to be satisfied for writing. The write policy has
2 branches:
- TPM2_PolicyNvWritten(true), which can be satisified without any
  further authorization if the NvPCR has already been extended.
- TPM2_PolicyAuthorize(pcrPubKey, SHA256("nvpcr-init")) which can be
  satisfied with a signed PCR policy, and must be used to perform the
  initial extend to a NvPCR.

The intention here is that the signed PCR policy that can be used to
authorize the initial extend to the NvPCR can only be satisfied during
early boot. During later boot phases, this signed PCR policy must not be
valid. This means that if a NvPCR is undefined and redefined, it won't
be possible to satisfy its write policy in order to able to perform the
initial extend.

In order to anchor the NvPCRs and prevent them from being undefined and
then redefined with a different policy that does allow them to be
extended, the names of the NvPCRs are measured to PCR9. Verifiers must
check that the names of attested NvPCRs match the measurements in PCR9.

This uses the PCR signing key from the currently booted UKI to create
the NvPCRs. If this changes between boots, then tpm2-setup automatically
recreates new NvPCRs with an updated write policy to reflect this. I've
tried to be careful to not undefine arbitrary NV indexes in this case,
so it checks that the existing NV index looks like a NvPCR (ie, it has
the expected attributes) before undefining it.

I did originally try to preserve the old behaviour for existing systems,
but it makes things a lot more complicated. As the new implementation
already creates new NvPCRs when the PCR signing key changes, I ended up
just automatically upgrading the old NvPCRs as well. Again, I check here
that any existing NV index looks like an old style NvPCR (ie, it has the
expected attributes) before undefining it.

I did notice that the initial NvPCR measurement isn't going into the
log. I don't know if that was an intentional choice, but I've preserved
that behaviour in this PR.

This also adds a new option to ukify (--sign-initrd-pcrs) which creates
signed policies (one per PCR bank) that can only be satisfied from the
initrd. These policies are used for initializing the NvPCRs, but can also
be used for protecting TPM2 keyslots enrolled with systemd-cryptenroll
(by using the --tpm2-public-key-policyref=initrd option).

There is one outstanding issue. The NvPCR definitions support different
algorithms, but the use of PolicyAuthorize means that they can only support
SHA-256 for now. This is because the signed policy algorithm must match
the name algorithm, and some additional work is required to support
signed PCR policies for algorithms other than SHA256. I've left a note in
tpm2_nvpcr_initialize that details what's required, and I'll take a look
at that in a subsequent PR.

sysupdate: In the auto-enable service, don't enable all features

The auto-enable service should activate suggested components and
features but enabled all features which includes the default components
unsuggested features and any unsuggested features of the suggested
components. This is unexpected behavior and we rather want this service
to be limited to suggested features.
Switch the service flag to suggested and make the wording more explicit
in the man page. Also fix the wrong statement that it operates on
enabled components, it operates on all components, also explicitly
disabled ones.

localed: normalize empty X11 option values to NULL after parsing

x11_read_data() parses an 'Option "XkbVariant" ""' line in
00-keyboard.conf with strv_split_full(..., EXTRACT_UNQUOTE), which turns
the empty quoted value into a non-NULL empty string rather than NULL.
Since 812aa57d2c ("string-util: beef up string_is_safe()") an empty
string is rejected by string_is_safe() unless STRING_ALLOW_EMPTY is
passed, so x11_context_is_safe() now refuses such a context and
x11_context_verify() discards the whole thing. As a result "localectl
status" reports "X11 Layout: (unset)" even though the file names a valid
layout, and compositors reading org.freedesktop.locale1 (e.g. the SDDM
greeter) fall back to the us layout.

Introduce x11_context_normalize(), suggested by @lionheartyu, which
converts empty strings to NULL while freeing the heap allocation —
unlike x11_context_empty_to_null() which only NULLs the pointer without
freeing. Call it at the end of x11_read_data(), before
x11_context_verify(), so empty option values are treated as unset. This
also keeps x11_context_equal() comparisons consistent with the setter
path (method_set_x11_keyboard) and vconsole_read_data(), which both
store NULL for empty values.

Fixes #43007

hwdb: Fix Brazilian ABNT2 KEY_RO scancode for HP ProBook x360 435 G7 (#43151)

The physical key between right Shift and right Ctrl (Brazilian ABNT2 ["/
? deg"]) emits scancode 0x4e (KEY_KPPLUS) at boot.

Remap the observed 0x4e keycode to KEY_RO so it produces the expected
characters with the Brazilian ABNT2 XKB layout.

hwdb: Add accelerometer matrix for OneXPlayer Super X

The BMI260 accelerometer in the OneXPlayer Super X is exposed through
the ACPI BMI0160 ID. Its X and Y axes do not match the built-in display
axes, and no firmware mount matrix is provided.

Add an exact vendor and product DMI match with the matrix verified on
the hardware. The matrix keeps the native landscape position normal and
maps both portrait rotations to the corresponding display orientation.

Tested with iio-sensor-proxy 3.8 and Mutter 49.7 in all display
orientations. The compiled hwdb entry also matches the complete modalias
reported by the device.

Development of this patch used assistance from ChatGPT 5.6 sol.

po: Translated using Weblate (Interlingua)

Currently translated at 0.6% (2 of 286 strings)

Co-authored-by: Emilio Sepulveda <emism.translations@gmail.com>
Translate-URL: https://translate.fedoraproject.org/projects/systemd/main/ia/
Translation: systemd/main

creds: query OpenSSL for GCM tag length

Follow-up for: 21bc0b6fa1de44b520353b935bf14160f9f70591

Follow-up for: 99d0a9fdb08d0291dcd06a279bf6e2597f651244

udev: probe_superblocks: return a real negative errno on failure

Previously, probe_superblocks() forwarded blkid_do_fullprobe()'s and
blkid_do_safeprobe()'s raw return code on error, which is just -1 with
no errno attached. The caller passes this value straight to
log_device_debug_errno() with %m, so a generic probing failure always
printed strerror(-1) regardless of what actually went wrong.

Convert the -1 error case to a proper negative errno via
errno_or_else(), matching the pattern used elsewhere in this file. The
'nothing found' (1) and success (0) return values are unchanged.

mount-util: don't trigger automounts when cloning submounts

get_sub_mounts() clones each submount of the given prefix with
OPEN_TREE_CLONE. The kernel resolves the path of an OPEN_TREE_CLONE
with LOOKUP_AUTOMOUNT, i.e. if the submount is an autofs automount
point that has not been triggered yet, cloning it forces the automount
to trigger, and open_tree() blocks until the automount request has
been served.

This is particularly problematic during boot: setting up a private
/proc for the first sandboxed service (e.g. systemd-userdbd.service,
which uses ProtectProc=invisible) clones the submounts of /proc, which
include PID 1's own /proc/sys/fs/binfmt_misc automount point. The
executor then blocks until PID 1 gets around to dispatching the
resulting proc-sys-fs-binfmt_misc.mount job, which competes with the
ongoing boot transaction. On a Fedora 44 VM this delayed
systemd-userdbd.service by ~0.9s, and with it every early-boot NSS
user/group lookup that ends up in nss-systemd's varlink queries — most
importantly systemd-tmpfiles-setup-dev-early.service, which
systemd-udevd.service is ordered after, stalling the whole boot
critical path:

  [2.131846] proc-sys-fs-binfmt_misc.automount: Got automount request
             for /proc/sys/fs/binfmt_misc, triggered by 323 ((systemd-userd))
  [2.943574] Mounting proc-sys-fs-binfmt_misc.mount...
  [2.968507] Mounted proc-sys-fs-binfmt_misc.mount.

Triggering the automount here also defeats its purpose, since
binfmt_misc ends up mounted on every boot even if nothing ever
accesses it.

Pass AT_NO_AUTOMOUNT so that untriggered automount points are cloned
as they are instead.

Before (Fedora 44 VM, 4 vCPUs):
  Startup finished in ... + 2.559s (userspace)
    1.058s systemd-tmpfiles-setup-dev-early.service
     938ms systemd-userdbd.service

After:
  Startup finished in ... + 1.582s (userspace)
     137ms systemd-tmpfiles-setup-dev-early.service
      22ms systemd-userdbd.service

test: add deb coverage and a few more sanity checks to TEST-88-UPGRADE (#43162)

ci/mkosi: bump fedora release version to 44 (#43160)

test: udev might not be running in container, skip check in TEST-88-UPGRADE

test: add a few more quick sanity checks to TEST-88-UPGRADE

test: add deb coverage to TEST-88-UPGRADE

ci/mkosi: bump fedora release version to 44

mkosi/sanitizers: also wrap mkfs.erofs

Fixes the following failure on Fedora 44:
```
TEST-58-REPART.sh[932]: Executing mkfs command: /usr/bin/mkfs.erofs -U 45745a56-aa2f-4619-8ca7-9cb63667c2ae -zlz4hc,level=3 /dev/loop0 /var/tmp/.#reparteb739e8b8cae7b70
TEST-58-REPART.sh[932]: Successfully forked off '(mkfs)' as PID 933.
TEST-58-REPART.sh[933]: ==933==ASan runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD.
TEST-58-REPART.sh[932]: '(mkfs)' failed with exit status 1.
```

mkosi: drop references to EOL fedora releases

github: update placeholders in template

sysupdate: fix root cleanup and default feature selection (#43154)

iovec-util: enclose macro arg in ()

Follow-up for 56f3ae9292742fff1cac6dbe4883a1715e56e874

core/namespace: split out bind mount retry helper

Move the bind mount retry path into a small helper so
apply_one_mount() no longer carries destination creation and
retry state inline.

Report destination creation and retry failures at debug level,
then return the error to apply_mounts(). The caller still reports
the final mount namespace failure with the cleaned-up mount path.

shared: drop stale env-file-label source

env-file-label.[ch] was removed by 3e5320e27d3e
("env-file: port write_env_file() to label_ops_pre()"), which
replaced write_env_file_label() with WRITE_ENV_FILE_LABEL.

0dc39dffbd45 ("Use paths specified from environment variables for
/etc configuration files") later reintroduced only
src/shared/env-file-label.c. The header and meson entry were not
restored, no callers use write_env_file_label() or
write_vconsole_conf_label(), and the current write_env_file()
signature no longer matches the stale wrapper.

Remove the unbuilt source file again.

Removed by: 3e5320e27d3e5f1bbbb7eb1c98dcec970d558017
Reintroduced by: 0dc39dffbd4525e79d7a1537b5c95780ba4f9727
Follow-up for: 0dc39dffbd4525e79d7a1537b5c95780ba4f9727

sysupdate: include default component for feature-all

--component-all is documented to include the default component-less
installation. Do not drop it merely because the context operates on a
root/image, or because all its transfers are currently disabled by
features.

This lets --component-all --feature-all enable-feature write the
default component feature drop-ins instead of succeeding with no
components selected.

TEST-72-SYSUPDATE covers both all transfers disabled by features and
the same default component feature operation under --root=.

Repro: create a default feata.feature plus a transfer gated by feata,
then run:
build/systemd-sysupdate --root="$root" --component-all --feature-all enable-feature

Before: no drop-in was written.

Follow-up for: 4481661a75acc01b5d66aa36443a7b80b557e4ba

sysupdate: keep root-relative installdb paths absolute

When recording installdb entries under --root=, keep the leading slash
after stripping the root. Compare current transfer target paths in the
same root-relative form during cleanup.

This prevents cleanup from treating still-owned resources below --root=
as orphaned.

TEST-72-SYSUPDATE covers --root= cleanup keeping a still-owned file and
its matching installdb entry.

Repro: create a rooted transfer for /target/foo-@v.bin, add a
matching installdb entry for /target/./foo-@v.bin, then run:
build/systemd-sysupdate --root="$root" --verify=no cleanup

Before: foo-1.bin and the installdb entry were removed.

Follow-up for: d82e256bb9d151b185a8afec1fcacd8fbe80555c

test: add coverage for systemctl preset in test-systemctl-enable.sh

Repeats the enable/disable specifier-expansion check with 'systemctl
preset' instead. preset-all is intentionally not exercised here, since
$root accumulates unit files from earlier sections that are
deliberately invalid, and preset-all would trip on those unrelated
units.

sysupdate: don't double-prefix definitions with --root=

Definitions enumerated under --root= are already rooted. Passing those
paths to the config parsers with the same root prefixes the root again,
so feature and transfer files are parsed from the wrong path.

Repro: create root/etc/sysupdate.d/rootfeat.feature and
01-root.transfer, then run:
build/systemd-sysupdate --root="$root" --verify=no --offline features rootfeat

Before: parsing failed at line 1 with a bogus Source Type= error.

Fixes #42783.
Follow-up for: e1384cfb096ad3561e3d20f193c37f59d5379768

veritysetup: keep parsing after ignored NvPCR options

tpm2-measure-nvpcr=no and invalid NvPCR names only affect the current
comma-separated option. They returned from parse_options(), so later
options were silently skipped.

Repro:
build/systemd-veritysetup attach testvol /dev/null /no/such \
0000000000000000000000000000000000000000000000000000000000000000 \
tpm2-measure-nvpcr=no,root-hash-signature=relative

Before: root-hash-signature=relative was skipped, and execution continued
to the missing block-device error.

After: root-hash-signature=relative is parsed and rejected. Invalid NvPCR
names take the same continue path.

Follow-up for: 85d7fb22470e5a4b17045d9288d4e2c06ff7b2e8

escape: add --stdin input mode

systemd-escape currently only processes strings passed as
command line arguments. This is awkward for callers that already
have a generated list of strings, because they need to loop around
the tool or use xargs and carefully preserve whitespace and other
special characters.

Add --stdin to read one string per line from standard input and
write one escaped result per output line. Keep command line strings
mutually exclusive with --stdin so the input source remains
unambiguous.

Use an explicit option instead of treating '-' specially, since '-'
is itself a valid string to escape. The existing escape, unescape,
mangle, path, suffix, and template rules are reused unchanged.

shared/dropin: don't re-derive drop-in name candidates per lookup dir

unit_file_find_dirs() is called once for every (unit name or alias,
lookup directory, drop-in suffix) combination while enumerating units
at boot, to check whether that unit has a ".d", ".wants", ".requires"
or ".upholds" drop-in directory in that particular lookup path. On a
typical system with ~270 loaded units and ~12 directories in the unit
search path, this adds up to tens of thousands of calls.

For every one of those calls, the function used to independently
re-derive the full chain of candidate unit names to check for that one
directory: the name itself, its template if it is a template instance,
and its "-" prefix chain (e.g. for "foo-bar-waldo.service" also
"foo-bar-.service" and "foo-.service"), recursively expanding further
where applicable. That derivation only depends on the unit name itself
and does not involve the lookup directory at all, so it produces the
exact same list of candidate names regardless of which of the 12
lookup directories is currently being checked. Despite this, it was
being fully recomputed for every single directory, doing several small
allocations and unit-name parsing calls (unit_name_template(),
unit_name_to_prefix(), unit_name_build_from_type(), ...) each time.

Split the name-derivation logic out into its own function,
unit_file_expand_dropin_names(), and compute it once per unit
name/alias, then reuse the resulting candidate list across all lookup
directories instead of re-deriving it for each of them. The order in
which candidate directories end up being added is unchanged, so this
is not expected to alter drop-in resolution behaviour: I confirmed this
by comparing the sorted unit load state, fragment path and drop-in path
output of "systemd --test --system" before and after this change on the
same unit tree, which is byte-for-byte identical.

I measured the effect by instrumenting manager_enumerate() with
CLOCK_MONOTONIC timestamps and running systemd, built from this exact
tree, as actual PID 1 in a container with ~270 real units loaded, 50
runs each before and after this change:

before: mean 45.35ms (stddev 0.60ms)
after: mean 38.81ms (stddev 0.81ms)

a ~14% reduction with about 8 standard deviations of separation between
the two distributions, i.e. well outside of run-to-run noise.

unit_file_expand_dropin_names()'s out parameter is renamed from
ret_names to names, since it is appended to (including recursively)
rather than only being populated on success, matching the ret_ naming
convention used elsewhere for output-only parameters. Also, a failure
partway through expanding a name's candidates (e.g. OOM) no longer
discards the candidates already derived before the failure, keeping
unit_file_find_dirs() closer to the original recursive
implementation's error handling.

unit_file_add_dir_if_exists(), which builds the path to check for each
(lookup directory, candidate name) pair, is now the hottest remaining
part of this code: with the per-directory re-derivation gone, it is
called once for every directory/candidate combination instead of once
per candidate. It used to build that path with strjoin(name, suffix)
followed by path_join(unit_path, name_and_suffix), i.e. two heap
allocations plus path_join()'s normalization pass. Lookup paths are
already normalized (path_simplify() + strv_uniq()), so a single
strjoin(unit_path, "/", name, suffix) produces the same string while
halving the allocations and skipping the redundant normalization.

ask-password: refuse agent requests with unsafe characters in prompt fields

The message, icon and id fields are written verbatim into single-line
assignments of the [Ask] section of the agent request file, so a newline in
them lets the caller append arbitrary further assignments. Agents let a later
assignment override an earlier one, so an injected Socket= line redirects the
password to a path of the injector's choosing. Validate the fields and refuse
the request instead.