git.ipfire.org Git - thirdparty/systemd.git/log

update TODO

tpm2-util: keep the measurement log's torn-write marker intact

The userspace measurement log carries a sticky-bit marker while a writer
is between updating a measurement register and appending the matching
record, so that a writer dying in between leaves the log detectably
incomplete.

However, the next successful writer used to clear the marker again after
appending its own record, erasing the evidence that an earlier writer
had died and the log is still missing a record. Keep the marker set in
that case instead; the new record is appended and synced regardless. The
marker likewise stays set if the measurement itself fails, so that any
non-clean completion remains flagged: a spurious flag on a failed but
harmless measurement is preferable to erasing evidence of a real gap,
and PCR replay stays authoritative either way.

Finally, warn when systemd-pcrlock loads a marked log, so the resulting
PCR validation failures come with a hint at their cause.

Signed-off-by: Paul Meyer <katexochen0@gmail.com>

journalctl: reject field listing with filters

Reproducer:
journalctl -F _SYSTEMD_UNIT -u test.service --no-pager
journalctl -u test.service --no-pager -n 1

Before, the field listing ignored the unit filter and listed unrelated
units, while the normal journal query applied the filter.

sd_journal_query_unique() cannot represent journalctl filters such as
unit, boot, time, cursor, or grep filters. Walking the journal line by
line would make field listing linear in the number of entries and
duplicate unique-value handling.

Keep the scope-option predicate next to add_filters(), and reject field
listings combined with options that actually limit the journal.
Display-only options such as --pager-end are left alone.

mkosi: update debian commit reference to 88c420c621dbd1b988950cf26fc60789fc989453

* 88c420c621 Install new files for upstream build
* 5b7f67b86a Update changelog for 261.1-3 release
* 3f91c8567c NEWS: update about removal of sd-tpm dep
* e13ba94737 Drop obsolete TODO
* b0c5def366 systemd: downgrade systemd-tpm and mount to recommends
* e9eb859a3a systemd-tpm: depend on tpm-udev for rules and tmpfiles.d
* 42f74bb110 Drop obsolete comment from d/control
* fc40986137 debian/extra/network: use NamePolicy=keep mac on USB wifi devices

oci-util: Don't fall back to default registry for explicit registries

When a reference specifies a registry like ghcr.io or quay.io for which
we don't have a registry config file, the fallback forced the default
registry which is wrong.

Only use the default as fallback when there is no explicit registry.

Cleanup argument and return values for copy_bytes_full helpers (#43034)

Follow-up for #43004.

boot: guard missing Windows auto entry

config_add_entry_loader_auto() returns NULL when the Windows loader
is missing. With reboot-for-bitlocker enabled, the caller still
assigned e->call and could dereference NULL.

This can happen when reboot-for-bitlocker is enabled but the ESP does
not contain EFI/Microsoft/Boot/bootmgfw.efi.

Before:
systemd-boot could dereference NULL while building the menu

Follow-up for: 661615a0afacee3545cde0a48286c0fef983f8fe.

journalctl: use root machine ID for namespaces

journalctl --root=... --list-namespaces scans the target root's
journal tree, but used the host machine ID when matching namespace
directories.

Reproducer:
  root="$(mktemp -d)"
  mid=11111111111111111111111111111111
  mkdir -p "$root/etc" "$root/var/log/journal/$mid.testns"
  printf '%s\n' "$mid" >"$root/etc/machine-id"
  journalctl --root="$root" --list-namespaces --quiet

Before:
  no output

Follow-up for: 68f66a171398e27280a95e58ae7464219cccaaec.

network: add IPv4ProxyARPAddress= and consolidate proxy ARP/NDP handling

This adds an IPv4 counterpart to `IPv6ProxyNDPAddress=` for adding
manual entries to the kernel's IPv4 neighbour proxy table (check via
`ip -4 neighbour show proxy dev <dev>`). systemd-networkd only exposed
`IPv4ProxyARP=` for per-interface `proxy_arp` sysctl (automatic proxy
ARP) with no way to manage manual entries from a .network file.

To avoid duplicating the IPv6 proxy NDP code path, both families are
now combined into a single new `networkd-neighbor-proxy` module. The
IPv6 behaviour is preserved: `IPv6ProxyNDPAddress=` still implies
`IPv6ProxyNDP=yes` unless `IPv6ProxyNDP=` is explicitly disabled and
entries are still dropped if the kernel has no IPv6 support.
The same rule is applied to `IPv4ProxyARPAddress=`. It implies
`IPv4ProxyARP=yes` when the sysctl is not explicitly set and has no
effect if `IPv4ProxyARP=` has been set to false.

This keeps the user model symmetric and predictable across both
families: a single per-address setting that turns on the matching
per-interface sysctl automatically, while still letting system
administrators opt out by setting the boolean explicitly to false.

Note that the IPv4 manual NTF_PROXY entries installed here would
actually function without `proxy_arp` (unlike IPv6, where `proxy_ndp`
gates the manual entries); the implication is kept for symmetry with
`IPv6ProxyNDPAddress=` and is now called out explicitly in the man
page, together with the fact that enabling `proxy_arp` also activates
interface-wide automatic proxy ARP for routed-toward addresses on
connected subnets.

Parser-time validation rejects addresses the kernel would refuse:
the ANY/null address for both families, IPv4 and IPv6 multicast
and the IPv4 limited broadcast 255.255.255.255.

Signed-off-by: Aritra Basu <aritrbas+gh@cisco.com>

shared/copy: simplify argument convention for reflink_range

reflink_range() would treat size==UINT64_MAX as "copy everything", but
only if the offsets were 0. There are only two callers: one always passes
a fixed size, the other translated size==UINT64_MAX to 0. Make things
more uniform by always treating size==UINT64_MAX the same as size==0.

shared/copy: simplify return convention for try_reflink_copy_bytes

Follow-up for 5a12005c834eab92d7c76d61c7090aa78b61cdfb.

Also a rescue a comment from the discussion in the PR.

tpm2-util: initialize NvPCRs on first extension

Both callers of tpm2_nvpcr_extend_bytes() duplicate the same dance: on
-ENETDOWN, i.e. when the NvPCR isn't anchored yet because
systemd-tpm2-setup hasn't run, they acquire the anchor secret,
initialize the NvPCR, and extend again. Move this into
tpm2_nvpcr_extend_bytes() itself. The only caller-specific bit, whether
the anchor secret shall be synced to /var, is passed in as a parameter.

Signed-off-by: Paul Meyer <katexochen0@gmail.com>

copy: drop COPY_REFLINK

Reflinking is a safe optimization whenever the source and destination are
regular, seekable files on a filesystem that supports cloning. Requiring each
caller to opt in through COPY_REFLINK adds flag plumbing without changing the
necessary fallback behavior.

Drop COPY_REFLINK, renumber the remaining flags, and have copy_bytes_full()
unconditionally try FICLONE or FICLONERANGE before entering its normal copy
loop. Isolate the clone attempt and its file-offset bookkeeping in a helper so
copy_bytes_full() only has to handle its cloned, unavailable, and error results.
A successful clone therefore completes in one operation, while an unsupported
or rejected clone falls back to progress-limited copy_file_range(), sendfile(),
or buffered copying. Keep the public reflink helpers for operations that
specifically require clone semantics.

Existing test-copy coverage exercises both bounded and unbounded regular-file
copies through copy_bytes_full().

Signed-off-by: Daan De Meyer <daan@amutable.com>

repart: make COW behavior configurable

systemd-repart currently forces newly created image files into NOCOW mode.
That prevents files from being reflinked into the image, making image builds
slower and increasing their disk usage on filesystems that support cloning.

Add a tristate --cow= option. By default, leave the filesystem or parent
directory COW policy unchanged. With --cow=yes, explicitly enable COW; with
--cow=no, retain the previous behavior of forcing NOCOW. Add XO_COW as the
counterpart to XO_NOCOW so xopenat_full() applies either policy while retaining
its normal creation-error cleanup.

Document the new option and extend TEST-58-REPART to verify inherited COW and
NOCOW policies as well as explicit COW and NOCOW overrides. Compare the unset
behavior with the filesystem default so the test also works on nodatacow
mounts, and skip it when the inode attribute is unsupported.

Signed-off-by: Daan De Meyer <daan@amutable.com>

rules: install 60-persistent-media-controller.rules

Commit 04f19d673587 ("udev: Add /dev/media/by-path symlinks for media
controllers") added the rule file and the feature was announced in the
v256 NEWS, but the file was never added to the rules.d/meson.build
install list, so no build has ever shipped it and the advertised
/dev/media/by-path/ symlinks are never created.

Add it to the unconditionally installed rules.

sd-dhcp-relay: fix off-by-one when discarding BOOTREQUEST messages by hops count

According to RFC specifications
```
RFC 1542 section 4.1.1 states:
The relay agent MUST silently discard BOOTREQUEST messages whose 'hops'
field exceeds the value 16."
```

"Exceeds the value 16" means hops > 16, i.e. a message that arrives with
hops == 16 is still valid and must be relayed (after which its hops field
becomes 17). The code used ">= 16", which silently dropped a valid message
that had legitimately traversed exactly 16 relay agents, one hop too early.

This matches the wording of the adjacent comment, which already says
"exceeds the value 16".

dlopen-note: drop unnecessary header

It unexpectedly snuck in there.

Follow-up for a74b3e1778ec00a21f5d1f10947daef2c02c6ffe.

bcd, id128, sysupdate: tighten edge-case handling (#43018)

Assorted udev hardening fixes flagged by kres (#42903)

resolved: fix spurious BrowseServices add/remove flapping with ifindex=0 (#42982)

## Problem
A `BrowseServices` subscription with `"ifindex":0` (browse all
interfaces) receives a continuous flap of `added`/`removed` events for a
service that is still present — within a second, and with no goodbye
packet involved.

## Root cause
For `ifindex==0`, `mdns_browser_revisit_cache()` looked up each mDNS
scope's cache separately and called `mdns_manage_services_answer()`
**once per scope**. That function derives `removed` events by diffing
the browser's *global* discovered-service list (all interfaces, filtered
only by owner family) against the single answer it's handed. So with ≥2
mDNS-relevant interfaces of the same family, a service present on
interface A isn't in interface B's answer and is spuriously removed
while B is reconciled, then re-added on the next revisit tick.

## Fix
Accumulate the pruned cache answers from every matching mDNS scope into
one combined `DnsAnswer` and reconcile **once**, so removals diff the
global list against the union across interfaces. Items are merged with
`dns_answer_add_full()` (not `dns_answer_extend()`, which defaults each
item's `until` to `USEC_INFINITY` and would skew the RFC 6762 §5.2
TTL-maintenance schedule). The single-interface (`ifindex>0`) path is
unchanged.

## Test
`TEST-89-RESOLVED-MDNS.sh` gains `testcase_browse_ifindex_zero_no_flap`:
it adds a service-less dummy mDNS link to guarantee ≥2 same-family
scopes (the flap precondition), browses `ifindex=0`, waits for
discovery, then asserts **zero** `removed` events while every publisher
stays up. The subscription uses `varlinkctl --timeout=infinity`, since
it sits idle after discovery and the default 45s idle timeout would
sever it (and the assertion) mid-observation.

## Testing status
Builds clean; `shellcheck -x` clean. First CI round: `TEST-89`
(including this testcase) passed on all mkosi platforms; the failing
jobs all traced to unrelated flakes/infra.

version_is_valid() tweaks (#43025)

We have two validators, and neither really makes sense. Let's unify and
clean things up.

boot: downgrade EFI_MEMORY_ATTRIBUTE_PROTOCOL warning

U-Boot currently does not implement EFI_MEMORY_ATTRIBUTE_PROTOCOL
even when reporting EFI version >= 2.10. Consequently, systemd-boot
emits a warning on every boot when running on U-Boot firmware.

The absence of EFI_MEMORY_ATTRIBUTE_PROTOCOL is a current
U-Boot limitation and not a condition users can remedy.
Furthermore, the EFI specification does not require all
firmware advertising EFI 2.10 or newer to implement the
protocol. As a result, the warning provides little value on
U-Boot systems while causing log_wait() to impose a 2.5-second
boot delay.

Downgrade the message to LOG_DEBUG, this keeps the diagnostic
available for debugging purposes without penalizing normal
boot time.

Signed-off-by: Aswin Murugan <aswin.murugan@oss.qualcomm.com>

Assorted basic/shared/home/import hardening fixes flagged by kres (#42950)

sysupdate: Add ListFeatures() and ListTargets() varlink methods (#42900)

Following on from adding the basic varlink scaffolding to sysupdate,
let’s varlinkify a couple of the D-Bus methods. Because varlink doesn’t
have a concept of object paths, the D-Bus path structure which allows a
target to be selected has been squashed down to a target argument for
each relevant method.

Varlinkify the way to list targets, and also the way to list features
because that was simple to do at the same time.

More methods need varlinkifying in the future, but let’s do it in small
and manageable chunks.

test: extend version_is_valid() testcase a bit

bootspec: log about invalid version strings

man: update version formatting requirements in os-release

Allow full UAPI.10 version strings, i.e. "+", "_", "~" and "^" too, to
match the recent reworking.

These version strings are generally distro-managed, hence use the more
liberal alphabet.

Fixes: #32785

string-util: replace version_is_valid()/version_is_valid_version_spec() by a common call

Let's take inspiration from string_is_safe() and take a flags field that
allows fine tuning the validation.

Then port over all current users of either function to the new logic.
Note that this *does* change behaviour in various cases:

1. Generally: we'll now always accept the full UAPI.10 alphabet,
   including the "~" and "^" characters. As far as I can see there's no
   downside to this liberalization as none of the current consumers of
   the two functions uses these characters for anything else.

2. systemd-analyze compare-version will now accept version strings with
   "_" and "+" without complaining. I see no downside here, it just
   normalizes these debugging tools, to make them accept what most our
   other tools accept.

3. "bootctl link" will not accept empty version strings anymore
   Which is a bugfix I guess.

4. vpick will now refuse "_" and "+" in version strings. It kinda
   already did, because when parsing versions from filenames it uses "_"
   and "+" as name, architecture and attempt counter separators. We now
   systematically refuse it everywhere else in vpick too. This is hence
   a clean-up.

Fixes: #28906
Replaces: #42815 #41937

string-util: reorder characters in version charset

Let's bring the version string character set into a systematic order,
matching the order in which they appear and are defined in the UAPI.10
specification text.

This makes it easier to compare the relevant functions.

string-util: do not accept ',' in version strings

Allowing this apparently has been cargo-culted from my initial sysupdate
PR, but Claude and me could not find a single other software package
that uses "," as a character within version strings. Hence, let's remove
this, even though this is a compat breakage of a kind, in the hope
nobody notices. We can easily restore this if this later shows to be an
issue for people.

sysupdate: use strverscmp_improved() like everywhere else

At one location we accidentally called strverscmp() instead of
strverscmp_improved()

timedatectl: display RTCTimeUSec in UTC format

Fixes #39930

The RTCTimeUSec property was being displayed in local time format
when using 'timedatectl show', while 'timedatectl status' correctly
displayed it in UTC format. This inconsistency was due to the
bus_print_all_properties() function using TIMESTAMP_PRETTY style
for all timestamps, which formats in local time.

This fix adds special handling for RTCTimeUSec to use TIMESTAMP_UTC
style, ensuring consistent UTC display across both commands.

Rename basic-forward.h, sd-forward.h, and shared-forward.h to forward.h

Follow-up for 74d392ed1bab578e901699ee272faa0c8b922128.

network: do not use assert() on a call with side effects

Fixes bf943a9d49941801b45e4631f010359619173d12.

dissect-image: don't assert() on partition geometry from blkid

The per-partition loop in dissect_image() reads the start and size (in
512-byte sectors) of each partition from libblkid and guards the
following byte conversions with assert(). Images sizes are input,
rather than programming, so return an error instead of asserting.

Follow-up for 88b3300fdc64d5320fb50d0f369d3fc0885e15e8

homework-luks: add new key slots before destroying old ones

home_passwd_luks() rotates the LUKS key slots with a single loop that
destroys slot i before adding its replacement at the same index. If
adding the replacement fails (e.g.: argon2 OOMs), slot i is left
destroyed with no replacement and no rollback. If the user has only
one password, its only key slot is now gone and the home directory can
no longer be unlocked.

Rotate in two passes instead: first add every new password into a free
slot (CRYPT_ANY_SLOT), and only once all adds succeed, destroy the old
slots. If an add fails, roll back the slots added so far and return
with the pre-existing slots untouched, so at least one valid key slot
always remains for each password the user holds.

Follow-up for 70a5db5822c8056b53d9a4a9273ad12cb5f87a92

pull-oci: switch assert() to assert_se() for set_remove() call

This has obvious side effects so switch it over

Follow-up for a9f6ba04969d6eb2e629e30299fab7538ef42a57

copy: avoid following fifo/node chmod target

Use AT_SYMLINK_NOFOLLOW when applying copied FIFO and device
node modes, matching the adjacent ownership and timestamp updates.

Follow-up for e69cc9eb36fd6e76710b4d5f4bb7013980fb5174

log: add upper bound to journal iovec accounting

Use checked arithmetic before clamping the journal iovec length.
Stop copying input iovecs once the fixed journal vector is full.

Follow-up for e69cc9eb36fd6e76710b4d5f4bb7013980fb5174

string-util: add upper bound to ellipsize_mem UTF-8 walks

Validate UTF-8 characters against the caller-provided byte slice.
This stops truncated sequences from borrowing bytes past old_length.

Follow-up for e69cc9eb36fd6e76710b4d5f4bb7013980fb5174

sysupdate: Allow multiple documentation URLs for a feature

Change the varlink API for Feature structs to allow multiple
documentation URLs for them, to match what systemd already does for
units etc.

This is a deviation from what the sysupdated D-Bus API allows and, for
the moment, from what’s supported internally by sysupdate. Internally it
continues to support 0-1 URLs for now.

But by defining the API as a strv, multiple URLs can be supported in
future without API breaks.

sysupdate: Run ListFeatures in offline mode

Historically, the ListFeatures API (in both D-Bus and now varlink) was
run without an `--offline` flag. This appears like it’s an oversight, but
actually `verb_features()` always unconditionally loaded in offline
mode.

In any case, there doesn’t appear to be a reason for the context to be
online (i.e. for it to check sources for available updates). Features are
defined in local config files and are loaded by `read_features()`, which
is called in both offline and online mode.

When the varlink ListFeatures API was added, it was put into online mode
in order to match the lack of `--offline` argument in the existing D-Bus
API implementation. Change both of them to be explicitly in offline mode.

Signed-off-by: Philip Withnall <pwithnall@gnome.org>

test: Remove a redundant exit call

`[[ blah ]] || exit 1` is equivalent to `[[ blah ]]`.

Fixes: b0ca987cd96

sysupdate: Downgrade an info to a debug log message

Since we now enumerate all targets on a varlink call (to validate the
requested target), this message gets printed multiple times in the log.
It’s not really necessary, so downgrade it to a debug message.

sysupdate: Add varlink ListTargets() method

And add integration tests for it using `jq`.

sysupdate: Add varlink ListFeatures() method

And add integration tests for it using `jq`.

sysupdate: Factor out core of `features` verb

This will be used in the following commit to add a varlink interface for
it.

This introduces no functional changes.

man: update gpt-auto-generator ESP mounting behavior

Remove the stipulation that /boot/ must exist for the ESP to be mounted
there to reflect the change in #34550.

meson: speed up building standalone binaries

Previously, files listed in 'sources' were built twice:
once when building the main binary, and again when building the
statically linked one.

This change ensures that all object files from the main binary are
reused when building the static binary. Hence, the only step now
necessary for the static binary is linking the object files.

Follow-up for 39d00e1d20717e56285795335fe3172fc24f3577.

repart: Properly pre-calculate auto size of images

When passing --size=auto to repart, it will pre-calculate the image size and
resize the image to that size before partitioning. Currently, that fails when
passing a large grain size, complaining that the auto-sized image is too small
to fit the data.

The reason for this is that the current code simply assumes the GPT metadata
size taken away from the usable size by fdisk is static (1044KiB), when it
actually is more complicated than that:

There's two ranges of GPT metadata: One at the beginning of the image, and one
at the end of the image. And there's the first usable block that is defined by
fdisk when creating the partition table.

The static value of 1044KiB usually works, because fdisk sets the first usable
block to 1MiB (so 1024KiB), leaving 20KiB of leeway for the secondary GPT at
the end of the image.

Now as soon as the first partition starts at an offset higher than 1024KiB, we
lose the 20KiB leeway for the secondary GPT, and the partitions will no longer
fit.

What we should do, is first of all round up to the grain size instead of 4096
(as that's the minimum offset our first partition will start at), and second of
all properly subtract the secondary GPT at the end.

Also confirm we don't regress on this anymore by adding a test that uses a 2MiB
grain size, breaking the old code.

Assorted resolved hardening fixes flagged by kres (#43011)

sysupdate: Improve empty table when printing features

Rather than just printing out the table header and then exiting, print
“No features” similarly to what `loginctl` or `storagectl` do.

sysupdate: print version details once

list VERSION printed the version status header twice in plain output.

Keep the complete later header and add a regression check.

Follow-up for: 42c0b689a800b2ec7cceb1528d3834bf9b3417f8.

id128: honor json output for single ids

JSON mode was accepted by single-ID verbs but still printed bare IDs.

Print JSON objects for those verbs and reject ambiguous JSON combinations.

Follow-up for: a50666e376057a24b67f80c9a8025096c750fb23.

boot: allow BCD fields to end at buffer limit

Allow offset + len == max; the range ends exactly at the buffer boundary.

Keep the overflow-safe bounds check and cover the edge case in test-bcd.

Follow-up for: aa1d0f25873f737fb9306a12f9283872012f2d9a.

boot: honour "read-only" vfat flag on random seed file (#43012)

Inspired by #42979

systemd-run: reject unsupported option combinations (#43003)

Boundary tests were conducted on the "run" command tool, and the
identified issues were resolved.

repart: Fix growing the partition preceding a FreeArea from leftover space (#42969)

The "Donate to preceding partition" logic is dead code since commit

https://github.com/jonas2515/systemd/commit/19903a433507897449c086b72abb5e133e431336
("repart: split out context_grow_partition_one()").
context_grow_partition_one() gets passed a free area and a partition,
and it has
an early-return check to ensure the partition it got passed belongs to
the free
area it got passed. That means we compare the FreeArea a to the FreeArea
a->after->allocated_to_area, which always yields FALSE.

Fix the behavior of donating any left over space to the preceding
partition
by adding that partition to the loop below (and relying on the
partitions list
being ordered according to physical partition offsets).

Since this behavior is not that easy to trigger, mention how to trigger
it in a
comment, and add a test for it as well.

portablectl: retry inspect with PORTABLE_PREFIXES

When no prefix is specified, portablectl inspect first tries the
prefix derived from the image name. This keeps inspect aligned with
attach behavior.

If that lookup finds no matching units, retry with validated
PORTABLE_PREFIXES read from the image os-release. This makes inspect
work for images whose filename does not match their portable service
prefix.

Keep metadata error handling explicit so request-construction failures
are not logged twice, while sd_bus_call() failures still include the
inspect context.

Add a TEST-29-PORTABLE regression case for a directory image whose
name does not match the portable service prefix.

Fixes #37296.

Signed-off-by: dongshengyuan <dongshengyuan@uniontech.com>

sysext: honor confext config during notify refresh

systemd-sysext handles the sysupdate notification hook,
but the hook refreshes both sysexts and confexts.

Build the refresh context for each image class so confext settings,
such as Mutable=yes, are applied when confexts are refreshed.

Add regression coverage for the mutable confext overlay.

Fixes #42873
Signed-off-by: dongshengyuan <dongshengyuan@uniontech.com>

dns-answer: preserve shared aliases when removing records

dns_answer_remove_by_rr() mutats the backing OrderedSet even when the answer
has multiple references. Callers such as trust-anchor revocation could thus
change snapshots held elsewhere in the resolver.

Clone a shared answer before removal.

Follow-up for 71aee23dba7faeef68e7232f444626267a6c90d7

resolved: roll back partial DNS zone publication

dns_zone_link_item() inserts a new item into the by-key hashmap before
publishing it by name. If the second hashmap insertion fails, _cleanup_
frees the item while by-key retains a dangling key and value.

Follow-up for 623a4c97b9175f95c4b1c6fc34e36c56f1e4ddbf

man: clarify that --when= is a lower bound, not a condition

`systemctl reboot --when=yesterday` reboots the machine immediately, which
surprised users enough to be reported as a bug. It is not one: the timestamp
passed to --when= (and to ScheduleShutdown(), and to shutdown(8)) declares the
earliest point in time the action may be taken, it is not a condition that is
evaluated and that could fail.

Behaving any differently would be racy and surprising: "--when=now" refers to
the past by the time the request is processed, and "--when=+50ms" may well have
elapsed already due to scheduling latencies. In both cases we must still carry
out the action the user asked for.

Document the semantics explicitly in systemctl(1), shutdown(8) and the
org.freedesktop.login1(5) D-Bus interface documentation.

Fixes: #42437

boot: skip boot counter logic for entries marked read-only

If the read-only FAT file attribute is set on a boot entry file with a
counter in its name, don't attempt to rename it, but simply skip the
boot counter logic for it, taking the flag as a hint that the entry
shall not be subject to boot assessment.

boot-secret: don't initialize secret mixin file if marked read-only

If the boot secret mixin file exists but is empty we'd normally fill it
with a fresh mixin. Refrain from that if the read-only FAT file
attribute is set on it, taking that as a hint that the file shall not
be initialized. Reading an existing, fully populated mixin file remains
unaffected by the flag.

boot: skip random seed handling if seed file is marked read-only

If the read-only FAT file attribute is set on /loader/random-seed,
don't update the seed file — and hence don't use it either, since a
seed we cannot update would be the same on every boot.

This gives users an explicit way to turn off random seed handling by
marking the file read-only, useful for example in pre-built OS images
that are replicated to many systems, where the baked-in seed is shared
and hence must not be credited.

The check is done upfront in process_random_seed(), before any other
work, mirroring the existing check for read-only volumes. This covers
both systemd-boot and systemd-stub, which share this code.

Inspired-by: #42979

run: reject waiting for remain-after-exit services

Reproducer:
  unit=run-wait-rae-$(date +%s)
  sudo timeout 3s systemd-run --wait --remain-after-exit \
      --unit="$unit" /bin/true
  echo $?
  systemctl is-active "$unit.service"

Before, the command timed out with exit status 124 while the service
stayed active. --wait waits for deactivation, but RemainAfterExit=yes
keeps the service active after the command exits.

Follow-up for 2a453c2ee3090e1942bfd962262f3eff0adbfa97

run: reject JSON output with verbose logs

Reproducer:
sudo systemd-run --wait --verbose --json=short /bin/echo hi

Before, systemd-run printed JSON metadata to stdout while --verbose
also spawned journalctl output on stdout. The resulting stream mixed JSON
with journal lines, so reject the conflicting options.

Follow-up for 744ca8f616b98f579d930176f6262e1ac197840b

run: reject JSON output in scope mode

Reproducer:
sudo systemd-run --scope --json=short /bin/echo hi

Before, systemd-run printed JSON metadata to stdout and then executed
the scope command on the same stdout. The combined stream was not valid
JSON, so reject --json= in scope mode.

Follow-up for fe5a6c47af675bc0020c545d86fb103492e1d77c

run: reject JSON output for trigger units

Reproducer:
  unit=run-json-trigger-$(date +%s)
  sudo systemd-run --json=short --unit="$unit" \
      --on-active=30s \
      /bin/true

Before, trigger mode accepted --json=short but printed only human-readable
"Running timer as unit" and "Will run service" lines. Reject the option
until trigger mode has structured output.

Follow-up for fe5a6c47af675bc0020c545d86fb103492e1d77c

run: reject JSON output with stdio forwarding

Reproducer:
sudo systemd-run --wait --pipe --json=short /bin/echo hi

Before, systemd-run wrote its JSON metadata to stdout and then passed
the command stdout through the same stream. The combined output was not
valid machine-readable JSON, so reject the conflicting modes.

Follow-up for fe5a6c47af675bc0020c545d86fb103492e1d77c

run: reset groups before scope uid switch

Reproducer:
sudo systemd-run --scope --uid=nobody /usr/bin/id

Before, the command ran as nobody but kept the caller supplementary
root group, for example groups=65534(nogroup),0(root). Scope mode
performs the uid/gid switch locally, so initialize the target user groups
before dropping privileges.

Follow-up for 4de33e7f3238a6fe616e61139ab87e221572e5e5

run: accept explicit trigger unit names

Reproducer:
  unit=run-explicit-path-$(date +%s)
  systemd-run --unit="$unit.path" \
      --path-property=PathExists=/tmp \
      /bin/true
  echo $?

Before, an explicit .path unit name was not recognized as the
trigger unit. It was mangled again as a service name, so PID 1 rejected
the transient request with an already-loaded unit conflict.

Follow-up for d59ef3e24362aa7a9e209ed07db5feca1a2cdb8e

run: reject --ignore-failure in scope mode

Reproducer:
sudo systemd-run --scope --ignore-failure /bin/false
echo $?

Before, the option was accepted but had no effect because scope mode
executes the command locally after creating the scope. The flag is only
encoded into service ExecStart properties, so accept it only where it can
be applied.

Follow-up for 1072d9473123ed174e033704fc5222216b655c9e

run: honor --no-block for trigger units

Reproducer:
  unit=run-nb-trigger-$(date +%s)
  systemd-run --no-block --collect --unit="$unit" \
      --socket-property=ListenStream=/proc/systemd-run-repro/socket \
      /usr/bin/true
  echo $?

Before, systemd-run still waited for the trigger unit job and
propagated the socket start failure. With --no-block it should only
verify and enqueue the request, as the service path already does.

Follow-up for 3d161f991e16369aa59f447eb4cdb90af33261c8

userdb: suppress userdb queries for backends indicating uid/gid/name range info via xattrs on entrypoint sockets (#42961)

Let's optimize userdb queries a bit: by encoding the covered UID/GID
ranges and user/group name patterns on the varlink entrypoint sockets
for userdb backends we can make them wake up less and reduce the work
triggered by queries.

sysupdate: add "suggestion" concept to feature and component enablement (#42970)

repart: allow empty EncryptedVolume= volume name (#42889)

Treat an empty volume name alongside other fields as unset instead of
rejecting it as invalid.

Example use case:
```
EncryptedVolume=:none:discard
```

In this case, the volume name is not specified so it can be generated as
luks-UUID.

From the docs:

> EncryptedVolume=
> Specifies how the encrypted partition should be set up. Takes at least
one and at most four fields separated with a colon (":"). The first
field specifies the encrypted volume name under /dev/mapper/. If not
specified, "luks-UUID" will be used where "UUID" is the LUKS UUID.

repart: Make use of partitions list being ordered by offsets while looping

Let's do the same thing we did above with the last commit here too and make use
of the partitions list being ordered by physical offsets. This makes the code a
little simpler.

repart: Fix growing the partition preceding a FreeArea from leftover space

The "Donate to preceding partition" logic is dead code since commit
19903a4 ("repart: split out context_grow_partition_one()").
context_grow_partition_one() gets passed a free area and a partition, and it has
an early-return check to ensure the partition it got passed belongs to the free
area it got passed. That means we compare the FreeArea a to the FreeArea
a->after->allocated_to_area, which always yields FALSE.

Fix the behavior of donating any left over space to the preceding partition
by adding that partition to the loop below (and relying on the partitions list
being ordered according to physical partition offsets).

Since this behavior is not that easy to trigger, mention how to trigger it in a
comment, and add a test for it as well.

process-util: introduce prctl_safe() and port everything over to it (#43006)

prctl() is an API full of pitfalls: it is variadic, and some interfaces
don't expect zero-initialization of excess arguments, and others do.
Moreover, the parameters are "long", and nonetheless we usually pass
"int" to them. If we are too dumb to call it properly, let's just not
call it directly anymore, but let's add a wrapper around it that makes
the function non-variadic and declares the right types. Then, let's port
over everything to it.

This is inspired by #42996, but we had issues with this many times,
before and looking at this PR one can see that we otherwise still are
having the issue at numerous other places.

cryptsetup: add Argon2id-based PIN mode for TPM2 enrollment (#41859)

The current TPM2 PIN mode is flawed as a compromised TPM directly
exposes
the sealed secret which is the LUKS volume key itself
(https://github.com/systemd/systemd/pull/27502 and
https://github.com/systemd/systemd/issues/37386).

Goal: add Argon2id-based PIN hardening to TPM2 enrollment, making
the TPM a second factor rather than a single point of failure:

1. Password + salt → Argon2id → 512-bit key split into Key1 + Key2
2. Key2 (base64-encoded) is used as the PIN to seal a random secret
in the TPM
3. Key1 + unsealed secret → HKDF-SHA256 → final LUKS volume key

This implementation ensures that if the TPM is compromised, an attacker
still needs the password to derive Key1 and combine it with the unsealed
secret.

The --tpm2-with-pin= option now accepts three values:
- false (no PIN used)
- true (PIN hardened with Argon2id - default)
- "direct" (legacy PIN without Argon2id for backward compatibility)

Argon2id parameters are customizable via:

--tpm2-argon2id-memory=
--tpm2-argon2id-iterations=
--tpm2-argon2id-parallelism=
--tpm2-argon2id-iter-time=

These default to a function of available CPUs and physical memory, with
a benchmark that scales iterations to the target time (default: 2s) and
falls back to ARGON2ID_PARAMETERS_DEFAULT (64 MiB, 8 iter, 4 lanes) when
auto detection fails.
Also if the runtime OpenSSL lacks Argon2id support (< 3.2), the feature
silently falls back to direct PIN mode with a warning.

Added includes:
- src/cryptenroll/cryptenroll.c: cpu-set-util.h, limits-util.h,
time-util.h
for Argon2id benchmark auto-tuning (cpus_online, physical_memory_scale,
  now/usec_t)
- src/cryptenroll/cryptenroll-tpm2.c: crypto-util.h for
Argon2IdParameters
  struct in load_volume_key_tpm2()
- src/shared/tpm2-util.h: crypto-util.h for Argon2IdParameters in
  tpm2_make_luks2_json() API
- src/cryptsetup/cryptsetup-tokens/luks2-tpm2.c: crypto-util.h for
  kdf_argon2id_derive()/kdf_hkdf_sha256() on the token unlock path

sysupdate: add support for --component-all + --feature-all + --feature-suggested to enable-feature verb

This adds similar concepts as we already have for enable-component:
let's add a way to enable all features or the suggeste dones, and
possibly on all components.

Or in other words, with this:

systemd-sysupdate enable-component -S
systemd-sysupdate enable-feature -A -s

We'll automatically enable all suggested components, and all features of
them.

sysupdate: validate feature name syntax before using it

Let's tighten the rules on features, and enforce a sensible naming
policy.

sysupdate: add feature select

Prepartion for the next commit: also allow selecting the all or
suggested features.

sysupdate: implement --component-all and --component-suggested for enable-component and disable-component

Let's now introduce "systemd-sysupdate enable-component
--component-suggested" and systemd-sysupdate enable-component
--component-all" for enabling all or all suggested components at once.
Similar, if used for disable-component will disable all components or
those not suggested.

sysupdate: introduce --component-suggested similar to --component-all

While --component-all really picks all components --component-suggested
picks only the suggested ones.

Note that none of the verbs currently implement the concept, they will
all refuse the option. Hooking this up is going to be added next.

sysupdate: add a "suggests" concept to features and components

Let's make it possible to "suggest" that certain features or components
are enabled under some conditions.

For this, both features and components gain two things:

1. A Suggested= field which takes a boolean. If true the
   feature/component will be suggested for installation, if false it
   will not.

2. A set of SuggestedOnXYZ= settings are modelled after ConditionXYZ= in
   unit files (and implement a subset of them), will suggest some
   component/feature under specific conditions.

The result of the condition is shown in the various output tools.

timer: avoid re-arming WakeSystem=yes timer after suspend

There's an edge case where a single-shot timer using WakeSystem=yes
could get re-armed after elapsing. For example, when a timer is created
using:

$ systemd-run --user --on-active="1m" --timer-property=WakeSystem=yes flatpak run io.bassi.Amberol

and the system is then suspended, following sequence of events may
happen:

Jul 08 06:57:25 systemd[2640]: run-p192456-i199160.timer: Installed new job run-p192456-i199160.timer/start as 7675
Jul 08 06:57:25 systemd[2640]: run-p192456-i199160.timer: Enqueued job run-p192456-i199160.timer/start as 7675
Jul 08 06:57:25 systemd[2640]: run-p192456-i199160.timer: Monotonic timer elapses in 59.999999s.
Jul 08 06:57:25 systemd[2640]: run-p192456-i199160.timer: Changed dead -> waiting
Jul 08 06:57:25 systemd[2640]: run-p192456-i199160.timer: Job 7675 run-p192456-i199160.timer/start finished, result=done
Jul 08 06:57:25 systemd[2640]: Started [systemd-run] /usr/bin/flatpak run io.bassi.Amberol.
Jul 08 06:58:13 systemd[2640]: run-p192456-i199160.timer: Time change, recalculating next elapse.
Jul 08 06:58:13 systemd[2640]: run-p192456-i199160.timer: Monotonic timer elapses in 12.785674s.
Jul 08 06:58:26 systemd[2640]: run-p192456-i199160.timer: Timer elapsed.
Jul 08 06:58:26 systemd[2640]: run-p192456-i199160.timer: Changed waiting -> running
Jul 08 06:58:29 systemd[2640]: run-p192456-i199160.timer: Got notified about unit deactivation.
Jul 08 06:58:29 systemd[2640]: run-p192456-i199160.timer: Monotonic timer elapses in 33.544681s.
Jul 08 06:58:29 systemd[2640]: run-p192456-i199160.timer: Changed running -> waiting
Jul 08 06:59:48 systemd[2640]: run-p192456-i199160.timer: Timer elapsed.

1) The timer is armed at 06:57:25. timer_enter_waiting()
   calculates following values:

   base = M0 (current CLOCK_MONOTONIC)
   usec_shift_clock(base, CLOCK_MONOTONIC, CLOCK_BOOTTIME_ALARM) = M0 (no delta yet)
   v->next_elapse = M0 + 60s

   The timer is on CLOCK_BOOTTIME_ALARM at M0 + 60s

2) System suspends. If the system ran for ~12s and was suspended for
   ~36s, at resume we'd get:

   CLOCK_MONOTONIC = M0 + 12s
   CLOCK_BOOTTIME = M0 + 12s + 36s = M0 + 48s

3) timer_time_change() fires and calls timer_enter_waiting(t, true)

   Because time_change=true, v->next_elapse is not recalculated and
   keeps the original M0 + 60s value. The timer then correctly computes
   the remaining time as ~12.8 seconds:

        Jul 08 06:58:13 ...: Monotonic timer elapses in 12.785674s.

4) Timer elapses at 06:58:26

        Jul 08 06:58:26 ...: Timer elapsed.

   timer_enter_running() sets last_trigger.monotonic to CLOCK_MONOTONIC
   (which equals to M0 + 12s before suspend + 13s after suspend, thus
   + 25s)

5) Unit deactivates at 06:58:29

   timer_trigger_notify() calls timer_enter_waiting(t, false) -
   time_change=false; that means that this time v->next_elapse gets
   recalculated:

   base = inactive_exit_timestamp.monotonic = M0 (i.e. when the timer was originally armed)
   usec_shift_clock(M0, CLOCK_MONOTONIC, CLOCK_BOOTTIME_ALARM)
        a = now(CLOCK_MONOTONIC) = M0 + 28s
        b = now(CLOCK_BOOTTIME_ALARM) = M0 + 64s
        result = b - (a - M0) = (M0 + 64s) - 28s = M0 + 36s => the time spent in suspend

   Hence v->next_elapse = (M0 + 36s) + 60s = M0 + 96s

   This then skips disabling of one-shot timers in the following check,
   because the expression

        v->next_elapse < triple_timestamp_by_clock(&ts, TIMER_MONOTONIC_CLOCK(t))

   is false, because the v->next_elapse (M0 + 96s) is not less than the
   current time (M0 + 64s), so the timer is re-armed again:

        Jul 08 06:58:29 ...: Monotonic timer elapses in 33.544681s

Let's mitigate this by skipping the next elapse timestamp recalculations
for one-shot timers for which we've already calculated the value in this
activation cycle.

Resolves: #42929

resolved: publish browsed service after initialization

dns_add_new_service() links a _cleanup_ service into the browser
before copying its record and registering its maintenance timer has fully
succeeded. A timer setup failure then frees the service while leaving the
list head pointing at it.

Follow-up for 8458b7fb91ea5e5109b6f3c94f8a781a120c798b

tree-wide: port to prctl_safe()

nsresource: port from PR_GET_NAME to pid_get_comm()

SImilar reasons as for the memfd change: let's normalize behaviour and
always use pid_get_comm().

This adds escaping here for the first time, which is a good thing.

memfd-util: port to pid_get_comm()

Let's the common function for querying the local thread name.

This changes the escaping rules when the therad name is not quite
kosher: previously we'd just escape invalid UTF-8 charcaters, now we do
what we usually do: also escape control characters and such, and limit
us to ASCII.

The description generated here is mostly for debug purposes, and process
names should normally not require this escaping anyway (it's mostly
paranoia), hence I think this change in behaviour should be fine, it's
not part of the API in any form.

process-util: introduce proc_get_timerslack()

process-util: introduce trivial proc_set_nnp() helper

process-util: introduce proc_set_comm() helper

This operation is done at a bunch of places, let's add a type-safe
helper for it.

process-util: introduce prctl_safe()

udev: fix several option parsing edge cases (#42997)

Boundary tests were conducted on the udev subsystem, and some issues
were identified and resolved.

repart: Some fixes for --copy-from= (#42976)

A bunch of things I noticed that aren't correct about `--copy-from` and
grain sizes + paddings.

sd-varlink: async upgrade support (#42974)

With 2c6f9af8e5425c2086fbc8ca496843f162e4af9b sd-varlink gained protocol
upgrade support, however only in a synchronous fashion. This adds
asynchronous protocol upgrade for the server side, thus enabling
multiplexing daemons (i.e. those that handle multiple connections from
the same event loop) to support protocol upgrades too.

I plan to use this in the upcoming "ptybroker" component that allows
acquiring a pty through varlink.

/cc @mvo5