Marien Zwart [Sun, 19 Oct 2025 13:41:08 +0000 (00:41 +1100)]
docs: fix conversion / calculation errors
0x1770 is 6000, not 60000. It looks like 60000 is intended (the next
range starts at 60000 in both decimal and hex), so use that.
1000 to 60000 is 59001 users, as the range is inclusive on both sides.
Similar off-by-one for one of the "unused" ranges. After these changes,
the sizes of the ranges up to and including the "-1" ID sum up to 65536,
as expected.
I'm not sure where the size of the unused range after the container UID
range came from, but it is not correct (the "Container UID" and this
reserved range combined would be larger than the "HIC SVNT LEONES" 2^31
to 2^32-2 range...). Fix it.
It is unfortunate that the first half of this table makes more sense in
decimal while the second half makes more sense in hex (which would also
make the size in 65536 chunks easy to obtain): I'm tempted to add a
"sizes in hex" column...
Luca Boccassi [Fri, 17 Oct 2025 10:27:55 +0000 (11:27 +0100)]
log: add underflow assert guard
We often use ssize_t in log_error macros, but typically return int
which confuses coverity, as technically there is no guarantee that
int and ssize_t have the same range. Add an assert to enforce it.
Luca Boccassi [Fri, 17 Oct 2025 13:00:23 +0000 (14:00 +0100)]
ci: re-enable bpf-framework option for build and unit test jobs
Use the same trickery we do in the package build and search for
the actual bpftool binary. For the CI job any one we find is
good enough.
When we switch all jobs to 26.04 we can drop all of this.
Frantisek Sumsal [Thu, 16 Oct 2025 11:06:51 +0000 (13:06 +0200)]
test: let kernel OOM-kill a child process instead of the main one
This test occasionally fails due to a race where systemd processes
kernel's SIGKILL before the OOM notification, so the test service dies
with Result=signal instead of the expected Result=oom-kill:
To mitigate this, let's spawn a child process and move it to the
subcgroup to get killed instead of the main process, so systemd has more
time to react to the OOM notification and terminate the service with the
expected oom-kill result.
Daan De Meyer [Fri, 17 Oct 2025 08:49:53 +0000 (10:49 +0200)]
tree-wide: Various forward header cleanups
- Make sure forward headers have the iwyu pragma to always keep them
- Make sure we always include the daemon specific forward header
instead of shared-forward.h
- Remove shared-forward.h include where the daemon specific forward
header is already included
Luca Boccassi [Thu, 16 Oct 2025 18:43:45 +0000 (19:43 +0100)]
dissect: add support for verity-protected bare filesystems via mountfsd (#39325)
Needed to implement support for RootHashSignature=/RootVerity=/RootHash=
and friends when going through mountfsd, for example with user units,
so that system and user units provide the same features at the same
level
I now get a warning like this with python3-pyparsing-3.1.2-8.fc42:
hwdb.d/parse_hwdb.py:208: UserWarning: warn_multiple_tokens_in_named_alternation:
setting results name 'VALUE' on Or expression will return a list of all parsed
tokens in an And alternative, in prior versions only the first token was returned;
enclose contained argument in Group
('!' ^ (Optional('!') - Word(alphanums + '_')))('VALUE')
kmod-setup: don't load unix.ko as a module anymore
Building unix.ko as a module always has been a really bad idea, from day
1. Debian used to do this, but has long been fixed. Kernel developers
saw the light too, and removed support for it in 6.5
(97154bcf4d1b7cabefec8a72cff5fbb91d5afb7b). Let's hence drop support for
this here too, and delete some old cruft. AF_UNIX is simply our most
basic IPC system and supporting systems without it being around is just
not realistic.
Luca Boccassi [Tue, 14 Oct 2025 22:32:54 +0000 (23:32 +0100)]
dissect: add support for verity-protected bare filesystems via mountfsd
Needed to implement support for RootHashSignature=/RootVerity=/RootHash=
and friends when going through mountfsd, for example with user units,
so that system and user units provide the same features at the same
level
Govind Venugopal [Thu, 16 Oct 2025 15:06:17 +0000 (08:06 -0700)]
varlink: omit empty parameters field in JSON messages (#38922)
When varlink parameters are empty, omit the "parameters" field entirely
rather than sending "parameters":{}. This reduces message size and
follows varlink specification which allows parameters to be omitted.
The implementation supports three equivalent representations for empty
parameters: field omission, JSON null, and empty object {}. All three
are accepted on input for backward compatibility.
Daan De Meyer [Thu, 16 Oct 2025 13:20:36 +0000 (15:20 +0200)]
tree-wide: Introduce sd-forward.h and shared-forward.h headers
Let's not leak details from src/shared and src/libsystemd into
src/basic, even though you can't actually do anything useful with
just forward declarations from src/shared.
The sd-forward.h header is put in src/libsystemd/sd-common as we
don't have a directory for shared internal headers for libsystemd
yet.
Let's also rename forward.h to basic-forward.h to keep things
self-explanatory.
Luca Boccassi [Wed, 15 Oct 2025 19:05:03 +0000 (20:05 +0100)]
core: also enable PrivateUsers= for user services when using images via mountfsd
RootDirectory= and other options already implicitly enable PrivateUsers=
since 6ef721cbc7dadee4ae878ecf0076d87e57233908 if they are set in user
units, so that they can work out of the box.
Now with mountfsd support we can do the same for the images settings,
so enable them and document them.
Frantisek Sumsal [Wed, 15 Oct 2025 11:26:44 +0000 (13:26 +0200)]
test: wait for signed.test's zone DS records to get pushed to the parent zone
It looks like the 4 second sleep might not be enough on some slower
machines (like the ARM GH Actions nodes) which can lead to the DS RRs
propagation to clash with the manual test zone edit, and the
signed.test zone then might end up not properly signed:
TEST-75-RESOLVED.sh[749]: + : '--- ZONE: signed.test (static DNSSEC) ---'
TEST-75-RESOLVED.sh[749]: + run_delv @ns1.unsigned.test signed.test
TEST-75-RESOLVED.sh[749]: + run delv -a /etc/bind.keys @ns1.unsigned.test signed.test
TEST-75-RESOLVED.sh[778]: + delv -a /etc/bind.keys @ns1.unsigned.test signed.test
TEST-75-RESOLVED.sh[779]: + tee /tmp/tmp.2KOIiyrgth
TEST-75-RESOLVED.sh[779]: ;; /etc/bind.keys:1: option 'managed-keys' is deprecated
TEST-75-RESOLVED.sh[779]: ;; validating signed.test/DS: no valid signature found
TEST-75-RESOLVED.sh[779]: ;; validating signed.test/A: no valid signature found
TEST-75-RESOLVED.sh[779]: ; unsigned answer
TEST-75-RESOLVED.sh[779]: signed.test. 86400 IN A 10.0.0.10
TEST-75-RESOLVED.sh[779]: signed.test. 86400 IN RRSIG A 13 2 86400 2025102811435620251014101356 39330 signed.test. oo3ca8WPusbBPRhzsEKw3bsBBqFtI8i4bckoMVNzt7lY+udGW6PlaSYj OjpQGgY9oglowVM9bteNtwJKHUbvtw==
TEST-75-RESOLVED.sh[749]: + grep -qF '; fully validated' /tmp/tmp.2KOIiyrgth
[FAILED] Failed to start TEST-75-RESOLVED.service - TEST-75-RESOLVED.
Let's explicitly wait for the DS records propagation to finish before we
start editing the test zone to avoid this.
I'm still not completely sure if this is the root cause, but it's the
best shot I currently have, so I'll let the CIs decide.
Daan De Meyer [Thu, 16 Oct 2025 06:45:46 +0000 (08:45 +0200)]
test: fixes for debian unstable and TEST-50-DISSECT (#39331)
Test failed in a weird way, turns out we don't use pipefail and an
intermediate command was moved to a different package so it wasn't in
the minimal image anymore. Add it, and use pipefail so in the future
it's easier to spot.
Fix the confusing behavior where when an incorrect configuration item such as
'ManagerEnvironment=SYSTEMD_LOG_LEVEL=' is set, the first daemon-reload uses
old environment variables while the second daemon-reload uses LogLevel=.
Co-authored-by: Zbigniew Jędrzejewski-Szmek <zbyszek@in.waw.pl>
The difference in behaviour is that the operations that were done between the
first log_parse_environment() and the second one might not be logged now, e.g.
if the environment enabled debug logging. That is unfortunate, but parsing the
environment twice and not having the explicit configuration take effect until a
second daemon-reload is confusing. We will always have some window where the
configuration for logging does not apply, in particular this must be true when
parsing the logging configuration. To make that window smaller, move operations
that could log after the call to log_parse_environment() as far as possible.
man/systemd-systemd.conf: describe DefaultEnvironment= and ManagerEnvironment= better
The description of ME= said "see above", but it was actually above the other
one. So change the order. But while reading this, I found it very hard to
understand. So reword things, hopefully in a way that is easier to understand.
The current behaviour is rather complex and unintuitive, but this description
just tries to describe it truthfully.
Luca Boccassi [Wed, 15 Oct 2025 10:01:57 +0000 (11:01 +0100)]
Use verity sharing for user services and nspawn too (#39313)
https://github.com/systemd/systemd/pull/39168 made verity sharing
opt-in, and enabled it for system services.
Also enable it for user services for RootImage/etc, and for nspawn, for
the same reasons.
Govind Venugopal [Wed, 15 Oct 2025 09:20:41 +0000 (02:20 -0700)]
network: add DHCP server domain name option support (#39260)
Implements DHCP option 15 (Domain Name) for systemd-networkd's DHCP
server, allowing administrators to configure the DNS default domain that
clients should use.
This addresses the feature request in issue #37077, where users needed
to manually configure domain names using
SendOption=15:string:example.com as a workaround.
This adds two new configuration options to the [DHCPServer] section:
- EmitDomain= (boolean): whether to send domain name to clients
- Domain= (string): the domain name to send (e.g., "example.com")
Example configuration:
[DHCPServer] EmitDomain=yes Domain=example.com
This eliminates the need for manual workarounds using
SendOption=15:string:...
dissect-image: when autoprobing insist on vfat for XBOOTLDR
Let's reduce our attack surface by insisting that XBOOTLDR is vfat when
auto-probing, just like we do for the ESP. Given neither can
realistically be integrity protected (because firmware needs to access
them) let's insist on a vfat which has a much smaller attack surface,
and one we have to accept (for now) anyway, given that the ESP must be
VFAT.
This only applies to auto-probing of course. If people mount things
explicitly via fstab none of this matters. But we really shouldn't
automount a btrfs/xfs/ext4 partition as XBOOTLDR just because it looks
like one, as that would really defeat our otherwise possibly very strict
image policies.
This also introduces a new env var $SYSTEMD_DISSECT_FSTYPE_<DESIGNATOR>
environment variable that may override this hardcoding. This is in
particular useful in our testcases, since various actually do use ext4
as XBOOTLDR case. The tests are updated to make use of the new env var,
both as a mechanism to test this and to keep the tests working.
Luca Boccassi [Tue, 14 Oct 2025 17:46:08 +0000 (18:46 +0100)]
nspawn: enable verity sharing
Just like RootImage=, ExtensionImages= etc, nspawn can make use of
this to save a lot of time when starting containers that use an already
open image, since the default was changed to disabled.
Miroslav Lichvar [Tue, 14 Oct 2025 09:03:01 +0000 (11:03 +0200)]
udev: create symlinks for s390 PTP devices
Similarly to the udev rules handling KVM and Hyper-V PTP devices, create
symlinks for the s390-specific STCKE and Physical clocks (supported
since Linux 6.13) to have some stable names that can be specified in
default configurations of PTP/NTP applications.
core: allow split /usr/local/s?sbin with merged /usr/s?bin
Previously, we used either the fully split path or the fully merged path,
treating "split sbin" as a boolean condition. The idea was that conversion to
to merged bin would be a single event, so we don't need to care about the
details of the transition. But it turns out that some systems may be converted
in disparate steps. In https://bugzilla.redhat.com/show_bug.cgi?id=2400220,
there was a lengthy discussion about a coreos system where
/usr/local/{bin,sbin} were created as separate directories. Since /usr/local is
not part of the packaged system, it might remain split for a longer time. So
check /usr/local/s?bin separately and stop adding /usr/sbin to $PATH if only
/usr/local/s?bin is split. (I don't think it makes sense to handle the reverse
case, i.e. only /usr/s?bin being split, since that should be much rarer.)
Inspired by https://bugzilla.redhat.com/show_bug.cgi?id=2400220.
Frantisek Sumsal [Tue, 14 Oct 2025 12:23:55 +0000 (14:23 +0200)]
mkosi: explicitly pull in libz1 on OpenSUSE
Otherwise it pulls in libz-ng-compat1 which isn't 100% compatible with
libz1, and more importantly it requires an ldconfig drop-in in /etc/
(/etc/ld.so.conf.d/zlib-ng-compat-x86_64.conf) which breaks hermetic-usr
and TEST-07-PID1:
systemd[5582]: /usr/lib/systemd/systemd: error while loading shared libraries: libz.so.1: cannot open shared object file: No such file or directory
Frantisek Sumsal [Mon, 13 Oct 2025 15:36:55 +0000 (17:36 +0200)]
timer: rebase the next elapse timestamp only if timer didn't already run
The test added in f4c3c107d9be4e922a080fc292ed3889c4e0f4a5 uncovered a
corner case while recalculating the next elapse timestamp of a timer unit
that uses RandomizedDelaySec= during deserialization.
If the scheduled time (without RandomizedDelaySec=) already elapsed,
systemd "rebases" the next elapse timestamp to the time when systemd
first started, to make the RandomizedDelaySec= feature work even at
boot. However, since it was done unconditionally, it always overrode the
next elapse timestamp, which could then cause the final next elapse
timestamp to fall out of the expected window.
With a couple of additional debug logs one of the test fail looks like
this:
[ 132.129815] TEST-53-TIMER.sh[384]: + : 'Next elapse timestamp after daemon-reload, try #328'
[ 132.129815] TEST-53-TIMER.sh[384]: + systemctl daemon-reload
[ 132.136352] systemd[1]: Reload requested from client PID 16399 ('systemctl') (unit TEST-53-TIMER.service)...
[ 132.136636] systemd[1]: Reloading...
[ 132.446160] systemd[1]: Rebasing next elapse timestamp
[ 132.446168] systemd[1]: v->next_elapse: Tue 2025-10-14 00:10:00 CEST
[ 132.446170] systemd[1]: rebased: Tue 2025-10-14 00:10:56 CEST
[ 132.446172] systemd[1]: v->next_elapse after rebase: Tue 2025-10-14 00:10:56 CEST
[ 132.447361] systemd[1]: Reloading finished in 310 ms.
[ 132.484041] TEST-53-TIMER.sh[384]: + check_elapse_timestamp
[ 132.484041] TEST-53-TIMER.sh[384]: + systemctl status timer-RandomizedDelaySec-16377.timer
[ 132.533657] TEST-53-TIMER.sh[16440]: ● timer-RandomizedDelaySec-16377.timer
[ 132.533657] TEST-53-TIMER.sh[16440]: Loaded: loaded (/run/systemd/system/timer-RandomizedDelaySec-16377.timer; static)
[ 132.533657] TEST-53-TIMER.sh[16440]: Active: active (waiting) since Mon 2025-10-13 23:00:00 CEST; 1h 13min ago
[ 132.533657] TEST-53-TIMER.sh[16440]: Invocation: 5555d4f060114a5493ff228013830d17
[ 132.533657] TEST-53-TIMER.sh[16440]: Trigger: Tue 2025-10-14 22:10:04 CEST; 21h left
[ 132.533657] TEST-53-TIMER.sh[16440]: Triggers: ● timer-RandomizedDelaySec-16377.service
[ 132.533657] TEST-53-TIMER.sh[16440]: Oct 14 00:13:07 H systemd[1]: timer-RandomizedDelaySec-16377.timer: Changed dead -> waiting
[ 132.533657] TEST-53-TIMER.sh[16440]: Oct 14 00:13:07 H systemd[1]: timer-RandomizedDelaySec-16377.timer: Adding 15h 35min 1.230173s random time.
[ 132.533657] TEST-53-TIMER.sh[16440]: Oct 14 00:13:07 H systemd[1]: timer-RandomizedDelaySec-16377.timer: Realtime timer elapses at Tue 2025-10-14 15:45:58 CEST.
[ 132.533657] TEST-53-TIMER.sh[16440]: Oct 14 00:13:07 H systemd[1]: timer-RandomizedDelaySec-16377.timer: Changed dead -> waiting
[ 132.533657] TEST-53-TIMER.sh[16440]: Oct 14 00:13:08 H systemd[1]: timer-RandomizedDelaySec-16377.timer: Adding 16h 29min 44.084409s random time.
[ 132.533657] TEST-53-TIMER.sh[16440]: Oct 14 00:13:08 H systemd[1]: timer-RandomizedDelaySec-16377.timer: Realtime timer elapses at Tue 2025-10-14 16:40:41 CEST.
[ 132.533657] TEST-53-TIMER.sh[16440]: Oct 14 00:13:08 H systemd[1]: timer-RandomizedDelaySec-16377.timer: Changed dead -> waiting
[ 132.533657] TEST-53-TIMER.sh[16440]: Oct 14 00:13:08 H systemd[1]: timer-RandomizedDelaySec-16377.timer: Adding 21h 59min 7.955828s random time.
[ 132.533657] TEST-53-TIMER.sh[16440]: Oct 14 00:13:08 H systemd[1]: timer-RandomizedDelaySec-16377.timer: Realtime timer elapses at Tue 2025-10-14 22:10:04 CEST.
[ 132.533657] TEST-53-TIMER.sh[16440]: Oct 14 00:13:08 H systemd[1]: timer-RandomizedDelaySec-16377.timer: Changed dead -> waiting
[ 132.535386] TEST-53-TIMER.sh[384]: + systemctl show -p InactiveExitTimestamp timer-RandomizedDelaySec-16377.timer
[ 132.537727] TEST-53-TIMER.sh[16442]: InactiveExitTimestamp=Mon 2025-10-13 23:00:00 CEST
[ 132.540317] TEST-53-TIMER.sh[16444]: ++ systemctl show -P NextElapseUSecRealtime timer-RandomizedDelaySec-16377.timer
[ 132.547745] TEST-53-TIMER.sh[384]: + NEXT_ELAPSE_REALTIME='Tue 2025-10-14 22:10:04 CEST'
[ 132.548020] TEST-53-TIMER.sh[16445]: ++ date '--date=Tue 2025-10-14 22:10:04 CEST' +%s
[ 132.550218] TEST-53-TIMER.sh[384]: + NEXT_ELAPSE_REALTIME_S=1760472604
[ 132.550218] TEST-53-TIMER.sh[384]: + : 'Next elapse timestamp should be Tue 2025-10-14 00:10:00 CEST <= Tue 2025-10-14 22:10:04 CEST <= Tue 2025-10-14 22:10:00 CEST'
[ 132.550218] TEST-53-TIMER.sh[384]: + assert_ge 17604726041760393400
[ 132.550555] TEST-53-TIMER.sh[16446]: + set +ex
[ 132.550702] TEST-53-TIMER.sh[384]: + assert_le 17604726041760472600
[ 132.550832] TEST-53-TIMER.sh[16447]: + set +ex
[ 132.551091] TEST-53-TIMER.sh[16447]: FAIL: '1760472604' > '1760472600'
Here the original next elapse timestamp was Tue 2025-10-14 00:10:00 CEST
as expected, but it was overridden by the rebased timestamp:
Tue 2025-10-14 00:10:56 CEST. And when a new randomized delay was added
to it (21h 59min 7.955828s) the final next elapse timestamp fell out of
the expected window, i.e. Tue 2025-10-14 00:10:00 (scheduled time) <
Tue 2025-10-14 22:10:04 CEST (rebased elapse timestamp + randomized
delay) < Tue 2025-10-14 22:10:00 CEST (scheduled time + maximum from
RandomizedDelaySec=, i.e. 22h).
By limiting the timestamp rebase only the case where the unit hasn't
already run should prevent this from happening during daemon-reload.
This change was misguided. The warning is enough during development and will
get fixed, but turning this into a hard failure just makes WIP harder. Also, a
hard error increases the likelyhood of a build failure in scenarios where
somebody is disabling components (as seen e.g. in ba8801a07640205778c5a62539597c68d7bdb211). We already are not very good at
keeping our codebase compile correctly as it ages, because of changes in
compilers and dependencies, and we should not go out of our way to increase the
probability of failure. Such scenarios are painful for downstream builds.
meson: stop probing for paths of programs in /usr/sbin
We dropped support for split-usr a while ago, which means that the programs
will be in /usr/sbin, which actually may be the same as /usr/bin on merged-bin
systems. So the whole checking is mostly pointless in the usual case. OTOH, on
Nix the paths will be totally different and need to be set through the option
anyway. So save time during builds by using the "fallback" path unless the
option is specified.
This avoid some busywork during the slow serial build phase.
libsystemd: drop "const" decorators on public inline functions
The point of the "const" attribute is to give the compiler hints about
behaviour of functions if it only has the function prototype but no body
around. But inline functions are the ones where the compiler *always*
has the body around, hence the "const" decorator is really just noise:
the compuler can determine the constness on its own, just by looking at
the code.
This way we have can expose identical behaviour everywhere, can make use
of our atomic replacement calls, and openat() logic, and later apply
additional tracks while unpacking, such as putting limits on UID ranges
and similar.
dissect-image: take policy into consideration when unlocking verity, too
Previously, we'd take the image policy only into consideration when
dissecting the mage, but for the unlock/verity step we'd go via best
effort. Change that. This means we can now enforce policies such as
activating by root hash only even if a signature exists and similar.
Also, introduce a separate error code if we try to unlock a Verity
volume but have no root hash. Previously we'd return ENOKEY for that,
exactly like we do for encrypted volumes where we have no passparse. The
interctive unlock loop dissected_image_decrypt_interactively() is
otherwise very confused and will ask for a root hash, which makes no
sense. Hence use two distinct errors for this.
dissect-image: turn verity device sharing into opt-in
Sharing verity volumes is problematic for a veriety of reasons, for
example because it might pin the wrong backing device at the wrong time.
Let's hence turn this around: unless verity sharing is enabled, leave it
off, and turn $SYSTEMD_VERITY_SHARING into a true boolean that can be
set both ways.
The primary usecase for verity sharing is RootImage=, where it probably
makes sense to leave on, hence set the flag there.
This is crucial when putting together installers which install an OS on
a second disk: if verity sharing is always on we might mount the wrong
of the two disks at the wrong time.
Daan De Meyer [Mon, 13 Oct 2025 08:43:16 +0000 (10:43 +0200)]
sd-id128: Drop _sd_const_ from sd_id128_in_setv()
Both the const and pure attributes disallow modifying input arguments
but sd_id128_in_setv() clearly modifies its ap input argument by iterating
over it with va_arg() so drop the _sd_const_ attribute from
sd_id128_in_setv().
If systemd-pcrphase-initrd.service and friends failed for some reasons,
the test VM will reboot infinitely and the test will timeout. Let's
propagate the failure to the host and fail the test earlier in that case.