rm-rf: make sure we can safely remove dirs we have no access to via rm_rf_at()
Previously, we'd first empty a dir, and then remove it. This works fine
as long as we have access to a dir. But in some cases (like for example
a foreign owned container tree) we might not have access to the dir, but
are still able to remove it (because it is empty, and in a dir we own).
Hence let's try that first. If it works, we do not need to enter the dir
(and thus fail).
Michal Sekletar [Fri, 24 Oct 2025 10:55:20 +0000 (12:55 +0200)]
coredump: handle ENOBUFS and EMSGSIZE the same way
Depending on the runtime configuration, e.g. sysctls
net.core.wmem_default= and net.core.rmem_default and on the actual
message size, sendmsg() can fail also with ENOBUFS. E.g. alloc_skb()
failure caused by net.core.[rw]mem_default=64MiB and huge fdinfo list
from process that has 90k opened FDs.
We should handle this case in the same way as EMSGSIZE and drop part of
the message.
Do not use "critical assert_return" in libsystemd or libudev
Previously, when compiled in developer mode, a call into libsystemd with
invalid parameters would result in an abort. This means that it's effectively
impossible to install such libsystemd in a normal system, since various
third-party programs may now abort. A shared library should generally never
abort or exit the calling program.
In python-systemd, the test suite calls into libsystemd, to check if the proper
return values are received and propagated through the Python wrappers.
Obviously with libsystemd compiled from git, the test suite now fails
in a nasty way.
So rework the code to set assert_return_is_critical similarly to how we handle
mempool enablement: the function that returns true is declared as a week
symbol, and we "opt in" by linking a file that provides the function in
libsystemd-shared. Effectively, libsystemd and libudev always have
assert_return_is_critical==false, and our binaries and modules enable it
conditionally.
The function internally does caching which means that the result must
always be the same, the definition of a pure function. The compiler might
be able to optimize some repeated calls to the function.
Peter Hutterer [Thu, 9 Oct 2025 00:56:54 +0000 (10:56 +1000)]
hwdb: don't tag a named Mouse device as pointingstick
The generic kernel hid drivers split up devices based on the application
collection, appending a suffix for each collection (e.g. Touchpad,
Mouse, ...). Many i2c touchpads get a "... Mouse" event node which is
mislabelled as pointingstick by the input_id builtin, see commit 3d7ac1c655ec40f3829543072494dcdfb92dbc6b.
Peter Hutterer [Thu, 9 Oct 2025 00:55:16 +0000 (10:55 +1000)]
rules: extend 60-input-id.rules to allow for bus/vid/pid/name matches
Same approach as used in 70-mouse.rules, allow for a name-based match
optionally combined with bus/vid/pid (which the existing modalias rule
would already allow us anyway). Note that ID_BUS isn't assigned until
after this rule has run so we need to use the id/bustype attribute
directly.
Related to https://github.com/systemd/systemd/issues/36677
Mike Yuan [Sun, 9 Feb 2025 22:12:15 +0000 (23:12 +0100)]
core/mount: properly handle REMOUNTING_* states in mount_stop()
Currently, mount_stop() simply turns REMOUNTING_* into corresponding
UNMOUNTING_* states. However the transition is bogus, because
the interruption of remount does not bring down the mount.
Let's instead follow the logic of service_stop(), i.e. terminate
the remount process and spawn umount.
Frantisek Sumsal [Mon, 13 Oct 2025 15:36:55 +0000 (17:36 +0200)]
timer: rebase the next elapse timestamp only if timer didn't already run
The test added in f4c3c107d9be4e922a080fc292ed3889c4e0f4a5 uncovered a
corner case while recalculating the next elapse timestamp of a timer unit
that uses RandomizedDelaySec= during deserialization.
If the scheduled time (without RandomizedDelaySec=) already elapsed,
systemd "rebases" the next elapse timestamp to the time when systemd
first started, to make the RandomizedDelaySec= feature work even at
boot. However, since it was done unconditionally, it always overrode the
next elapse timestamp, which could then cause the final next elapse
timestamp to fall out of the expected window.
With a couple of additional debug logs one of the test fail looks like
this:
[ 132.129815] TEST-53-TIMER.sh[384]: + : 'Next elapse timestamp after daemon-reload, try #328'
[ 132.129815] TEST-53-TIMER.sh[384]: + systemctl daemon-reload
[ 132.136352] systemd[1]: Reload requested from client PID 16399 ('systemctl') (unit TEST-53-TIMER.service)...
[ 132.136636] systemd[1]: Reloading...
[ 132.446160] systemd[1]: Rebasing next elapse timestamp
[ 132.446168] systemd[1]: v->next_elapse: Tue 2025-10-14 00:10:00 CEST
[ 132.446170] systemd[1]: rebased: Tue 2025-10-14 00:10:56 CEST
[ 132.446172] systemd[1]: v->next_elapse after rebase: Tue 2025-10-14 00:10:56 CEST
[ 132.447361] systemd[1]: Reloading finished in 310 ms.
[ 132.484041] TEST-53-TIMER.sh[384]: + check_elapse_timestamp
[ 132.484041] TEST-53-TIMER.sh[384]: + systemctl status timer-RandomizedDelaySec-16377.timer
[ 132.533657] TEST-53-TIMER.sh[16440]: ● timer-RandomizedDelaySec-16377.timer
[ 132.533657] TEST-53-TIMER.sh[16440]: Loaded: loaded (/run/systemd/system/timer-RandomizedDelaySec-16377.timer; static)
[ 132.533657] TEST-53-TIMER.sh[16440]: Active: active (waiting) since Mon 2025-10-13 23:00:00 CEST; 1h 13min ago
[ 132.533657] TEST-53-TIMER.sh[16440]: Invocation: 5555d4f060114a5493ff228013830d17
[ 132.533657] TEST-53-TIMER.sh[16440]: Trigger: Tue 2025-10-14 22:10:04 CEST; 21h left
[ 132.533657] TEST-53-TIMER.sh[16440]: Triggers: ● timer-RandomizedDelaySec-16377.service
[ 132.533657] TEST-53-TIMER.sh[16440]: Oct 14 00:13:07 H systemd[1]: timer-RandomizedDelaySec-16377.timer: Changed dead -> waiting
[ 132.533657] TEST-53-TIMER.sh[16440]: Oct 14 00:13:07 H systemd[1]: timer-RandomizedDelaySec-16377.timer: Adding 15h 35min 1.230173s random time.
[ 132.533657] TEST-53-TIMER.sh[16440]: Oct 14 00:13:07 H systemd[1]: timer-RandomizedDelaySec-16377.timer: Realtime timer elapses at Tue 2025-10-14 15:45:58 CEST.
[ 132.533657] TEST-53-TIMER.sh[16440]: Oct 14 00:13:07 H systemd[1]: timer-RandomizedDelaySec-16377.timer: Changed dead -> waiting
[ 132.533657] TEST-53-TIMER.sh[16440]: Oct 14 00:13:08 H systemd[1]: timer-RandomizedDelaySec-16377.timer: Adding 16h 29min 44.084409s random time.
[ 132.533657] TEST-53-TIMER.sh[16440]: Oct 14 00:13:08 H systemd[1]: timer-RandomizedDelaySec-16377.timer: Realtime timer elapses at Tue 2025-10-14 16:40:41 CEST.
[ 132.533657] TEST-53-TIMER.sh[16440]: Oct 14 00:13:08 H systemd[1]: timer-RandomizedDelaySec-16377.timer: Changed dead -> waiting
[ 132.533657] TEST-53-TIMER.sh[16440]: Oct 14 00:13:08 H systemd[1]: timer-RandomizedDelaySec-16377.timer: Adding 21h 59min 7.955828s random time.
[ 132.533657] TEST-53-TIMER.sh[16440]: Oct 14 00:13:08 H systemd[1]: timer-RandomizedDelaySec-16377.timer: Realtime timer elapses at Tue 2025-10-14 22:10:04 CEST.
[ 132.533657] TEST-53-TIMER.sh[16440]: Oct 14 00:13:08 H systemd[1]: timer-RandomizedDelaySec-16377.timer: Changed dead -> waiting
[ 132.535386] TEST-53-TIMER.sh[384]: + systemctl show -p InactiveExitTimestamp timer-RandomizedDelaySec-16377.timer
[ 132.537727] TEST-53-TIMER.sh[16442]: InactiveExitTimestamp=Mon 2025-10-13 23:00:00 CEST
[ 132.540317] TEST-53-TIMER.sh[16444]: ++ systemctl show -P NextElapseUSecRealtime timer-RandomizedDelaySec-16377.timer
[ 132.547745] TEST-53-TIMER.sh[384]: + NEXT_ELAPSE_REALTIME='Tue 2025-10-14 22:10:04 CEST'
[ 132.548020] TEST-53-TIMER.sh[16445]: ++ date '--date=Tue 2025-10-14 22:10:04 CEST' +%s
[ 132.550218] TEST-53-TIMER.sh[384]: + NEXT_ELAPSE_REALTIME_S=1760472604
[ 132.550218] TEST-53-TIMER.sh[384]: + : 'Next elapse timestamp should be Tue 2025-10-14 00:10:00 CEST <= Tue 2025-10-14 22:10:04 CEST <= Tue 2025-10-14 22:10:00 CEST'
[ 132.550218] TEST-53-TIMER.sh[384]: + assert_ge 17604726041760393400
[ 132.550555] TEST-53-TIMER.sh[16446]: + set +ex
[ 132.550702] TEST-53-TIMER.sh[384]: + assert_le 17604726041760472600
[ 132.550832] TEST-53-TIMER.sh[16447]: + set +ex
[ 132.551091] TEST-53-TIMER.sh[16447]: FAIL: '1760472604' > '1760472600'
Here the original next elapse timestamp was Tue 2025-10-14 00:10:00 CEST
as expected, but it was overridden by the rebased timestamp:
Tue 2025-10-14 00:10:56 CEST. And when a new randomized delay was added
to it (21h 59min 7.955828s) the final next elapse timestamp fell out of
the expected window, i.e. Tue 2025-10-14 00:10:00 (scheduled time) <
Tue 2025-10-14 22:10:04 CEST (rebased elapse timestamp + randomized
delay) < Tue 2025-10-14 22:10:00 CEST (scheduled time + maximum from
RandomizedDelaySec=, i.e. 22h).
By limiting the timestamp rebase only the case where the unit hasn't
already run should prevent this from happening during daemon-reload.
Frantisek Sumsal [Mon, 13 Oct 2025 15:35:02 +0000 (17:35 +0200)]
test: format the min/max timestamps in "systemd" style
Before:
Next elapse timestamp should be Sun Oct 12 00:10:00 UTC 2025 <= Sun 2025-10-12 05:43:04 UTC <= Sun Oct 12 22:10:00 UTC
After:
Next elapse timestamp should be Tue 2025-10-14 00:10:00 CEST <= Tue 2025-10-14 19:39:11 CEST <= Tue 2025-10-14 22:10:00 CEST
(cherry picked from commit 62ca845ac776d5877fe46dab52692053df6c8efa)
core: allow split /usr/local/s?sbin with merged /usr/s?bin
Previously, we used either the fully split path or the fully merged path,
treating "split sbin" as a boolean condition. The idea was that conversion to
to merged bin would be a single event, so we don't need to care about the
details of the transition. But it turns out that some systems may be converted
in disparate steps. In https://bugzilla.redhat.com/show_bug.cgi?id=2400220,
there was a lengthy discussion about a coreos system where
/usr/local/{bin,sbin} were created as separate directories. Since /usr/local is
not part of the packaged system, it might remain split for a longer time. So
check /usr/local/s?bin separately and stop adding /usr/sbin to $PATH if only
/usr/local/s?bin is split. (I don't think it makes sense to handle the reverse
case, i.e. only /usr/s?bin being split, since that should be much rarer.)
Inspired by https://bugzilla.redhat.com/show_bug.cgi?id=2400220.
* 8e2833a5b6 Automatically figure out the name of the top-level tar dir
* dffbf2beba Make sure fallback source is listed first
* 1d3b892105 Enable sysupdate and sysupdated
docs: add comment about requiring the mount hierarchy to be mounted MS_SHARED
This has been tripping up container manager people. let's document this
explicitly.
(Note that the container interface could really use some updates, i.e.
it was written before a time where cgroup namespacing was a thing. But I
am too lazy to fix that now, so let's just add this once facet.)
doc: indicate Type=oneshot also detects invocation failures
Type `simple` explicitly mentions that invocation failures like a missing binary
or `User=` name won’t get detected – whereas type `exec` mentions that it does.
Type `oneshot` refers to being similar to `simple`, which could lead one to
assume it doesn’t detect such invocation failures either – it seems however it
does.
Indicate this my changing its wording to be similar to `exec`.
Frantisek Sumsal [Thu, 23 Oct 2025 13:30:52 +0000 (15:30 +0200)]
man: handle leading/trailing/repeating whitespaces in anchor links
So even if a <term> section contains newlines, we get a reasonable
anchor link to it.
Before:
<dt id="
bind
UNIT
PATH
[PATH]
"><span class="term">
...
<a class="headerlink" title="Permalink to this term" href="#%0A%20%20%20%20%20%20%20%20%20%20%20%20bind%0A%20%20%20%20%20%20%20%20%20%20%20%20UNIT%0A%20%20%20%20%20%20%20%20%20%20%20%20PATH%0A%20%20%20%20%20%20%20%20%20%20%20%20[PATH]%0A%20%20%20%20%20%20%20%20%20%20">¶</a>
After:
<dt id="bind UNIT PATH [PATH]"><span class="term">
...
<a class="headerlink" title="Permalink to this term" href="#bind%20UNIT%20PATH%20[PATH]">¶</a>
Ronan Pigott [Sun, 26 Oct 2025 04:04:03 +0000 (21:04 -0700)]
zsh: add completion for dbus bus address
The DBUS_SESSION_BUS_ADDRESS and DBUS_SYSTEM_BUS_ADDRESS parameters have
an interesting syntax thats useful to complete. Let's include a
completion definition for these parameters.
Ryan Brue [Mon, 18 Aug 2025 17:12:26 +0000 (12:12 -0500)]
man: Clarify usage of /usr/share/factory/ in programs
As discussed in this thread:
https://github.com/redhat-performance/tuned/issues/798#issuecomment-3197697654
/usr/share/factory/ is not intended to be read from by programs,
but the wording in the FHS can be misread to think that programs
should be using /usr/share/factory/ as the vendor supplied configuration
directory rather than something like /usr/lib/foo/ or /usr/share/foo/.
This commit points developers to the UAPI configuration spec for how to
make their programs hermetic /usr/ compatible.
Marien Zwart [Sun, 19 Oct 2025 13:41:08 +0000 (00:41 +1100)]
docs: fix conversion / calculation errors
0x1770 is 6000, not 60000. It looks like 60000 is intended (the next
range starts at 60000 in both decimal and hex), so use that.
1000 to 60000 is 59001 users, as the range is inclusive on both sides.
Similar off-by-one for one of the "unused" ranges. After these changes,
the sizes of the ranges up to and including the "-1" ID sum up to 65536,
as expected.
I'm not sure where the size of the unused range after the container UID
range came from, but it is not correct (the "Container UID" and this
reserved range combined would be larger than the "HIC SVNT LEONES" 2^31
to 2^32-2 range...). Fix it.
It is unfortunate that the first half of this table makes more sense in
decimal while the second half makes more sense in hex (which would also
make the size in 65536 chunks easy to obtain): I'm tempted to add a
"sizes in hex" column...
man/systemd-systemd.conf: describe DefaultEnvironment= and ManagerEnvironment= better
The description of ME= said "see above", but it was actually above the other
one. So change the order. But while reading this, I found it very hard to
understand. So reword things, hopefully in a way that is easier to understand.
The current behaviour is rather complex and unintuitive, but this description
just tries to describe it truthfully.
Luca Boccassi [Fri, 17 Oct 2025 13:00:23 +0000 (14:00 +0100)]
ci: re-enable bpf-framework option for build and unit test jobs
Use the same trickery we do in the package build and search for
the actual bpftool binary. For the CI job any one we find is
good enough.
When we switch all jobs to 26.04 we can drop all of this.
core/unit: fail earlier before spawning executor when we failed to realize cgroup
Before 23ac08115af83e3a0a937fa207fc52511aba2ffa, even if we failed to
create the cgroup for a unit, a cgroup runtime object for the cgroup is
created with the cgroup path. Hence, the creation of cgroup is failed,
execution of the unit will fail in posix_spawn_wrapper() and logged
something like the following:
```
systemd[1]: testservice.service: Failed to create cgroup /testslice.slice/testservice.service: Cannot allocate memory
systemd[1]: testservice.service: Failed to spawn executor: No such file or directory
systemd[1]: testservice.service: Failed to spawn 'start' task: No such file or directory
systemd[1]: testservice.service: Failed with result 'resources'.
systemd[1]: Failed to start testservice.service.
```
However, after the commit, when we failed to create the cgroup, a cgroup
runtime object is not created, hence NULL will be assigned to
ExecParameters.cgroup_path in unit_set_exec_params().
Hence, the unit process will be invoked in the init.scope.
```
systemd[1]: testservice.service: Failed to create cgroup /testslice.slice/testservice.service: Cannot allocate memory
systemd[1]: Starting testservice.service...
cat[1094]: 0::/init.scope
systemd[1]: testservice.service: Deactivated successfully.
systemd[1]: Finished testservice.service.
```
where the test service calls 'cat /proc/self/cgroup'.
To fix the issue, let's fail earlier when we failed to create cgroup.
Daan De Meyer [Mon, 13 Oct 2025 08:43:16 +0000 (10:43 +0200)]
sd-id128: Drop _sd_const_ from sd_id128_in_setv()
Both the const and pure attributes disallow modifying input arguments
but sd_id128_in_setv() clearly modifies its ap input argument by iterating
over it with va_arg() so drop the _sd_const_ attribute from
sd_id128_in_setv().
test: check the next elapse timer timestamp after deserialization
When deserializing a serialized timer unit with RandomizedDelaySec= set,
systemd should use the last inactive exit timestamp instead of current
realtime to calculate the new next elapse, so the timer unit actually
runs in the given calendar window.
Luca Boccassi [Tue, 26 Aug 2025 18:12:53 +0000 (19:12 +0100)]
sysext: do not attempt to unlock images interactively
These images are not using a passphrase, they are using keys
or at most TPM-based sealing (not yet implemented, for contexts).
Do not use the interactive helper, as it will block and ask the
user for a password if it fails to find the signing cert, which
is not useful for this tool.
Since mountfsd was added in 702a52f4b5d49cce11e2adbc740deb3b644e2de0 the
caps bounding set line was commented. That's an accident. Fix that. (We
need to add a bunch of caps to the list).
This fixes two things: first of all it ensures we take the override
status output field properly into account, instead of going directly to
the regular one.
Moreover, it ensures that we bypass auto for both notice + emergency,
since both have the same "impact", and, don't limit this for notice
only.
UID entry in the machine state file is introduced in v258,
hence when a host is upgraded to v258, the field does not exist in the
file, thus the variable 'uid' is NULL.
core/bpf-firewall: replace unnecessary unit_setup_cgroup_runtime() with unit_get_cgroup_runtime()
Except for the test, bpf_firewall_compile() is only called by the following:
cgroup_context_apply() -> cgroup_apply_firewall() -> bpf_firewall_compile()
and in the early stage of cgroup_context_apply(), it checks if the cgroup
runtime exists. Hence, it is not necessary to try to allocate the
runtime in bpf_firewall_compile().
Mike Yuan [Thu, 25 Sep 2025 20:28:33 +0000 (22:28 +0200)]
core/cgroup: make sure deserialized accounting data is not voided
Currently, cgroup_path is (de-)serialized after all the cached
accounting data. This is bogus though, since unit_set_cgroup_path()
destroys the CGroupRuntime object and starts afresh, discarding
all deserialized values. This matters especially for IP accounting,
whose BPF maps get recreated on reload/reexec and the previous values
are exclusively retrievable from deserialization. Let's hence swap things
around and serialize cgroup_path first, accounting data only afterwards.
Felix Pehla [Sat, 27 Sep 2025 13:01:06 +0000 (15:01 +0200)]
shared/bootspec: parse 'uki' boot entry option
Commit e2a3d562189c413de3262ec47cdc1e1b0b13d78b (as part of #36314)
makes sd-boot recognize a 'uki' stanza in a boot loader entry and
uapi-group/specifications@3f2bd8236d7f9ce6dedf8bda9cadffd0d363cb08 adds
it to the BLS, but bootctl and other components parsing said config do
not know about it, leading to the error message
`Unknown line 'uki', ignoring.` when attempting to parse the same entry.
This commit makes it get parsed the same way that that 'efi' is.
In the commit c960ca2be1cfd183675df581f049a0c022c1c802, the logic of
updating ACL on device node was moved from logind to udevd, but at that
time, mistakenly removed the logic for static nodes.
man: fix advice regarding thread safety of libsystemd
The prohibition to move libsystemd objects between threads was added in 64a7ef8bc06b5dcfcd9f99ea10a43bde75c4370f ('man: be more explicit about thread
safety of sd_journal'). At the time, this was valid, because we were using the
mempool for allocation and it apparently didn't handle access from different
threads. Sadlly, the commit links to a bugzilla entry referenced in the commit
is not publicly visible anymore, so the details are murky. But we stopped using
the mempool in a5d8835c78112206bbf0812dd4cb471f803bfe88 ('mempool: only enable
mempool use when linked to libsystemd-shared.so'), with subsequent followup in b01f31954f1c7c4601925173ae2638b572224e9a ('Turn mempool_enabled() into a weak
symbol'). The restriction added in the man page is not necessary since then.
The text in the man page was arguably incorrect in calling the code
"thread-agnostic". If the code does not support being touched from threads at
all and has global state to tied to the main thread, it is not "agnostic", but
just doesn't support threads.
(I'm looking into https://github.com/systemd/python-systemd/issues/143, and
with the current scheme, the python-systemd module and all python code using
libsystemd would be very hard to use. With the change to free-threaded python
in python3.13, i.e. the replacement of single Global Interpreter Lock by
locking on individual objects, this limitation would become even more
constraining.)