Matteo Croce [Mon, 17 Nov 2025 16:30:34 +0000 (17:30 +0100)]
oomd: check if a cgroup can be killed before attempting to kill it
On OOM event, oomd tries to kill a cgroup until it succeedes.
The kill can fail with EPERM in case a pid is not killed, this leaves
the cgroup with only half of the processed killed.
This is unlikely but theoretically possible in a user namespace,
where systemd run as root inside the container and tries to kill a
cgroup with some PID from the host namespace.
To address this, send the SIG0 signal to all the processes to check
that we have privileges to kill them.
val4oss [Wed, 19 Nov 2025 09:18:30 +0000 (10:18 +0100)]
pam_systemd: fix OSC write failure message appearing in error logs
Create and use new function pam_debug_syslog_errno() instead to ensure the
message only appears when debug mode is enabled. Pass the debug flag to
open_osc_context() and close_osc_context() to support this change.
val4oss [Wed, 19 Nov 2025 09:18:41 +0000 (10:18 +0100)]
pam-util: fix pam_syslog_errno() ignoring the level parameter
The function accepts a level parameter but was always logging at
LOG_ERR. Fix by passing the level parameter to sym_pam_vsyslog()
instead of hardcoding LOG_ERR.
This caused debug and warning messages to incorrectly appear in error
logs.
Make “effect” plural to indicate that BindsTo= also includes the other effects
of Requires= (like starting the listed units).
The documentation of Requires= already describes that the configuring unit is
stopped/restarted if any of the list units is explicitly stopped/restarted.
This made the previous wording “in addition to the effect of Requires, it
declares that if the unit bound to is stopped, this unit will be stopped too.”
ambiguous – this is no in addition, Requires= already does that, at least for
some (namely the explicit) cases.
Resolve this by making it clear what the actual difference to Requires= is and
further mention that this also includes failed units.
Signed-off-by: Christoph Anton Mitterer <mail@christoph.anton.mitterer.name>
Frantisek Sumsal [Wed, 19 Nov 2025 13:44:13 +0000 (14:44 +0100)]
timer: rebase last_trigger timestamp if needed
After bdb8e584f4509de0daebbe2357d23156160c3a90 we stopped rebasing the
next elapse timestamp unconditionally and the only case where we'd do
that was when both last trigger and last inactive timestamps were empty.
This covered timer units during boot just fine, since they would have
neither of those timestamps set. However, persistent timers
(Persistent=yes) store their last trigger timestamp on a persistent
storage and load it back after reboot, so the rebasing was skipped in
this case.
To mitigate this, check the last_trigger timestamp is older than the
current machine boot - if so, that means that it came from a stamp file
of a persistent timer unit and we need to rebase it to make
RandomizedDelaySec= work properly.
Yu Watanabe [Thu, 20 Nov 2025 04:23:51 +0000 (13:23 +0900)]
core: SMACK label to Unix socket path and FD (#39772)
Currently, when a socket unit specifies SmackLabel=,
the label is not applied to the underlying Unix socket file or its file
descriptor.
This change ensures that the SMACK label is applied both to the
Unix socket path on the filesystem and to all associated socket FDs
when the socket is created.
Testing:
- Tested on Fedora 43 with kernel 6.17.7 with SMACK enabled.
- Created a systemd socket unit:
In all cases, everything that we list in 'extract', we also list in
'sources'. We can simplify things by automatically appending the first
list to the second.
In the listings, move 'extract' key right below 'sources', since now
they are both "sources", just with slightly different meanings.
socket-label: apply SMACK label to socket and its file descriptor
When a socket unit specifies SmackLabel=, the label was previously
not applied to the underlying Unix socket file or its file descriptor.
This change ensures that the SMACK label is applied both to the
socket path on the filesystem and to the opened socket FD.
Yu Watanabe [Thu, 20 Nov 2025 00:39:32 +0000 (09:39 +0900)]
socket-label: move prototype of socket_address_listen() and string table for SocketAddressBindIPv6Only
The function socket_address_listen() is declared at shared/socket-label.c,
however its prototype was in basic/socket-util.h. This moves the
prototype to shared/socket-label.h.
Also, enum SocketAddressBindIPv6Only is not used anymore in basic/*.[ch].
Let's move the definition and its string table to shared/socket-label.[ch].
Yu Watanabe [Wed, 19 Nov 2025 23:19:46 +0000 (08:19 +0900)]
core: Verify inherited FDs are writable for stdout/stderr (#39674)
When inheriting file descriptors for stdout/stderr (either from stdin or
when making stderr inherit from stdout), we previously just assumed they
would be writable and dup'd them. This could lead to broken setups if
the inherited FD was actually opened read-only.
Before dup'ing any inherited FDs to stdout/stderr, verify they are
actually writable using the new fd_is_writable() helper. If not, fall
back to /dev/null (or reopen the terminal in the TTY case) with a
warning, rather than silently creating a broken setup where output
operations would fail.
network: clear existing routes if Gateway= is empty in [Network]
Add support for an empty Gateway= in [Network] to clear the existing
routes. This change will allow users to remove the default route from a
drop-in file.
man: add 'testing' as one of the suggestions for DEPLOYMENT=
Looking at the list, "test" or "testing" seems to be a fairly generic entry
that is missing from the list of suggestions. I went with "testing" because it
fits better with the other item, e.g. "staging".
In https://github.com/systemd/systemd/issues/38743 "laboratory" was also
suggested. I didn't include this because that is more about the location, not
deployment type. Any of the other deployments could be in a "laboratory".
Chris Down [Wed, 19 Nov 2025 19:52:02 +0000 (03:52 +0800)]
tests: ASSERT_SIGNAL: Prevent hallucinating parent as child and confusing exit codes with signals (#39807)
This series fixes two distinct, pretty bad bugs in `ASSERT_SIGNAL`.
These bugs can allow failing tests to pass, and can also cause the test
runner to silently terminate prematurely in a way that looks like
success.
This is not theoretical, see
https://github.com/systemd/systemd/pull/39674#discussion_r2540552699 for
a real case of this happening.
---
Bug 1: Parent process hallucinates it is the child and re-executes the
expression being tested
Previously, assert_signal_internal() returned 0 in two mutually
exclusive states:
1. We are the child process (immediately after fork()).
2. We are the parent process, and the child exited normally (status 0).
The macro failed to distinguish these cases. If a child failed to crash
as expected, the parent received 0, incorrectly interpreted it as it
being the child, and re-executed the test expression inside the parent
process.
This can cause tests to falsely pass. The parent would successfully run
the expression (which wasn't supposed to crash in the parent), succeed,
and call _exit(EXIT_SUCCESS).
The second consequence is silent truncation. When the parent called
_exit(), it terminated the entire test runner immediately. Any
subsequent tests in the same binary were never executed.
---
Bug 2: Conflation of exit codes and signals
The harness returned the raw si_status without checking si_code. This
meant that an exit code was indistinguishable from a signal number. For
example, if a child process failed and called exit(6), the harness
reported it as having been killed by SIGABRT (signal 6).
---
This PR both fixes the bugs and reworks the ASSERT_SIGNAL infrastructure
to ensure this is very unlikely to regress:
- assert_signal_internal now returns an explicit control flow enum
(FORK_CHILD / FORK_PARENT) separate from the status data. This makes it
structurally impossible for the parent to hallucinate that it is the
child.
- The output parameter is only populated with a signal number if si_code
confirms the process was killed by a signal. Normal exits return 0.
Chris Down [Wed, 19 Nov 2025 14:06:03 +0000 (22:06 +0800)]
tests: ASSERT_SIGNAL: Do not allow parent to hallucinate it is the child
assert_signal_internal() returns 0 in two distinct cases:
1. In the child process (immediately after fork returns 0).
2. In the parent process, if the child exited normally (no signal).
ASSERT_SIGNAL fails to distinguish these cases. When a child exited
normally (case 2), the parent process receives 0, incorrectly interprets
it as meaning it is the child, and re-executes the test expression
inside the parent process. Goodness gracious!
This causes two severe test integrity issues:
1. False positives. The parent can run the expression, succeed, and call
_exit(EXIT_SUCCESS), causing the test to pass even though no signal
was raised.
2. Silent truncation. The _exit() call in the parent terminates the test
runner prematurely, preventing subsequent tests in the same file from
running.
Example of the bug in action, from #39674:
ASSERT_SIGNAL(fd_is_writable(closed_fd), SIGABRT)
This test should fail (fd_is_writable does not SIGABRT here), but with
the bug, the parent hallucinated being the child, re-ran the expression
successfully, and exited with success.
Fix this by refactoring assert_signal_internal() to be much more strict
about separating control flow from data.
The signal status is now returned via a strictly typed output parameter,
guaranteeing that determining whether we are the child is never
conflated with whether the child exited cleanly.
Chris Down [Wed, 19 Nov 2025 13:45:40 +0000 (21:45 +0800)]
tests: ASSERT_SIGNAL: Ensure sanitisers do not mask expected signals
ASAN installs signal handlers to catch crashes like SIGSEGV or SIGILL.
When these signals are raised, ASAN traps them, prints an error report,
and then typically terminates the process with a different signal (often
SIGABRT) or a non-zero exit code.
This interferes with ASSERT_SIGNAL when checking for specific crash
signals (for example, checking that a function raises SIGSEGV). In such
a case, the test harness sees the ASAN termination signal rather than
the expected signal, causing the test to fail.
Fix this by resetting the signal handler to SIG_DFL in the child process
immediately before executing the test expression. This ensures the
kernel kills the process directly with the expected signal, bypassing
ASAN's interceptors.
Chris Down [Wed, 19 Nov 2025 08:50:38 +0000 (16:50 +0800)]
tests: ASSERT_SIGNAL: Stop exit codes from masquerading as signals
When a child process exits normally (si_code == CLD_EXITED),
siginfo.si_status contains the exit code. When it is killed by a signal
(si_code == CLD_KILLED or CLD_DUMPED), si_status contains the signal
number. However, assert_signal_internal() returns si_status blindly.
This causes exit codes to be misinterpreted as signal numbers.
This allows failing tests to silently pass if their exit code
numerically coincides with the expected signal. For example, a test
expecting SIGABRT (6) would incorrectly pass if the child simply exited
with status 6 instead of being killed by a signal.
Fix this by checking si_code. Only return si_status as a signal number
if the child was actually killed by a signal (CLD_KILLED or CLD_DUMPED).
If the child exited normally (CLD_EXITED), return 0 to indicate that no
signal occurred.
Chris Down [Mon, 10 Nov 2025 20:26:10 +0000 (04:26 +0800)]
core: Verify inherited FDs are writable for stdout/stderr
When inheriting file descriptors for stdout/stderr (either from stdin
or when making stderr inherit from stdout), we previously just assumed
they would be writable and dup'd them. This could lead to broken setups
if the inherited FD was actually opened read-only.
Before dup'ing any inherited FDs to stdout/stderr, verify they are
actually writable using the new fd_is_writable() helper. If not, fall
back to /dev/null (or reopen the terminal in the TTY case) with a
warning, rather than silently creating a broken setup where output
operations would fail.
Chris Down [Mon, 17 Nov 2025 03:05:09 +0000 (11:05 +0800)]
fd-util: Add fd_is_writable() to check if FD is opened for writing
This checks whether a file descriptor is valid and opened in a mode that
allows writing (O_WRONLY or O_RDWR). This is useful when we want to
verify that inherited FDs can actually be used for output operations
before dup'ing them.
The helper explicitly handles O_PATH file descriptors, which cannot be
used for I/O operations and thus are never writable.
Chris Down [Wed, 19 Nov 2025 08:49:22 +0000 (16:49 +0800)]
tests: Avoid variable shadowing in ASSERT_SIGNAL
The ASSERT_SIGNAL macro uses a fixed variable name, `_r`. This prevents
nesting the macro (like ASSERT_SIGNAL(ASSERT_SIGNAL(...))), as the inner
instance would shadow the outer instance's variable.
Switch to using the UNIQ_T helper to generate unique variable names at
each expansion level. This allows the macro to be used recursively,
which is required for upcoming regression tests regarding signal
handling logic.
Daan De Meyer [Wed, 19 Nov 2025 09:30:01 +0000 (10:30 +0100)]
tools: Add script to detect unused symbols in libshared
Symbols exported by libshared can't get pruned by the linker, so
every unused exported symbol is effectively dead code we ship to users
for no good reason. Let's add a script to analyze how many such symbols
we have.
We also add a meson test to run the script on all of our binaries.
Since it detects unused symbols and still has a few false positives,
don't enable the test by default similar to the clang-tidy tests.
The script was 100% vibe coded by Github Copilot with Claude Sonnet 4.5
as the model.
Current results are (without the unused symbols list):
Analysis of libsystemd-shared-259.so
======================================================================
Total exported symbols: 4830
(excluding public API symbols starting with 'sd_')
Used symbols: 4672
Unused symbols: 158
Usage rate: 96.7%
ssh-generator: suppress error message for vsock EADDRNOTAVAIL
In logs in the Fedora OpenQA CI:
Nov 17 22:20:06 fedora systemd-ssh-generator[4117]: Failed to query local AF_VSOCK CID: Cannot assign requested address
Nov 17 22:20:06 fedora (generato[4088]: /usr/lib/systemd/system-generators/systemd-ssh-generator failed with exit status 1.
Nov 17 22:20:06 fedora systemd[1]: sshd-vsock.socket: Unit configuration changed while unit was running, and no socket file descriptors are open. Unit not functional until restarted.
AF_VSOCK is not configured there and systemd-ssh-generator should just exit
quietly. vsock_get_local_cid() already does some logging at debug level, so we
don't need to.
There is also a second bug, we report modifications to the unit have just
created. I think we have an issue open for this somewhere, but cannot find it.
man: use prefix number that matches the general suggestion
`systemd.network(5)` recommends “that each filename is prefixed with a number
smaller than "70" (e.g. 10-eth0.network)”.
Reduce that used by the example accordingly, but stay above the number (`50`)
used in the earlier example for static configuration, so that would take
precedence over the dynamic one if both match for the same network.
Before:
$ build/systemd-creds --uid=asdf
Failed to resolve user 'asdf': No such process
Now:
$ build/systemd-creds --uid=asdf
Failed to resolve user 'asdf': Unknown user
core: improve messages about unknown users and groups
$ sudo build/systemd-run --uid=asdf whoami
$ journalctl -e
(whoami)[1007784]: run-p1007782-i5200512.service: Failed to determine user credentials: No such process
(whoami)[1007784]: run-p1007782-i5200512.service: Failed at step USER spawning /usr/sbin/whoami: No such process
systemd[1]: run-p1007782-i5200512.service: Main process exited, code=exited, status=217/USER
systemd[1]: run-p1007782-i5200512.service: Failed with result 'exit-code'.
Now:
(whoami)[1013204]: run-p1013202-i5205932.service: Failed to determine credentials for user 'asdf': Unknown user
(whoami)[1013204]: run-p1013202-i5205932.service: Failed at step USER spawning /usr/sbin/whoami: Invalid argument
systemd[1]: run-p1013202-i5205932.service: Main process exited, code=exited, status=217/USER
systemd[1]: run-p1013202-i5205932.service: Failed with result 'exit-code'.
Before:
$ sudo build/systemd-run --scope --uid=asdf whoami
Failed to resolve user asdf: No such process
Now:
$ sudo build/systemd-run --scope --uid=asdf whoami
Failed to resolve user 'asdf': Unknown user
tmpfiles: improve error message for missing user/group
From a boot with a dracut initrd:
systemd-tmpfiles[242]: /usr/lib/tmpfiles.d/tpm2-tss-fapi.conf:2: Failed to resolve user 'tss': No such process
systemd-tmpfiles[242]: Failed to parse ACL "default:group:tss:rwx", ignoring: Invalid argument
systemd-tmpfiles[242]: /usr/lib/tmpfiles.d/tpm2-tss-fapi.conf:4: Failed to resolve user 'tss': No such process
systemd-tmpfiles[242]: Failed to parse ACL "default:group:tss:rwx", ignoring: Invalid argument
systemd-tmpfiles[242]: /usr/lib/tmpfiles.d/tpm2-tss-fapi.conf:6: Failed to resolve group 'tss': No such process
systemd-tmpfiles[242]: /usr/lib/tmpfiles.d/tpm2-tss-fapi.conf:7: Failed to resolve group 'tss': No such process
udev: define a generic helper to print messages about unknown users and groups
We cannot just use %m, because strerror returns a confusing error message
for ESRCH or ENOEXEC. udev code was doing a good job, but the error handling
was very verbose. Let's encapsulate the customized error messages in a
helper.
No functional change, except that the error messages have a slightly different
form now. The old messages were a bit better, but we don't have as much
flexibility in the new scheme. "Failed to resolve user 'foo': Unknown user"
should be good enough.
man/file-hierarchy: refer to LFSH and MOUNT_REQUIREMENTS
The contents of file-hierarchy.7 have been copied over to the new page in
uapi-docs, and are already going stale here, since a bunch of additions and
improvements has been made there. OTOH, a commit was made here, but not there.
https://github.com/uapi-group/specifications/pull/172 updates the other doc.
OTOH, a reader should also read MOUNT_REQUIREMENTS if they care about what
systemd cares about. Thus, replace most of the text in our man page by a
reference to those two pages. In case we later want to list some disagreements
or differences wrt. LFSH, we can always add a paragraph or two here,
but having two documents with almost the same content is not going to work.
docs/MOUNT_REQUIREMENTS: describe nested mounts more carefully
I was looking into a question posed in one of the Fedora discussion threads:
is it OK for a package to assume that files in different directories under /usr
are always on the same mount point? rpmlint emits a warning if a package has
files that are hardlinked between directories, i.e. rpmlint thinks that this
is not the case. But in practice, our systems are like this and our tooling
generally doesn't expect a part of /usr to be separated out. I looked at the
MOUNT_REQUIREMENTS document, but it doesn't answer this question clearly.
It was clearly written with the assumption that e.g. "/usr/" or "/var/" are one
mount point, so when it is "mounted", all of it is available. But the document
also talks about submounts being pulled in through requirements on specific
units, which requires some mounts not to be mounted all at once, so the reader
is left without any direct answer to this question.
This rewrite makes the following changes:
- rename "generally three categories of requirements" to
"three general categories of mount points" because we're categorizing
mount points, not requirements.
- always repeat the category name in further mentions,
e.g. "2/early" instead of just "2" so the reader doesn't have to jump
back to the table when reading.
- mention that it is OK for a mount point to be not split out
- say that submount which is "conceptually separate" may be mounted
later.
- say "ephemeral system" instead of "stateless system" and split out
the description of those systems into a separate paragraph and clearly
state that they are an exception that skips the requirements listed in
this document.
- be consistent in specifying the boundary before which each category must
have been mounted. Previously, cat. 1 was described as "before transisition"
and cat. 2 was described as "during early boot", which created the additional
problem that later we needed to contradict this saying that "must be mounted
during early boot" doesn't actually mean that and this can be done ealier.
If we say "before end of early boot", we avoid this awkwardness.
network: gracefully disable resolve hook when socket is disabled
systemd-networkd cannot create the directory /run/systemd/resolve.hook/. Even
if the directory exists, it is not owned by systemd-network user/group, so
systemd-networkd cannot create socket file in the directory. Hence, if the
systemd-networkd-resolve-hook.socket unit is disabled, networkd fails to open
the varlink socket, and fail to start:
systemd-networkd[1304645]: Failed to bind to systemd-resolved hook Varlink socket: Permission denied
systemd-networkd[1304645]: Could not set up manager: Permission denied
systemd[1]: systemd-networkd.service: Main process exited, code=exited, status=1/FAILURE
systemd[1]: systemd-networkd.service: Failed with result 'exit-code'.
systemd[1]: Failed to start systemd-networkd.service - Network Management.
If the socket unit is disabled, that should mean the system administrator wants
to disable the feature. Let's not try to setup the varlink socket in that case.
Now the resolve hook feature can be toggled by enabling/disabling the socket
unit, let's drop the $SYSTEMD_NETWORK_RESOLVE_HOOK environment variable.
Simon Barth [Mon, 10 Nov 2025 20:57:24 +0000 (21:57 +0100)]
man: Fix systemd-analyze exit-status example output
The output of `systemd-analyze exit-status` changed in commit e04ed6db6b44681b7a7876b9c4a1e6adaf877670, so that the exit-status class
for EXIT_SUCCESS and EXIT_FAILURE is "libc" instead of "glibc".
This commit makes the example output in the man-page match the actual
output again.
Mike Yuan [Sat, 15 Nov 2025 20:06:39 +0000 (21:06 +0100)]
core/unit: mark running reload job as canceled if the unit deactivated
The semantics of reload is that the service updates its extrinsic state
and continues execution. If it actually deactivated we shouldn't
spuriously notify the caller that reload succeeded.
Mike Yuan [Sun, 16 Nov 2025 14:59:28 +0000 (15:59 +0100)]
core/unit: no need to handle intermediate job types in unit_process_job()
Installed jobs are always collapsed, i.e. can only be of types
accepted by job_run_and_invalidate() modulo JOB_NOP which is
stored in Unit.nop_job (if any). Let's trim the unreachable
branches.
libutmps does not support utmpxname(), the function always fails
with ENOSYS, and always uses their own file.
However, our code relies on the funtion needs to succeed.
Let's revert the change now, and revisit later when musl users
request to support libutmps.
Philip Withnall [Sun, 2 Nov 2025 11:34:03 +0000 (11:34 +0000)]
docs: Update MEMORY_PRESSURE to mention recent improvements in GLib
See https://gitlab.gnome.org/GNOME/glib/-/issues/2931 for the changes in
GLib upstream. Using `GMemoryMonitor` is now more compliant with the
systemd recommended approach, but it needs further work to read the
recommended environment variables rather than unconditionally accessing
the per-cgroup PSI kernel file directly.
Signed-off-by: Philip Withnall <pwithnall@gnome.org>