Daan De Meyer [Mon, 16 Feb 2026 12:14:58 +0000 (13:14 +0100)]
sd-bus: Don't fork unnecessarily to connect to container
Let's check if we're already in the right namespaces and call connect()
directly if that's the case. This can easily happen when the machine is
specified as .host or so.
Daan De Meyer [Tue, 17 Feb 2026 14:36:00 +0000 (15:36 +0100)]
namespace-util: Do is_our_namespace() checks first in namespace_enter()
These checks may rely on /proc on older kernels which we could lose access
to by joining namespaces so let's do all the checks first and then join
namespaces.
Chris Down [Tue, 17 Feb 2026 06:58:44 +0000 (14:58 +0800)]
oomd: Fix unnecessary delays during OOM kills with pending kills present
Let's say a user has two services with ManagedOOMMemoryPressure=kill,
perhaps a web server under system.slice and a batch job under
user.slice. Both exceed their pressure limits. On the previous timer
tick, oomd has already queued the web server's candidate for killing,
but the prekill hook has not yet responded, so the kill is still
pending.
In the code, monitor_memory_pressure_contexts_handler() iterates over
all pressure targets that have exceeded their limits. When it reaches
the web server target and calls oomd_cgroup_kill_mark(), which returns 0
because that cgroup is already queued. The code treats this the same as
a successful new kill: it resets the 15 second delay timer and returns
from the function, exiting the loop.
This loop is handled by SET_FOREACH and the iteration order is
hash-dependent. As such, if the web server target happens coincidentally
to be visited first, oomd never evaluates the batch job target at all.
The effect is twofold:
1. oomd stalls for 15 seconds despite not having initiated any new kill.
That can unnecessarily delay further action to stem increases in
memory pressure. The delay exists to let stale pressure counters
settle after a kill, but no kill has happened here.
2. It non-deterministically skips pressure targets that may have
unqueued candidates, dangerously allowing memory pressure to persist
for longer than it should.
Fix this by skipping cgroups that are already queued so the loop
proceeds to try other pressure targets. We should only delay when a new
kill mark is actually created.
Chris Down [Tue, 17 Feb 2026 06:30:16 +0000 (14:30 +0800)]
oomd: Fix silent failure to find bad cgroups when another cgroup dies
Consider a workload slice with several sibling cgroups. Imagine that one
of those cgroups is removed between the moment oomd enumerates the
directory and the moment it reads memory.oom.group. This is actually
relatively plausible under the high memory pressure conditions where
oomd is most needed.
In this case, the failed read prompts us to `return 0`, which exits the
entire enumeration loop in recursively_get_cgroup_context(). As a
result, all remaining sibling cgroups are silently dropped from the
candidate list for that monitoring cycle.
The effect is that oomd can fail to identify and kill the actual
offending cgroup, allowing memory pressure to persist until a subsequent
cycle where the race doesn't occur.
Fix this by instead proceeding to evaluate further sibling cgroups.
Let's say a user has two services with ManagedOOMMemoryPressure=kill,
one for a web server under system.slice, and one for a batch job under
user.slice. The batch job is causing severe memory pressure, whereas the
web server's cgroup has no processes with significant pgscan activity.
In the code, monitor_memory_pressure_contexts_handler() iterates over
all pressure targets that have exceeded their limits. When
oomd_select_by_pgscan_rate() returns 0 (that is, no candidates) for a
target, we return from the entire SET_FOREACH loop instead of moving to
the next target. Since SET_FOREACH iteration order is hash-dependent, if
the web server target happens to be visited first, oomd will find no
kill candidates for it and exit the loop. The batch job target that is
actually slamming the machine will never even be evaluated, and can
continue to wreak havoc without any intervention.
The effect is that oomd non-deterministically and silently fails to kill
cgroups that it should actually kill, allowing memory pressure to
persist and dangerously build up on the machine.
The fix is simple, keep evaluating remaining targets when one does not
match.
These were introduced as part of the effort of sd-executor
worker pool (#29566), which never landed due to unsignificant
performance improvement. Let's just remove the unused
helpers. If that work ever gets resurrected they can be
restored from this commit pretty easily.
Yu Watanabe [Tue, 17 Feb 2026 05:53:46 +0000 (14:53 +0900)]
oomd: Fix Killed signal reason being lost (#40689)
Emitting "oom" doesn't mesh with the org.freedesktop.oom1.Manager
Killed() API contract, which defines "memory-used" and "memory-pressure"
as possible reasons. Consumers that key off reason thus will either lose
policy attribution or may reject the unknown value completely.
Plumb the reason through so it is visible to consumers.
Chris Down [Sun, 15 Feb 2026 17:42:51 +0000 (01:42 +0800)]
oomd: Fix Killed signal reason being lost
Emitting "oom" doesn't mesh with the org.freedesktop.oom1.Manager
Killed() API contract, which defines "memory-used" and "memory-pressure"
as possible reasons. Consumers that key off reason thus will either lose
policy attribution or may reject the unknown value completely.
Plumb the reason through so it is visible to consumers.
Daan De Meyer [Mon, 16 Feb 2026 18:59:10 +0000 (19:59 +0100)]
nspawn-mount: Use setns() in wipe_fully_visible_api_fs()
namespace_enter() now does a is_our_namespace() check, which requires
/proc on older kernels, which is not available anymore after we call
do_wipe_fully_visible_api_fs() in wipe_fully_visible_api_fs().
Let's just call setns() instead as namespace_enter() is overkill to
enter a single namespace anyway.
Daan De Meyer [Mon, 16 Feb 2026 14:42:35 +0000 (15:42 +0100)]
mkosi: Set CacheOnly=metadata for test images (#40699)
The default behavior is to sync repository metadata for every image
that does not have a cache and we recently changed behavior to
invalidate
all cached images whenever we decide the repository metadata needs to
be resynced.
In systemd we have two images that are not cached because they use
BaseTrees=
hence set CacheOnly=metadata to explicitly indicate these two images
should never cause a repository metadata if resync even though they are
not cached.
Daan De Meyer [Mon, 16 Feb 2026 12:28:22 +0000 (13:28 +0100)]
mkosi: Set CacheOnly=metadata for test images
The default behavior is to sync repository metadata for every image
that does not have a cache and we recently changed behavior to invalidate
all cached images whenever we decide the repository metadata needs to
be resynced.
In systemd we have two images that are not cached because they use BaseTrees=
hence set CacheOnly=metadata to explicitly indicate these two images
should never cause a repository metadata if resync even though they are
not cached.
* 66d51024b7 man: Update caching section
* 4eac60f168 Remove all cached images if repository metadata will be synced
* 025c6c0150 Move Incremental= to inherited settings in docs
* 427970d8fd Make MakeScriptsExecutable= a multiversal setting
* 53bd2da6fe Look at all CacheOnly= settings to determine if we need to sync metadata
* 114ae558ef config / qemu: add Console=headless
Daan De Meyer [Mon, 16 Feb 2026 10:26:41 +0000 (11:26 +0100)]
namespace-util: Merge namespace_enter_delegated() into namespace_enter() (#40669)
There's no need to pass in a boolean to decide whether we use
namespace_enter_delegated() or not. Instead, we can just check if we
have CAP_SYS_ADMIN in our own user namespace. If we don't, then we have
to insist on a child user namespace being passed in and we have to enter
it first to get CAP_SYS_ADMIN as without CAP_SYS_ADMIN we wouldn't be
able
to call setns() in the first place. If we do have CAP_SYS_ADMIN, we can
always enter the other namespaces first before entering the user
namespace.
Additionally, we don't fail anymore if we can't reset the UID/GID since
a
root user might not always be available in every user namespace we might
enter.
r-vdp [Thu, 12 Feb 2026 21:52:54 +0000 (23:52 +0200)]
dns-delegates: add support for setting a firewall mark
This makes it possible to have DNS requests for certain domains routed
differently than normal requests, which is for instance useful when
using policy routing to route traffic over a VPN but DNS requests for
the VPN endpoint itself, should be routed differently.
It doesn't make much sense to configure a firewall mark at the level of
a network interface, but at the level of a DNS delegate it can be very
useful.
Daan De Meyer [Sun, 15 Feb 2026 13:22:44 +0000 (14:22 +0100)]
namespace-util: Merge namespace_enter_delegated() into namespace_enter()
There's no need to pass in a boolean to decide whether we use
namespace_enter_delegated() or not. Instead, we can just check if we
have CAP_SYS_ADMIN in our own user namespace. If we don't, then we have
to insist on a child user namespace being passed in and we have to enter
it first to get CAP_SYS_ADMIN as without CAP_SYS_ADMIN we wouldn't be able
to call setns() in the first place. If we do have CAP_SYS_ADMIN, we can
always enter the other namespaces first before entering the user namespace.
Additionally, we don't fail anymore if we can't reset the UID/GID since a
root user might not always be available in every user namespace we might
enter.
The commit introduced a new "metrics" varlink server, but for
user scope stuff it is not bound anywhere. The copy-pasted
"fresh" handling for deserialization is also essentially
meaningless as metrics_setup_varlink_server() doesn't even report
whether the varlink server is fresh (let alone that no serialization
is being done at all right now). Moreover, currently the event
priority is hardcoded, while event loop and associated priority
assignment ought to be subject to each daemon.
While fixing the mentioned issues I took the chance to restructure
the existing code a bit for readability. Note that serialization
for the metrics server is still missing - it will be tackled
in subsequent commits.
Mike Yuan [Sun, 8 Feb 2026 20:47:38 +0000 (21:47 +0100)]
tree-wide: drop redundant check for SD_VARLINK_METHOD_MORE flag
If the IDL declares the method requires 'more' yet the call doesn't
have it set, varlink_idl_validate_method_call() should have rejected
it and the callback shouldn't be reached.
Daan De Meyer [Fri, 13 Feb 2026 11:24:49 +0000 (12:24 +0100)]
user-util: Don't setgroups() if /proc/self/gid_map is empty
If /proc/self/gid_map is empty, the kernel will refuse setgroups(),
so don't attempt it if that's the case on top of the /proc/self/setgroups
check we already have.
gvenugo3 [Thu, 20 Nov 2025 03:35:03 +0000 (20:35 -0700)]
network: implement varlink LinkUp and LinkDown methods
The new varlink methods are basically equivalent to 'ip link set INTERFACE up/down',
but they support polkit authentication. Also, on LinkDown, it gracefully
stops dynamic engines like DHCP client/server before the interface is
bring down. Hence, e.g. an empty RA on stop should be sent.
Yu Watanabe [Mon, 16 Feb 2026 04:25:35 +0000 (13:25 +0900)]
udev: guess if usb devices are internal external (#40649)
Actually we are defining databases to determine when a usb device is
inherent part of the system or if it's a external device.
Let's use the removable attribute of the port where it is connected to
say that. That gives us the ability to not rely on a particular vendor
only does external devices or to not having the need to be quirking
input subsystem for that purpose that will become unreliable as more and
more internal devices are connected over usb instead over ps2 or i2c
buses. Eg.
https://gitlab.freedesktop.org/libinput/libinput/-/commit/02b495e79022e64514015e1a3dea32997035dd4f?merge_request_iid=1389
Actually this has been seen as reliable in a small set of device from
normal laptops, to detachable ones. The need to check maxchild is 0 is
for detachable devices, pogo pin usbs are fixed, while we attach the
keyboard|touchpad dock the input devices tend to be directly connected
to that port and if the dock has more usbs tend to be a hub that then
exposes removable as unknow. If we don't set maxchild 0 we will not only
guess that the keyboard and touchpad are internal but also incorrectly
other input devices like mice connected to the dock's usb ports.
I have use a very generic name like INTEGRATION because is not actually
used for any other thing and is used to determine not only over usb bus
but for acpi, pci, platform actually.
Also a remap to actual libinput variables is done for compatibility
purposes. if it's possible to have only the INTEGRATION variable instead
multiple ones will be done in the future but is actually unclear.
This can also be used for example to achieve an actual feature that we
lack in linux, when a device with accelerometers and cameras is rotated
the video output is not, this tag the own device cameras as internal
while external ones as external to be able to only do that for the
internal ones.
Note that this has nothing to do with the removable attribute found in
usb storage devices where it's values can be 0 or 1. There is no
conflict at all because the removable attribute we check is specifically
the one found in usb port ones.
Actually GNOME sets a clamp of 1% and divides in 20 steps the brightness
control. Using 5% clamp makes things like in a device with max value 640
to always be in the first brightness step in GNOME and we can't leave in
the minimum.
GNOME set steps of 640/20 = 32 with the zero step 640 * 1% = 6. When we
restart the device with the lowest bright systemd sees 6 but sets
640 * 5% = 32, so we get the brightness in the first step.
Tests in IPS and OLED panels have been done and 1% still seems a
comprensive minimun usable value so use that to allow all environments
to be able to set lower brightness values that won't be raised by
systemd at boot.
If your user enviroment allow to set excesive lower unusable values you
should blame it or yourself if you directle changes it through sysfs but
not systemd.
Yu Watanabe [Mon, 16 Feb 2026 00:10:01 +0000 (09:10 +0900)]
boot: fix buffer alignment when doing block I/O (#40465)
UEFI Block I/O Protocol has `Media->IoAlign` field dictating the minimum
alignment for I/O buffer. It's quite surprising this has been lingering
here unnoticed for years, seems like most UEFI implementations have
small or no alignment requirements. U-Boot is not the case here, and
requires at least 512 byte alignment, hence attempt to read GPT
partition table fail and in effect systemd-boot can not find XBOOTLDR
partition.
These patches allow to boot from XBOOTLDR partition on U-Boot - tested
with latest systemd revision and U-Boot master
(`8de6e8f8a076d2c9b6d38d8563db135c167077ec`) on x64 and ARM32, of which
both are failing without the patch.
Also fixes Bitlocker probing logic, which is the only other place where
raw block I/O is used, however this is untested.
Chris Down [Sun, 15 Feb 2026 17:31:12 +0000 (01:31 +0800)]
oomd: Return tristate status from oomd_cgroup_kill_mark()
oomd_cgroup_kill_mark() currently returns 0 on all non-error paths. But
the manager only logs that it marked for killing on `if (r > 0)`, which
is thus unreachable.
Changing it to `r >= 0` would also be wrong, because then we would log
on no-op paths.
So let's fix this by making the return value express what actually
happened:
- < 0: failure to queue the kill state
- 0: no new mark was created (already queued or dry-run)
- > 0: a new kill state was queued
Chris Down [Sun, 15 Feb 2026 17:30:02 +0000 (01:30 +0800)]
oomd: Fix bug where we drop queued kill state on duplicate cgroup
oomd_cgroup_kill_mark() allocates a temporary OomdKillState and inserts
it into kill_states via set_ensure_put(). This is keyed by cgroup path.
When the same cgroup is already queued, set_ensure_put() dutifully
returns 0.
The function then returns with
_cleanup_(oomd_kill_state_removep) still armed, which eventually calls
oomd_kill_state_free().
oomd_kill_state_free() removes from kill_states by cgroup-path key, so
because this path already exists, it will remove the existing queued
kill state instead of just dropping the temporary object.
This is wrong and results in mistakenly drops the queued kill state on
duplicates.
This can happen when a cgroup is marked multiple times before the first
queued kill state is consumed. The result is lost kill-state tracking
and incorrect prekill/kill sequencing.
Handle r == 0 explicitly by freeing only the temporary object and
leaving the already queued state intact.
Chris Down [Sat, 14 Feb 2026 16:05:12 +0000 (00:05 +0800)]
oomd: Prevent corruption of cgroup paths in Killed signal
While looking at oomd behaviour in production I noticed that I always
get garbage cgroup paths for the Killed event. Looking more closely, I
noticed that while the signature is (string cgroup, string reason), we
currently erroneously pass the `OomdCGroupContext*` pointer itself as
the first argument to sd_bus_emit_signal(), rather than the ctx->path
string it contains.
The in-memory layout on affected machines in my case is:
...which explains the control characters, since they're garbage from
parsing n_ref, the path pointer, and later fields. At runtime, sd-bus
treats ctx as `const char *` and reads struct bytes as string data,
resulting in garbage being sent.
Pass ctx->path correctly so listeners receive the valid cgroup path.
Chris Down [Sat, 14 Feb 2026 16:40:14 +0000 (00:40 +0800)]
string-util: Prevent infinite loop pegging CPU on malformed ESC input
string_has_ansi_sequence() currently does this to look for ESC input:
t = memchr(s, 0x1B, ...)
So each iteration re-searches from the original start pointer. But if we
find an ESC byte that does *not* start a valid ANSI sequence (like "\x1B
", or an ESC at the end of the string), then ansi_sequence_length()
returns 0, and if that ESC is still in the search window, we will just
spin consuming 100% CPU forever.
Fix this by always advancing past rejected ESC bytes.