nikstur [Sun, 8 Feb 2026 13:22:28 +0000 (14:22 +0100)]
meson: guard symlinks in sysconfdir behind install_sysconfidr
Symlinks to files inside sysconfdir are now only installed if
ìnstall_sysconfdir=true (which is the default).
If sshconfdir,sshdconfdir,shellprofiledir are not inside sysconfdir and
install_sysconfidr=false, these symlinks are still installed to the
configured directory.
Philip Withnall [Mon, 9 Feb 2026 12:13:51 +0000 (12:13 +0000)]
test: Add basic tests for path_split_prefix_filename()
These aren’t anything comprehensive, but provide some basic assurances
that it’s working correctly. In particular, they test its behaviour when
*both* the prefix and filename components are requested.
Split out from the original version of this function which was part
of #40236.
Signed-off-by: Philip Withnall <pwithnall@gnome.org>
Luca Boccassi [Thu, 5 Feb 2026 00:39:35 +0000 (00:39 +0000)]
journald: set a lower size limit for FDs from unpriv processes
Unprivileged processes can send 768M in a FD-based message to journald,
which will be malloc'ed in one go, likely causing memory issues.
Set the limit for unprivileged users to 24M.
Allow coredumps as an exception, since we always allowed storing
up to the 768M max core files in the journal.
This contains the first two commits from #38764. While @daandemeyer
convinced me to base systemd-sysinstall on a new "bootctl link" rather
than "kernel-install", I think the refactorings I prepped as part of the
original work still make a lot of sense on their own, and I hope I
didn't do them for /dev/null.
tree-wide: symlink well-known Varlink service entry point sockets into /run/varlink/registry/ (#40590)
This is generally useful, but is particularly useful in context of
https://github.com/mvo5/varlink-proxy-rs which can expose a set of local
Varlink services via a HTTP bridge. The idea is that the sockets linked
into /run/varlink/registry/ are candidates for being exposed like that.
units: symlink well-known Varlink services into /run/varlink/registry/
So far we didn't provide any concept to enumerate local Varlink
services. Let's change that.
Let's define very light-weight scheme for this: provide a well-known dir
/run/varlink/registry/ where services that implement public interfaces
can link their sockets into. When enumerating services it's thus
sufficient to enumerate inodes in that directory.
The usecase for this is twofold:
1. It's simply very useful to be able to see which public services are
bound on the local system, for debugging/admin/development purposes.
2. At Amutable we'd like to optionally provide a HTTP-to-Varlink bridge
on individual nodes, that allows remote peers (after authentication)
to access local Varlink services. For that it's essential we know the
list of services and their entrypoints to expose, it would be
security-wise highly problematic for clients to provide AF_UNIX
entrypoint paths when connecting. hence: let's instead just have a
dir with the public stuff, and let's ensure the HTTP-to-Varlink
bridge simply exposes that stuff, and nothing else.
Non-public interfaces (such as the oomd interfaces between PID 1 and
oomd), and interfaces with multiple implementors (such as the resolved
hook interface, or the metrics collection stuff) should not be linked
in.
This is inspired by the Varlink.org "registry" concept, briefly
explained here:
Note however that the described Varlink interface is not actually
implemented here, the directory is introduced however in a fashion that
conceptually matches the registry defined there, and would allow us to
implement the registry interface on top of it. (One of the reason the
registry Varlink API is not implemented right now is that the URI format
it relies on is entirely unspecified in the Varlink docs right now. Some
research needs to be done to extract what's implemented in the reference
implementation and to determine how it maps to the Varlink entrypoint
address format systemd's own tooling currently uses)
This primarily installs the symlinks via Symlinks= in unit files and via
a new tmpfiles.d/ drop-in. But since we touch all .socket units relating
to Varlink this also sets the FileDescriptorName= to varlink for each,
just to minimize diffrences and make things work more alike (the
services in questin don't care about the name, so this doesn't change).
In one case we replace a pair of separate sockets for two closely
related varlink services by a socket and a symlink, so that we can
safely use Symlinks= to also install the registry symlinks.
mountfsd: do not cross mount boundaries when looking for parent of foreign UID range owned dirs
This is primarily paranoia: it might be possible for unpriv users to set
up mount hierarchies in unexpected ways when using userns. Hence let's
make protections more rigid: when looking for a parent dir of a foreign
UID owned dir tree, refuse to cross mount boundaries.
kernel-install: allocate "Context" object only in verb_xyz() functions, not already in run()
We soon want to add a Varlink interface to this, but that means that the
various paramaters for the Context object will be sourced from a Varlink
message not from the command line. Hence split apart the parsing logic
so that we alway parse the command line into arg_xyz first, and then,
inside the verb_abc() calls copy the data from there into the Context
object.
This reworks things a bit, so that the "Context" object can later be
allocated for each Varlink call separately. For example we define a
more precise CONTEXT_NULL that invalidates truly all fields, so that we
can discern "defaults" from "unspecified" later on.
When a cgroup is selected for termination, send varlink messages to
hooks registered in `/run/systemd/oomd.prekill-hooks/`.
oomd waits up to `PreKillTimeoutSec=` seconds for response before
proceeding with the kill.
Matteo Croce [Mon, 25 Aug 2025 15:13:00 +0000 (17:13 +0200)]
oomd: implement a prekill varlink event
When a cgroup is selected for termination, send varlink messages
to hooks registered in `/run/systemd/oomd.prekill-hooks/`.
oomd waits up to `PreKillHookTimeoutSec=` seconds for response
before proceeding with the kill.
The revert is needed because with the PreKill hook, oomd_cgroup_kill()
is not goint to really kill processes but it just creates the callbacks.
So the check is deferred to the real kill.
udev: Introduce uaccess for remote graphical sessions (#38516)
When systemd is compiled with group-render-mode=0660, only the active
seat gets access to the render devices through uaccess. Remote desktop
sessions like gnome-remote-desktop would be left with no hardware
rendering, because those sessions are not associated with a seat.
We solve the issue by granting uaccess to specifically tagged devices on
session start, if the session is marked with
XDG_SESSION_EXTRA_DEVICE_ACCESS.
udev-builtin-uaccess is refactored to grant multiple users access to a
device, taking into account the device's seat and all the active
EXTRA_DEVICE_ACCESS sessions.
report: keep track of varlink connections inside of Context object
Let's also move the Varlink connection management into the Context
object. Let's also switch to Set* for it, so that we get get
auto-expanding behaviour.
It's one of the primary objects that make up the program "context"
conceptually, hence it also should be part of the Context object. This
allows us to just have it available if the Context object is seen.
report: do not treat an empty report dir as an issue
We should permit that the report varlink dir is created on the fly when
the first socket is bound there. Hence, let's treat a non-existant dir
equivalent to an empty one.
We usually do this in our tree like this, do it here too.
Yu Watanabe [Fri, 6 Feb 2026 16:07:33 +0000 (01:07 +0900)]
daemon-util: downgrade log level on ECONNREFUSED and friends
This partially reverts 36c557f7d41441bbd98a8965348dfe8050fc9c98, which
introduced notify_remove_fd() that logs in LOG_DEBUG. However,
notify_remove_fd_warn() is still called other library functions, e.g.
notify_push_fd(), and produces warning message about the failure in
removing fd from fdstore on shutdown.
During shutdown process, we get the following logs:
```
systemd-udevd[370]: Failed to send notify message to '/run/systemd/notify': Connection refused
systemd-udevd[370]: Failed to remove file descriptor "config-serialization" from the store, ignoring: Connection refused
systemd-udevd[370]: Failed to send notify message to '/run/systemd/notify': Connection refused
systemd-udevd[370]: Failed to push serialization fd to service manager: Connection refused
```
Here, the 1st, 3rd, and 4th messages are in LOG_DEBUG, but the 2nd one
was in LOG_WARNING before this commit, and this makes it also in LOG_DEBUG.
Nick Rosbrook [Fri, 6 Feb 2026 16:38:47 +0000 (11:38 -0500)]
resolvectl: include ifindex when printing link-local DNS server
Historically, resolvectl status has not included the interface
specification for DNS servers with an IPv6 link-local address, since it
is technically somewhat redundant. But, adding this extra bit of
information makes it easier to copy-and-paste to use elsewhere, etc.
For example, the previous output:
Link 2 (enp34s0)
Current Scopes: DNS LLMNR/IPv4 LLMNR/IPv6
Protocols: +DefaultRoute LLMNR=resolve -mDNS -DNSOverTLS DNSSEC=no/unsupported
Current DNS Server: fe80::861e:a3ff:feb1:f8e7
DNS Servers: 192.168.1.12 192.168.1.13 fe80::861e:a3ff:feb1:f8e7
DNS Domain: lan
now becomes:
Link 2 (enp34s0)
Current Scopes: DNS LLMNR/IPv4 LLMNR/IPv6
Protocols: +DefaultRoute LLMNR=resolve -mDNS -DNSOverTLS DNSSEC=no/unsupported
Current DNS Server: fe80::861e:a3ff:feb1:f8e7%2
DNS Servers: 192.168.1.12 192.168.1.13 fe80::861e:a3ff:feb1:f8e7%2
DNS Domain: lan
bootctl: return recognizable Varlink error when we cannot determine the boot entry token
When running "bootctl install" on an empty --root= dir, we don't know
which token to use, and the operation will fail. Make sure to return an
explicit error about this.
This introduces a recognizable low-level error for this (EUNATCH), and
then turns this into a recognizable Varlink error.
(I made sure that the old low-level error EINVAL wasn't load-bearing,
and it is safe to change this.)
bootctl: rework bootctl-install.c in preparation of varlinkification
This primarily introduces a context object for each operation, so that
we later can instantiate one for each varlink op we execute, and can
safely lifecycle all operation parameters for each subequent call.
This also reworks the root dir handling to be fd based.
This drops explicit CHASE_TRIGGER_AUTOFS from a bunch of chase() calls
that operate within the ESP/XBOOTLDR, while it keeps them in place for the
chase() calls that find the top-level ESP/XBOOTLDR inode. This reflects
the fact that we explicitly support autofs for the ESP/XBOOTLDR itself,
but below it expect no further mounts, just plain VFAT.
This changes behaviour of the interaction of $KERNEL_INSTALL_CONF_ROOT
and --root=: the former will now be taken relative to the host root, and
will no longer be affected by --root=. This follows similar behaviour in
kernel-install, where it is very explicitly documented in the man page
(the bootclt man page does not document this). This is strictly speaking
a compat breakage, but i think a very minor, niche one, and I think the
pain afflicted by this change is probably neglible compare to the
unsystematic behaviour comapred to kernel-install.
CODING_STYLE: document how to handle kernel compat
Let's define a way how to mark codepaths that are subject to
deletion once the kernel baseline reaches a certain version, to make it
easier to find these cases.
WHile we are at it, introuce a whole section in CODING_STYLE about
kernel version compat.
I followed the new scheme in #39621, but we can merge the coding style
guidelines on this already.
In my testing I switched building my locally run CI integration tests to
ArchLinux and realized that for that the default sizes don't work
anymore, the images are larger than the space allocated. Let's bump the
size by 50% for the relevant disk images.
When systemd is compiled with group-render-mode=0660, only the active seat
gets access to the render devices through uaccess. Remote desktop sessions
like gnome-remote-desktop would be left with no hardware rendering, because
those sessions are not associated with a seat.
Tag the render nodes with "xaccess" so that access is also granted to remote
sessions created with XDG_SESSION_EXTRA_DEVICE_ACCESS=1
udev: Grant sessions access to devices tagged with xaccess
Grant access to devices tagged with "xaccess" on session start, if the session
was created with XDG_SESSION_EXTRA_DEVICE_ACCESS=1.
udev-builtin-uaccess is refactored to grant multiple users access to a device,
taking into account the device's seat and all the active EXTRA_DEVICE_ACCESS
sessions.
login: Add XDG_SESSION_EXTRA_DEVICE_ACCESS variable for additional access
A session created with XDG_SESSION_EXTRA_DEVICE_ACCESS will be granted
additional powers.
Exactly which powers are granted is going to be defined by udevd.
The matrix before was setting accel values to follow normal device
orientation, but the accel values must match the panel orientation that
in these devices is 90 degrees CCW.
Indicate how the panel is mounted in the comment. Could be interesting
to do it also for other devices because when desktop enviroments do it
right the user could be unaware of the panel mounting and could think
monitor-sensor output is bogus.
nsresourced: Ensure that all user namespaces are cleaned-up
The code here assumes that free_user_ns() is called for every single
user namespace. That however has never been the case and the logic for
free_user_ns() is a bit more involved.
A nested user namespace pins its parent user namespace. IOW, the
lifetime of the parent user namespaces is at least as long as the child
user namespaces.
If a parent user namespace becomes unused (no namespace file descriptors
or task using it anymore) then it will stick around and its lifetime
still bound to the child user namespace.
free_user_ns() takes advantage of that behavior. If a child user
namespace is freed and its parent user namespace is already unused then
then free_user_ns() will free both the child and the parent user
namespace. This means a single free_user_ns() frees two user namespaces.
Hence, the bpf program never sees the parent user namespace being freed.
We can fix this by piggy-backing on another function that is called for
every single user namespace being freed. This requires CONFIG_SYSCTL but
systemd doesn't work without that anyway.
The return type needs to change to a scalar type as required by libbpf.
Long-term what we need is appropriate LSM infrastructure for this
including hooks that get called on namespace destruction.
Thanks to Daan DeMeyer for figuring out that the cast is needed.
Signed-off-by: Christian Brauner <brauner@kernel.org>
Daan De Meyer [Sat, 24 Jan 2026 19:52:14 +0000 (20:52 +0100)]
mountfsd: Always open_tree() in mount namespace of peer
open_tree() will fail with EINVAL when passed a directory file descriptor
that comes from another mount namespace. While this should be fixed in a
future kernel, let's workaround the issue for now by entering the mount
namespace of the peer if needed and calling open_tree() there and then
passing the fd back to the mountfsd process.
Mike Yuan [Thu, 5 Feb 2026 00:32:59 +0000 (01:32 +0100)]
mountpoint-util: rework name_to_handle_at() unique mount id handling
name_to_handle_at_try_unique_mntid_fid() in its current form is
ill-designed for various reasons:
* AT_HANDLE_FID requires file system support, while unique mount id
is a VFS concept hence is always available if supported. Hence
the fallback for AT_HANDLE_MNT_ID_UNIQUE should be independent
of fid.
* The request for AT_HANDLE_MNT_ID_UNIQUE can be identified via
specifying ret_unique_mnt_id, no need for opening up the control
to caller (and currently the function simply doesn't handle
mismatch between ret params and flags).
* The caller cannot realistically differentiate whether the returned
mount id is actually unique.
* The path_get_unique_mnt_id() fallback did not handle AT_SYMLINK_FOLLOW.
Let's instead move the statx() fallback into name_to_handle_at_loop()
directly, and revamp interaction of ret_mnt_id/ret_unique_mnt_id:
if both are set, it indicates that the caller can handle both, hence
set what we have and return 0/1 for whether we managed to acquire
the unique one.
The !ret_handle && ret_mnt_id logic is removed. Let's not rely on
undocumented bizaare behavior and it's unused anyways.
path_get_mnt_id_at() exists for a reason...