Philip Withnall [Wed, 31 Dec 2025 00:48:54 +0000 (00:48 +0000)]
sysupdate: Add acquire and install verbs
These expose the two parts of ‘update’, so that update sets can be
acquired (downloaded) and installed (applied) in separate actions at
different times. For example, this could allow a load of update sets to
be acquired when online, and later applied when offline.
Signed-off-by: Philip Withnall <pwithnall@gnome.org>
Helps: https://github.com/systemd/systemd/issues/34814
Philip Withnall [Wed, 31 Dec 2025 00:05:05 +0000 (00:05 +0000)]
sysupdate: Vacuum partial/pending instances first
Modify the vacuum implementation to preferentially vacuum partial or
pending transfers first (unless protected) as they are meant to be
fairly transitory, and ones which are hanging around have probably been
forgotten about and/or are out of date.
Signed-off-by: Philip Withnall <pwithnall@gnome.org>
Helps: https://github.com/systemd/systemd/issues/34814
Philip Withnall [Wed, 31 Dec 2025 00:02:06 +0000 (00:02 +0000)]
sysupdate: Implement acquire and install steps for transfers
Instead of using a random temporary path for file transfers, use a
predictable one which indicates whether the transfer is partially
complete or pending installation. Similarly for partitions.
This is another step towards being able to split the ‘update’ step into
‘acquire’ and ‘install’.
Signed-off-by: Philip Withnall <pwithnall@gnome.org>
Helps: https://github.com/systemd/systemd/issues/34814
Philip Withnall [Tue, 30 Dec 2025 23:49:47 +0000 (23:49 +0000)]
sysupdate: Allow instances to be partial or pending
If we allow target instances to be partial or pending, we can build on
top of this to allow updates to be split into two phases: ‘acquire’ (which
takes an available source instance and copies it (temporarily partial) to
a pending target instance; and ‘install’ (which takes a pending target
instance and installs it as an installed target instance).
This commit introduces a file/directory and partition prefix naming
scheme to identify partial and pending instances.
Signed-off-by: Philip Withnall <pwithnall@gnome.org>
Helps: https://github.com/systemd/systemd/issues/34814
This contains the first two commits from #38764. While @daandemeyer
convinced me to base systemd-sysinstall on a new "bootctl link" rather
than "kernel-install", I think the refactorings I prepped as part of the
original work still make a lot of sense on their own, and I hope I
didn't do them for /dev/null.
tree-wide: symlink well-known Varlink service entry point sockets into /run/varlink/registry/ (#40590)
This is generally useful, but is particularly useful in context of
https://github.com/mvo5/varlink-proxy-rs which can expose a set of local
Varlink services via a HTTP bridge. The idea is that the sockets linked
into /run/varlink/registry/ are candidates for being exposed like that.
units: symlink well-known Varlink services into /run/varlink/registry/
So far we didn't provide any concept to enumerate local Varlink
services. Let's change that.
Let's define very light-weight scheme for this: provide a well-known dir
/run/varlink/registry/ where services that implement public interfaces
can link their sockets into. When enumerating services it's thus
sufficient to enumerate inodes in that directory.
The usecase for this is twofold:
1. It's simply very useful to be able to see which public services are
bound on the local system, for debugging/admin/development purposes.
2. At Amutable we'd like to optionally provide a HTTP-to-Varlink bridge
on individual nodes, that allows remote peers (after authentication)
to access local Varlink services. For that it's essential we know the
list of services and their entrypoints to expose, it would be
security-wise highly problematic for clients to provide AF_UNIX
entrypoint paths when connecting. hence: let's instead just have a
dir with the public stuff, and let's ensure the HTTP-to-Varlink
bridge simply exposes that stuff, and nothing else.
Non-public interfaces (such as the oomd interfaces between PID 1 and
oomd), and interfaces with multiple implementors (such as the resolved
hook interface, or the metrics collection stuff) should not be linked
in.
This is inspired by the Varlink.org "registry" concept, briefly
explained here:
Note however that the described Varlink interface is not actually
implemented here, the directory is introduced however in a fashion that
conceptually matches the registry defined there, and would allow us to
implement the registry interface on top of it. (One of the reason the
registry Varlink API is not implemented right now is that the URI format
it relies on is entirely unspecified in the Varlink docs right now. Some
research needs to be done to extract what's implemented in the reference
implementation and to determine how it maps to the Varlink entrypoint
address format systemd's own tooling currently uses)
This primarily installs the symlinks via Symlinks= in unit files and via
a new tmpfiles.d/ drop-in. But since we touch all .socket units relating
to Varlink this also sets the FileDescriptorName= to varlink for each,
just to minimize diffrences and make things work more alike (the
services in questin don't care about the name, so this doesn't change).
In one case we replace a pair of separate sockets for two closely
related varlink services by a socket and a symlink, so that we can
safely use Symlinks= to also install the registry symlinks.
mountfsd: do not cross mount boundaries when looking for parent of foreign UID range owned dirs
This is primarily paranoia: it might be possible for unpriv users to set
up mount hierarchies in unexpected ways when using userns. Hence let's
make protections more rigid: when looking for a parent dir of a foreign
UID owned dir tree, refuse to cross mount boundaries.
kernel-install: allocate "Context" object only in verb_xyz() functions, not already in run()
We soon want to add a Varlink interface to this, but that means that the
various paramaters for the Context object will be sourced from a Varlink
message not from the command line. Hence split apart the parsing logic
so that we alway parse the command line into arg_xyz first, and then,
inside the verb_abc() calls copy the data from there into the Context
object.
This reworks things a bit, so that the "Context" object can later be
allocated for each Varlink call separately. For example we define a
more precise CONTEXT_NULL that invalidates truly all fields, so that we
can discern "defaults" from "unspecified" later on.
When a cgroup is selected for termination, send varlink messages to
hooks registered in `/run/systemd/oomd.prekill-hooks/`.
oomd waits up to `PreKillTimeoutSec=` seconds for response before
proceeding with the kill.
Matteo Croce [Mon, 25 Aug 2025 15:13:00 +0000 (17:13 +0200)]
oomd: implement a prekill varlink event
When a cgroup is selected for termination, send varlink messages
to hooks registered in `/run/systemd/oomd.prekill-hooks/`.
oomd waits up to `PreKillHookTimeoutSec=` seconds for response
before proceeding with the kill.
The revert is needed because with the PreKill hook, oomd_cgroup_kill()
is not goint to really kill processes but it just creates the callbacks.
So the check is deferred to the real kill.
udev: Introduce uaccess for remote graphical sessions (#38516)
When systemd is compiled with group-render-mode=0660, only the active
seat gets access to the render devices through uaccess. Remote desktop
sessions like gnome-remote-desktop would be left with no hardware
rendering, because those sessions are not associated with a seat.
We solve the issue by granting uaccess to specifically tagged devices on
session start, if the session is marked with
XDG_SESSION_EXTRA_DEVICE_ACCESS.
udev-builtin-uaccess is refactored to grant multiple users access to a
device, taking into account the device's seat and all the active
EXTRA_DEVICE_ACCESS sessions.
report: keep track of varlink connections inside of Context object
Let's also move the Varlink connection management into the Context
object. Let's also switch to Set* for it, so that we get get
auto-expanding behaviour.
It's one of the primary objects that make up the program "context"
conceptually, hence it also should be part of the Context object. This
allows us to just have it available if the Context object is seen.
report: do not treat an empty report dir as an issue
We should permit that the report varlink dir is created on the fly when
the first socket is bound there. Hence, let's treat a non-existant dir
equivalent to an empty one.
We usually do this in our tree like this, do it here too.
Yu Watanabe [Fri, 6 Feb 2026 16:07:33 +0000 (01:07 +0900)]
daemon-util: downgrade log level on ECONNREFUSED and friends
This partially reverts 36c557f7d41441bbd98a8965348dfe8050fc9c98, which
introduced notify_remove_fd() that logs in LOG_DEBUG. However,
notify_remove_fd_warn() is still called other library functions, e.g.
notify_push_fd(), and produces warning message about the failure in
removing fd from fdstore on shutdown.
During shutdown process, we get the following logs:
```
systemd-udevd[370]: Failed to send notify message to '/run/systemd/notify': Connection refused
systemd-udevd[370]: Failed to remove file descriptor "config-serialization" from the store, ignoring: Connection refused
systemd-udevd[370]: Failed to send notify message to '/run/systemd/notify': Connection refused
systemd-udevd[370]: Failed to push serialization fd to service manager: Connection refused
```
Here, the 1st, 3rd, and 4th messages are in LOG_DEBUG, but the 2nd one
was in LOG_WARNING before this commit, and this makes it also in LOG_DEBUG.
Nick Rosbrook [Fri, 6 Feb 2026 16:38:47 +0000 (11:38 -0500)]
resolvectl: include ifindex when printing link-local DNS server
Historically, resolvectl status has not included the interface
specification for DNS servers with an IPv6 link-local address, since it
is technically somewhat redundant. But, adding this extra bit of
information makes it easier to copy-and-paste to use elsewhere, etc.
For example, the previous output:
Link 2 (enp34s0)
Current Scopes: DNS LLMNR/IPv4 LLMNR/IPv6
Protocols: +DefaultRoute LLMNR=resolve -mDNS -DNSOverTLS DNSSEC=no/unsupported
Current DNS Server: fe80::861e:a3ff:feb1:f8e7
DNS Servers: 192.168.1.12 192.168.1.13 fe80::861e:a3ff:feb1:f8e7
DNS Domain: lan
now becomes:
Link 2 (enp34s0)
Current Scopes: DNS LLMNR/IPv4 LLMNR/IPv6
Protocols: +DefaultRoute LLMNR=resolve -mDNS -DNSOverTLS DNSSEC=no/unsupported
Current DNS Server: fe80::861e:a3ff:feb1:f8e7%2
DNS Servers: 192.168.1.12 192.168.1.13 fe80::861e:a3ff:feb1:f8e7%2
DNS Domain: lan
bootctl: return recognizable Varlink error when we cannot determine the boot entry token
When running "bootctl install" on an empty --root= dir, we don't know
which token to use, and the operation will fail. Make sure to return an
explicit error about this.
This introduces a recognizable low-level error for this (EUNATCH), and
then turns this into a recognizable Varlink error.
(I made sure that the old low-level error EINVAL wasn't load-bearing,
and it is safe to change this.)
bootctl: rework bootctl-install.c in preparation of varlinkification
This primarily introduces a context object for each operation, so that
we later can instantiate one for each varlink op we execute, and can
safely lifecycle all operation parameters for each subequent call.
This also reworks the root dir handling to be fd based.
This drops explicit CHASE_TRIGGER_AUTOFS from a bunch of chase() calls
that operate within the ESP/XBOOTLDR, while it keeps them in place for the
chase() calls that find the top-level ESP/XBOOTLDR inode. This reflects
the fact that we explicitly support autofs for the ESP/XBOOTLDR itself,
but below it expect no further mounts, just plain VFAT.
This changes behaviour of the interaction of $KERNEL_INSTALL_CONF_ROOT
and --root=: the former will now be taken relative to the host root, and
will no longer be affected by --root=. This follows similar behaviour in
kernel-install, where it is very explicitly documented in the man page
(the bootclt man page does not document this). This is strictly speaking
a compat breakage, but i think a very minor, niche one, and I think the
pain afflicted by this change is probably neglible compare to the
unsystematic behaviour comapred to kernel-install.
CODING_STYLE: document how to handle kernel compat
Let's define a way how to mark codepaths that are subject to
deletion once the kernel baseline reaches a certain version, to make it
easier to find these cases.
WHile we are at it, introuce a whole section in CODING_STYLE about
kernel version compat.
I followed the new scheme in #39621, but we can merge the coding style
guidelines on this already.
In my testing I switched building my locally run CI integration tests to
ArchLinux and realized that for that the default sizes don't work
anymore, the images are larger than the space allocated. Let's bump the
size by 50% for the relevant disk images.
When systemd is compiled with group-render-mode=0660, only the active seat
gets access to the render devices through uaccess. Remote desktop sessions
like gnome-remote-desktop would be left with no hardware rendering, because
those sessions are not associated with a seat.
Tag the render nodes with "xaccess" so that access is also granted to remote
sessions created with XDG_SESSION_EXTRA_DEVICE_ACCESS=1
udev: Grant sessions access to devices tagged with xaccess
Grant access to devices tagged with "xaccess" on session start, if the session
was created with XDG_SESSION_EXTRA_DEVICE_ACCESS=1.
udev-builtin-uaccess is refactored to grant multiple users access to a device,
taking into account the device's seat and all the active EXTRA_DEVICE_ACCESS
sessions.
login: Add XDG_SESSION_EXTRA_DEVICE_ACCESS variable for additional access
A session created with XDG_SESSION_EXTRA_DEVICE_ACCESS will be granted
additional powers.
Exactly which powers are granted is going to be defined by udevd.
The matrix before was setting accel values to follow normal device
orientation, but the accel values must match the panel orientation that
in these devices is 90 degrees CCW.
Indicate how the panel is mounted in the comment. Could be interesting
to do it also for other devices because when desktop enviroments do it
right the user could be unaware of the panel mounting and could think
monitor-sensor output is bogus.
nsresourced: Ensure that all user namespaces are cleaned-up
The code here assumes that free_user_ns() is called for every single
user namespace. That however has never been the case and the logic for
free_user_ns() is a bit more involved.
A nested user namespace pins its parent user namespace. IOW, the
lifetime of the parent user namespaces is at least as long as the child
user namespaces.
If a parent user namespace becomes unused (no namespace file descriptors
or task using it anymore) then it will stick around and its lifetime
still bound to the child user namespace.
free_user_ns() takes advantage of that behavior. If a child user
namespace is freed and its parent user namespace is already unused then
then free_user_ns() will free both the child and the parent user
namespace. This means a single free_user_ns() frees two user namespaces.
Hence, the bpf program never sees the parent user namespace being freed.
We can fix this by piggy-backing on another function that is called for
every single user namespace being freed. This requires CONFIG_SYSCTL but
systemd doesn't work without that anyway.
The return type needs to change to a scalar type as required by libbpf.
Long-term what we need is appropriate LSM infrastructure for this
including hooks that get called on namespace destruction.
Thanks to Daan DeMeyer for figuring out that the cast is needed.
Signed-off-by: Christian Brauner <brauner@kernel.org>
Daan De Meyer [Sat, 24 Jan 2026 19:52:14 +0000 (20:52 +0100)]
mountfsd: Always open_tree() in mount namespace of peer
open_tree() will fail with EINVAL when passed a directory file descriptor
that comes from another mount namespace. While this should be fixed in a
future kernel, let's workaround the issue for now by entering the mount
namespace of the peer if needed and calling open_tree() there and then
passing the fd back to the mountfsd process.