DaanDeMeyer [Sat, 27 Dec 2025 19:37:02 +0000 (20:37 +0100)]
dissect: Introduce --copy-ownership= to configure chown behavior
Currently, if we're copying a file, we won't copy the owner UID/GID
from the source. If we're copying a directory, we will copy the owner
UID/GID from the source. Let's give users a bit more control over this
behavior by introducing --copy-ownership= which will default to the
current behavior but allows users to explicitly enable/disable copying
of ownership.
DaanDeMeyer [Fri, 26 Dec 2025 21:18:29 +0000 (22:18 +0100)]
dissect: Make --mount/--unmount/--with work unprivileged
Let's check for CAP_SYS_ADMIN instead of root for these, and make
unmounting more graceful if we can't access the backing loop device
because of permission issues. This allows mounting and unmounting images
from an unprvileged mount namespace. The actual files in the image will
end up owned by nobody:nobody because we'll be in an unprivileged user
namespace, but assuming the directory permissions are not too strict, this
still allows interacting with the image in useful ways.
DaanDeMeyer [Fri, 26 Dec 2025 20:51:00 +0000 (21:51 +0100)]
dissect: Don't use private userns for --copy-to/--copy-from
These actions interact with the host. The former needs privileges to
write into the image, the latter needs privileges to write on the host.
Neither will have the privileges required if the image is attached under
a private userns, hence, don't use one.
Daan De Meyer [Mon, 2 Feb 2026 13:23:40 +0000 (14:23 +0100)]
sd-varlink: Introduce varlink_set_sentinel()
Streaming methods which are not used as a continuous subscription but
instead only send a series of objects all end up with the same workaround
to be able to figure out when to send sd_varlink_reply() or sd_varlink_notify().
Let's generalize this in sd-varlink itself.
Let's introduce the concept of a sentinel, which is a reply that will be sent
by sd-varlink if no other reply was queued by a method callback. The sentinel
is configured with varlink_set_sentinel(). If a sentinel is configured,
sd_varlink_reply() can be used more than once in streaming methods to queue
multiple values to stream to the client. The last queued reply is not sent
until the callback finishes. When the callback finishes, the last reply is
sent without "continues: more". If no reply was queued, the sentinel is sent.
This always using only sd_varlink_reply() in such streaming methods and
leaves sd_varlink_notify() available solely for continuous subscription
streaming methods, where we never use sd_varlink_reply() and instead disconnect
when the server exits.
Mike Yuan [Tue, 10 Feb 2026 22:59:07 +0000 (23:59 +0100)]
terminal-util: handle the case where no system console is active (#40630)
/dev/console might have no backing driver, in which case
/sys/class/tty/console/active is empty. Unlike get_kernel_consoles()
resolve_dev_console() currently proceeds with empty devnode, resulting
in setup_input() -> acquire_terminal() emitting -EISDIR as we're trying
to open /dev/. Let's catch this and report -ENXIO.
Mike Yuan [Fri, 6 Feb 2026 01:07:05 +0000 (02:07 +0100)]
terminal-util: handle the case where no system console is active
/dev/console might have no backing driver, in which case
/sys/class/tty/console/active is empty. Unlike get_kernel_consoles()
resolve_dev_console() currently proceeds with empty devnode,
resulting in setup_input() -> acquire_terminal() emitting -EISDIR
as we're trying to open /dev/. Let's catch this and report -ENXIO.
These operations to quite different things, they just share 2 common
funcs. Let's split them out into separate files.
This also splits up verb_list() into separate calls for the three
operations. This actually fixes issues, as for status/list we want
"unpriv" ESP discovery logic, but for the other two we really should
have privileged discovery logic.
This is preparation for adding "bootctl link" later, but this makes
sense either way, I am sure.
Luca Boccassi [Tue, 10 Feb 2026 13:11:52 +0000 (13:11 +0000)]
sysupdate: Split update into acquire and install verbs (#40236)
Using roughly the approach described in
https://gitlab.gnome.org/GNOME/gnome-software/-/merge_requests/2004#note_2145880.
Basically, copying in-progress downloads to a file/partition with a
predictable prefix, and then moving to a predictable ‘pending’ prefix
when ready to install.
Kai Lüke [Thu, 5 Feb 2026 17:51:07 +0000 (18:51 +0100)]
repart: Discard only once
The indirect discard in mkfs.btrfs on the loop device mapped to the
region on disk can hang and fail the first-boot creation of the rootfs.
Since there already is a discard done we anyway don't need to do it
twice. This might help for most cases to avoid the failure in
mkfs.btrfs.
Keep track if the direct discard worked and then skip the mkfs.btrfs
discard if it did. This still leaves the case where mkfs.btrfs can hang
when the direct discard couldn't succeed and mkfs.btrfs tries again but
since the conditions are rather the same it might be that this case is
not easy to trigger. If the problem still shows up and the kernel won't
be fixed soon we can still disable the mkfs discard for at least btrfs.
nikstur [Sun, 8 Feb 2026 13:22:28 +0000 (14:22 +0100)]
meson: guard symlinks in sysconfdir behind install_sysconfidr
Symlinks to files inside sysconfdir are now only installed if
ìnstall_sysconfdir=true (which is the default).
If sshconfdir,sshdconfdir,shellprofiledir are not inside sysconfdir and
install_sysconfidr=false, these symlinks are still installed to the
configured directory.
Philip Withnall [Mon, 9 Feb 2026 12:13:51 +0000 (12:13 +0000)]
test: Add basic tests for path_split_prefix_filename()
These aren’t anything comprehensive, but provide some basic assurances
that it’s working correctly. In particular, they test its behaviour when
*both* the prefix and filename components are requested.
Split out from the original version of this function which was part
of #40236.
Signed-off-by: Philip Withnall <pwithnall@gnome.org>
Luca Boccassi [Thu, 5 Feb 2026 00:39:35 +0000 (00:39 +0000)]
journald: set a lower size limit for FDs from unpriv processes
Unprivileged processes can send 768M in a FD-based message to journald,
which will be malloc'ed in one go, likely causing memory issues.
Set the limit for unprivileged users to 24M.
Allow coredumps as an exception, since we always allowed storing
up to the 768M max core files in the journal.
Philip Withnall [Mon, 12 Jan 2026 16:43:46 +0000 (16:43 +0000)]
test: Expand sysupdate test to cover split acquire/install updates
This essentially means the sysupdate tests are now run twice: once with
a monolithic update (`sysupdate update`) and once with a split update
(`sysupdate acquire; sysupdate install`).
Signed-off-by: Philip Withnall <pwithnall@gnome.org>
Philip Withnall [Wed, 31 Dec 2025 00:48:54 +0000 (00:48 +0000)]
sysupdate: Add acquire and install verbs
These expose the two parts of ‘update’, so that update sets can be
acquired (downloaded) and installed (applied) in separate actions at
different times. For example, this could allow a load of update sets to
be acquired when online, and later applied when offline.
Signed-off-by: Philip Withnall <pwithnall@gnome.org>
Helps: https://github.com/systemd/systemd/issues/34814
Philip Withnall [Wed, 31 Dec 2025 00:05:05 +0000 (00:05 +0000)]
sysupdate: Vacuum partial/pending instances first
Modify the vacuum implementation to preferentially vacuum partial or
pending transfers first (unless protected) as they are meant to be
fairly transitory, and ones which are hanging around have probably been
forgotten about and/or are out of date.
Signed-off-by: Philip Withnall <pwithnall@gnome.org>
Helps: https://github.com/systemd/systemd/issues/34814
Philip Withnall [Wed, 31 Dec 2025 00:02:06 +0000 (00:02 +0000)]
sysupdate: Implement acquire and install steps for transfers
Instead of using a random temporary path for file transfers, use a
predictable one which indicates whether the transfer is partially
complete or pending installation. Similarly for partitions.
This is another step towards being able to split the ‘update’ step into
‘acquire’ and ‘install’.
Signed-off-by: Philip Withnall <pwithnall@gnome.org>
Helps: https://github.com/systemd/systemd/issues/34814
Philip Withnall [Tue, 30 Dec 2025 23:49:47 +0000 (23:49 +0000)]
sysupdate: Allow instances to be partial or pending
If we allow target instances to be partial or pending, we can build on
top of this to allow updates to be split into two phases: ‘acquire’ (which
takes an available source instance and copies it (temporarily partial) to
a pending target instance; and ‘install’ (which takes a pending target
instance and installs it as an installed target instance).
This commit introduces a file/directory and partition prefix naming
scheme to identify partial and pending instances.
Signed-off-by: Philip Withnall <pwithnall@gnome.org>
Helps: https://github.com/systemd/systemd/issues/34814
This contains the first two commits from #38764. While @daandemeyer
convinced me to base systemd-sysinstall on a new "bootctl link" rather
than "kernel-install", I think the refactorings I prepped as part of the
original work still make a lot of sense on their own, and I hope I
didn't do them for /dev/null.
tree-wide: symlink well-known Varlink service entry point sockets into /run/varlink/registry/ (#40590)
This is generally useful, but is particularly useful in context of
https://github.com/mvo5/varlink-proxy-rs which can expose a set of local
Varlink services via a HTTP bridge. The idea is that the sockets linked
into /run/varlink/registry/ are candidates for being exposed like that.
units: symlink well-known Varlink services into /run/varlink/registry/
So far we didn't provide any concept to enumerate local Varlink
services. Let's change that.
Let's define very light-weight scheme for this: provide a well-known dir
/run/varlink/registry/ where services that implement public interfaces
can link their sockets into. When enumerating services it's thus
sufficient to enumerate inodes in that directory.
The usecase for this is twofold:
1. It's simply very useful to be able to see which public services are
bound on the local system, for debugging/admin/development purposes.
2. At Amutable we'd like to optionally provide a HTTP-to-Varlink bridge
on individual nodes, that allows remote peers (after authentication)
to access local Varlink services. For that it's essential we know the
list of services and their entrypoints to expose, it would be
security-wise highly problematic for clients to provide AF_UNIX
entrypoint paths when connecting. hence: let's instead just have a
dir with the public stuff, and let's ensure the HTTP-to-Varlink
bridge simply exposes that stuff, and nothing else.
Non-public interfaces (such as the oomd interfaces between PID 1 and
oomd), and interfaces with multiple implementors (such as the resolved
hook interface, or the metrics collection stuff) should not be linked
in.
This is inspired by the Varlink.org "registry" concept, briefly
explained here:
Note however that the described Varlink interface is not actually
implemented here, the directory is introduced however in a fashion that
conceptually matches the registry defined there, and would allow us to
implement the registry interface on top of it. (One of the reason the
registry Varlink API is not implemented right now is that the URI format
it relies on is entirely unspecified in the Varlink docs right now. Some
research needs to be done to extract what's implemented in the reference
implementation and to determine how it maps to the Varlink entrypoint
address format systemd's own tooling currently uses)
This primarily installs the symlinks via Symlinks= in unit files and via
a new tmpfiles.d/ drop-in. But since we touch all .socket units relating
to Varlink this also sets the FileDescriptorName= to varlink for each,
just to minimize diffrences and make things work more alike (the
services in questin don't care about the name, so this doesn't change).
In one case we replace a pair of separate sockets for two closely
related varlink services by a socket and a symlink, so that we can
safely use Symlinks= to also install the registry symlinks.
mountfsd: do not cross mount boundaries when looking for parent of foreign UID range owned dirs
This is primarily paranoia: it might be possible for unpriv users to set
up mount hierarchies in unexpected ways when using userns. Hence let's
make protections more rigid: when looking for a parent dir of a foreign
UID owned dir tree, refuse to cross mount boundaries.
kernel-install: allocate "Context" object only in verb_xyz() functions, not already in run()
We soon want to add a Varlink interface to this, but that means that the
various paramaters for the Context object will be sourced from a Varlink
message not from the command line. Hence split apart the parsing logic
so that we alway parse the command line into arg_xyz first, and then,
inside the verb_abc() calls copy the data from there into the Context
object.
This reworks things a bit, so that the "Context" object can later be
allocated for each Varlink call separately. For example we define a
more precise CONTEXT_NULL that invalidates truly all fields, so that we
can discern "defaults" from "unspecified" later on.
When a cgroup is selected for termination, send varlink messages to
hooks registered in `/run/systemd/oomd.prekill-hooks/`.
oomd waits up to `PreKillTimeoutSec=` seconds for response before
proceeding with the kill.
Matteo Croce [Mon, 25 Aug 2025 15:13:00 +0000 (17:13 +0200)]
oomd: implement a prekill varlink event
When a cgroup is selected for termination, send varlink messages
to hooks registered in `/run/systemd/oomd.prekill-hooks/`.
oomd waits up to `PreKillHookTimeoutSec=` seconds for response
before proceeding with the kill.
The revert is needed because with the PreKill hook, oomd_cgroup_kill()
is not goint to really kill processes but it just creates the callbacks.
So the check is deferred to the real kill.
udev: Introduce uaccess for remote graphical sessions (#38516)
When systemd is compiled with group-render-mode=0660, only the active
seat gets access to the render devices through uaccess. Remote desktop
sessions like gnome-remote-desktop would be left with no hardware
rendering, because those sessions are not associated with a seat.
We solve the issue by granting uaccess to specifically tagged devices on
session start, if the session is marked with
XDG_SESSION_EXTRA_DEVICE_ACCESS.
udev-builtin-uaccess is refactored to grant multiple users access to a
device, taking into account the device's seat and all the active
EXTRA_DEVICE_ACCESS sessions.
report: keep track of varlink connections inside of Context object
Let's also move the Varlink connection management into the Context
object. Let's also switch to Set* for it, so that we get get
auto-expanding behaviour.
It's one of the primary objects that make up the program "context"
conceptually, hence it also should be part of the Context object. This
allows us to just have it available if the Context object is seen.
report: do not treat an empty report dir as an issue
We should permit that the report varlink dir is created on the fly when
the first socket is bound there. Hence, let's treat a non-existant dir
equivalent to an empty one.
We usually do this in our tree like this, do it here too.