test: add testcase for unpriv machined nspawns reg + killing
Let's add a superficial test for the code we just added: spawn a
container unpriv, make sure registration fully worked, then kill it via
machinectl, to ensure it all works properly.
vmspawn systems might take quite a while to boot in particular if they
go through uefi and wait for a network lease. Hence let's increase the
start timeout to 2min (from 45s). We'll do that for both nspawn and
vmspawn, even though the UEFI thing certainly doesn't apply there (but
the DHCP thing still does).
This mimics the switch of the same name from nspawn: it controls whether
we expect a READY=1 message from the payload or not. Previously we'd
always expect that. This makes it configurable, just like it is in
nspawn.
There's one fundamental difference in behaviour though: in nspawn it
defaults to off, in vmspawn it defaults to on. (for historical reasons,
ideally we'd default to on in both cases, but changing is quite a compat
break both directly and indirectly: since timeouts might get triggered).
vmspawn: substantially beef up cgroup logic, to match more closely what nspawn does
This beefs up the cgroup logic, adding --slice=, --property= to vmspawn
the same way it already exists in nspawn.
There are a bunch of differences though: we don't delegate the cgroup
access in the allocated unit (since qemu wouldn't need that), and we do
registration via varlink not dbus. Hence, while this follows a similar
logic now, it differs in a lot of details.
This makes in particular one change: when invoked on the command line
we'll only add the qemu instance to the allocated scope, not the vmspawn
process itself (this follows more closely how nspawn does this where
only the container payload has its scope, not nspawn itself). This is
quite tricky to implement: unlike in nspawn we have auxiliary services
to start, with depencies to the scope. This means we need to start the
scope early, so that we know the scope's name. But the command line to
invoke is only assembled from the data we learn about the auxiliary
services, hence much later. To addres we'll now fork off the child that
eventually will become early, then move it to a scope, prepare the
cmdline and then very late send the cmdline (and the fds we want to
pass) to the prepared child, which then execs it.
Just like in nspawn, there's a chance we need to PK authenticate the
registration, hence let's spawn off the agent for that during that
phase, and terminate it once we don't need it anymore.
This cleans up allocation of a scope unit for the container: when
invoked in user context we'll now allocate a scope through the per-user
service manager instead of the per-system manager. This makes a ton more
sense, since it's the user that invokes things after all. And given that
machined now can register containers in the user manager there's nothing
stopping us to clean this up.
Note that this means we'll connect to two busses if run unpriv: once to
the per-user bus to allocate the scope unit, and once to the per-system
bus to register it with machined.
machined: also track 'supervisor' process of a machine
So far, machined strictly tracked the "leader" process of a machine,
i.e. the topmost process that is actually the payload of the machine.
Its runtime also defines the runtime of the machine, and we can directly
interact with it if we need to, for example for containers to join the
namespaces, or kill it.
Let's optionally also track the "supervisor" process of a machine, i.e.
the host process that manages the payload if there is one. This is
generally useful info, but in particular is useful because we might need
to communicate with it to shutdown a machine without cooperation of the
payload. Traditionally we did this by simply stopping the unit of the
machine, but this is not doable now that the host machined can be used
to track per-user machines.
In the long run we probably want a more bespoke protocol between
machined and supervisors (so that we can execute other commands too,
such as request cooperative reboots/shutdowns), but that's for later.
Some environments call the concept "monitor" rather than "supervisor" or
use some other term. I stuck to "supervisor" because nspawn uses this,
and ultimately one name is as good as another.
And of course, in other implementations of VM managers of containers
there might not be a single process tracking each VM/container. Because
of this, the concept of a supervisor is optional.
machined: use different polkit actions for registering and creating a machine
The difference between these two operations are large: one is relatively
superficial: for "registration" all resources remain associated with the
invoking user, only the cgroup is reported to machined which then keeps
track of the machine, too. OTOH "creation" a scope is allocated in
system context, hence the invoked code will be owned by the system, and
its resource usage charged against the system.
Hence, use two distinct polkit actions for this, so that we can relax
access to registration, but keep access to creation tough.
This new helper takes both a PID and and a pidfd ID, and initializes a
PidRef from it. It ensures they actually belong together and returns an
error if not.
The same concern as expalined in #37960 exists also in
missing_syscall.h. If we use enough new glibc, a function we want to use
may be already provided by glibc, but our baseline glibc may not. And it
is hard to detect in our daily development.
This moves all prototypes of syscalls to relevant headers, and missing
syscall functions are defined in relevant .c files of libc wrapper. This
way, we can use usual header as is, e.g. when we want to write code with
`move_mount()`, we can simply use sys/mount.h without checking if it is
supported by our baseline glibc.
conf-files: make conf-file enumerators provide more detailed information of enumerated files (#38006)
This introduces `struct ConfFile` that stores detailed information of an
enumerated file, and introduces `conf_files_list_full()` and friends
that provide results in `ConfFile`.
Then make udev, hwdb, catalog, and cat-files use the new function and
struct to make them not read files outside of specified root directory.
tree-wide: several cleanups for generating symbol lists and gperf files
- pass our system include directories to make generators use our libc
wrappers and latest kernel headers,
- include relevant headers in generated gperf file,
- use files() rather than find_program(), as the result of
find_program() cannot be passed to 'input' of custom_target(),
- move generate-bpf-delegate-configs.py to src/core/, as it is only used
by libcore.
selinux-util: downgrade log level to LOG_DEBUG when error code is zero
Previously, the logger is only used in error paths, but since fe3f2ac0734e64dcd729b00992a6261cbf4cc846, the logger is also used in a
success path. Let's not log loudly on success.
Yu Watanabe [Sun, 29 Jun 2025 20:18:32 +0000 (05:18 +0900)]
pretty-print: several cleanups for cat_files()
- drop redundant error messages in cat_files(), as cat_file() internally
logs errors,
- show an empty line and filename before opening file, to make not mix
any error messages with the previous file,
- drop unnecessary fflush(),
- use RET_GATHER() and continue to show files even if some files cannot
be shown.
r = cg_create(SYSTEMD_CGROUP_CONTROLLER, test_a);
- if (IN_SET(r, -EPERM, -EACCES, -EROFS)) {
+ if (IN_SET(r, -EPERM, -EACCES, -EROFS, -ENOENT)) {
log_info_errno(r, "Skipping %s: %m", __func__);
return;
}
```
I confirmed that the `ERRNO_IS_NEG_FS_WRITE_REFUSED` macro is equivalent
to checking the first 3 error codes above, so the addition of the check
for `ENOENT` is still just as relevant as it was in 252, but adding it
into the macro would be inconsistent with its name, description, and
possible other uses. Hence, in this PR I'm adding the extra check into
the `if`.
Plumbing to perform SELinux checks in varlink API (#38146)
This PR does minimal changes to introduce varlink support. Ideally, the
code should switch to using `mac_selinux_get_our_label()` and new
`mac_selinux_get_peer_label()`. But I leave it for now to minimize
breakage. `mac_selinux_get_peer_label()` remains unused.
This is a prep step to merge
https://github.com/systemd/systemd/pull/38032
Add a small paragraph explaining how BPF token works, how it's being
created and its relationship between the BPF filesystem.
Move all the relevant documentation in the PrivateBPF= section and let
point all the BPFDelegate* options to that one.
Introduce ERRNO_IS_FS_WRITE_REFUSED(), and use it in binfmt_mounted() (#38117)
- This introduces ERRNO_IS_FS_WRITE_REFUSED(), and apply it where
usable.
- This makes unexpected errors in access_fd() called by binfmt_mounted()
propagated to the caller.
- Renames binfmt_mounted() to binfmt_mounted_and_writable(), as it also
checks the fs is writable.
- Voidifies one disable_binfmt() call in shutdown.c.
To implement --bind-user in systemd-vmspawn, we need a transient
version of these credentials. These are useful when the home directory
of the user is mounted into the container/vm and every trace of the user
will be (mostly) gone again when the container/vm is shut down.
Li Tian [Tue, 8 Jul 2025 06:44:35 +0000 (14:44 +0800)]
Add --entry-type=type1|type2 option to kernel-install.
Both kernel-core and kernel-uki-virt call kernel-install upon removal. Need an additional argument to avoid complete removal for both traditional kernel and UKI.
Both EPEL 9 and 10 now have the packages we need except for dhcp-server
so let's get rid of the EPEL conditionals and simply skip the tests that
require dhcp-server on CentOS.
While we're at it, make sure we use the new Architecture=uefi match in
mkosi to simplify the uefi checks.
* 184472f0f1 mkosi-tools: make sure p11-kit dir exists when configuring module
* 9fb807884e mkosi-tools: Explicitly install p11-kit
* 9131877d60 Support matching against architectures with uefi support
* f1eab5a783 Rename sandbox verb to box
* d609f55d98 Fix /var/tmp directory cleanup
* 4997b9495c build(deps): bump github/codeql-action from 3.28.18 to 3.29.2
Even if we're not using --accept=, it's very useful to be able to
synchronize on systemd-socket-activate having binded to its listen
socket, so let's always send READY=1. This means the payload can't
send READY=1 anymore but it's doubtful whether that's useful in this
case in the first place.
vmspawn: Use virtio-blk-pci for image instead of virtio-scsi-pci
We don't need a full blown SCSI controller just to present the main
root drive device to the VM. Let's simplify the storage stack by using
virtio-blk-pci instead.
Additionally, virtio-blk-pci is a builtin module in Arch and Fedora
which means we can do qemu direct kernel boot without needing an initrd.
vmspawn: Disable hpet for vmspawn x86 virtual machines
hpet is an emulated clocksource that is generally discouraged in favor
of kvm-clock or tsc for virtual machines. While vmspawn's virtual machines
already use kvm-clock, leaving hpet enabled causes qemu on the host to
consume a non-trivial amount of cpu, so let's disable the hpet feature since
we're not making use of it anyway.
ci: also set TEST_RUNNER environment variable in coverage test
Otherwise, integration-test-wrapper.py will fail.
```
Traceback (most recent call last):
File "/home/runner/work/systemd/systemd/test/integration-tests/integration-test-wrapper.py", line 693, in <module>
main()
~~~~^^
File "/home/runner/work/systemd/systemd/test/integration-tests/integration-test-wrapper.py", line 677, in main
runner = os.environ['TEST_RUNNER']
~~~~~~~~~~^^^^^^^^^^^^^^^
File "<frozen os>", line 717, in __getitem__
KeyError: 'TEST_RUNNER'
```