The offending commit fails to account for the case where
we have fewer lines before --until= than what's specified
in --lines=. Aside from that, if --grep= + --lines=+N are used,
we might also seek forward in the middle of the loop,
breaking the --until= boundary.
Let's turn the logic around then. Context.until_safe will
be set iff we're certain that there's enough to output,
and it gets reset whenever we seek forward.
test-network: stop varlink.socket before stopping networkd.service
To avoid the following warnings:
```
systemd-networkd-tests.py[3139]: Stopping 'systemd-networkd.service', but its triggering units are still active:
systemd-networkd-tests.py[3139]: systemd-networkd-varlink.socket
```
fsck,quotacheck: drop support for traditional /forcefsck, /fastboot, and /forcequotacheck files
Instead, please use the kernel command line options with the same name.
I am not sure these files are System V complieant or not, but at least
they are very traditional way to control fsck or quotacheck.
However, the concept of the files are really broken, especially for
fsck. As when we want to fsck the root filesystem, we need to access the
filessystem, but it may be broken...
Let's drop such traditional ways to control fsck and quotacheck.
We already support kernel command line options to control the behaviors.
Maybe, also it is better to provide ways to control them by credentials.
Since kernel 4.18 BTRFS_IOC_GET_SUBVOL_INFO exists to query subvolume
metadata without privs. This is much better than the manual approach
with finding objects in the fs tree (which is priv). Let's use it, and
drop the old code (since 4.18 is older than our baseline).
It turns out checking sysfs is not 100% reliable to figure out whether
the firmware had TPM2 support enabled or not. For example with EDK2 arm64, the
default upstream build config bundles TPM2 support with SecureBoot support,
so if the latter is disabled, TPM2 is also unavailable. But still, the ACPI
TPM2 table is created just as if it was enabled. So /sys/firmware/acpi/tables/TPM2
exists and looks correct, but there are no measurements, neither the firmware
nor the loader/stub can do them, and /sys/kernel/security/tpm0/binary_bios_measurements
does not exist.
The loader can use the apposite UEFI protocol to check, which is a more
definitive answer. Given userspace can also make use of this information, export
the bitmask with the list of active banks as-is. If it's not 0, then we can be
sure a working TPM2 was available in EFI mode.
vmspawn: Run auxiliary daemons inside scope instead of separate service (#38047)
Currently, vmspawn is in this really weird state where vmspawn itself
and qemu will inherit the caller's execution environment but the
auxiliary
daemons it spawn will run in a fully pristine environment in the service
manager. In practice, this causes issues as checks for whether auxiliary
daemons are installed happen in the caller's execution environment but
they
might not exist in the spawned service's execution environment.
A good example of where this causes issues is trying to use
systemd-vmspawn
in our CI. We use mkosi in CI to run systemd-vmspawn in a custom
userspace
with all the necessary tools available, but systemd-vmspawn then tries
to
spawn services that run these tools using the host userspace, where the
tools are not available or too old and hence systemd-vmspawn fails to
start.
Let's make things more consistent and allow using systemd-vmspawn in CI
at
the same time by having systemd-vmspawn spawn auxiliary daemons itself
instead of having the service manager spawn them. We use
systemd-socket-activate to still have socket activation for these
services,
even though we now spawn them ourselves. To make sure we wait for
systemd-socket-activate to bind to its socket before continuing, we use
the
new general fork_notify() helper.
Why not support both "online" and "offline" operation? systemd-vmspawn
is not
well tested as is and supporting two completely separate modes for
spawning
auxiliary daemons will drastically increase the surface area for bugs.
Given
there doesn't seem to be a major benefit to running daemons in services,
it
seems better to only support offline operation and not both. Should we
want
separate resource control for the auxiliary daemons in the future, we
can run
move them into separate scopes if needed.
nspawn: Prepare --bind-user= logic for reuse in systemd-vmspawn
Aside from the usual boilerplate of moving the shared logic to shared/,
we also rework the implementation of --bind-user= to be similar to what
we'll do in systemd-vmspawn. Instead of messing with the nspawn container
user namespace, we use idmapped mounts to map the user's home directory on
the host to the mapped uid in the container.
Ideally we'd also use the "userdb.transient" credentials to provision the
user records, but this would only work for booted containers, whereas the
current logic works for non-booted containers as well.
Aside from being similar to how we'll implement --bind-user= in vmspawn,
using idmapped mounts also allows supporting --bind-user= without having to
use --private-users=.
vmspawn: Run auxiliary daemons inside scope instead of separate service
Currently, vmspawn is in this really weird state where vmspawn itself
and qemu will inherit the caller's execution environment but the auxiliary
daemons it spawn will run in a fully pristine environment in the service
manager. In practice, this causes issues as checks for whether auxiliary
daemons are installed happen in the caller's execution environment but they
might not exist in the spawned service's execution environment.
A good example of where this causes issues is trying to use systemd-vmspawn
in our CI. We use mkosi in CI to run systemd-vmspawn in a custom userspace
with all the necessary tools available, but systemd-vmspawn then tries to
spawn services that run these tools using the host userspace, where the
tools are not available or too old and hence systemd-vmspawn fails to start.
Let's make things more consistent and allow using systemd-vmspawn in CI at
the same time by having systemd-vmspawn spawn auxiliary daemons itself
instead of having the service manager spawn them. We use
systemd-socket-activate to still have socket activation for these services,
even though we now spawn them ourselves. To make sure we wait for
systemd-socket-activate to bind to its socket before continuing, we use the
new general fork_notify() helper.
Why not support both "online" and "offline" operation? systemd-vmspawn is not
well tested as is and supporting two completely separate modes for spawning
auxiliary daemons will drastically increase the surface area for bugs. Given
there doesn't seem to be a major benefit to running daemons in services, it
seems better to only support offline operation and not both. Should we want
separate resource control for the auxiliary daemons in the future, we can run
move them into separate scopes if needed.
As a bonus, this approach allows us to get rid of the extra complexity of
having to fork off the qemu process first so we can allocate a scope for it
that the other services bind to. This means large parts of 0fc45c8d20ad46ab9be0d8f29b16e606e0dd44ca are reverted by this commit.
Credentials data can get potentially very large. Passing it all via
the command line is rather messy. Let's pass all the credential data
via files instead to both make the final command line less verbose
and reduce the chance of us running into command line size limits if
many or large credentials are used.
nspawn: Don't clear idmapping if we're not doing an idmapped mount
We only need to clear the existing idmapping if we're going to be
replacing it with another idmapping. Otherwise we should keep the
existing idmapping in place.
systemctl/halt: drop support for calling in SysV init script
Traditionally, halt is called at the end of the init script on
reboot/shutdown. To support such usecase, previously we read the current
runlevel from utmp and set force flag on reboot/shutdown.
This drops the support for such the usecase.
Note, neither supported nor tested, but hopefully still the command can
be used in the end of the sysv init script by specifying -ff.
systemctl: move functions in systemctl-sysv-compat.[ch]
- parse_shutdown_time_spec() is used only by systemctl-compat-shutdown.c,
- talk_initctl() and action_to_runlevel() are used only by systemctl-compat-telinit.c,
- the exit code enum is widely used in systemctl, hence moved to systemctl-util.h.
No functional change, preparation for later changes.
Mike Yuan [Fri, 11 Jul 2025 20:38:49 +0000 (22:38 +0200)]
core/cgroup: remove deserialization for "cpuacct-usage-base"
This has been superseded by "cpu-usage-base" ever since
the introduction of cgroup v2. With upgrading and thus
deserialzing from cgroup v1 systems becoming impossible
it is eligible for removal.
Since 90fa161b5ba29d58953e9f08ddca49121b51efe6, --bind= or Bind=
settings for coverage directory does not work with managed mode:
```
[ 158.105361] systemd-nspawn[3718]: Failed to open tree and set mount attributes: Operation not permitted
[ 158.105364] systemd-nspawn[3718]: Failed to clone /coverage: Operation not permitted
[ 158.118655] systemd-nspawn[3707]: (sd-namespace) failed with exit status 1.
```
Let's tentatively skip the test case when running on coverage.
This moves coverage.h to src/coverage/, and specifies path to coverage.h
with files() directive, to make it can be included even when located
outside of the include directories. Otherwise, libc-wrapper cannot be
built when -Db_coverage=true option is enabled.
The mentioned commit switched scope unit's "pids" deserialization
to call unit_watch_pid() already, meaning all later invocations
in scope_coldplug() are no-op. Remove the cruft altogether.
Support global sysext/confext in systemd-stub/systemd-sysext (#38113)
Systemd-stub supports loading addons, credentials, system and
configuration
extensions from ESP and while addons and credentials can be both global
and
per-UKI, sysext/confext are only per-UKI.
Add support for global sysext/confext to systemd-stub/systemd-sysext.
machined: make registration of unpriv user's VMs/containers work (#37855)
This adds missing glue to reasonably allow unpriv users VMs/containers
to register with the system machined.
This primarily adds two things:
1. machined can now properly track VMs/containers residing in subcgroups
of units, because that's effectively what happens for per-user
VMs/containers: they are placed below the system unit `user@….service`
in some user unit.
2. machines registered with machined now have an owning UID: users can
operate on their own machines withour re-authentication, but not on
others.
Note that this is only a first step regarding machined's hookup of
nspawn/vmspawn in the long run for unpriv operation.
I think eventually we should make it so that there's both a per-user and
a per-system machined instance (so far, and even with this PR there's
still one per-system instance), and per-user containers/VMs would
registering with *both*. Having two instances makes sense I think,
because it would mean we can make machined reasonably manage the
per-user image discovery, and also do the per-system network/hostname
handling.
test: add testcase for unpriv machined nspawns reg + killing
Let's add a superficial test for the code we just added: spawn a
container unpriv, make sure registration fully worked, then kill it via
machinectl, to ensure it all works properly.
vmspawn systems might take quite a while to boot in particular if they
go through uefi and wait for a network lease. Hence let's increase the
start timeout to 2min (from 45s). We'll do that for both nspawn and
vmspawn, even though the UEFI thing certainly doesn't apply there (but
the DHCP thing still does).
This mimics the switch of the same name from nspawn: it controls whether
we expect a READY=1 message from the payload or not. Previously we'd
always expect that. This makes it configurable, just like it is in
nspawn.
There's one fundamental difference in behaviour though: in nspawn it
defaults to off, in vmspawn it defaults to on. (for historical reasons,
ideally we'd default to on in both cases, but changing is quite a compat
break both directly and indirectly: since timeouts might get triggered).
vmspawn: substantially beef up cgroup logic, to match more closely what nspawn does
This beefs up the cgroup logic, adding --slice=, --property= to vmspawn
the same way it already exists in nspawn.
There are a bunch of differences though: we don't delegate the cgroup
access in the allocated unit (since qemu wouldn't need that), and we do
registration via varlink not dbus. Hence, while this follows a similar
logic now, it differs in a lot of details.
This makes in particular one change: when invoked on the command line
we'll only add the qemu instance to the allocated scope, not the vmspawn
process itself (this follows more closely how nspawn does this where
only the container payload has its scope, not nspawn itself). This is
quite tricky to implement: unlike in nspawn we have auxiliary services
to start, with depencies to the scope. This means we need to start the
scope early, so that we know the scope's name. But the command line to
invoke is only assembled from the data we learn about the auxiliary
services, hence much later. To addres we'll now fork off the child that
eventually will become early, then move it to a scope, prepare the
cmdline and then very late send the cmdline (and the fds we want to
pass) to the prepared child, which then execs it.
Just like in nspawn, there's a chance we need to PK authenticate the
registration, hence let's spawn off the agent for that during that
phase, and terminate it once we don't need it anymore.