test: skip TEST-06-SELINUX if not on fedora/centos
The test skips at runtime on the same condition, but that's already too late
as it often gets stuck on boot in Debian/Ubuntu. Check in the meson
condition directly so that it's not even started.
mkosi: Stop passing package environment variables to tools image
The tools image is not guaranteed to be the same distribution as the
target distribution and so might have different package environment
variables than the main image yet we currently unconditionally use the
same package environment variables for both of them.
Let's fix this by not passing the package environment variables to the
tools image and subimages anymore, and instead having the main, tools and
build images separately include a config file with the required environment
variables.
mkosi: Use mkosi.tools.conf for tools tree configuration
This allows us to use the regular settings instead of having to bother
with ToolsTreeXXX variants. It'll also allow us to share configuration
between the regular images and the tools tree image, which we'll make
use of in the next commit.
mkosi: Drop number prefixes from configuration files
We already removed these in some places, let's migrate the others as
well. There's no ordering required at all between these configuration
files so let's not bother with any numbered prefixes.
unit: return a better error state for unit_get_unit_file_preset() if we have no fragment path
We'd previously return what was already set. Let's instead return a
clear ENOEXEC in this case, to make clear what is going on: preset logic
doesn't apply to units which lag a fragment path.
unit: initialize unit_file_preset field to valid value
"-1" is not a valid enum value. Use a better one. All code using this
considers negative values error codes anyway, hence the old code was
just a weird way to write -EPERM. Let's clean this up.
unit: don't bother determining unit install state for transient or perpetual units
I noticed that we keep querying the preset database for transient units,
which makes little sense, since transient units are well, transient, and
hence not suject to enablement/disablement. Hence, let's shortcut things
and simply not check the preset database for them.
While we are at it, shortcut unit file state checks for transient units,
too. We know they are transient already, we can return that directly,
no need to go to disk.
Finally, treat perpetual units like transient units for the the preset
case: also bypass the preset database. (But keep checking for the unit
file state for them, since it *is* relevant to know whether they were
generated or not.)
The tests were failing, because the quota was not enforced.
It seems that we simply don't have privileges to set or display the quota.
The test is running priviled, so this is probably some SELinux:
TEST-46-HOMED.sh[117]: + /usr/lib/systemd/tests/unit-tests/manual/test-display-quota tmpfsquota /dev/shm /tmp
TEST-46-HOMED.sh[1103]: Lacking privileges to query UID quota on /dev/shm: Operation not permitted
TEST-46-HOMED.sh[1103]: Lacking privileges to query UID quota on /tmp: Operation not permitted
If we cannot display the quota, ignore the test results.
In a local run under mkosi, quota is shown and the tests pass. So this is something
about how the testing-farm:fedora-rawhide-x86_64 is configured.
TEST-46-HOMED: check for support on /dev/shm and /tmp separately
The test fails in CI. My guess was this is because the enablement of quota on
/tmp and /dev/shm is independent. The former fs is mounted by systemd in the
host, while the latter is mounted in the initrd, so we can end up with quota
support on one but not the other, which is the situation I had on my laptop.
This wasn't actually the source of the problems in CI, but it's a reasonable
change to make anyway.
test-display-quota: add a little helper binary to show quota on tmpfs
quota from quota project fails:
$ quota
quota: Cannot stat() mounted device tmpfs: No such file or directory
quota: Cannot stat() mounted device tmpfs: No such file or directory
Having this helper helped me understand what is going on with the quotas when
the tests failed. I think it'd be useful to keep it around for now, even though
it is not actually connected in the tests.
test: use 'exit 0' instead of 'return' in test scripts
14385s [ 66.896852] TEST-87-AUX-UTILS-VM.sh[3744]: + test -x /usr/lib/systemd/systemd-validatefs
14385s [ 66.898544] TEST-87-AUX-UTILS-VM.sh[3744]: + echo 'no systemd-validatefs'
14385s [ 66.899115] TEST-87-AUX-UTILS-VM.sh[3744]: no systemd-validatefs
14385s [ 66.899699] TEST-87-AUX-UTILS-VM.sh[3744]: + return
14385s [ 66.900189] TEST-87-AUX-UTILS-VM.sh[3744]: .//usr/lib/systemd/tests/testdata/units/TEST-87-AUX-UTILS-VM.validatefs.sh: line 13: return: can only `return' from a function or sourced script
hibernate-resume: restore full message if resume fails
We had a INFO message before 760e99bb52dd132aeab14802c9ed2889471e9cdf. Logging
at INFO level made sense back when we didn't have the EFI variable and people
would set resume= on the kernel command line. Nowadays, if we have the
hibernation info, then we expect it to be accurate. Log at WARN level if we
have the EFI variable and the resume fails for any reason, and at INFO
otherwise.
OTOH, we already print errors immediately when that happens, and if the resume
failed in the kernel, the kernel should log on its own. So just use WARN, not
ERR.
introduce notify_socket_prepare() and use it where applicable (#36911)
This introduces notify_socket_prepare(), which creates an autobind
notify socket and IO event source for the socket. Then, use it where we
send notification messages from worker processes to their manager
process.
- drop 'Options' sections,
- drop underlining for link,
- fix indentation.
Prompted by https://github.com/systemd/systemd/pull/36850#discussion_r2020594171
> the underline stuff we only use for long --help texts that have sections,
> for the section headers. systemctl --help does that for example. This one
> here is not that long, hence doesn't really need section headers, and
> hence no underlining. The clickable links don't need to be explicitly
> underlined, the terminal emulators that supper hyperlinks will underline
> them on their own (for example gnome-terminal uses a dotted line).
Also, let's not get too tangled up in the style of defining variables
in between. The functions are short enough, and vars involved are still
effectively at the beginning... Put differently, the separation from
'int r' is too deliberate and brings no actual value in my eyes.
Yu Watanabe [Mon, 31 Mar 2025 16:14:33 +0000 (01:14 +0900)]
introduce systemd-validatefs@.service that ensures file systems can only be used in the way they were intended (#36714)
If we have multiple trusted fs (i.e. luks or dm-verity) we generate via
repart at boot, we must make sure they cannot be "misappropriated", i.e.
used for a different mount they were intended for.
Hence, let's introduce "mount constraint" data (encoded in xattrs on the
root inode of the fs) that tells us where a file system has to be
mounted, and what the gpt partition metadata has to be for the fs to be
valid.
Inspired by this thread:
https://lists.freedesktop.org/archives/systemd-devel/2025-March/051244.html
If the target dir is tmpfs and we run on old kernels we cannot extract
xattrs and the extracting will fail if there are any. hence add
-no-xattrs to the two remaining unsquashfs invocations that don't have
it.
(Also all other invocations across our test tree spell "-dest" instead
of "-d", hence do so here too.)
Let's automatically generate validatefs xattrs by default, that encode
the intended use of partitions.
This defaults to on, since the structure of repart definition files
tells us enough on use for this to be safe. There's an option however,
to turn this off.
validatefs: add new tool that enforces mount constraints
This new tool looks for a three xattr on the root inode of a file system
that encode mount constraints of the file system. The tool is supposed
to be hooke into the mount logic and is supposed to protect against
misappropriating trusted file systems in unintended ways.
Consider the following scenario: we boot up on first boot and create a
tpm-locked pair of /var/ and /srv/ partitions via systemd-repart. An
attacker then offline modifies the partition table, exchanging the
metadata of the /var/ and /srv/ partition. So far we'd happily accept
that, honour the modified metadata and boot up. This could be used to
revert changes to /var/ or similar. And all that even though both
partitions are encrypted and locked to TPM!
With this new mechanism we can encode in the protected contents of the
file systems the ways it can be used: the partition type uuid, the
partition label and the intended mount point can be stored in xattrs,
and we can check them automatically on mount, and take action on
mismatch. (action would typically be immediate reboot).
pcrextend: whenever we fail to extend PCRs, reboot immediately
PCR extensions are supposed to be useful for "destroying" the ability to
access TPM bound secrets. Hence, if for some reason we fail to extend a
PCR, it's safer to just reboot, instead of going on without the
extension, leaving secrets potentially accessible which should not be
accessible.
Note that the services exit gracefully if no TPM is found, hence this
should not be triggered on TPM-less systems. However, this enforces that
if there is a TPM that is accessible to Linux and that works properly,
the PCR measurement must complete too.
Let's always prefer quotactl_fd() when it's available and use quotactl()
only as as a fallback on old kernels.
This way we can operate on the fds we typically already have open, or if
needed we can open a new one, and use for multiple fs operation.
In the long run we should really focus on operating exclusively by fd
instead of by path, by device nor or otherwise. This gets us a step
closer to that.
Mike Yuan [Sun, 16 Mar 2025 21:05:41 +0000 (22:05 +0100)]
core/namespace: remove wonky fallback in mount_private_apivfs()
Let's avoid dropping opts willy-nilly, especially that we already
carry the logic of determining availability prior to mount (but
make sure we respect the result though, and don't assume things
are available if the check fails).
Mike Yuan [Sun, 16 Mar 2025 20:55:29 +0000 (21:55 +0100)]
core/namespace: stop applying mount options on private cgroupfs mount
We always unshare cgroup ns for ProtectControlGroups=private/strict,
while the mount options only apply to the cgroupfs instance
in initial cgns (c.f.
https://github.com/torvalds/linux/blob/b69bb476dee99d564d65d418e9a20acca6f32c3f/kernel/cgroup/cgroup.c#L1984)
Hence let's drop the thing wholesale.
Also, as noted in the comment already, mount_private_apivfs()
internally enforces nosuid/noexec, so drop explicit flags too.
Mike Yuan [Sun, 30 Mar 2025 16:45:27 +0000 (18:45 +0200)]
TEST-07-PID1: remove bogus test case for DelegateNamespaces=cgroup
We enable nsdelegate for cgroupfs, and hence the kernel would
always refuse writes to /sys/fs/cgroup/cgroup.pressure and friends
regardless of whether the cgns is owned by userns:
https://github.com/torvalds/linux/blob/cb82ca153949c6204af793de24b18a04236e79fd/kernel/cgroup/cgroup.c#L4132
This currently works because the mountns (thus cgroupfs) remains
to be non-delegated and we're actually operating on the real root
cgroup.
It appears that cgroupfs generally doesn't care about userns,
so I'm yet to see a way to test this properly. Let's drop this for now,
to unblock fixes in the following commits.