Adrian Vovk [Wed, 24 Jan 2024 00:50:21 +0000 (19:50 -0500)]
core: Fail to start/stop/reload unit if frozen
Previously, unit_{start,stop,reload} would call the low-level cgroup
unfreeze function whenever a unit was started, stopped, or reloaded. It
did so with no error checking. This call would ultimately recurse up the
cgroup tree, and unfreeze all the parent cgroups of the unit, unless an
error occurred (in which case I have no idea what would happen...)
After the freeze/thaw rework in a previous commit, this can no longer
work. If we recursively thaw the parent cgroups of the unit, there may
be sibling units marked as PARENT_FROZEN which will no longer actually
have frozen parents. Fixing this is a lot more complicated than simply
disallowing start/stop/reload on a frozen unit
Adrian Vovk [Sun, 21 Jan 2024 20:05:20 +0000 (15:05 -0500)]
core: Rework recursive freeze/thaw
This commit overhauls the way freeze/thaw works recursively:
First, it introduces new FreezerActions that are like the existing
FREEZE and THAW but indicate that the action was initiated by a parent
unit. We also refactored the code to pass these FreezerActions through
the whole call stack so that we can make use of them. FreezerState was
extended similarly, to be able to differentiate between a unit that's
frozen manually and a unit that's frozen because a parent is frozen.
Next, slices were changed to check recursively that all their child
units can be frozen before it attempts to freeze them. This is different
from the previous behavior, that would just check if the unit's type
supported freezing at all. This cleans up the code, and also ensures
that the behavior of slices corresponds to the unit's actual ability
to be frozen
Next, we make it so that if you FREEZE a slice, it'll PARENT_FREEZE
all of its children. Similarly, if you THAW a slice it will PARENT_THAW
its children.
Finally, we use the new states available to us to refactor the code
that actually does the cgroup freezing. The code now looks at the unit's
existing freezer state and the action being requested, and decides what
next state is most appropriate. Then it puts the unit in that state.
For instance, a RUNNING unit with a request to PARENT_FREEZE will
put the unit into the PARENT_FREEZING state. As another example, a
FROZEN unit who's parent is also FROZEN will transition to
PARENT_FROZEN in response to a request to THAW.
creds-util: add a concept of "user-scoped" credentials
So far credentials are a concept for system services only: to encrypt or
decrypt credential you must be privileged, as only then you can access
the TPM and the host key.
Let's break this up a bit: let's add a "user-scoped" credential, that
are specific to users. Internally this works by adding another step to
the acquisition of the symmetric encryption key for the credential: if a
"user-scoped" credential is used we'll generate an symmetric encryption
key K as usual, but then we'll use it to calculate
K' = HMAC(K, flags || uid || machine-id || username)
and then use the resulting K' as encryption key instead. This basically
includes the (public) user's identity in the encryption key, ensuring
that only if the right user credentials are specified the correct key
can be acquired.
Yu Watanabe [Sat, 27 Jan 2024 18:27:41 +0000 (03:27 +0900)]
nspawn: resolve network interface names before moving to container network namespace
To escape a kernel issue fixed by
https://github.com/torvalds/linux/commit/8e15aee621618a3ee3abecaf1fd8c1428098b7ef,
let's resolve provided interface names earlier, and adjust the interface
name pairs with the result.
Yu Watanabe [Sat, 27 Jan 2024 17:49:22 +0000 (02:49 +0900)]
sd-netlink: unify network interface name getter and resolvers
This makes rtnl_resolve_interface() always check the existence of the
resolved interface, which previously did not when a decimal formatted
ifindex is provided, e.g. "1" or "42".
This is not a hot path, but it seems silly to evalute subsequent branches,
which can never match once one has matched. Also, it makes the code harder to
read, because the reader has to first figure out that only one branch can
match.
By definition, a parameter cannot contain a comma because commas
are used to delimit parameters. So we also don't need to use parens
when the use site is delimited by commas.
Mike Yuan [Mon, 29 Jan 2024 18:07:35 +0000 (02:07 +0800)]
notify: don't exit silently when --exec but no msg
Before this commit, if --exec is used but no message shall
be sent, we silently ignore --exec and exit, which is pretty
surprising. Therefore, let's emit clear error instead.
sd-bus: tighten rules on sd_bus_query_sender_creds() a bit
Let's always derive credentials from a bus name or a conneciton fd if we
can, because they pin things.
Let's not go via PID really, because it's always racy to do so.
Note that this doesn't change much, since we wouldn't use such augmented
data for auth anyway (because it will be masked in the
sd_bus_creds.augmented mask as untrusted). But still, let's prefer
trusted data over untrusted data.
socket-util: start SO_PEERGROUP loop with sysconf(_SC_NGROUPS_MAX), too
We do this for getgroups_malloc() hence we should do this here too,
after all whether we do it for a socket peer or for ourselves doesn't
make too much of a difference.
[ 40.039232] testsuite-50.sh[624]: ++ systemd-dissect --make-archive /tmp/tmp.RZEq3t/minimal_0.raw
[ 40.044745] testsuite-50.sh[625]: ++ sha256sum
[ 40.066693] systemd-dissect[621]: libarchive.so.13 is not installed: libarchive.so.13: cannot open shared object file: No such file or directory
[ 40.068577] systemd-dissect[621]: Archive support not available (compiled without libarchive, or libarchive not installed?).
[ 40.092242] systemd-dissect[624]: libarchive.so.13 is not installed: libarchive.so.13: cannot open shared object file: No such file or directory
[ 40.095716] systemd-dissect[624]: Archive support not available (compiled without libarchive, or libarchive not installed?).
[ 40.100510] testsuite-50.sh[538]: + test e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 '!=' ''
[ 40.100510] testsuite-50.sh[538]: + test e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 = e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
[ 40.108249] testsuite-50.sh[627]: + tar t
[ 40.113791] testsuite-50.sh[626]: + systemd-dissect --make-archive /tmp/tmp.RZEq3t/minimal_0.raw
[ 40.120300] testsuite-50.sh[628]: + grep etc/os-release
[ 40.176288] systemd-dissect[626]: libarchive.so.13 is not installed: libarchive.so.13: cannot open shared object file: No such file or directory
[ 40.180273] systemd-dissect[626]: Archive support not available (compiled without libarchive, or libarchive not installed?).
[ 40.184017] testsuite-50.sh[627]: tar: This does not look like a tar archive
[ 40.185430] testsuite-50.sh[627]: tar: Exiting with failure status due to previous errors
Luca Boccassi [Tue, 23 Jan 2024 16:01:31 +0000 (16:01 +0000)]
core: add SYSTEMD_VERITY_SHARING env var for local development
When running an image that cannot be mounted (e.g.: key missing intentionally
for development purposes), there's a retry loop that takes some time
and slows development down. Add an env var to disable it.
Luca Boccassi [Thu, 25 Jan 2024 20:31:39 +0000 (20:31 +0000)]
sd-bus: fix exiting event loop when sd_bus_set_exit_on_disconnect is used
If sd_bus_set_exit_on_disconnect is used and the bus is part of an event
loop, and the D-Bus connection goes away (e.g.: soft-reboot), sd-bus
will always exit() the program instead of returning from the loop, as
the reference to the event is removed before it is checked.
Luca Boccassi [Fri, 26 Jan 2024 00:22:38 +0000 (00:22 +0000)]
test: unset TZ before timezone-sensitive unit tests are run
Some tests have hard-coded results that need to match, and change if
the caller has a timezone set via the TZ= environment variable, as it
is the case during reproducible build tests. Unset it.
Daan De Meyer [Wed, 24 Jan 2024 11:24:11 +0000 (12:24 +0100)]
man: Document ranges for distributions config files and local config files
Let's recommend that config files and drop-ins in /usr use the range
0-49 and config files in /etc and /run use the range 50-99 so that
files in /run and /etc will generally always override files from
/usr.
Mike Yuan [Mon, 22 Jan 2024 16:00:46 +0000 (00:00 +0800)]
fstab-generator: drop unapplicable options for /usr/ too
We already drop these for /sysroot/usr/ in parse_fstab
(1e9b2e4fdd8d04e3fbfadbc0b92dc138c819c221). Let's make
things consistent, and do the same for /usr/ too (after
switch-root).
Mike Yuan [Sat, 20 Jan 2024 14:16:52 +0000 (22:16 +0800)]
fstab-util: clean up fstab_filter_options
Let's get rid of the confusing goto so that the flow is more
straightforward. Note that the behavior is slightly changed:
previously, ret_filtered would be an empty string even if
the original opts passed in is NULL, but after this commit
it returns NULL too. But this shouldn't matter, as all our
code handles NULL opts gracefully.
This file is a bit misnamed. What it actually implements is one specific
BPF LSM module, that restricts file systems. As such it really should be
named after that, and not primarily by the mechanism it uses for that.
With this our glue code is now named the same way as the actual bpf code
files in src/core/bpf/, thus things become a bit more symmetric.
This is particular relevant as we'll soon have another BPF LSM in our
tree, see #26826, and we should be able to distinguish them by name.
This commit just renames the files and does some dumb search/replace of
the string. A follow-up commit will name some functions more expressively
inside the files.
I added the filtering in 752fedbea7c02c82287c7ff2a4139f528b3f7ba8 as a way
to reduce the number of items in the tables. I thought it's "obvious", but
it might not be so.
One immediate problem is that the filter is broken, because on arm64,
os.uname().machine returns "aarch64", so we incorrectly filter out the arm
syscalls (there is just one: arm_fadvise64_64). Of course we could fix the
filter, but I think it's better to nuke it altogether. The filter on applies to
1 arm syscall and 5 s390 syscalls, and we have 500+ other syscalls, so this
"optimization" doesn't really matter. OTOH, if we get the filter wrong,
the result is bad. And also, the existence of the filter at all creates
problems for cross-builds.
I wanted to get rid of 'generate-syscall-list.py', but we need to generate a
backslash in the output. https://github.com/mesonbuild/meson/issues/1564 makes
this very very hard, since any attempt to put a backslash an inline argument
results in the backslash being replaces by a forward slash, which doesn't quite
have the same meaning. So let's use a standalone script until
https://github.com/mesonbuild/meson/issues/1564 is resolved.
cgroup: don't enable bpf pseudo-controllers when doing a wildcard delegation
We can only delegate actual controllers, not the BPF pseudo-controllers
we defined as there's imply no concept for that. Hence, when users set
Delegate=yes to do a wildcard delegation, only delegate the regular
controllers.
This means that we won't bother with BPF stuff for such units where it's
entirelly unnecessary.