nspawn: add support for 'managed' userns mode even when we run privileged
So far, we supported two modes:
1. when running unpriv we'd get the mounts from mountfsd, and the userns
from nsresourced
2. when running priv we'd do the mounts/userns ourselves
This untangles this a bit, so that we can also use mountfsd/nsresourced
when running privilged.
I think this is generally a bit nicer, and probably something we should
switch to entirely one day, as it reduces the variety of codepaths.
With this patch the default behaviour remains unchanged, but by
selecting the new "managed" option for --private-users= the codepaths
via mountfsd/nsresourced can be explicitly requested even when running
with privs.
This is mostly just reworks that we check for arg_userns_mode !=
USER_NAMESPACE_MANAGED rather than arg_privileged for a number of
codepaths, but requires more fixes, too. The devil is in the details.
nspawn: support foreign mappings also when nspawn doing the mapping itself
This adds a new "foreign" value to --private-users-ownership= which is a
lot like "map", but maps from the host's foreign UID range rather than from the
host's 0.
(This has nothing much to do with making unprivileged directory-based
containers work, it's just very handy that we can run privileged
contains with such a mapping too, with an easy switch)
This simply calls into mountfsd to acquire the root mount and uses it as
root for the container.
Note that this also makes one more change: previously we ran containers
directory off their backing directory. Except when we didn't, and there
were a variety of exceptions: if we had no privs, if we ran off a disk
image, if the directory was the host's root dir, and some others.
This simplifies the logic a bit: we now simply always create a temporary
directory in /tmp/ and bind mount everything there, in all code paths.
This simplifies our code a bit. After all, in order to control
propagation we need to turn the root into a mount point anyway, hence we
might just do it at one place for all cases.
systemd-mountfsd so far provided a MountImage() API call for mounting a
disk image and returning a set of mount fds. This complements the API
with a new MountDirectory() API call, that operates on a directory
instead of an image file. Now, what makes this interesting is that it
applies an idmapping from the foreign UID range to the provided target
userns – and in which case unpriveleged operation is allowed (well,
under some conditions: in particular the client must own a parent dir of
the provided path).
This allows container managers to run fully unprivileged from
directories – as long as those directories are owned by the foreign UID
range. Basic operation is like this:
1. acquire a transient userns from systemd-nsresourced with 64K users
2. ask systemd-mountfsd for an idmapped mount of the container dir
matching that userns
3. join the userns and bind the mount fd as root.
Note that we have to drop various sandboxing knobs from the mountfsd
service file for this to work, since the kernel's security checks that
try to ensure than an obstructed /proc/ cannot be circumvented via
mounting a new procfs will otherwise prohibit mountfsd to duplicate the
mounts properly.
However, if non-system group with the same name is already exist,
previously the devices were owned by the non-system group. That may
possibly happen on updating systemd.
Let's avoid accidentally devices being owned by non-system user/group.
Yu Watanabe [Wed, 22 Jan 2025 20:59:04 +0000 (05:59 +0900)]
udev-rules: ignore OWNER=/GROUP= with unknown user/group
Previously, when an unknown or invalid user/group is specified,
a token was installed with UID_INVALID/GID_INVALID. That's not only
meaningless in most cases, but also clears previous assignment,
if multiple OWNER=/GROUP= token exist for the same device, e.g.
This makes when an unknown user/group is specified, the line will be
ignored. Hence, in the above example, the device will be owned by the
group "disk".
Yu Watanabe [Thu, 23 Jan 2025 17:16:36 +0000 (02:16 +0900)]
udev-rules: get_user_creds()/get_group_creds() return -ESRCH when user/group does not exist
This drops -ENOENT error check for get_user_creds()/get_group_creds(),
as nowadays they always return -ESRCH when the specified user/groups
cannot be found.
File system modules should be something the kernel can autoload
automatically, and according to my testing that works fine, hence let's
drop the explicit deps, in particular as systems usually stick to one fs
for these things, not both.
I inquired bluca about the reason to add it, and didn't remember
anymore, and was fine with me removing this. So let's remove this for
now, should issues arise we can revert this.
mountfsd is supposed to be available during early boot aleady, before
systemd-tmpfiles-setup-dev-early.service completes, hence make sure
loopback devices and DM already work before that.
Yu Watanabe [Sat, 18 Jan 2025 01:40:32 +0000 (10:40 +0900)]
sd-device: use specific setters for read entries from uevent file
Previously, if e.g. DRIVER=foo is specified in uevent file, the value is
only saved as property, but was not set to sd_device.driver.
That was inconsistent to the case when a device is created through
netlink uevent.
Let's always set when we get e.g. sd_device.driver when DRIVER=foo
from both uevent file and netlink uevent.
Yu Watanabe [Sat, 11 Jan 2025 22:03:49 +0000 (07:03 +0900)]
sd-device: chase sysattr and refuse to read/write files outside of sysfs
This makes sd_device_get_sysattr_value()/sd_device_set_sysattr_value()
refuse to read/write files outside of sysfs for safety.
Also this makes
- use chase() to resolve and open the symlink in path to sysfs attribute,
- use delete_trailing_chars(),
- include error code in cache entry, so we can cache more error cases,
- refuse caching value written to uevent file of any devices, i.e.
sd_device_set_sysattr_value(dev, "../uevent", "add") will also not
cache the value "add".
Yu Watanabe [Tue, 21 Jan 2025 20:24:35 +0000 (05:24 +0900)]
catalog: modernize code
- set destructors to catalog_hash_ops,
- acquire OrderedHashmap when necessary,
- gracefully handle NULL catalog directories and output stream,
- rename function output arguments,
- add many many assertions,
- use RET_GATHER().
Andrew Sayers [Thu, 23 Jan 2025 08:06:57 +0000 (08:06 +0000)]
Clarify that Conflicts= only applies when starting units
The "vice versa" in the old text could be interpreted as either
(wrong) "stopping the former will start the latter", or
(right) "starting the latter will stop the former".
Yu Watanabe [Thu, 23 Jan 2025 09:11:30 +0000 (18:11 +0900)]
run: add --job-mode= argument (#34708)
systemctl has a --job-mode= argument, and adding the same argument to
systemd-run is useful for starting transient scopes with dependencies.
For example, if a transient scope BindsTo a service that is stopping,
specifying --job-mode=replace will wait for the service to stop before
starting it again, while the default job mode of "fail" will cause the
systemd-run invocation to fail.
Gavin Li [Thu, 10 Oct 2024 20:07:16 +0000 (16:07 -0400)]
run: add --job-mode= argument
systemctl has a --job-mode= argument, and adding the same argument to
systemd-run is useful for starting transient scopes with dependencies.
For example, if a transient scope BindsTo a service that is stopping,
specifying --job-mode=replace will wait for the service to stop before
starting it again, while the default job mode of "fail" will cause the
systemd-run invocation to fail.
Yu Watanabe [Thu, 23 Jan 2025 00:04:12 +0000 (09:04 +0900)]
core/device: do not drop backslashes in SYSTEMD_WANTS=/SYSTEMD_USER_WANTS= (#35869)
Let consider the following udev rules:
```
PROGRAM="/usr/bin/systemd-escape foo-bar-baz", ENV{SYSTEMD_WANTS}+="test1@$result.service"
PROGRAM="/usr/bin/systemd-escape aaa-bbb-ccc", ENV{SYSTEMD_WANTS}+="test2@$result.service"
```
Then, a device expectedly gains a property:
```
SYSTEMD_WANTS=test1@foo\x2dbar\x2dbaz.service test2@aaa\x2dbbb\x2dccc.service
```
After the event being processed by udevd, PID1 processes the device, the
property previously was parsed with
`extract_first_word(EXTRACT_UNQUOTE)`, then the device unit gained the
following dependencies:
```
Wants=test1@foox2dbarx2dbaz.service test2@aaax2dbbbx2dccc.service
```
So both `%i` and `%I` for the template services did not match with the
original data, and it was hard to use `systemd-escape` in `PROGRAM=`
udev rule token.
This makes the property parsed with
`extract_first_word(EXTRACT_UNQUOTE|EXTRACT_RETAIN_ESCAPE)`, hence the
device unit now gains the following dependencies:
```
Wants=test1@foo\x2dbar\x2dbaz.service test2@aaa\x2dbbb\x2dccc.service
```
and `%I` for the template services match with the original data.
The commit reworked job merging logic so that reload jobs
won't get merged. However, they might get dropped from
transaction due to being deemed redundant, i.e. way before
it even hits job_install(). Let's make sure reload jobs
are always kept during transaction construction stage, too.
Daan De Meyer [Wed, 22 Jan 2025 14:58:13 +0000 (15:58 +0100)]
mkosi: Update to latest
With the latest mkosi, mkosi takes care of making sure it is
available within mkosi sandbox so we get rid of all the --preserve-env=
options when we invoke mkosi sandbox with sudo as these are not
required anymore. It also doesn't matter anymore if mkosi is installed
in /usr on the host so we get rid of the documentation around that as
well.
Daan De Meyer [Wed, 22 Jan 2025 21:24:36 +0000 (22:24 +0100)]
mkosi: Run two more mkosi commands with sudo
Running some mkosi commands as root and other not can lead to cache
invalidations with the latest version, so make sure we run everything
as root after we've built the tools tree.
Yu Watanabe [Sat, 11 Jan 2025 08:54:43 +0000 (17:54 +0900)]
udevadm-test: allow to specify extra directories to load udev rules files
This adds -D/--extra-rules-dir=DIR switch for 'udevadm test' command.
When specified, udev rules files in the specified directory will be also
loaded. This may be useful for debugging udev rules by copying some udev
rules files to a temporary directory.
Yu Watanabe [Mon, 6 Jan 2025 08:26:52 +0000 (17:26 +0900)]
core/device: do not drop backslashes in SYSTEMD_WANTS=/SYSTEMD_USER_WANTS=
Let consider the following udev rules:
===
PROGRAM="/usr/bin/systemd-escape foo-bar-baz", ENV{SYSTEMD_WANTS}+="test1@$result.service"
PROGRAM="/usr/bin/systemd-escape aaa-bbb-ccc", ENV{SYSTEMD_WANTS}+="test2@$result.service"
===
Then, a device expectedly gains a property:
===
SYSTEMD_WANTS=test1@foo\x2dbar\x2dbaz.service test2@aaa\x2dbbb\x2dccc.service
===
After the event being processed by udevd, PID1 processes the device, the
property previously was parsed with extract_first_word(EXTRACT_UNQUOTE),
then the device unit gained the following dependencies:
===
Wants=test1@foox2dbarx2dbaz.service test2@aaax2dbbbx2dccc.service
===
So both '%i' and '%I' for the template services did not match with the original
data, and it was hard to use systemd-escape in PROGRAM= udev rule token.
This makes the property parsed with extract_first_word(EXTRACT_UNQUOTE|EXTRACT_RETAIN_ESCAPE),
hence the device unit now gains the following dependencies:
===
Wants=test1@foo\x2dbar\x2dbaz.service test2@aaa\x2dbbb\x2dccc.service
===
and '%I' for the template services match with the original data.
errno handling for NSS is always a bit weird since NSS modules generally
are not particularly careful with it. Hence let's initialize errno
explicitly before we invoke getpwent() so that we know it's in a
reasonable state afterwards on failure, or zero if not.
We do this in most places we use NSS, including in userdb when it comes
to getgrent(), just for getpwent() we don't so far. Address that.
The getopt() parser was completely wrong, it expected an argument where
wasn't expected or processes.
The test cases only passed by accident because they use the "user" verb
which is also the default verb. It would be accidently read as argument
for --fuzzy and ignored.
Daan De Meyer [Wed, 22 Jan 2025 13:55:45 +0000 (14:55 +0100)]
test: Make sure we run lcov from the meson source directory
In ac75c5192797082c1965ab30be4711490f2937bc, we accidentally changed
the working directory that the tools executed in the wrapper script
are invoked in. This broke our invocations of lcov. Let's explicitly
run those in the meson source directory again to fix the coverage
workflow.
Yu Watanabe [Tue, 21 Jan 2025 18:45:11 +0000 (03:45 +0900)]
networkd-test: unconditionally stop previous invocation of networkd before starting new one
When networkd is already running, creating some .network files and
friends and starting networkd does not take any effect. Let's always
restart networkd when we want to start a new invocation.