core: create per-user inaccessible node from the service manager
Previously, we'd create them from user-runtime-dir@.service. That has
one benefit: since this service runs privileged, we can create the full
set of device nodes. It has one major drawback though: it security-wise
problematic to create files/directories in directories as privileged
user in directories owned by unprivileged users, since they can use
symlinks to redirect what we want to do. As a general rule we hence
avoid this logic: only unpriv code should populate unpriv directories.
Hence, let's move this code to an appropriate place in the service
manager. This means we lose the inaccessible block device node, but
since there's already a fallback in place, this shouldn't be too bad.
nspawn: provide $container and $container_uuid in /run/host too
This has the major benefit that the entire payload of the container can
access these files there. Previously, we'd set them only as env vars,
but that meant only PID 1 could read them directly or other privileged
payload code with access to /run/1/environ.
nspawn,pid1: pass "inaccessible" nodes from cntr mgr to pid1 payload via /run/host
Let's make /run/host the sole place we pass stuff from host to container
in and place the "inaccessible" nodes in /run/host too.
In contrast to the previous two commits this is a minor compat break, but
not a relevant one I think. Previously the container manager would place
these nodes in /run/systemd/inaccessible/ and that's where PID 1 in the
container would try to add them too when missing. Container manager and
PID 1 in the container would thus manage the same dir together.
With this change the container manager now passes an immutable directory
to the container and leaves /run/systemd entirely untouched, and managed
exclusively by PID 1 inside the container, which is nice to have clear
separation on who manages what.
In order to make sure systemd then usses the /run/host/inaccesible/
nodes this commit changes PID 1 to look for that dir and if it exists
will symlink it to /run/systemd/inaccessible.
Now, this will work fine if new nspawn and new pid 1 in the container
work together. as then the symlink is created and the difference between
the two dirs won't matter.
For the case where an old nspawn invokes a new PID 1: in this case
things work as they always worked: the dir is managed together.
For the case where different container manager invokes a new PID 1: in
this case the nodes aren't typically passed in, and PID 1 in the
container will try to create them and will likely fail partially (though
gracefully) when trying to create char/block device nodes. THis is fine
though as there are fallbacks in place for that case.
For the case where a new nspawn invokes an old PID1: this is were the
(minor) incompatibily happens: in this case new nspawn will place the
nodes in the /run/host/inaccessible/ subdir, but the PID 1 in the
container won't look for them there. Since the nodes are also not
pre-created in /run/systed/inaccessible/ PID 1 will try to create them
there as if a different container manager sets them up. This is of
course not sexy, but is not a total loss, since as mentioned fallbacks
are in place anyway. Hence I think it's OK to accept this minor
incompatibility.
The sd_notify() socket that nspawn binds that the payload can use to
talk to it was previously stored in /run/systemd/nspawn/notify, which is
weird (as in the previous commit) since this makes /run/systemd
something that is cooperatively maintained by systemd inside the
container and nspawn outside of it.
We now have a better place where container managers can put the stuff
they want to pass to the payload: /run/host/, hence let's make use of
that.
This is not a compat breakage, since the sd_notify() protocol is based
on the $NOTIFY_SOCKET env var, where we place the new socket path.
nspawn/machine: move mount propagation dir to /run/host/incoming
Previously we'd use a directory /run/systemd/nspawn/incoming for
accepting mounts to propagate from the host. This is a bit weird, since
we have a shared namespace: /run/systemd/ contains both stuff managed by
the surround nspawn as well as from the systemd inside.
We now have the /run/host/ hierarchy that has special stuff we want to
pass from host to container. Let's make use of that here, and move this
directory here too.
This is not a compat breakage, since the payload never interfaces with
that directory natively: it's only nspawn and machined that need to
agree on it.
arg_system == true and getpid() == 1 hold under the very same condition
this early in the main() function (this only changes later when we start
parsing command lines, where arg_system = true is set if users invoke us
in test mode even when getpid() != 1.
Hence, let's simplify things, and merge a couple of if branches and not
pretend they were orthogonal.
homed: default to "btrfs" as fs type in the LUKS backend
Apparently both Fedora and suse default to btrfs now, it should hence be
good enough for us too.
This enables a bunch of really nice things for us, most importanly we
can resize home directories freely (i.e. both grow *and* shrink) while
online. It also allows us to add nice subvolume based home directory
snapshotting later on.
Also, whenever we mention the three supported types, alaways mention
them in alphabetical order, which is also our new order of preference.
Anita Zhang [Tue, 18 Aug 2020 06:09:38 +0000 (23:09 -0700)]
meson: add min version for libfdisk
Was trying to run src/partition/test-repart.sh on CentOS 8 and the first
resize call kept failing with ERANGE. Turned out that CentOS 8 comes
with libfdisk-devel-2.32.1 which is missing
https://github.com/karelzak/util-linux/commit/2f35c1ead621f42f32f7777232568cb03185b473
(in libfdisk 2.33 and up).
home: make libpwquality dep a runtime dlopen() one
Also, let's move the glue for this to src/shared/ so that we later can
reuse this in sysemd-firstboot.
Given that libpwquality is a more a leaf dependency, let's make it
runtime optional, so that downstream distros can downgrade their package
deps from Required to Recommended.
homework: explicitly close cryptsetup context, to not keep loopback device busy
The cryptsetup context pins the loop device even after deactivation.
Let's explicitly release the context to make sure the subsequent
loopback device detaching works cleanly.
homework: sync everything to disk before we rename LUKS loopback file into place
This how this works on Linux: when atomically creating a file we need to
fully populate it under a temporary name and then when we are fully
done, sync it and the directory it is contained in, before renaming it
to the final name.
Franck Bui [Mon, 3 Aug 2020 15:50:11 +0000 (17:50 +0200)]
log: don't explicitly re-open log for failed assertions
This was needed before commit 16e4fd87c5be06d2b7a3b368205c8c5bab9df32a added a
mode that opens the log fds for every single log message. This mode is used in
execute.c since then making the explicit call to log_open unnecessary.
resolve: lift limits on search domains count or length
glibc 2.26 lifted restrictions on search domains count or length to
unlimited. This has also been backported to 2.17 in some distributions (RHEL 7
and derivatives). Other softwares may have their own limits for search domains,
but we should not restrict what is written out any more.
journal: adjust line about when the journal begins and ends
This comes up occasionally with new users. The phrase "Logs begin ..." is
ambiguous because it can be taken to mean the logs being displayed or all logs
(the intended meaning). Let's rephrase this as "Journal begins ..." to make
this clearer.
analyze-security: include an actual syscall name in the message
This information was already available in the debug output, but I think it
is good to include it in the message in the table. This makes it easier to wrap
one's head around the allowlist/denylist filtering.
Topi Miettinen [Mon, 17 Aug 2020 09:08:57 +0000 (12:08 +0300)]
test-fs-util: skip encrypted path test if we get EACCES
Unprivileged test-fs-util fails on my system since /sys/dev/block is
inaccessible for unprivileged users, so let's skip encrypted path test if we
get EACCES or similar.
Luca Boccassi [Mon, 10 Aug 2020 10:22:30 +0000 (11:22 +0100)]
dissect: yield for 2ms when a verity device cannot be opened before retrying
If we don't succeed on the first try it's because another process is
opening the same device. Do a microsleep for 2ms to increase the
chances it has completed the next time around the loop.
Luca Boccassi [Mon, 10 Aug 2020 10:19:22 +0000 (11:19 +0100)]
dissect: wait for udev event if verity device not yet available
The symlink /dev/mapper/dm_name is created by udev after a mapper
device is set up. So libdevmapper/libcrypsetup might tell us that
a verity device exists, but the symlink we use as the source for
the mount operation might not be there yet.
Instead of falling back to a new unique device set up, wait for
the udev event matching on the expected devlink for at least 100ms
(after which the benefits of sharing a device in terms of setup
time start to disappear - on my production machines, opening a new
verity device seems to take between 150ms and 300ms)
dissect: immediately close pipes when we determined we have no data for them
This effectively makes little difference because we exit soon later
anyway, which will close the fds, too. However, it's still useful since
it means the parent will get EOF events on them in the order we process
things and isn't delayed to process the data from the pipes until the
child dies.
That way we can turn off kernel partition scanning if verity data is
available (as we don't support verity for full GPT images, only for
simple file system images).
LOOP_CONFIGURE allows us to configure a loopback device in one ioctl
instead of two, which is not just faster but also removes the race that
udev might start probing the device before we adjusted things properly.
Unfortunately LOOP_CONFIGURE is broken in regards to LO_FLAGS_PARTSCAN
as of kernel 5.8.0. This patch contains a work-around for that, to
fallback to old behaviour if partition scanning is requested but does
not work. Sucks a bit.
Let's make sure systemd-repart can still see the real device before we
replace its mount with an overlay mount, and thus order repart before
volatile-root.
Daan De Meyer [Thu, 6 Aug 2020 20:49:31 +0000 (21:49 +0100)]
bootctl: Remove dependency on machine-id.
The machine-id is used to create a few directories and setup a default
loader entry in loader.conf. Having bootctl create the directories
itself is not particularly useful as it does not put anything in them
and bootctl install is not guaranteed to be called before an initramfs
tool like kernel-install so other programs will always need to have
logic to create the directories themselves if they happen to be called
before bootctl install is called.
On top of this, when using unified kernel images, these are installed to
$BOOT/EFI/Linux which removes the need to have the directories created
by bootctl at all. This further indicates that these directories should
be created by the program that puts something in them rather than by
bootctl.
Removing the machine-id dependency allows bootctl install to be called
even when there's no machine-id in the image. This is useful for image
builders such as mkosi which don't have a machine-id when
installing systemd-boot (via bootctl) because it should only be
generated by systemd when the final image is booted.
The default entry in loader.conf based on the machine-id in loader.conf
is also removed which shouldn't be a massive loss in usability overall.