It appears LOOP_CONFIGURE in 5.8 is even more broken than initially
thought: it doesn't properly propgate lo_sizelimit to the block device
layer. :-(
Let's hence check the block device size immediately after issuing
LOOP_CONFIGURE, and if it doesn't match what we just set let's fallback
to the old ioctls.
This means LOOP_CONFIGURE currently works correctly only for the most
simply case: no partition table logic and no size limit. Sad!
(Kernel people should really be told about the concepts of tests and
even CI, one day!)
repart: if --size= is specified as "auto" determine minimal size for disk image
When assembling a disk image locally, using --size=auto can be used to
generate the minimal image based on the provided definitions. THis is
useful to prepare images that are grown on first boot.
This matches how we handle things everywhere else, i.e. in .mount units,
and similar: when a mount point dir is missing, we create it, let's do
so too when dealing with disk images.
This makes things a lot simpler, more robust, and systematic.
Wiping means writing zero sectors to disk. Hence it's better to do this
before we discard, so that the zeroes we use to overwrite are properly
discarded. If we'd do it the other way round we'd discard the data and
then reallocte it just to write zeroes.
We initialize the partition contents before the partitions actually
exist, hence to reduce confusion let's talk about "future partitions" up
to the point where they are actually realized.
Let's issue the wiping ourselves, so that we know it's done before we
write partition data onto the disk, and before the disk label
is written. Before this commit the writing of the disk label would imply
the wiping step, potentially overriding again what we just wrote into
the disk data section.
(Normally this shouldn't matter, since the partition table metadata
that the wiping process deletes is at the start and end of the disk
while we write our data to the middle, but you never know what kind of
weird signatures might exist that depart from that.)
(And effectively this ends up using the same wiping code, since that's
implemented in libblkkid, and libfdisk just acts as frontend to that
anyway. We now simply call it directly.)
repart: don't unload data we configured explicitly, and fully free all data we match to disk
The context_unload_partition_table() call is supposed to remove all
data from the loaded partitions about how we mapped it to existing
partitions on disk, but it should leave everything we parsed from the
definition files in place.
We mostly got this right, except for two cases:
1. new_uuid is parsed from the definition files and should stay
2. current_label is read from the existing partition table and should be
freed
Jan Chren [Mon, 24 Aug 2020 14:40:11 +0000 (16:40 +0200)]
man: fix a fix of a typo in systemd.service example
The fix from cb263973acf83de22a86f08fe502a9cbd6c01d2b was made the other way around,
i.e. `SIGKILL` was changed to `SIGUSR1`, but the sentence is about a "termination signal", i.e. `SIGKILL`, not `SIGUSR1`.
Clemens Gruber [Fri, 21 Aug 2020 14:03:23 +0000 (16:03 +0200)]
network: can: Fix CAN initialization
When introducing CAN-FD support, the .can_fd_mode was not initalized
with -1 and due to cm.mask containing the CAN_CTRLMODE_FD bit, it was
not ignored when FDMode was not configured but instead disabled.
The same thing happened when listen-only mode support was introduced.
On chips that do not support these features, this lead to an error:
can0: Failed to configure CAN link: Operation not supported
Fix it by intializing all the CAN related tristate variables
(.can_listen_only, .can_fd_mode and .can_non_iso) to -1.
Aurelien Jarno [Wed, 19 Aug 2020 20:44:15 +0000 (22:44 +0200)]
seccomp: add support for riscv64
This patch adds seccomp support to the riscv64 architecture. seccomp
support is available in the riscv64 kernel since version 5.5, and it
has just been added to the libseccomp library.
riscv64 uses generic syscalls like aarch64, so I used that architecture
as a reference to find which code has to be modified.
With this patch, the testsuite passes successfully, including the
test-seccomp test. The system boots and works fine with kernel 5.4 (i.e.
without seccomp support) and kernel 5.5 (i.e. with seccomp support). I
have also verified that the "SystemCallFilter=~socket" option prevents a
service to use the ping utility when running on kernel 5.5.
mount-util: tweak how we find inaccessible device nodes
On new kernels (>= 5.8) unprivileged users may create the 0:0 character
device node. Which is great, as we can use that as inaccessible device
nodes if we run unprivileged. Hence, change how we find the right
inaccessible device inodes: when the user asks for a block device node,
but we have none, try the char device node first. If that doesn't exist,
fall back to the socket node as before.
This means that:
1. in the best case we'll return a node if the right device node type
2. otherwise we hopefully at least can return a device node if one asked
for even if the type doesn't match (i.e. we return char instead of
the requested block device node)
3. in the worst case (old kernels…) we'll return a socket node
Similarly to "setup" vs. "set up", "fallback" is a noun, and "fall back"
is the verb. (This is pretty clear when we construct a sentence in the
present continous: "we are falling back" not "we are fallbacking").
meson: fix build/man/{man,html} to support page redirects
Commands like build/man/man journald.conf.d would show the installed
man page (or an error if the page cannot be found in the global search
path), and not the one in the build directory. If the man page is
a redirect, or the .html is a symlink, resolve it, build the target,
and show that.
For some reason this failed in koji build on s390x:
--- command ---
16:12:46 PATH='/builddir/build/BUILD/systemd-stable-246.1/s390x-redhat-linux-gnu:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/sbin' SYSTEMD_LANGUAGE_FALLBACK_MAP='/builddir/build/BUILD/systemd-stable-246.1/src/locale/language-fallback-map' SYSTEMD_KBD_MODEL_MAP='/builddir/build/BUILD/systemd-stable-246.1/src/locale/kbd-model-map' /builddir/build/BUILD/systemd-stable-246.1/s390x-redhat-linux-gnu/test-acl-util
--- stdout ---
-rw-r-----. 1 mockbuild mock 0 Aug 7 16:12 /tmp/test-empty.7RzmEc
other::---
--- stderr ---
Assertion 'r >= 0' failed at src/test/test-acl-util.c:42, function test_add_acls_for_user(). Aborting.
Follow the same model established for RootImage and RootImageOptions,
and allow to either append a single list of options or tuples of
partition_number:options.
let's make sure we collect the right error code from errno, otherwise
we'll see EPERM (i.e. error 1) for all errors readv() returns (since it
returns -1 on error), including EAGAIN.
core: create per-user inaccessible node from the service manager
Previously, we'd create them from user-runtime-dir@.service. That has
one benefit: since this service runs privileged, we can create the full
set of device nodes. It has one major drawback though: it security-wise
problematic to create files/directories in directories as privileged
user in directories owned by unprivileged users, since they can use
symlinks to redirect what we want to do. As a general rule we hence
avoid this logic: only unpriv code should populate unpriv directories.
Hence, let's move this code to an appropriate place in the service
manager. This means we lose the inaccessible block device node, but
since there's already a fallback in place, this shouldn't be too bad.
nspawn: provide $container and $container_uuid in /run/host too
This has the major benefit that the entire payload of the container can
access these files there. Previously, we'd set them only as env vars,
but that meant only PID 1 could read them directly or other privileged
payload code with access to /run/1/environ.
nspawn,pid1: pass "inaccessible" nodes from cntr mgr to pid1 payload via /run/host
Let's make /run/host the sole place we pass stuff from host to container
in and place the "inaccessible" nodes in /run/host too.
In contrast to the previous two commits this is a minor compat break, but
not a relevant one I think. Previously the container manager would place
these nodes in /run/systemd/inaccessible/ and that's where PID 1 in the
container would try to add them too when missing. Container manager and
PID 1 in the container would thus manage the same dir together.
With this change the container manager now passes an immutable directory
to the container and leaves /run/systemd entirely untouched, and managed
exclusively by PID 1 inside the container, which is nice to have clear
separation on who manages what.
In order to make sure systemd then usses the /run/host/inaccesible/
nodes this commit changes PID 1 to look for that dir and if it exists
will symlink it to /run/systemd/inaccessible.
Now, this will work fine if new nspawn and new pid 1 in the container
work together. as then the symlink is created and the difference between
the two dirs won't matter.
For the case where an old nspawn invokes a new PID 1: in this case
things work as they always worked: the dir is managed together.
For the case where different container manager invokes a new PID 1: in
this case the nodes aren't typically passed in, and PID 1 in the
container will try to create them and will likely fail partially (though
gracefully) when trying to create char/block device nodes. THis is fine
though as there are fallbacks in place for that case.
For the case where a new nspawn invokes an old PID1: this is were the
(minor) incompatibily happens: in this case new nspawn will place the
nodes in the /run/host/inaccessible/ subdir, but the PID 1 in the
container won't look for them there. Since the nodes are also not
pre-created in /run/systed/inaccessible/ PID 1 will try to create them
there as if a different container manager sets them up. This is of
course not sexy, but is not a total loss, since as mentioned fallbacks
are in place anyway. Hence I think it's OK to accept this minor
incompatibility.
The sd_notify() socket that nspawn binds that the payload can use to
talk to it was previously stored in /run/systemd/nspawn/notify, which is
weird (as in the previous commit) since this makes /run/systemd
something that is cooperatively maintained by systemd inside the
container and nspawn outside of it.
We now have a better place where container managers can put the stuff
they want to pass to the payload: /run/host/, hence let's make use of
that.
This is not a compat breakage, since the sd_notify() protocol is based
on the $NOTIFY_SOCKET env var, where we place the new socket path.
nspawn/machine: move mount propagation dir to /run/host/incoming
Previously we'd use a directory /run/systemd/nspawn/incoming for
accepting mounts to propagate from the host. This is a bit weird, since
we have a shared namespace: /run/systemd/ contains both stuff managed by
the surround nspawn as well as from the systemd inside.
We now have the /run/host/ hierarchy that has special stuff we want to
pass from host to container. Let's make use of that here, and move this
directory here too.
This is not a compat breakage, since the payload never interfaces with
that directory natively: it's only nspawn and machined that need to
agree on it.
arg_system == true and getpid() == 1 hold under the very same condition
this early in the main() function (this only changes later when we start
parsing command lines, where arg_system = true is set if users invoke us
in test mode even when getpid() != 1.
Hence, let's simplify things, and merge a couple of if branches and not
pretend they were orthogonal.
Luca Boccassi [Fri, 19 Jun 2020 10:26:22 +0000 (11:26 +0100)]
systemctl: add --timestamp to change timestamp print format
Timestamps for unit start/stop are recorded with microsecond granularity,
but status and show truncate to second granularity by default.
Add a --timestamp=pretty|us|utc option to allow including the microseconds
or to use the UTC TZ to all timestamps printed by systemctl.
homed: default to "btrfs" as fs type in the LUKS backend
Apparently both Fedora and suse default to btrfs now, it should hence be
good enough for us too.
This enables a bunch of really nice things for us, most importanly we
can resize home directories freely (i.e. both grow *and* shrink) while
online. It also allows us to add nice subvolume based home directory
snapshotting later on.
Also, whenever we mention the three supported types, alaways mention
them in alphabetical order, which is also our new order of preference.
Benjamin Berg [Thu, 23 Jul 2020 10:56:32 +0000 (12:56 +0200)]
mount-setup: Enable memory_recursiveprot for cgroup2
When available, enable memory_recursiveprot. Realistically it always
makes sense to delegate MemoryLow= and MemoryMin= to all children of a
slice/unit.
The kernel option is not enabled by default as it might cause
regressions in some setups. However, it is the better default in
general, and it results in a more flexible and obvious behaviour.
The alternative to using this option would be for user's to also set
DefaultMemoryLow= on slices when assigning MemoryLow=. However, this
makes the effect of MemoryLow= on some children less obvious, as it
could result in a lower protection rather than increasing it.
From the kernel documentation:
memory_recursiveprot
Recursively apply memory.min and memory.low protection to
entire subtrees, without requiring explicit downward
propagation into leaf cgroups. This allows protecting entire
subtrees from one another, while retaining free competition
within those subtrees. This should have been the default
behavior but is a mount-option to avoid regressing setups
relying on the original semantics (e.g. specifying bogusly
high 'bypass' protection values at higher tree levels).
This was added in kernel commit 8a931f801340c (mm: memcontrol:
recursive memory.low protection), which became available in 5.7 and was
subsequently fixed in kernel 5.7.7 (mm: memcontrol: handle div0 crash
race condition in memory.low).
shared/seccomp: do not use ifdef guards around textual syscall names
It is possible that we will be running with an upgraded libseccomp, in which
case libseccomp might know the syscall name, even if the number is not known at
the time when systemd is being compiled. The guard only serves to break such
upgrades, by requiring that we also recompile systemd.
For s390-specific syscalls, use a define to exclude them, so that that we don't
try to filter them on other arches.
Anita Zhang [Tue, 18 Aug 2020 06:09:38 +0000 (23:09 -0700)]
meson: add min version for libfdisk
Was trying to run src/partition/test-repart.sh on CentOS 8 and the first
resize call kept failing with ERANGE. Turned out that CentOS 8 comes
with libfdisk-devel-2.32.1 which is missing
https://github.com/karelzak/util-linux/commit/2f35c1ead621f42f32f7777232568cb03185b473
(in libfdisk 2.33 and up).