Peter Morrow [Tue, 13 Apr 2021 16:20:42 +0000 (17:20 +0100)]
core: allow services stuck in reloading state to exit
If a service is in reloading state but has exited do not delay
the final exit until the service reload timer expires. Instead allow
the service to exit immediately since we can't expect the service to
ever transition out of reloading state.
For example if a service sent RELOADING=1 but crashed before it could
send READY=1 then it should be restarted if the service had
Restart= configured.
Signed-off-by: Peter Morrow <pemorrow@linux.microsoft.com>
Igor Zhbanov [Tue, 20 Apr 2021 17:22:28 +0000 (17:22 +0000)]
journald: Retry if posix_fallocate returned -1 (EINTR)
On some conditions (particularly when mobile CPUs are going to sleep),
the posix_fallocate(), which is called when a new journal file is allocated,
can return -1 (EINTR). This is counted as a fatal error. So the journald
closes both old and journals, and simply throwing away further incoming
events, because of no log files open.
Introduce posix_fallocate_loop() that restarts the function in the case
of EINTR. Also let's make code base more uniform by returning negative
values on error.
Fix assert in test-sigbus.c that incorrectly counted positive values as
success. After changing the function return values, that will actually work.
Fixes: #19041 Signed-off-by: Igor Zhbanov <i.zhbanov@omprussia.ru>
generator: write out special systemd-fsck-usr.service
So far all file systems where checked by instances of
systemd-fsck@.service, with the exception of the root fs which was
covered by systemd-fsck-root.service. The special handling is necessary
to deal with ordering issues: we typically want the root fs to be
checked before all others, and — weirdly — allow mounting it before the
fsck done (for compat with initrd-less boots).
This adds similar special handling for /usr: if the hierarchy is placed
on a separate file system check it with a special
systemd-fsck-usr.service instead of a regular sysemd-fsck@.service
instance. Reason is again ordering: we want to allow mounting of /usr
without the root fs already being around in the initrd, to cover for
cases where the root fs is created on first boot and thus cannot be
mounted/checked before /usr.
I was worried that the text size will grow, but apparently that's not the
case:
With --optimization=2:
$ size build/src/shared/libsystemd-shared-248.a.p/gpt.c.o*
text data bss dec hex filename
3674 1104 0 4778 12aa build/src/shared/libsystemd-shared-248.a.p/gpt.c.o.old
3085 1104 0 4189 105d build/src/shared/libsystemd-shared-248.a.p/gpt.c.o
(I don't understand the generated assembly, even though it seems to work:
Disassembly of section .text.gpt_partition_type_is_usr_verity:
It is made inline in the hope that the compiler will be able to optimize
all the va_args boilerplate away, and do an efficient comparison when
the arguments are all constants.
generator: exit early when asked to generate fsck unit for / and /usr in initrd
Let's exit early if we are invoked to generate an fsck unit for the
rootfs or /usr of the initrd itself. The "systemd-root-fsck.service" and
"systemd-usr-fsck.service" units are after all for the host file
systems, and the initrd file hierarchy is from an unpacked cpio anyway.
Hence, this semantically doesn't really make sense, so quickly exit if
we detect this case. This allows us to remove some checks further down
the codepath.
In man pages, horizontal space it at premium, and everything should
generally be indented with 2 spaces to make it more likely that the
examples fit on a user's screen.
This teaches repart to look for the root block device both as the
backing for /sysroot and for /sysusr/usr.
The latter is a new addition, and starts making more sense with the next
commit. It's about supporting systems that are shipped with only a /usr/
fs, but where a root fs is allocated and formatted on first boot via
systemd-repart (or a similar tool). In this case it's useful to be able
to mount the ultimate /usr/ early on without mounting the root fs
right-away (simple because the rootfs might not exist yet, and we need
the repart data encoded in /usr/ to actually format it). Hence, instead
of requiring that we mount /sysroot/ first and /sysroot/usr/ second as
we did so far, let's rearrange things slightly:
1. We mount the /usr/ file system we discover to /sysusr/usr/
2. We mount the root file system we discover to /sysroot/
3. Once both are established we bind mount /sysusr/usr/ to /sysroot/usr/
And that' it. The first two steps can happen in either order, and we can
access /usr/ with or without a rootfs being around.
This commit implements nothing of the above. Instead, it teaches
systemd-repart to check both /sysroot/ and /sysusr/ for repart drop-ins,
and use the first of these hierarchies it finds populated. This way
systemd-repart can be spawned once /usr is mounted and it will work
correctly without root fs having to exist, or we can invoke it when the
root fs is already mounted, where it also will work correctly.
fstab-generator: if usr= is specified, mount it to /sysusr/usr/ first
This changes the fstab-generator to handle mounting of /usr/ a bit
differently than before. Instead of immediately mounting the fs to
/sysroot/usr/ we'll first mount it to /sysusr/usr/ and then add a
separate bind mount that mounts it from /sysusr/usr/ to /sysroot/usr/.
This way we can access /usr independently of the root fs, without for
waiting to be mounted via the /sysusr/ hierarchy. This is useful for
invoking systemd-repart while a root fs doesn't exist yet and for
creating it, with partition data read from the /usr/ hierarchy.
This introduces a new generic target initrd-usr-fs.target that may be
used to generically order services against /sysusr/ to become available.
dissect: ignore udev database entries from before the loopback attachment
This tries to shorten the race of device reuse a bit more: let's ignore
udev database entries that are older than the time where we started to
use a loopback device.
This doesn't fix the whole loopback device raciness mess, but it makes
the race window a bit shorter.
loop-util: track CLOCK_MONOTONIC timestamp immediately before attaching a loopback device
This is similar to the preceding work to store the uevent seqnum, but
this stores the CLOCK_MONOTONIC timestamp.
Why? This allows to validate udev database entries, to determine if they
were created *after* we attached the device.
The uevent seqnum logic allows us to validate uevent, and the timestamp
database entries, hence together we should be able to validate both
sources of truth for us.
(note that this is all racy, just a bit less racy, since we cannot
atomically attach loopback devices and get the timestamp for it, the
same way we can't get the uevent seqnum. Thus is shortens the race
window, but doesn#t close it).
sd-device: add API to query from when a udev database entry is
We already store a CLOCK_MONOTONIC timestamp for each device appearance,
let' make this queriable.
This is useful to determine whether a udev device database entry is from
a current appearance of the device or a previous one, by comparing it
with appropriately taken timestamps.
dissect: ignore old uevents when waiting for loopback partition scan
Let's drop all monitor uevent that were enqueued before we actually
started setting up the device.
This doesn't fix the race, but it makes the race window smaller: since
we cannot determine the uevent seqnum and the loopback attachment
atomically, there's a tiny window where uevents might be generated by
the device which we mistake for being associated with out use of the
loopback device.
loop-util: read kernel's uevent seqnum right before attaching a loopback device
Later, this will allow us to ignore uevents from earlier attachments a
bit better, as we can compare uevent seqnums with this boundary. It's
not a full fix for the race though, since we cannot atomically determine
the uevent and attach the device, but it at least shortens the window a
bit.
loop-util: make loop_device_make() return fd in all code paths
Previously, loop_device_make() would return the device fd in one success
code path, but not the other (where' we'd just return 0).
loop_device_open() returns it in all cases.
Hence, let's clean this up, and make sure in all success code paths of
both functions we return it (even though it strictly speaking is
redundant, since we return it in LoopDevice anyway, and currently noone
actually relies on this).
Miroslav Suchý [Tue, 20 Apr 2021 08:23:01 +0000 (10:23 +0200)]
document DefaultOOMPolicy
the `man systemd.service` say:
Defaults to the setting DefaultOOMPolicy= in systemd-system.conf(5) is set to
but there is no such line in this config.
This is the default value I extracted from
systemctl show --property=DefaultOOMPolicy
repart: add new ReadOnly= and Flags= settings for repart dropins
Let's make the GPT partition flags configurable when creating new
partitions. This is primarily useful for the read-only flag (which we
want to set for verity enabled partitions).
This adds two settings for this: Flags= and ReadOnly=, which strictly
speaking are redundant. The main reason to have both is that usually the
ReadOnly= setting is the one wants to control, and it' more generic.
Moreover we might later on introduce inherting of flags from CopyBlocks=
partitions, where one might want to control most flags as is except for
the RO flag and similar, hence let's keep them separate.
When using systemd-repart as an installer that replicates the install
medium on another medium it is useful to reference the root
partition/usr partition or verity data that is currently booted, in
particular in A/B scenarios where we have two copies and want to
reference the one we currently use. Let's add a CopyBlocks=auto for this
case: for a partition that uses that we'll copy a suitable partition
from the host.
CopyBlocks=auto finds the partition to copy like this: based on the
configured partition type uuid we determine the usual mount point (i.e.
for the /usr partition type we determine /usr/, and so on). We then
figure out the block device behind that path, through dm-verity and
dm-crypt if necessary. Finally, we compare the partition type uuid of
the partition found that way with the one we are supposed to fill and
only use it if it matches (the latter is primarily important on
dm-verity setups where a volume is likely backed by two partitions and
we need to find the right one).
This is particularly fun to use in conjunction with --image= (where
we'll restrict the device search onto the specify device, for security
reasons), as this allows "duplicating" an image like this:
If the right repart data is embedded into "source.raw" this will be able
to create and initialize a partition table on target.raw that carrries
all needed partitions, and will stream the source's file systems onto it
as configured.
repart: add high-level setting for creating dirs in formatted file systems
So far we already had the CopyFiles= option in systemd-repart drop-in
files, as a mechanism for populating freshly formatted file systems with
files and directories. This adds MakeDirectories= in similar style, and
creates simple directories as listed. The option is of course entirely
redundant, since the same can be done with CopyFiles= simply by copying
in a directory. It's kinda nice to encode the dirs to create directly in
the drop-in files however, instead of providing a directory subtree to
copy in somehere, to make the files more self-contained — since often
just creating dirs is entirely sufficient.
The main usecase for this are GPT OS images that carry only a /usr/
tree, and for which a root file system is only formatted on first boot
via repart. Without any additional CopyFiles=/MakeDirectories=
configuration these root file systems are entirely empty of course
initially. To mount in the /usr/ tree, a directory inode for /usr/ to
mount over needs to be created. systemd-nspawn will do so automatically
when booting up the image, as will the initrd during boot. However, this
requires the image to be writable – which is OK for npawn and
initrd-based boots, but there are plenty tools where read-only operation
is desirable after repart ran, before the image was booted for the first
time. Specifically, "systemd-dissect" opens the image in read-only to
inspect its contents, and this will only work of /usr/ can be properly
mounted. Moreover systemd-dissect --mount --read-only won't succeed
either if the fs is read-only.
Via MakeDirectories= we now provide a way that ensures that the image
can be mounted/inspected in a fully read-only way immediately after
systemd-repart completed. Specifically, let's consider a GPT disk image
shipping with a file usr/lib/repart.d/50-root.conf:
With this in place systemd-repart will create a root partition when run,
and add /usr and /efi into it as directory inods. This ensures that the
whole image can then be mounted truly read-only anf /usr and /efi can be
overmounted by the /usr partition and the ESP.
libfdisk appears to return NULL when encountering an empty partition
label, let's handle this sanely, and treat NULL and "" for the current
label as the same, but for the new label as distinct: there NULL means
nothing is set, and "" means an actual empty label.
This is similar to the --image= switch in the other tools, like
systemd-sysusers or systemd-tmpfiles, i.e. it apply the configuration
from the image to the image.
This is particularly useful for downloading minimized GPT image, and
then extending it to the desired size via:
Let's have one flag to request that when dissecting an image the
loopback device is made read-only, and another one to request that when
it is mounted to make it read-only. Previously both concepts were always
done read-only together.
(Of course, making the loopback device read-only but mounting it
read-write doesn't make too much sense, but the kernel should catch that
for us, no need to make restrictions from our side there)
Use-case for this: in systemd-repart we'd like to operate on images for
adding partitions. Thus we'd like to have the loopback device writable,
but if we read repart.d/ snippets from it, we want to do that read-only.