dissect-image: prefer PARTN= uevent property over "partition" sysfs attr
The kernel will send us a PARTN= uevent proprty with partition add
events, let's use it instead of going for the "partition" sysfs attr.
It's less racy that way and there are reports the sysfs attr shows up
after the device, which makes it evern worse.
Peter Morrow [Tue, 13 Apr 2021 16:20:42 +0000 (17:20 +0100)]
core: allow services stuck in reloading state to exit
If a service is in reloading state but has exited do not delay
the final exit until the service reload timer expires. Instead allow
the service to exit immediately since we can't expect the service to
ever transition out of reloading state.
For example if a service sent RELOADING=1 but crashed before it could
send READY=1 then it should be restarted if the service had
Restart= configured.
Signed-off-by: Peter Morrow <pemorrow@linux.microsoft.com>
repart: don't try to extract directory of root dir when copying directories
It's OK to specify the root dir as target directory when copying
directories. However, in that case path_extract_filename() is going to
fail, because the root dir simply has not filename.
Let's address that by moving the call further down into the loop, when
we made sure that the target dir doesn't exist yet (the root dir always
exists, hence this check is sufficient).
Moreover, in the branch for copying regular files, also move the calls
down, and generate friendly error messages in case people try to
overwrite dirs with regular files (and the root dir is just a special
case of a dir).
Altogether this makes CopyFiles=/some/place:/ work, i.e. copying some
dir on the host into the root dir of the newly created fs. Previously
this would fail with an error about the inability to extract a filename
from "/", needlessly.
repart: prefix the correct path with root dir in log output
When we copy files into the freshly formatted file system, the mount
point prefix must be prepended to the *target* path, not the *source*
path. Not just in code but in the log message about it, too.
Igor Zhbanov [Tue, 20 Apr 2021 17:22:28 +0000 (17:22 +0000)]
journald: Retry if posix_fallocate returned -1 (EINTR)
On some conditions (particularly when mobile CPUs are going to sleep),
the posix_fallocate(), which is called when a new journal file is allocated,
can return -1 (EINTR). This is counted as a fatal error. So the journald
closes both old and journals, and simply throwing away further incoming
events, because of no log files open.
Introduce posix_fallocate_loop() that restarts the function in the case
of EINTR. Also let's make code base more uniform by returning negative
values on error.
Fix assert in test-sigbus.c that incorrectly counted positive values as
success. After changing the function return values, that will actually work.
Fixes: #19041 Signed-off-by: Igor Zhbanov <i.zhbanov@omprussia.ru>
generator: write out special systemd-fsck-usr.service
So far all file systems where checked by instances of
systemd-fsck@.service, with the exception of the root fs which was
covered by systemd-fsck-root.service. The special handling is necessary
to deal with ordering issues: we typically want the root fs to be
checked before all others, and — weirdly — allow mounting it before the
fsck done (for compat with initrd-less boots).
This adds similar special handling for /usr: if the hierarchy is placed
on a separate file system check it with a special
systemd-fsck-usr.service instead of a regular sysemd-fsck@.service
instance. Reason is again ordering: we want to allow mounting of /usr
without the root fs already being around in the initrd, to cover for
cases where the root fs is created on first boot and thus cannot be
mounted/checked before /usr.
network: enable DHCP broadcast flag if required by interface
Some interfaces require that the DHCPOFFER message is sent via broadcast
if they can't receive unicast messages before they've been configured
with an IP address.
E.g., s390 ccwgroup network interfaces operating in layer3 mode face
this limitation. This can prevent the interfaces from receiving an
IP address via DHCP, if the have been configured for layer3.
To allow DHCP over such interfaces, we're introducing a new device
property ID_NET_DHCP_BROADCAST which can be set for those.
The networkd DHCP client will check whether this property is set
for an interface, and if so will set the broadcast flag, unless
the network configuration for the interface has an explicit
RequestBroadcast setting.
Besides that, we're adding a udev rule to set this device property
for ccwgroup devices operating in layer3 mode, which is the case
if the ID_NET_DRIVER property is qeth_l3.
I was worried that the text size will grow, but apparently that's not the
case:
With --optimization=2:
$ size build/src/shared/libsystemd-shared-248.a.p/gpt.c.o*
text data bss dec hex filename
3674 1104 0 4778 12aa build/src/shared/libsystemd-shared-248.a.p/gpt.c.o.old
3085 1104 0 4189 105d build/src/shared/libsystemd-shared-248.a.p/gpt.c.o
(I don't understand the generated assembly, even though it seems to work:
Disassembly of section .text.gpt_partition_type_is_usr_verity:
It is made inline in the hope that the compiler will be able to optimize
all the va_args boilerplate away, and do an efficient comparison when
the arguments are all constants.
generator: exit early when asked to generate fsck unit for / and /usr in initrd
Let's exit early if we are invoked to generate an fsck unit for the
rootfs or /usr of the initrd itself. The "systemd-root-fsck.service" and
"systemd-usr-fsck.service" units are after all for the host file
systems, and the initrd file hierarchy is from an unpacked cpio anyway.
Hence, this semantically doesn't really make sense, so quickly exit if
we detect this case. This allows us to remove some checks further down
the codepath.
In man pages, horizontal space it at premium, and everything should
generally be indented with 2 spaces to make it more likely that the
examples fit on a user's screen.
This teaches repart to look for the root block device both as the
backing for /sysroot and for /sysusr/usr.
The latter is a new addition, and starts making more sense with the next
commit. It's about supporting systems that are shipped with only a /usr/
fs, but where a root fs is allocated and formatted on first boot via
systemd-repart (or a similar tool). In this case it's useful to be able
to mount the ultimate /usr/ early on without mounting the root fs
right-away (simple because the rootfs might not exist yet, and we need
the repart data encoded in /usr/ to actually format it). Hence, instead
of requiring that we mount /sysroot/ first and /sysroot/usr/ second as
we did so far, let's rearrange things slightly:
1. We mount the /usr/ file system we discover to /sysusr/usr/
2. We mount the root file system we discover to /sysroot/
3. Once both are established we bind mount /sysusr/usr/ to /sysroot/usr/
And that' it. The first two steps can happen in either order, and we can
access /usr/ with or without a rootfs being around.
This commit implements nothing of the above. Instead, it teaches
systemd-repart to check both /sysroot/ and /sysusr/ for repart drop-ins,
and use the first of these hierarchies it finds populated. This way
systemd-repart can be spawned once /usr is mounted and it will work
correctly without root fs having to exist, or we can invoke it when the
root fs is already mounted, where it also will work correctly.
fstab-generator: if usr= is specified, mount it to /sysusr/usr/ first
This changes the fstab-generator to handle mounting of /usr/ a bit
differently than before. Instead of immediately mounting the fs to
/sysroot/usr/ we'll first mount it to /sysusr/usr/ and then add a
separate bind mount that mounts it from /sysusr/usr/ to /sysroot/usr/.
This way we can access /usr independently of the root fs, without for
waiting to be mounted via the /sysusr/ hierarchy. This is useful for
invoking systemd-repart while a root fs doesn't exist yet and for
creating it, with partition data read from the /usr/ hierarchy.
This introduces a new generic target initrd-usr-fs.target that may be
used to generically order services against /sysusr/ to become available.
dissect: ignore udev database entries from before the loopback attachment
This tries to shorten the race of device reuse a bit more: let's ignore
udev database entries that are older than the time where we started to
use a loopback device.
This doesn't fix the whole loopback device raciness mess, but it makes
the race window a bit shorter.
loop-util: track CLOCK_MONOTONIC timestamp immediately before attaching a loopback device
This is similar to the preceding work to store the uevent seqnum, but
this stores the CLOCK_MONOTONIC timestamp.
Why? This allows to validate udev database entries, to determine if they
were created *after* we attached the device.
The uevent seqnum logic allows us to validate uevent, and the timestamp
database entries, hence together we should be able to validate both
sources of truth for us.
(note that this is all racy, just a bit less racy, since we cannot
atomically attach loopback devices and get the timestamp for it, the
same way we can't get the uevent seqnum. Thus is shortens the race
window, but doesn#t close it).
sd-device: add API to query from when a udev database entry is
We already store a CLOCK_MONOTONIC timestamp for each device appearance,
let' make this queriable.
This is useful to determine whether a udev device database entry is from
a current appearance of the device or a previous one, by comparing it
with appropriately taken timestamps.
dissect: ignore old uevents when waiting for loopback partition scan
Let's drop all monitor uevent that were enqueued before we actually
started setting up the device.
This doesn't fix the race, but it makes the race window smaller: since
we cannot determine the uevent seqnum and the loopback attachment
atomically, there's a tiny window where uevents might be generated by
the device which we mistake for being associated with out use of the
loopback device.
loop-util: read kernel's uevent seqnum right before attaching a loopback device
Later, this will allow us to ignore uevents from earlier attachments a
bit better, as we can compare uevent seqnums with this boundary. It's
not a full fix for the race though, since we cannot atomically determine
the uevent and attach the device, but it at least shortens the window a
bit.
loop-util: make loop_device_make() return fd in all code paths
Previously, loop_device_make() would return the device fd in one success
code path, but not the other (where' we'd just return 0).
loop_device_open() returns it in all cases.
Hence, let's clean this up, and make sure in all success code paths of
both functions we return it (even though it strictly speaking is
redundant, since we return it in LoopDevice anyway, and currently noone
actually relies on this).
Miroslav Suchý [Tue, 20 Apr 2021 08:23:01 +0000 (10:23 +0200)]
document DefaultOOMPolicy
the `man systemd.service` say:
Defaults to the setting DefaultOOMPolicy= in systemd-system.conf(5) is set to
but there is no such line in this config.
This is the default value I extracted from
systemctl show --property=DefaultOOMPolicy
repart: add new ReadOnly= and Flags= settings for repart dropins
Let's make the GPT partition flags configurable when creating new
partitions. This is primarily useful for the read-only flag (which we
want to set for verity enabled partitions).
This adds two settings for this: Flags= and ReadOnly=, which strictly
speaking are redundant. The main reason to have both is that usually the
ReadOnly= setting is the one wants to control, and it' more generic.
Moreover we might later on introduce inherting of flags from CopyBlocks=
partitions, where one might want to control most flags as is except for
the RO flag and similar, hence let's keep them separate.