firstboot: synchronously wait for systemd-vconsole-setup.service/restart job
Requested in https://github.com/systemd/systemd/pull/27755#pullrequestreview-1443489520.
I dropped the info message about the job being requested, because we get
fairly verbose logs from starting the unit, and the additional message isn't
useful.
In the unit, the ordering before systemd-vconsole-setup.service is dropped,
because now it needs to happen in parallel, while systemd-firstboot.service
is running. This means that we may potentially execute vconsole-setup twice,
but it's fairly quick, so this doesn't matter much.
Frantisek Sumsal [Wed, 24 May 2023 11:29:52 +0000 (13:29 +0200)]
tree-wide: check memstream buffer after closing the handle
When closing the FILE handle attached to a memstream, it may attempt to
do a realloc() that may fail during OOM situations, in which case we are
left with the buffer pointer pointing to NULL and buffer size > 0. For
example:
Malte Poll [Wed, 24 May 2023 09:01:25 +0000 (11:01 +0200)]
ukify: fix handling signed kernel as file
The .linux section would contain the path to the signed kernel (instead of the signed kernel itself), since the python type of the variable is used to determine how it is handled when adding the pe sections.
Co-authored-by: Otto Bittner <cobittner@posteo.net>
Mike Yuan [Mon, 22 May 2023 00:30:30 +0000 (08:30 +0800)]
core: get rid of unused Service.will_auto_restart logic
The announced new behavior for OnFailure= never worked properly,
and we've fixed the document instead in #27675.
Therefore, let's get rid of the unused logic completely. More at #27594.
The to-be-added RestartMode= option should cover the use case hopefully.
mount: check right before invoking /bin/umount if it makes sense
Notifications from /proc/self/mountinfo are async, so if we stop a
service (and while doing so get rid of the credentials mount point of
it), then it will take a while until the notification reaches us and we
actually scan the table again. In particular as we nowadays ratelimit
notifications on the table, since it's so inefficient. And as I learnt
the ratelimiting is actually quite regularly hit during shutdown, where
a flurry of umount events are genreated. Hence, let's check if a mount
point is actually a mountpoint before trying to unmount it. And if it
isn't let's wait for the notification to come in.
(This race might be triggred not just by us on ourselves btw: there are
other daemons that unmount stuff when stopping where the race also
exists, but might simply be harder to trigger: if during service
shutdown these services remove some mount then they might collide with
us doing the same. After all, we have the rule to unmount everything
mounted automatically for you during shutdown.)
In the long run we should also start making us of this when it becomes
available: https://github.com/util-linux/util-linux/issues/2132 With
that we can make issues like this go away entirely from our side of
things at least.
core: split out default network dep generation into own function
Just some simple refactoring: let's split out network dep generation
into its own function mount_add_default_network_dependencies().
This way mount_add_default_dependencies() only does preparatory stuff,
and then calls both mount_add_default_network_dependencies() and
mount_add_default_ordering_dependencies() with that, making things
nicely symmetric.
core: suppress various defaults deps for credentials mounts
The per-unit credentials mounts might show up for any kind of service,
including very very early ones. Hence let's not order them after
local-fs-pre.target, because otherwise we might trigger cyclic deps of
services that want to plug before that but still use credentials.
Daan De Meyer [Tue, 23 May 2023 14:24:47 +0000 (16:24 +0200)]
core/timer: Always use inactive_exit_timestamp if it is set
If we're doing a daemon-reload, we'll be going from TIMER_DEAD => TIMER_WAITING,
so we won't use inactive_exit_timestamp because TIMER_DEAD != UNIT_ACTIVE, even
though inactive_exit_timestamp is serialized/deserialized and will be valid after
the daemon-reload.
This issue can lead to timers never firing as we'll always calculate the next
elapse based on the current realtime on daemon-reload, so if daemon-reload happens
often enough, the elapse interval will be moved into the future every time, which
means the timer will never trigger.
To fix the issue, let's always use inactive_exit_timestamp if it is set, and only
fall back to the current realtime if it is not set.
Jan Janssen [Tue, 23 May 2023 17:00:52 +0000 (19:00 +0200)]
elf2efi: Do not emit an empty relocation section
At least shim will choke on an empty relocation section when loading the
binary. Note that the binary is still considered relocatable (just with
no base relocations to apply) as we do not set the
IMAGE_FILE_RELOCS_STRIPPED DLL characteristic.
msizanoen1 [Tue, 23 May 2023 11:46:26 +0000 (18:46 +0700)]
core: Do not check child freezability when thawing slice
We want thawing operations to still succeed even in the presence of an
unfreezable unit type (e.g. mount) appearing under a slice after the
slice was frozen. The appearance of such units should never cause the
slice thawing operation to fail to prevent potential future repeats of
https://github.com/systemd/systemd/issues/25356.
sd-boot,sd-stub: also print version after the address
The kernel, systemd, and many other things print their version during boot.
sd-boot and sd-stub are also important, so let's print the version if EFI_DEBUG.
(If !EFI_DEBUG, continue to be quiet.)
When updating the docs, I saw that that the text in HACKING.md was out of date.
Instead of trying to update the instructions there, make it shorter and refer
the reader to tools/debug-sd-boot.sh for details.
Daan De Meyer [Tue, 23 May 2023 11:25:58 +0000 (13:25 +0200)]
tree-wide: Fix false positives on newer gcc
Recent gcc versions have started to trigger false positive
maybe-uninitialized warnings. Let's make sure we initialize
variables annotated with _cleanup_ to avoid these.
Frantisek Sumsal [Tue, 23 May 2023 07:55:17 +0000 (09:55 +0200)]
json: correctly handle magic strings when parsing variant strv
We can't dereference the variant object directly, as it might be
a magic object (which has an address on a faulting page); use
json_variant_is_sensitive() instead that handles this case.
For example, with an empty array:
==1547789==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000023 (pc 0x7fd616ca9a18 bp 0x7ffcba1dc7c0 sp 0x7ffcba1dc6d0 T0)
==1547789==The signal is caused by a READ memory access.
==1547789==Hint: address points to the zero page.
SCARINESS: 10 (null-deref)
#0 0x7fd616ca9a18 in json_variant_strv ../src/shared/json.c:2190
#1 0x408332 in oci_args ../src/nspawn/nspawn-oci.c:173
#2 0x7fd616cc09ce in json_dispatch ../src/shared/json.c:4400
#3 0x40addf in oci_process ../src/nspawn/nspawn-oci.c:428
#4 0x7fd616cc09ce in json_dispatch ../src/shared/json.c:4400
#5 0x41fef5 in oci_load ../src/nspawn/nspawn-oci.c:2187
#6 0x4061e4 in LLVMFuzzerTestOneInput ../src/nspawn/fuzz-nspawn-oci.c:23
#7 0x40691c in main ../src/fuzz/fuzz-main.c:50
#8 0x7fd61564a50f in __libc_start_call_main (/lib64/libc.so.6+0x2750f)
#9 0x7fd61564a5c8 in __libc_start_main@GLIBC_2.2.5 (/lib64/libc.so.6+0x275c8)
#10 0x405da4 in _start (/home/fsumsal/repos/@systemd/systemd/build-san/fuzz-nspawn-oci+0x405da4)
DEDUP_TOKEN: json_variant_strv--oci_args--json_dispatch
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV ../src/shared/json.c:2190 in json_variant_strv
==1547789==ABORTING
Or with an empty string in an array:
../src/shared/json.c:2202:39: runtime error: member access within misaligned address 0x000000000007 for type 'struct JsonVariant', which requires 8 byte alignment
0x000000000007: note: pointer points here
<memory cannot be printed>
#0 0x7f35f4ca9bcf in json_variant_strv ../src/shared/json.c:2202
#1 0x408332 in oci_args ../src/nspawn/nspawn-oci.c:173
#2 0x7f35f4cc09ce in json_dispatch ../src/shared/json.c:4400
#3 0x40addf in oci_process ../src/nspawn/nspawn-oci.c:428
#4 0x7f35f4cc09ce in json_dispatch ../src/shared/json.c:4400
#5 0x41fef5 in oci_load ../src/nspawn/nspawn-oci.c:2187
#6 0x4061e4 in LLVMFuzzerTestOneInput ../src/nspawn/fuzz-nspawn-oci.c:23
#7 0x40691c in main ../src/fuzz/fuzz-main.c:50
#8 0x7f35f364a50f in __libc_start_call_main (/lib64/libc.so.6+0x2750f)
#9 0x7f35f364a5c8 in __libc_start_main@GLIBC_2.2.5 (/lib64/libc.so.6+0x275c8)
#10 0x405da4 in _start (/home/fsumsal/repos/@systemd/systemd/build-san/fuzz-nspawn-oci+0x405da4)
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior ../src/shared/json.c:2202:39 in
Note: this happens only if json_variant_copy() in json_variant_set_source() fails.
pid1: when taking possession of passed fds check O_CLOEXEC state first
So here's the thing. One library we use (libselinux) is opening fds
behind our back when we initialize it and keeps it open. On the other
hand we want to automatically pick up all fds passed in to us, so that
we can distribute them to our services and close the rest. We pick them
up very early in our code, to ensure that we don't get confused by open
fds at that point. Except that libselinux insists on being initialized
even earlier. So suddenly we might take possession of libselinux' fds,
and then close them later when we decide no service wants them. Then
during shutdown we close down selinux and selinux closes its fds, but
since already closed long ago this ight close our fds instead. Hilarity
ensues.
I wish low-level software wouldn't do such things behind our back, but
well, let's make the best of it.
This changes the fd pick-up logic to only pick up fds that have
O_CLOEXEC unset. O_CLOEXEC must be unset for any fds passed in to us
over execve() after all. And for all our own fds we should set O_CLOEXEC
since we generally don't want to litter fd tables for execve(). Also,
libselinux thankfully appears to set O_CLOEXEC correctly on its fds,
hence the filter works.
log: propagate max log level into glibc's setlogmask()
Follow-up for: #27734
It makes sense to propagate the select log level we maintain also into
glibc, so that any code that uses syslog() directly that ends up in our
processes (libraries and such) are affected by our settings the same way
as we are ourselves.
udevadm: improve debug logging when triggering/watching events
Let's make debugging udev triggering a bit easier, by generating debug
log messages whenever we trigger a device, and also when we see the
event in pid1.
firstboot: reload manager after writing /etc/locale.conf
Requested in https://github.com/systemd/systemd/pull/27750#issuecomment-1559258861.
I didn't apply the locale configuration in firstboot itself, because
we don't have any localized messages, so that wouldn't change anything.
firstboot: process the root account after sysusers created it
We would create root account from sysusers or from firstboot, depending on
which one ran earlier. Since firstboot offers more options, in particular can
set the root password, we needed to order it earlier. This created an ugly
ordering requirement:
We want sysusers.service to create basic users, so we can create nodes in dev,
so we can operate on block devices and such, so that we can resize and remount
things. But at the same time, systemd-firstboot.service can only work if it is
run early, before systemd-sysusers.service has created /etc/passwd. We can't
have it both ways: the units that want to have a fully writable root file
system cannot be ordered before units which are required to do file system
preparation.
Instead of trying to order firstboot very early, let's let it do its thing even
if it is started later. Instead of refusing to create to the root account if
/etc/passwd and /etc/shadow exist, actually check if the account is configured.
Now sysusers writes root account with password PASSWORD_UNPROVISIONED
("!unprovisioned"), and then firstboot checks for this, and will configure root
in this case.
This allows sysusers to be executed earlier (or accounts to be set up earlier
in another way).
shared/condition: add envvar override for the check for first-boot
Before 7cd43e34c5a302ff323c013f437092d2ff5ccbbf, it was possible to use
SYSTEMD_PROC_CMDLINE=systemd.condition-first-boot to override autodetection.
But now this doesn't work anymore, and it's useful to be able to do that for
testing.
units: create /dev with --graceful first, allow sysusers to run later
We want to call systemd-tmpfiles-setup-dev.service to create /dev/fuse and
other device nodes so that module probing will work. But it is possible that
when we're in first boot, some users or groups need to be created by
systemd-sysusers first. But it is also possible that systemd-sysusers cannot
actually execute configuration because the root partition is not fully writable
yet. So let systemd-tmpfiles-setup-dev.service run earlier, possibly without
all users and groups in place. Since systemd-tmpfiles-setup-dev.service writes
to /dev only, it doesn't care how the root partition is mounted. In this early
run, some some nodes might be created with default permissions (i.e. not
accessible to non-root users or groups). This should be OK for the early boot
phase. Afterwards, we let systemd-tmpfiles-setup.service execute full
configuration. We will configure any files in /dev twice, but considering that
there's only a few of them and that the second run should only adjust ownership
and permissions, this should be OK. This way, we avoid the dependency loop.
If systemd-repart is running sufficiently early, /var/tmp might not be in place
yet. But if there is nothing to minimize, we won't even use it. Let's move the
check right before the first use.
systemd-repart[441]: Device '/' has no dm-crypt/dm-verity device, no need to look for…
systemd-repart[441]: Device /dev/sda opened and locked.
systemd-repart[441]: Sector size of device is 512 bytes. Using grain size of 4096.
systemd-repart[441]: Could not determine temporary directory: No such file or directory
systemd[1]: systemd-repart.service: Child 441 belongs to systemd-repart.service.
systemd[1]: systemd-repart.service: Main process exited, code=exited, status=1/FAILURE
systemd[1]: systemd-repart.service: Failed with result 'exit-code'.
firstboot: clarify that machine-id options are only offline, add missing docs
Let's flat out refuse to configure machine-id on a running system with
systemd-firstboot. It wouldn't work anyway, because by the time firstboot is
started, pid1 has created /etc/machine-id, possibly with "unitialized", so
firstboot wouldn't touch the file. (If --force is specified, it works. So
let's allow that in case people want to do crazy things.)
While at it, add missing descriptions of various things that were added over
time, and group descriptions of similar options together.
units: make sure proc-sys-binfmt_misc.automount is actually stopped
As with other units, stopping of the automount requires actual work,
and without the ordering dependency systemd might not execute the stop
job before shutdown.target is reached and units ordered after that are
executed.
units/systemd-repart: stop pretending that root config is executed in the initrd
I have a system with /usr/lib/repart.d/50-root.conf with GrowFileSystem=yes.
The partition wouldn't be resized in the initrd, because
ConditionDirectoryNotEmpty=|/sysusr/usr/lib/repart.d was evaluated very early,
before /sysroot was mounted. There was no ordering dependency between
systemd-repart.service and sysroot.mount. (There was After=initrd-usr-fs.target,
but it seems to be only referred to by systemd-fstab-generator, which in my
case doesn't even run, because there's no fstab.)
But in fact, we neeed to run systemd-repart in the initrd only in limited
circumstances: when we need to create the root device based on config under
sysusr.mount. If there is config on the root device, it can be executed in
the host system, early during boot. Thus, let's remove the condition on
/sysroot/…. Without an ordering dependency on sysroot.mount, it was subject to
a race condition anyway. (A race condition with a low probability of "winning",
because systemd-repart.service has no dependencies, but sysroot.mount requires
a device to be detected and the mount to happen.)
The other problem was that systemd-repart.service didn't have the ordering wrt.
initrd-switch-root.target, so it was subject to the same race condition that
was fixed for other units in 7c0e2b555968d70ac563a37e32a6931ee90961a6. (If the
systemd-repart.service/stop job is slow, we could end up not restarting
systemd-repart.service in the host system.)
With the changes here, I see systemd-repart.service/start running twice:
in the initrd it is skipped because the conditions fail, and then in the
host system it runs normally.
Note: support for /sysroot is retained in systemd-repart code. I don't see a
strong reason to remove it, since it may still be useful to people invoking
repart in the initrd in other circumstances.
> The block is reordered and split to have:
> 1. description + documentation
> 2. (optionally) conditions
> 3. all the dependencies
The dependencies for shutdown.target are listed separately because they are the
other deps are for startup, and shutdown.target only matter much later.