Daan De Meyer [Wed, 23 Nov 2022 13:12:38 +0000 (14:12 +0100)]
mkfs-util: Skip non files/directories when calling mcopy
Only files and directories are supported by vfat. When we pass a
symlink to mcopy, it will try to dereference them and copy what the
symlink points at into the vfat partition instead. Let's avoid this
by skipping all unsupported file types when establishing the list of
top level targets that mcopy should copy.
We also use RECURSE_DIR_SORT everywhere when iterating directories
to make things more reproducible.
Currently, services use mount_move_root() in order to setup the root
directory of services using a mount namespace. This relies on MS_MOVE
and chroot(). However, this has serious drawbacks even for relatively
simple mount propagation scenarios.
What systemd currently does is roughly equivalent to the following shell
code:
unshare --mount --propagation=shared
cd /
mount --make-rslave /
mkdir /new-root
mount --rbind / /new-root
cd /new-root
mount --move /new-root /
chroot .
This looks simple enough but has the consequence that two separate mount
trees exist for the lifetime of the service. The first one was created
when the mount namespace was created, and the second one when a new
mount for the rootfs was created. The first mount tree sticks around as
a shadow mount tree. Both mount trees are dependent mounts with the host
rootfs as their dominating mount.
Now, when mount propagation is triggered by the host by e.g.,
mount --bind /opt /mnt
it means that two propagation events are generated. I'm skipping over
the exact kernel details as they aren't that important. The gist is that
for every propagation event that is generated a second one is generated
for the shadow mount tree. In other words, the kernel creates two copies
for each mount that is propagated instead of one.
This isn't necessary. We can simply change the sequence above to:
unshare --mount --propagation=shared
cd /
mount --make-rslave /
mkdir /new-root
# stash fd to old rootfs
# stash fd to new rootfs
mount --rbind / /new-root
mkdir /new-root
cd /new-root
pivot_root . .
# new root is tucked under old root
# chdir into old rootfs via stashed fd
umount -l /old-root
The pivot_root allows us to get rid of the old mount tree that was
created when the mount namespace was created. So after this sequence
only one mount tree is alive. Plus, it's safer and nicer. Moving mounts
isn't pleasnt.
This patch doesn't convert nspawn yet as the requirements are more
tricky given that it wants to preserve the rootfs as a shared mount
which goes against pivot_root() requirements.
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Luca Boccassi [Wed, 23 Nov 2022 16:06:48 +0000 (16:06 +0000)]
portable: add a few more useful debug log messages
When attaching and /etc/systemd/system.attached can't be created or used
(eg: dead symlink) the logs are pretty much useless as even at debug
level there's no indication of what is going wrong.
Add some debug logs, and return a more specific error string over D-Bus.
Nick Rosbrook [Tue, 22 Nov 2022 16:30:03 +0000 (11:30 -0500)]
oomd: fix unreachable test case in test-oomd-util
This conditional with !empty_or_root(ctx->path) always returns false
because the most recent oomd_cgroup_context_acquire() call was with the
root cgroup. Make sure this test case can be reached by checking cgroup
instead of ctx->path.
While here, use an unused uid (61183) instead of the nobody uid so the
test case does not fail in unprivileged LXD containers.
Nick Rosbrook [Tue, 22 Nov 2022 15:33:55 +0000 (10:33 -0500)]
oomd: always allow root-owned cgroups to set ManagedOOMPreference
Commit 652a4efb66a ("oomd: loosen the restriction on ManagedOOMPreference")
made the change to allow ManagedOOMPreference on a cgroup candidate when
the monitored cgroup and cgroup candidate are owned by the same user.
The commit assumed that this check was sufficient to continue allowing
ManagedOOMPreference on all cgroups owned by root. However, it caused a
regression for unprivileged LXD containers where e.g. /sys/fs/cgroup is
owned by nobody (uid=65534).
Fix this by explicitly allowing the ManagedOOMPreference if uid == 0 in
oomd_fetch_cgroup_oom_preference().
hwdb: remove fuzz and deadzone for Simucube wheel bases.
For these devices the axes are setup via a special
configuration tool. udev should not apply additional
fuzz or deadzone.
Reference for the product IDs:
https://granitedevices.com/wiki/Simucube_product_USB_interface_documentation
This also indicates that there are a total of 8 axes.
kernel-install: add header to generate entry files
I was looking at a bug in bugzilla about some boot loader issue, and it was
hard to say if the boot entry files were generated by our plugin or something
else. Add a header to make this clear.
kernel-install invokes the plugins via absolute path always, so $0 gives as
the full path the location where the plugin is installed. This is what we want:
title Fedora Linux 37 (Workstation Edition)
# Boot Loader Specification type#1 entry
# File created by /usr/lib/kernel/install.d/90-loaderentry.install (systemd 252-409-g5028904^)
Daan De Meyer [Mon, 21 Nov 2022 19:41:22 +0000 (20:41 +0100)]
find-esp: Relax filesystem root directory check
When relaxed checks are requested, let's not require the efi/xbootldr
directory to be the root of the filesystem. When building images, image
builders might install all efi/xbootldr files to a regular directory
first before packing them up into a partition. To allow bootctl to be
used in such scenarios to install systemd-boot, we need to relax the
fsroot check.
Luca Boccassi [Tue, 22 Nov 2022 16:24:54 +0000 (16:24 +0000)]
repart: respect --discard=no also for block devices
It's only used to avoid BLKDISCARD on individual partitions at the moment.
It can take a lot of time to run on very slow devices, so avoid it for
them too.
sd-stub has an opportunity to handle the seed the same way sd-boot does,
which would have benefits for UKIs when sd-boot is not in use. This
commit wires that up.
It refactors the XBOOTLDR partition discovery to also find the ESP
partition, so that it access the random seed there.
Instead of just reverting that, this PR will change things so that we
strictly rely on glibc's new epoll_pwait2() wrapper (which was added
earlier this year), and drop our own manual fallback syscall wrapper.
That should nicely side-step any issues with correct syscall wrapping
definitions (which on some arch seem not to be easy, given the sigset_t
size final argument), by making this a glibc problem, not ours.
Given that the only benefit this delivers are time-outs more granular
than msec, it shouldn't really matter that we'll miss out on support
for this on systems with older glibcs.
man/journalctl: mention systemd-cat, make the description more direct
We said "query the journal". This is true but also very generic. Let's say
"print log entries from the journal" instead, so that users who are looking for
"logging" are more likely to figure out that the journalctl is the tool for
them.
Also, mention systemd-journal-remote.service which can write the journal too.
And give some hints how to figure out how to write *to* the journal.
In sd_bus_wait(), let's convert EINTR to a return code of 0, thus asking
the caller do loop again and enter sd_bus_process() again (which will
not find any queued events). This way we'll not return an error on
something that isn't really an error. This should typically make sure
things are properly handled by the caller, magically, without eating up
the event entirely, and still giving the caller time to run some code if
they want.
Yu Watanabe [Tue, 22 Nov 2022 05:24:32 +0000 (14:24 +0900)]
network: wifi: try to reconfigure when connected
Sometimes, RTM_NEWLINK message with carrier is received earlier than
NL80211_CMD_CONNECT. To make SSID= or other WiFi related settings in
[Match] section work, let's try to reconfigure the interface.
udev: make sure auto-root logic also works in UKIs booted from XBOOTLDR
If no root= switch is specified on the kernel command line we'll use the
root disk on which the partition the LoaderDevicePartUUID efi var is
located – as long as that partition is an ESP. Let's slightly liberalize
that and also allow it if that partition is an XBOOTLDR partition. This
ensures that UKIs spawned directly from XBOOTLDR work the same as those
from the ESP.
(Note that this makes no difference if sd-boot is in the mix, as in that
case LoaderDevicePartUUID is always set to the ESP, as that's where
sd-boot is located, and sd-boot will set the var first, sd-stub will
only set it later if it#s not set yet.)
tree-wide: make constant ratelimit compound actually const
The compiler should recognize that these are constant expressions, but
let's better make this explicit, so that the linker can safely share the
initializations all over the place.
Now that the random seed is used on virtualized systems, there's no
point in having a random-seed-mode toggle switch. Let's just always
require it now, with the existing logic already being there to allow not
having it if EFI itself has an RNG. In other words, the logic for this
can now be automatic.
basic/strv: check printf arguments to strv_extendf()
The second argument to _printf_() specifies where the arguments start. We need to
use 0 in two cases: when the args in a va_list and can't be checked, and with journald
logging functions which accept multiple format strings with multiple argument sets,
which the _printf_ checker does not understand. But strv_extendf() can be checked.
dlfcn-util: add static asserts ensuring our sym_xyz() func ptrs match the types from the official headers
Make sure that the sym_xyz function pointers have the types that the
functions we'll assign them have.
And of course, this found a number of incompatibilities right-away, in
particular in the bpf hookup.
(Doing this will trigger deprecation warnings from libbpf. I simply
turned them off locally now, since we are well aware of what we are
doing in that regard.)
There's one return type fix (bool → int), that actually matters I think,
as it might have created an incompatibility on some archs.
bootctl: install system token on virtualized systems
Removing the virtualization check might not be the worst thing in the
world, and would potentially get many, many more systems properly seeded
rather than not seeded. There are a few reasons to consider this:
- In most QEMU setups and most guides on how to setup QEMU, a separate
pflash file is used for nvram variables, and this generally isn't
copied around.
- We're now hashing in a timestamp, which should provide some level of
differentiation, given that EFI_TIME has a nanoseconds field.
- The kernel itself will additionally hash in: a high resolution time
stamp, a cycle counter, RDRAND output, the VMGENID uniquely
identifying the virtual machine, any other seeds from the hypervisor
(like from FDT or setup_data).
- During early boot, the RNG is reseeded quite frequently to account for
the importance of early differentiation.
So maybe the mitigating factors make the actual feared problem
significantly less likely and therefore the pros of having file-based
seeding might outweigh the cons of weird misconfigured setups having a
hypothetical problem on first boot.
Currently translated at 100.0% (193 of 193 strings)
Co-authored-by: Richard E. van der Luit <fedoraproject@veneax.nl>
Translate-URL: https://translate.fedoraproject.org/projects/systemd/master/nl/
Translation: systemd/main
Jan Janssen [Tue, 15 Nov 2022 17:53:02 +0000 (18:53 +0100)]
boot: Replace firmware security hooks directly
For some firmware, replacing their own security arch instance with our
override using ReinstallProtocolInterface() is not enough as they will
not use it. This commit goes back to how this was done before by
directly modifying the security protocols.
Jan Janssen [Mon, 14 Nov 2022 13:37:13 +0000 (14:37 +0100)]
boot: Do not require a loaded image path
If the device path to text protocol is not available (looking angrily at
Apple) we would fail to boot because we cannot get the loaded image
path. As this is only used for cosmetic purposes, we can just silently
continue.
bootctl: rework how we handle referenced but absent EFI boot entries
Follow-up for #25368.
Let's consider ENOENT an expected error, and just debug log about it
(though, let's suffix it with `, ignoring.`). All other errors will log
loudly, as they are unexpected errors.
resolved: when configuring 127.0.0.1 as per-interface DNS server, contact it via "lo" always
ussually if you specify a DNS server on some interface then we'll use
that interface to talk to it. Let's override this for localhost
addresses, as they only really make sense on "lo".
Sam James [Fri, 18 Nov 2022 07:18:18 +0000 (07:18 +0000)]
nspawn: allow sched_rr_get_interval_time64 through seccomp filter
We only allow a selected subset of syscalls from nspawn containers
and don't list any time64 variants (needed for 32-bit arches when
built using TIME_BITS=64, which is relatively new).
We allow sched_rr_get_interval which cpython's test suite makes
use of, but we don't allow sched_rr_get_interval_time64.
The test failures when run in an arm32 nspawn container on an arm64 host
were as follows:
```
======================================================================
ERROR: test_sched_rr_get_interval (test.test_posix.PosixTester.test_sched_rr_get_interval)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/var/tmp/portage/dev-lang/python-3.11.0_p1/work/Python-3.11.0/Lib/test/test_posix.py", line 1180, in test_sched_rr_get_interval
interval = posix.sched_rr_get_interval(0)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
PermissionError: [Errno 1] Operation not permitted
```
Then strace showed:
```
sched_rr_get_interval_time64(0, 0xffbbd4a0) = -1 EPERM (Operation not permitted)
```
This appears to be the only time64 syscall that isn't already included one of
the sets listed in nspawn-seccomp.c that has a non-time64 variant. Checked
over each of the time64 syscalls known to systemd and verified that none
of the others had a non-time64-variant whitelisted in nspawn other than
sched_rr_get_interval.
reuben olinsky [Tue, 1 Nov 2022 05:58:52 +0000 (22:58 -0700)]
sysupdate: Support volatile-root for finding the root partition
The existing logic can't find the root device in scenarios where
the root has been replaced with an overlay. We support looking
at "/run/systemd/volatile-root" to find the original root, similar
to what systemd-repart and gpt-auto-generator do.