Add persistent symlinks for MTD devices like SPI-NOR flash, based on the
partition names specified on the cmdline, in a Device Tree, or by other
MTD partitioning parser drivers. Using the persistent name can be
preferable to using the numbered /dev/mtdX device, as the latter can
change depending on probe order or when partitioning has changed.
Nick Rosbrook [Mon, 16 Oct 2023 17:13:57 +0000 (13:13 -0400)]
nspawn: check if we can set CoredumpReceive= before doing so
If systemd-nspawn is newer than the running systemd, we might try to set
CoredumpReceive=yes when systemd doesn't know about it yet. Try and
check if the running systemd is aware of this setting, and if not, don't
try and use it.
Fixes 411d8c72ec
("nspawn: set CoredumpReceive=yes on container's scope when --boot is set").
Daan De Meyer [Thu, 12 Oct 2023 09:20:06 +0000 (11:20 +0200)]
udev: Enable filtering the output of udevadm info --export-db
Let's support the same filtering options that we also support in
udevadm trigger in udevadm info to filter the devices produced by
--export-db.
One difference is that all properties specified by --propery-match=
have to be satisfied in udevadm info unlike udevadm trigger where just
one of them has to be satisfied.
mount-util: use mount beneath to replace previous namespace mount
Instead of mounting over, do an atomic swap using mount beneath, if
available. This way assets can be mounted again and again (e.g.:
updates) without leaking mounts.
almost all code in namespace.c only logs at debug level as it is
"library-like" code. But there are some outliers. Adjust them to match
the rest of the code
namespace: downgrade log message of error we ignore to LOG_WARNING
frankly, the log message shouldn't be there at all, but the error path
be propagated up, with a recognizable error code. But apparently this is
important to @bluca.
Lukas [Sun, 8 Oct 2023 17:45:34 +0000 (19:45 +0200)]
stub: NULL checks for DeviceHandle and FilePath
UKIs may be loaded in a way, that there can not be a device handle to
the filesystem, that contains the image, for example when using a
bootloader to load the image from a partition with a file system that is
not supported by the firmware.
With the current systemd stub, this causes a failed assertion, because
stub gets passed a NULL DeviceHandle and FilePath. Inserting two
explicit checks enables proper boot even in this case.
According to RFC 6762 section 8, an mDNS responder is supposed to announce its
records after probing.
Currently, there is a check in dns_scope_announce which returns if there are any
pending transactions. This prevents announcements from being sent out even if there
are pending non-probe transactions.
To fix this, return only if there are active probe transactions.
run: support --scope on old service managers that lack native PIDFD support
Before this we'd fail with a complaint that PIDFDs is not supported by
the service manager. Add some compat support by falling back to classic
numeric PIDs in that case.
Nick Rosbrook [Thu, 12 Oct 2023 17:39:56 +0000 (13:39 -0400)]
nspawn: set CoredumpReceive=yes on container's scope when --boot is set
When --boot is set, and --keep-unit is not, set CoredumpReceive=yes on
the scope allocated for the container. When --keep-unit is set, nspawn
does not allocate the container's unit, so the existing unit needs to
configure this setting itself.
Since systemd-nspawn@.service sets --boot and --keep-unit, add
CoredumpReceives=yes to that unit.
Nick Rosbrook [Wed, 6 Sep 2023 15:03:41 +0000 (11:03 -0400)]
coredump: add support for forwarding coredump to containers
If a process crashes within a container, try and forward the coredump to
that container. To do this, check if the crashing process is in a
different pidns, and if so, find the PID of the namespace leader. We
only proceed with forwarding if that PID belongs to a cgroup that is
descendant of another cgroup with user.delegate=1 and
user.coredump_receive=1 (i.e. Delegate=yes and CoredumpReceive=yes).
If we proceed, attach to the namespaces of the leader, and send the
coredump to systemd-coredump.socket in the container. Before this is
done, we need to translate the PID, UID, and GID, and also re-gather
procfs metadata. Translate the PID, UID, and GID to the perspective of
the container by sending an SCM_CREDENTIALS message over a socket pair
from the original systemd-coredump process, to the process forked in the
container.
If we cannot successfully forward the coredump, fallback to the current
behavior so that there is still a record of the crash on the host.
For a given PID and namespace type, this helper function gives the PID
of the leader of the namespace containing the given PID. Use this in
systemd-coredump instead of using the existing get_mount_namespace_leader.
Nick Rosbrook [Wed, 6 Sep 2023 15:01:33 +0000 (11:01 -0400)]
coredump: store crashing process UID and GID in Context
For convenience, store the crashing process's UID and GID in Context (as
uid_t and gid_t, respectively), as is currently done for the PID. This
means we can just parse the UID/GID once in save_context(), and use
those values in other places.
This is just re-factoring, and is a preparation commit for container
support.
Nick Rosbrook [Fri, 29 Sep 2023 19:39:17 +0000 (15:39 -0400)]
core: add CoredumpReceive= setting
This setting indicates that the given unit wants to receive coredumps
for processes that crash within the cgroup of this unit. This setting
requires that Delegate= is also true, and therefore is only available
where Delegate= is available.
This will be used by systemd-coredump to support forwarding coredumps to
containers.
Mike Yuan [Tue, 3 Oct 2023 12:20:55 +0000 (20:20 +0800)]
core/varlink: make sure we setup non-serialized varlink sockets
Before this PR, if m->varlink_server is not yet set up during
deserialization, we call manager_setup_varlink_server rather than
manager_varlink_init, the former of which doesn't setup varlink
addresses, but only binds to methods. This results in that
newly-added varlink addresses not getting created if deserialization
takes place.
Therefore, let's switch to manager_varlink_init, and add some
sanity checks to it in order to prevent listening on the same
address twice.
No functional changes, only moving code that is only needed in
exec_invoke, and adding new dependencies for seccomp/selinux/apparmor/pam
in meson for the sd-executor binary.
Luca Boccassi [Thu, 1 Jun 2023 18:51:42 +0000 (19:51 +0100)]
core: add systemd-executor binary
Currently we spawn services by forking a child process, doing a bunch
of work, and then exec'ing the service executable.
There are some advantages to this approach:
- quick: we immediately have access to all the enourmous amount of
state simply by virtue of sharing the memory with the parent
- easy to refactor and add features
- part of the same binary, will never be out of sync
There are however significant drawbacks:
- doing work after fork and before exec is against glibc's supported
case for several APIs we call
- copy-on-write trap: anytime any memory is touched in either parent
or child, a copy of that page will be triggered
- memory footprint of the child process will be memory footprint of
PID1, but using the cgroup memory limits of the unit
The last issue is especially problematic on resource constrained
systems where hard memory caps are enforced and swap is not allowed.
As soon as PID1 is under load, with no page out due to no swap, and a
service with a low MemoryMax= tries to start, hilarity ensues.
Add a new systemd-executor binary, that is able to receive all the
required state via memfd, deserialize it, prepare the appropriate
data structures and call exec_child.
Use posix_spawn which uses CLONE_VM + CLONE_VFORK, to ensure there is
no copy-on-write (same address space will be used, and parent process
will be frozen, until exec).
The sd-executor binary is pinned by FD on startup, so that we can
guarantee there will be no incompatibilities during upgrades.
Luca Boccassi [Tue, 10 Oct 2023 17:50:36 +0000 (18:50 +0100)]
core: fix checking for extension-releases for ExtensionImages/Directories
The parsing is done after the image has been opened, not before, as it
cannot be done on an block device. Also fix returning on any error for
ExtensionDirectories, not just ENOENT.
hibernate-resume: remove kernel/image version comparison when resuming
We already had a similar check that was removed, see 8340b762e4f597e98a72de1385e74b9be04e521d (*). The kernel supports loading of a
resume image from a different kernel version. This makes sense, because the
goal of "resume" is to replace the running system by a saved memory image, so
it doesn't really matter that the short-lived kernel is different.
By removing the check, we make the process more reliable: for example, the user
may select a different kernel from a list, or not have the previously running
kernel in /boot at all, etc. Requiring the exact same kernel version makes the
process more fragile for no benefit.
Similar reasoning holds for the image version: the image may be updated, and
for example an older kernel+initrd might be used, with an embedded VERSION_ID
that is not the latest. This is fine, and the check is not useful.
I left the check for ID/IMAGE_ID: we probably don't want to use the resume
image if the hibernation was done from a different installation.
(Note: why not check VERSION_ID/IMAGE_VERSION? Because of the following
scenario: a user has an installation of Fedora 35, and they upgrade to Fedora
36, which means that the os-release file on disk gets replaced and now
specifies VERSION_ID=36. But the running kernel is not replaced, and its
package is not removed because the running kernel version is never removed, so
we still have a boot entry that in initrd-release says VERSION_ID=35. Without
rebooting, the user does hibernation. When resuming, we want to resume, no
matter if one of the new entries with VERSION_ID=36 or one of the old entries
with VERSION_ID=35 is picked in the boot loader menu.
If the installation is image-based, i.e. it has IMAGE_ID+IMAGE_VERSION, the
situation is similar: after an upgrade, we may still have an boot entry from
before the upgrade. Using an older kernel+initrd to boot and switch-root into a
newer installation is supported and is rather common.
In fact, it is a rather common situation that the version reported by the boot
entry (or stored internally in the initrd-release in the initrd) does not match
the actual system on disk. Generally, this metadata is saved when the boot menu
entry is written and does not reflect subsequent upgrades. Various
distributions generally keep at least 3 kernels after a upgrade, and during an
upgrade only install one new, which means that after a major upgrade, generally
there will be at least two kernels which have mismatched version information.)
OTOH, I think it is useful to *write* all the details to the EFI var. As
discussed in https://github.com/systemd/systemd/issues/29037, we may want to
show this information in the boot loader. It is also useful for debugging.
(*) Also again discussed and verified in
https://github.com/systemd/systemd/pull/27330#discussion_r1234332080.
", ignored" is dropped, since this failure is likely to cause the following
check to fail. Better not to say anything then to say the misleading thing.
Fixes #10288.
I have confirmed that this does now fix cross-compilation.
It appears that changes upstream in Meson, probably mesonbuild/meson#5263, have made the original MR, #10289, work now.
This needs to be tested to ensure that it doesn't break Travis CI like when it was reverted in #10361.
Some of the entries are really configured, but we also have a bunch
of automatic entries. Calling them "config entries" is misleading, let's
use the more natural "boot entry".
While at it, rename:
config_load_entries() → config_load_type1_entries()
config_entry_add_unified() → config_load_type2_entries()
config_title_generate() → generate_boot_entry_titles()
config_entry_add_<type>() → config_add_entry_<type>()
efi/boot: make timeout changes relative to current value
When the user pressed + or -, we would set the efivar override, starting
from the default of 0. Instead, set an override that starts at the current
value. This means that when user has e.g. a configured override of 5 s, and
they press +, they get an override of 6 s. I think this is leads to a much
smoother experience for a user, who does not necessarilly need to know that
we have three levels of overrides, they just want to easily configure the
timeout with keys. If they press +, the timeout should increase, and not
jump to some low value.
Also, once an override has been set via the boot menu, i.e. the efivar is set,
do not allow unsetting the efivar from the boot menu. This way we also avoid
an unexpected "jump" to whatever the other sources of configuration specify.
The user can configure any value with the keys that they want, so we don't
need to allow unsetting.