Currently, when we deserialize an fd we do a lot of manual work. Add a
common helper that makes this more robust and uniform.
Note that this sometimes changes behaviour slightly, but in ways that
shouldn't really matter: if we fail to deserialize an fd correctly we'll
unset (i.e. set to -EBADF) the fd in the deserialized data structure.
Previously, we'd leave the old value in place.
This should not change effective result (as in either case we'll be in a
bad state afterwards, just once we mix old/invalidated state with new
state, while now we'll reset the state explicitly to invalidated state
on failure). In particular as deserialization starts from an empty
structure generally, hence the old value should be unset anyway.
Another slight change is that if we fail to deserialize some object half
way, and we already have taken out one fd from the serialized fdset
we'll now just close it instead of returning it to/leaving it in the
fdset. Given that such "orphaned" fds are blanket closed after
deserialization finishes this also shouldn't change behaviour IRL.
Also, the idle_pipe was previously incorrectly serialized: we'd
serialize invalidated fds, which would fail, but because parsing errors
on this were ignored on the deserializatin noone noticed. This is fixed.
Ronan Pigott [Sat, 14 Oct 2023 03:22:49 +0000 (20:22 -0700)]
network: include SSID in ipv6 stable prefix address generation
The SSID fills the role of the optional Network_ID input parameter
suggested by RFC7217. Including the SSID allows networkd to generate a
different pseudorandom address for different wireless networks, which
should help to obscure the host's identity when roaming between multiple
networks.
repart: avoid use of uninitialized TPM2B_PUBLIC data
The 'TPM2B public' struct is only initialized if the public key
is non-NULL, however, it is unconditionally passed to
tpm2_calculate_sealing_policy, resulting in use of uninitialized
data. If the uninitialized data is lucky enough to be all zeroes,
this results eventually results in an error message from
tpm2_calculate_name about an unsupported nameAlg field value.
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
cgroup: turn device cgroup controller "rwm" strings into proper flags
We generally prefer dealing with parsed data instead of original
strings, do so for the "rwm" strings too. We have to convert this to
flags for the primary backend implementation (BPF) anyway, hence we
can do this early to have simpler, shorter and more normalized code.
Franck Bui [Mon, 21 Aug 2023 10:37:00 +0000 (12:37 +0200)]
meson: add build option for install path of main config files
This allows distros to install configuration file templates in /usr/lib/systemd
for example.
Currently we install "empty" config files in /etc/systemd/. They serve two
purposes:
- The file contains commented-out values that show the default settings.
- It is easier to edit the right file if it is already there, the user doesn't
have to type in the path correctly, and the basic file structure is already in
place so it's easier to edit.
Things that have happened since this approach was put in place:
- We started supporting drop-ins for config files, and drop-ins are the
recommended way to create local configuration overrides.
- We have systemd-analyze cat-config which takes care of iterating over
all possible locations (/etc, /run, /usr, /usr/local) and figuring out
the right file.
- Because of the first two points, systemd-analyze cat-config is much better,
because it takes care of finding all the drop-ins and figuring out the
precedence. Looking at files manually is still possible of course, but not
very convenient.
The disadvantages of the current approach with "empty" files in /etc:
- We clutter up /etc so it's harder to see what the local configuration actually is.
- If a user edits the file, package updates will not override the file (e.g.
systemd.rpm uses %config(noreplace). This means that the "documented defaults"
will become stale over time, if the user ever edits the main config file.
Thus, I think that it's reasonable to:
- Install the main config file to /usr/lib so that it serves as reference for
syntax and option names and default values and is properly updated on package
upgrades.
- Recommend to users to always use drop-ins for configuration and
systemd-analyze cat-config to view the documentation.
It's not entirely clear why the UEFI calls gets slower, nevertheless the
information in itself proves useful.
This commit introduces a new option "menu-disabled", which omits the
100ms delay. The option is documented throughout the manual pages as
well as the Boot Loader Specification.
v2:
- use STR_IN_SET
v3:
- drop erroneous whitespace
v4:
- add a new LoaderFeature bit,
- don't change ABI keep TIMEOUT_* tokens the same
- move new token in the 64bit range, update API and storage for it
- change inc/dec behaviour to TIMEOUT_MIN : TIMEOUT_MENU_FORCE
- user cannot opt-in from sd-boot itself, add assert_not_reached()
v5:
- s/Menu disablement control/Menu can be disabled/
- rewrap comments to 109
- use SYNTHETIC_ERRNO(EOPNOTSUPP)
Signed-off-by: Emil Velikov <emil.velikov@collabora.com>
Frantisek Sumsal [Tue, 17 Oct 2023 10:49:03 +0000 (12:49 +0200)]
test: don't restart journal-upload on an expected fail
In c08bec1587 the journal-upload unit gained Restart=on-fail, which goes
against this one particular test that expects the unit to fail, making
the test flaky. Let's disable the automatic restarts just for this test
to make it stable once again.
Mike Yuan [Thu, 12 Oct 2023 10:38:15 +0000 (18:38 +0800)]
core/mount: allow disabling stop propagation from backing device
With file systems that have volume management functionalities or
volume managers like LVM, it's fine for the backing device of a mount
to disappear after mounted. Currently, we enforce BindsTo= or
StopPropagatedFrom= on the backing device, thus prohibiting such
cases. Instead, let's make this configurable through x-systemd.device-bound.
Add persistent symlinks for MTD devices like SPI-NOR flash, based on the
partition names specified on the cmdline, in a Device Tree, or by other
MTD partitioning parser drivers. Using the persistent name can be
preferable to using the numbered /dev/mtdX device, as the latter can
change depending on probe order or when partitioning has changed.
Nick Rosbrook [Mon, 16 Oct 2023 17:13:57 +0000 (13:13 -0400)]
nspawn: check if we can set CoredumpReceive= before doing so
If systemd-nspawn is newer than the running systemd, we might try to set
CoredumpReceive=yes when systemd doesn't know about it yet. Try and
check if the running systemd is aware of this setting, and if not, don't
try and use it.
Fixes 411d8c72ec
("nspawn: set CoredumpReceive=yes on container's scope when --boot is set").
test: make sure that the default naming scheme name maps back to itself
We were testing the that C constant is defined, but we weren't actually testing
that the string name maps back to itself. This would catch the issue fixed by
the grandparent commit.
The test for the default name is moved to the test file to keep the tests
together. The define is renamed to not have "_TEST" in the name. The issue here
is complicated by the fact that we allow downstreams to inject additional
fields, so we don't know the name of the default scheme if it not set with
-Ddefault-net-naming-scheme=, so _DEFAULT_NET_NAMING_SCHEME[_TEST] is not
defined in all cases, but at least in principle it could be used in other
places. If it exists, it is fully valid.
NEWS, man: move description of SR-IOV-R net naming to v255
https://github.com/systemd/systemd/pull/29582 adds the "v254" name. This also
changes what the default is and what "latest" refers to. Without the name, the
code could be enabled via runtime configuration. Nevertheless, it could be
enabled at compilation time. In other words:
meson setup build -Ddefault-net-naming-scheme=v254
would work, but
net.naming-scheme=v254
would fail.
It is possible that people were using the compile-time override, so I think
we should allow "v254" scheme to stay and clearly document that it wasn't the
default.
Unfortunately, unless people manually introduced the compile-time override, we
were never actually testing the new code too. So all the pull request testing
was not useful.
Daan De Meyer [Thu, 12 Oct 2023 09:20:06 +0000 (11:20 +0200)]
udev: Enable filtering the output of udevadm info --export-db
Let's support the same filtering options that we also support in
udevadm trigger in udevadm info to filter the devices produced by
--export-db.
One difference is that all properties specified by --propery-match=
have to be satisfied in udevadm info unlike udevadm trigger where just
one of them has to be satisfied.
mount-util: use mount beneath to replace previous namespace mount
Instead of mounting over, do an atomic swap using mount beneath, if
available. This way assets can be mounted again and again (e.g.:
updates) without leaking mounts.
almost all code in namespace.c only logs at debug level as it is
"library-like" code. But there are some outliers. Adjust them to match
the rest of the code
namespace: downgrade log message of error we ignore to LOG_WARNING
frankly, the log message shouldn't be there at all, but the error path
be propagated up, with a recognizable error code. But apparently this is
important to @bluca.
Lukas [Sun, 8 Oct 2023 17:45:34 +0000 (19:45 +0200)]
stub: NULL checks for DeviceHandle and FilePath
UKIs may be loaded in a way, that there can not be a device handle to
the filesystem, that contains the image, for example when using a
bootloader to load the image from a partition with a file system that is
not supported by the firmware.
With the current systemd stub, this causes a failed assertion, because
stub gets passed a NULL DeviceHandle and FilePath. Inserting two
explicit checks enables proper boot even in this case.
According to RFC 6762 section 8, an mDNS responder is supposed to announce its
records after probing.
Currently, there is a check in dns_scope_announce which returns if there are any
pending transactions. This prevents announcements from being sent out even if there
are pending non-probe transactions.
To fix this, return only if there are active probe transactions.
run: support --scope on old service managers that lack native PIDFD support
Before this we'd fail with a complaint that PIDFDs is not supported by
the service manager. Add some compat support by falling back to classic
numeric PIDs in that case.
Nick Rosbrook [Thu, 12 Oct 2023 17:39:56 +0000 (13:39 -0400)]
nspawn: set CoredumpReceive=yes on container's scope when --boot is set
When --boot is set, and --keep-unit is not, set CoredumpReceive=yes on
the scope allocated for the container. When --keep-unit is set, nspawn
does not allocate the container's unit, so the existing unit needs to
configure this setting itself.
Since systemd-nspawn@.service sets --boot and --keep-unit, add
CoredumpReceives=yes to that unit.
Nick Rosbrook [Wed, 6 Sep 2023 15:03:41 +0000 (11:03 -0400)]
coredump: add support for forwarding coredump to containers
If a process crashes within a container, try and forward the coredump to
that container. To do this, check if the crashing process is in a
different pidns, and if so, find the PID of the namespace leader. We
only proceed with forwarding if that PID belongs to a cgroup that is
descendant of another cgroup with user.delegate=1 and
user.coredump_receive=1 (i.e. Delegate=yes and CoredumpReceive=yes).
If we proceed, attach to the namespaces of the leader, and send the
coredump to systemd-coredump.socket in the container. Before this is
done, we need to translate the PID, UID, and GID, and also re-gather
procfs metadata. Translate the PID, UID, and GID to the perspective of
the container by sending an SCM_CREDENTIALS message over a socket pair
from the original systemd-coredump process, to the process forked in the
container.
If we cannot successfully forward the coredump, fallback to the current
behavior so that there is still a record of the crash on the host.
For a given PID and namespace type, this helper function gives the PID
of the leader of the namespace containing the given PID. Use this in
systemd-coredump instead of using the existing get_mount_namespace_leader.
Nick Rosbrook [Wed, 6 Sep 2023 15:01:33 +0000 (11:01 -0400)]
coredump: store crashing process UID and GID in Context
For convenience, store the crashing process's UID and GID in Context (as
uid_t and gid_t, respectively), as is currently done for the PID. This
means we can just parse the UID/GID once in save_context(), and use
those values in other places.
This is just re-factoring, and is a preparation commit for container
support.
Nick Rosbrook [Fri, 29 Sep 2023 19:39:17 +0000 (15:39 -0400)]
core: add CoredumpReceive= setting
This setting indicates that the given unit wants to receive coredumps
for processes that crash within the cgroup of this unit. This setting
requires that Delegate= is also true, and therefore is only available
where Delegate= is available.
This will be used by systemd-coredump to support forwarding coredumps to
containers.
Mike Yuan [Tue, 3 Oct 2023 12:20:55 +0000 (20:20 +0800)]
core/varlink: make sure we setup non-serialized varlink sockets
Before this PR, if m->varlink_server is not yet set up during
deserialization, we call manager_setup_varlink_server rather than
manager_varlink_init, the former of which doesn't setup varlink
addresses, but only binds to methods. This results in that
newly-added varlink addresses not getting created if deserialization
takes place.
Therefore, let's switch to manager_varlink_init, and add some
sanity checks to it in order to prevent listening on the same
address twice.
No functional changes, only moving code that is only needed in
exec_invoke, and adding new dependencies for seccomp/selinux/apparmor/pam
in meson for the sd-executor binary.
Luca Boccassi [Thu, 1 Jun 2023 18:51:42 +0000 (19:51 +0100)]
core: add systemd-executor binary
Currently we spawn services by forking a child process, doing a bunch
of work, and then exec'ing the service executable.
There are some advantages to this approach:
- quick: we immediately have access to all the enourmous amount of
state simply by virtue of sharing the memory with the parent
- easy to refactor and add features
- part of the same binary, will never be out of sync
There are however significant drawbacks:
- doing work after fork and before exec is against glibc's supported
case for several APIs we call
- copy-on-write trap: anytime any memory is touched in either parent
or child, a copy of that page will be triggered
- memory footprint of the child process will be memory footprint of
PID1, but using the cgroup memory limits of the unit
The last issue is especially problematic on resource constrained
systems where hard memory caps are enforced and swap is not allowed.
As soon as PID1 is under load, with no page out due to no swap, and a
service with a low MemoryMax= tries to start, hilarity ensues.
Add a new systemd-executor binary, that is able to receive all the
required state via memfd, deserialize it, prepare the appropriate
data structures and call exec_child.
Use posix_spawn which uses CLONE_VM + CLONE_VFORK, to ensure there is
no copy-on-write (same address space will be used, and parent process
will be frozen, until exec).
The sd-executor binary is pinned by FD on startup, so that we can
guarantee there will be no incompatibilities during upgrades.