uid-range: add new uid_range_load_userns_by_fd() helper
This is similar to uid_range_load_userns() but instead of reading the
uid_map off a process it reads it off a userns fd.
(Of course the kernel has no API for this right now, hence we fork off a
throw-away process which joins the user namespace, and then read off the
data from there.)
image-policy: add a new image_policy_intersect() call
This new call takes two image policy objects and generates an
"intersection" policy, i.e. only allows what is allowed by both. Or in
other words it conceptually implements a binary AND of the policy flags.
(Except that it's a bit harder, due to normalization, and underspecified
flags).
We can use this later for mountfsd: a client can specify a policy, and
mountfsd can specify another policy, and we'll then apply only what both
allow.
Note that a policy generated like this might be invalid. For example, if
one policy says root must exist and be verity or luks protected, and the
other policy says root must be absent, then the intersection is invalid,
since one policy only allows what the other prohibits and vice versa.
We'll return a clear error code in that case (ENAVAIL). (This is because
we simply don't allow encoding such impossible policies in an
ImagePolicy structure, for good reasons.)
This new call is like varlink_peek_fd() (i.e. gets an fd out of the
connection but leaving it also in there), and combines ith with
F_DUPFD_CLOEXEC to make a copy of it.
We previously already had varlink_dup_fd() which was a duplicating
version for pushing an fd *into* the connection. To reduce confusion,
let's rename that one varlink_push_dup_fd() to make the symmetry to
valrink_push_fd() clear so that we have no:
varlink_peer_push_fd() → put fd in without dup'ing
varlink_peer_push_dup_fd() → same with F_DUPFD_CLOEXEC
varlink_peer_peek_fd() → get fd out without dup'ing
varlink_peer_peek_dup_fd() → same with F_DUPFD_CLOEXEC
Since e56a8790a0 debugging test-execute fails has been a royal PITA, since
we ditch all potentially useful output from the test units (that, for
the most part, run `sh -x ...`). Let's improve the situation a bit by
setting EXEC_OUTPUT_NULL only when running the single test case that
needs it, and inheriting stdout otherwise.
For example, with a purposefully introduced error we get this output
with this patch:
exec-personality-x86-64.service: About to execute: sh -x -c "c=\$\$(uname -m); test \"\$\$c\" = \"foo_bar\""
Serializing sd-executor-state to memfd.
...
Personality: x86-64
LockPersonality: no
SystemCallErrorNumber: kill
++ uname -m
+ c=x86_64
+ test x86_64 = foo_bar
Received SIGCHLD from PID 1520588 (sh).
Child 1520588 (sh) died (code=exited, status=1/FAILURE)
exec-personality-x86-64.service: Child 1520588 belongs to exec-personality-x86-64.service.
exec-personality-x86-64.service: Main process exited, code=exited, status=1/FAILURE
exec-personality-x86-64.service: Failed with result 'exit-code'.
...
Exit Status: 1
src/test/test-execute.c:456:test_exec_personality: exec-personality-x86-64.service: can_unshare=yes: exit status 1, expected 0
(test-execute-root) terminated by signal ABRT.
Assertion 'r >= 0' failed at src/test/test-execute.c:1433, function prepare_ns(). Aborting.
Aborted
But without it, we'd miss the most important part:
exec-personality-x86-64.service: About to execute: sh -x -c "c=\$\$(uname -m); test \"\$\$c\" = \"foo_bar\""
Serializing sd-executor-state to memfd.
...
Personality: x86-64
LockPersonality: no
SystemCallErrorNumber: kill
Received SIGCHLD from PID 1521365 (sh).
Child 1521365 (sh) died (code=exited, status=1/FAILURE)
exec-personality-x86-64.service: Child 1521365 belongs to exec-personality-x86-64.service.
exec-personality-x86-64.service: Main process exited, code=exited, status=1/FAILURE
exec-personality-x86-64.service: Failed with result 'exit-code'.
...
Exit Status: 1
src/test/test-execute.c:456:test_exec_personality: exec-personality-x86-64.service: can_unshare=yes: exit status 1, expected 0
(test-execute-root) terminated by signal ABRT.
Assertion 'r >= 0' failed at src/test/test-execute.c:1433, function prepare_ns(). Aborting.
Aborted
Mike Yuan [Fri, 5 Apr 2024 10:21:50 +0000 (18:21 +0800)]
core/service: make service_set_main_pidref consume pidref
Currently, the memory management of service_set_main_pidref
is a bit odd. Normally we either invalidate the original
resource on caller's side after the call succeeds, or
just pass the ownership wholly. But service_set_main_pidref
take a pointer, and calls pidref_done() internally.
Let's just make it consume the passed pidref. This is more
straightforward.
On s390x both __s390__ and __s390x__ are defined, and with the original
order we'd go through the __s390__ branch and emit a warning:
[169/2118] Compiling C object src/shared/libsystemd-shared-256.a.p/base-filesystem.c.o
../src/shared/base-filesystem.c:136:11: note: ‘#pragma message: Please add an entry above specifying whether your architecture uses /lib64/, /lib32/, or no such links.’
136 | # pragma message "Please add an entry above specifying whether your architecture uses /lib64/, /lib32/, or no such links."
| ^~~~~~~
test: account for build dir being under one of the tmpfs-ed directories
If we're running test-execute from the build directory which is under
one of the tmpfs-ed directories (i.e. /root or /tmp), test-execute might
behave strangely, since in that case manager_new() pins the system
systemd-executor binary instead of the build dir one, which may lead to
a very confusing test fails (if there's enough difference between the
system and built sd-executor binary). Let's account for that and
bind-mount the build dir under the tmpfs-ed directory if necessary.
Debugging this was pain, since the child process didn't log anything
once we closed stdout/stderr (for obvious reasons). Let's fix both
issues by switching logging to kmsg once we close stdin/stdout/stderr,
and also by making the test work fine when there are some extra FDs in
the child's environment.
core: Serialize both pid and pidfd to keep downgrades working
Currently, when downgrading from a version with pidfd support to a
version without pidfd support, all information about running processes
is lost as the newer systemd will serialized pidfds which are not recognized
by the older systemd when deserializing.
To improve the situation, let's serialize both the pid and the pidfd.
This is safe because existing versions will either replace the first
deserialized pidref with the second one or discard the second one in
favor of the first one depending on the unit and field. Older versions
that don't support pidfd's will silently discard any fields that contain
a pidfd as those will try to parse the field as a pid and since a pidfd
field will start with '@', those versions will debug error log and ignore
the value.
To make sure we reuse the existing pidfd as much as possible, the pidfd
is serialized first. Both for scopes and service main pids, if the same
pid is seen multiple times, the first pidref is kept. So by serializing
the pidfd first we make sure the original pidfd is used instead of the
new one which is opened when deserializing the first pid field.
For other control units, older versions with pidfd support will discard
the first pidfd and replace it with a new pidfd from the second pid field.
This is a slight regression on downgrades, but we make sure it doesn't
happen for future versions (and older versions when this commit is
backported) by modifying the logic to only use the first successfully
deserialized pidref so that the raw pid without pidfd is discarded instead
of it replacing the existing pidfd.
meson: set -fno-ssa-phiopt when building bpf with gcc
There are bugs in the kernel verifier that cause legitimate code
to be rejected, disabling this optimization makes bpf programs
built with a new enough gcc work again.
Hopefully, devices in PCI subsystem have some properties, thus have
their udev database file. But, that may not be true. Here, we only read
sysattrs of enumerated devices, hence it is not necessary to check if
the device is initialized or not.
Requested for the testing of F40 riscv bringup. Numbers copied from
https://uapi-group.org/specifications/specs/discoverable_partitions_specification/.
It'd be nice to do the same in TEST-58, but the code there is rather involved
and I don't have a system to test on. We can probably try that later on when F40
is available.
As it turns out libkmod has quite a bunch of deps, including various
compressing libs and similar. By turning this into a dlopen()
dependency, we can make our depchain during install time quite a bit
smaller. In particular as inside of containers kmod doesn't help anyway
as CAP_SYS_MODULE is not available anyway.
While we are at it, also share the code that sets up logging/kmod
context.
watchdog: clarify that we set the *watchdog* timeout
This makes sure we mention the word "watchdog" in every log message
related to the watchdog.
Also, this uses the expression "hardware timeout" when referring to the
primary timeout of the watchdog, as opposed to the "pretimeout".
(Not ideal wording I know, but it's preexisting to some point, I just
continued it. I think it's OK though, in particular to underline the
difference to the software watchdog logic we implement via WATCHDOG= in
sd_notify().)
We stick to debug logging because in some cases network-generator
will fall back to trying another parsing function if one fails, so
if we return an error it's not necessarily a failure.
man/notify-selfcontained-example: check argument first
This is just good style. In this particular case, if the argument is incorrect and
the function is not tested with $NOTIFY_SOCKET set, the user could not get the
proper error until running for real.
Also, remove mention of systemd. The protocol is fully generic on purpose.
- Install individual asan libraries instead of gcc
- Drop duplicate qrencode package from arch config
- Install dbus-user-session which provides default-dbus-session-bus
- Explicitly install dbus-broker on Arch Linux
Unfortunately, sd-bus-vtable.h, sd-journal.h, and sd-id128.h
have variadic macro and inline initialization of sub-object, these are
not supported in C90. So, we need to silence some errors.
mkosi: Install selinux tools in main image instead of initramfs
Also install setools-console and policycoreutils instead of setools
which pulls in the kitchen sink. Also install selinux-policy-targeted
to make sure the right policy is installed.
fuzz: check that ND options are parsed sucessfully
At that point the options have been parsed, sent and received again so
`ndisc_parse_options` should never fail there (unless ndisc_send corrupts
them somehow).
It's a follow-up to https://github.com/systemd/systemd/pull/31807
ssh-generator: create privsep dir via tmpfiles.d/ if we are told to
To make it easy to have a workable ssh-generator on various distros,
let's optionally generate the ssh privsep dir via tmpfiles.d/ drop-in.
This enables the concept with a path of /run/sshd/ as default. This is
the path Debian/Ubuntu uses, and means that we just work on those
distros. Debian/Ubuntu is the only distro (apparently?) that puts the
privsep dir under /run/, hence always needs the dir to be created
manually. Other distros don't need it that much, because they place the
dir in /usr/ (fedora, best choice!) or /var/ (others, not ideal, because
still mutable).
Also adds a longer explanation about this in NEWS, in the hope that
distro maintaines read that and maybe start cleaning this up.
I think the example should reflect the full set of lifecycle messages,
including STOPPING=1, which tells the service manager that the service
is already terminating. This is useful for reporting this information
back to the user and to suppress repeated shutdown requests.
It's not as important as the READY=1 and RELOADING=1 messages, since we
actively wait for those from the service message if the right Type= is
set. But it's still very valuable information, easy to do, and completes
the state engine.
Mike Yuan [Sun, 31 Mar 2024 12:52:39 +0000 (20:52 +0800)]
units: introduce systemd-hibernate-clear.service that clears
stale HibernateLocation EFI variable
Currently, if the HibernateLocation EFI variable exists,
but we failed to resume from it, the boot carries on
without clearing the stale variable. Therefore, the subsequent
boots would still be waiting for the device timeout,
unless the variable is purged manually.
There's no point to keep trying to resume after a successful
switch-root, because the hibernation image state
would have been invalidated by then. OTOH, we don't
want to clear the variable prematurely either,
i.e. in initrd, since if the resume device is the same
as root one, the boot won't succeed and the user might
be able to try resuming again. So, let's introduce a
unit that only runs after switch-root and clears the var.
Martin Wilck [Wed, 6 Mar 2024 10:39:00 +0000 (11:39 +0100)]
99-systemd.rules: rework SYSTEMD_READY logic for device mapper
Device mapper devices are set up in multiple steps. The first step, which
generates the initial "add" event, only creates an empty container, which is
useless for higher layers. SYSTEMD_READY should be set to 0 on this event to
avoid premature device activation.
The event that matters is the "activation" event: the first "change" event on
which DM_UDEV_DISABLE_OTHER_RULES_FLAG=1 is not set. When this event arrives,
the device is ready for being scanned by blkid and similar tools, and for being
activated by systemd.
Intermittent events with DM_UDEV_DISABLE_OTHER_RULES_FLAG=1 should be ignored
as far as systemd or higher-level block layers are concerned. Previous device
properties and symlinks should be preserved: the device shouldn't be scanned or
activated, but shouldn't be deactivated, either. In particular, SYSTEM_READY
shouldn't be set to 0 if it wasn't set before, because that might cause mounted
file systems to be unmounted. Such intermittent events may occur any time,
before or after the "activation" event.
DM_UDEV_DISABLE_OTHER_RULES_FLAG=1 can have multiple reasons. One possible reason
is that the device is suspended. There are other reasons that depend on the
device-mapper subsystem (LVM, multipath, dm-crypt, etc.).
The current systemd rule set
1) sets SYSTEMD_READY=0 if DM_UDEV_DISABLE_OTHER_RULES_FLAG is set in "add"
events;
2) imports SYSTEMD_READY from the udev db if DM_SUSPENDED is set, and jumps to systemd_end;
3) sets SYSTEMD_READY=1, otherwise.
This logic has several flaws:
* 1) can cause file systems to be unmounted if an coldplug event arrives while
a file system is suspended. This rule shouldn't be applied for coldplug events
or in general, "synthetic" add events;
* 2) evaluates DM_SUSPENDED=1, which is a device-mapper internal property.
It's wrong to infer that a device is accessible if DM_SUSPENDED=0.
The jump to systemd_end may cause properties and/or symlinks to be lost;
* 3) is superfluous, because SYSTEMD_READY=1 is equivalent with SYSTEMD_READY
being unset, and can create the wrong impression that the device was explicitly
activated.
This patch fixes the logic as follows:
- apply 1) only if DM_NAME is empty, which is only the case for the first
"genuine add" event;
- change 2) to use DM_UDEV_DISABLE_OTHER_RULES_FLAG instead of DM_SUSPENDED,
and remove the GOTO directive;
- remove 3).
Fixes: b7cf1b6 ("udev: use SYSTEMD_READY to mask uninitialized DM devices") Fixes: 35a6750 ("rules: set SYSTEMD_READY=0 on DM_UDEV_DISABLE_OTHER_RULES_FLAG=1 only with ADD event (#2747)") Signed-off-by: Martin Wilck <mwilck@suse.com>
Luca Boccassi [Fri, 29 Mar 2024 23:36:51 +0000 (23:36 +0000)]
gcrypt: dlopenify for libsystemd
gcrypt is used only for journal sealing operations in libsystemd, so it
can be made into a dlopen dependency that is used only on demand. This
allows to reduce the footprint of libsystemd in the most common cases.
Keep systemd-pull and systemd-resolved with normal linking, as they are
executables, and usually built with OpenSSL support anyway.
sd-bus-vtable: add dummy macro to support compile without GNU extension
If SD_BUS_METHOD_WITH_ARGS() is set with SD_BUS_NO_ARGS and/or SD_BUS_NO_RESULT,
then it introduces
_SD_VARARGS_FOREACH_EVEN(_SD_ECHO, NULL)
-> _SD_VARARGS_FOREACH_SEQ(_01, …, _50, NULL)
Hence, the variadic argument `...` in _SD_VARARGS_FOREACH_SEQ() has no
argument, but it is not allowed if built without GNU extension, e.g. -std=c11.
Let's introduce one more unused dummy argument to support such situation.