coredump: avoid deadlock when passing processed backtrace data
We would deadlock when passing the data back from the forked-off process that
was doing backtrace generation back to the coredump parent. This is because we
fork the child and wait for it to exit. The child tries to write too much data
to the output pipe, and and after the first 64k blocks on the parent because
the pipe is full. The bug surfaced in Fedora because of a combination of four
factors:
- 87707784c70dc9894ec613df0a6e75e732a362a3 was backported to v251.5, which
allowed coredump processing to be successful.
- 1a0281a3ebf4f8c16d40aa9e63103f16cd23bb2a was NOT backported, so the output
was very verbose.
- Fedora has the ELF package metadata available, so a lot of output can be
generated. Most other distros just don't have the information.
- gnome-calendar crashes and has a bazillion modules and 69596 bytes of output
are generated for it.
The code is changed to try to write data opportunistically. If we get partial
information, that is still logged. In is generally better to log partial
backtrace information than nothing at all.
Yu Watanabe [Tue, 18 Oct 2022 12:23:26 +0000 (21:23 +0900)]
test: skip one test for iszero_safe() on i386 without SSE2
We do not provide any numerical libraries, and iszero_safe() is only
used in parsing or formatting JSON. Hence, it is not necessary for us to
request that the function provides the same result on different systems.
manager: use target process context to set socket context
Use target process context to set socket context when using SELinuxContextFromNet
not systemd's context. Currently when using the SELinuxContextFromNet option for
a socket activated services, systemd calls getcon_raw which returns init_t and
uses the resulting context to compute the context to be passed to the
setsockcreatecon call. A socket of type init_t is created and listened on and
this means that SELinux policy cannot be written to control which processes
(SELinux types) can connect to the socket since the ref policy allows all
'types' to connect to sockets of the type init_t. When security accessors see
that any process can connect to a socket this raises serious concerns. I have
spoken with SELinux contributors in person and on the mailing list and the
consensus is that the best solution is to use the target executables context
when computing the sockets context in all cases.
Before this patch, the flow was:
'''
mac_selinux_get_child_mls_label:
peercon = getpeercon_raw(socket_fd);
if (!exec_label)
exec_label = getfilecon_raw(exe);
socket_open_fds:
if (params->selinux_context_net) #
label = mac_selinux_get_our_label(); # this part is removed
else #
label = mac_selinux_get_create_label_from_exe(path);
socket_address_listen_in_cgroup(s, &p->address, label);
analyze: use DumpUnitsMatchingPatternsByFileDescriptor
Similarly to DumpByFileDescriptor vs Dump,
DumpUnitsMatchingPatternsByFileDescriptor is used in preference. Dissimilarly,
a fallback to DumpUnitsMatchingPatterns is not done on error, because there is
no need for backwards compatibility.
The code is still more verbose than I'd like, but there are four different code
paths with slightly different rules in each case, so it's hard to make this all
very brief. Since we have a separate file dedicated to making those calls, the
verbose-but-easy-to-follow implementation should be OK.
Closes #24989.
I only did a quick test that all both variants works locally and over ssh.
Frantisek Sumsal [Mon, 17 Oct 2022 16:11:21 +0000 (18:11 +0200)]
test: call sync() before checking the test logs
Otherwise we might hit a race where we read the test log just before
it's fully written to the disk:
```
======================================================================
FAIL: test_interleaved (__main__.ExecutionResumeTest.test_interleaved)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/root/systemd/test/test-exec-deserialization.py", line 170, in test_interleaved
self.check_output(expected_output)
File "/root/systemd/test/test-exec-deserialization.py", line 111, in check_output
self.assertEqual(output, expected_output)
AssertionError: 'foo\n' != 'foo\nbar\n'
foo
+ bar
```
With some debug:
```
test_interleaved (__main__.ExecutionResumeTest.test_interleaved) ...
Assertion failed; file contents just after the assertion:
b'foo\n'
Frantisek Sumsal [Mon, 17 Oct 2022 13:00:12 +0000 (15:00 +0200)]
test: use SIGKILL to kill the container if necessary
TEST-69 uses a Python wrapper around the systemd-nspawn call, which on
error calls the `spawn.terminate()` method. However, with no arguments
it will only use SIGHUP and SIGINT signals - this might leave a stuck
container around, causing fails if the test is run again. With `force=True`
SIGKILL is used as well (if necessary).
Jan Janssen [Fri, 14 Oct 2022 09:09:12 +0000 (11:09 +0200)]
boot: Rework shim image verification
This moves the shim security arch override to the new
ReinstallProtocolInterface based interface. This also has the benefit to
reduce the time window in which we have this override active and also
actually removes it, which was not previously done.
The shim hooks themselves are also modernized too. The upcalls should
really not be neccessary if shim is happy with the provided binary.
Jan Janssen [Wed, 21 Sep 2022 10:23:36 +0000 (12:23 +0200)]
boot: Remove unused parameters from pe_kernel_info
Only the compat entry address is used now. This also now only returns
the compat entry address. If the image is native we do not need to try
calling into the entry address again as we would already have done so
from StartImage (and failed).
Jan Janssen [Wed, 21 Sep 2022 09:07:53 +0000 (11:07 +0200)]
stub: Use LoadImage/StartImage to start the kernel
This is the proper way to start any EFI binary. The fact this even ever
worked was because the kernel does not have any PE relocations.
The only downside is that the embedded kernel image has to be signed and
trusted by the firmware under secure boot. A future commit will try to
deal with that.
Jan Janssen [Wed, 21 Sep 2022 08:42:40 +0000 (10:42 +0200)]
stub: Rename image parameter
This is really the parent image for the kernel that is to be run.
Renaming it as such prevents confusion with any image handles that are
about to be created.
Frantisek Sumsal [Mon, 17 Oct 2022 12:31:25 +0000 (14:31 +0200)]
test: ignore gcov errors in TEST-34
TEST-34 complains in `test_check_writable` when running with gcov, as
the build directory tree is not writable with DynamicUser=true. As I had
no luck with $GCOV_PREFIX and other runtime gcov configuration, let's
just ignore the gcov errors for this test.
core: simplify the return convention in manager_load_unit()
This function was returning 0 or 1 on success. It has many callers, and it
wasn't clear if any of them care about the distinction. It turns out they don't
and the return values were done for convenience because manager_load_unit_prepare()
returns 0 or 1. Let's invert the code in the static function to follow the usual
pattern where 0 means "no work was done" and 1 means "work was done", and make
the non-static function always return 0 to make the code easier to read, and
also add comments that explain what is happening.
This adds two more phases to the PCR boot phase logic: "sysinit" +
"final".
The "sysinit" one is placed between sysinit.target and basic.target.
It's good to have a milestone in this place, since this is after all
file systems/LUKS volumes are in place (which sooner or later should
result in measurements of their own) and before services are started
(where we should be able to rely on them to be complete).
This is particularly useful to make certain secrets available for
mounting secondary file systems, but making them unavailable later.
This breaks API in a way (as measurements during runtime will change),
but given that the pcrphase stuff wasn't realeased yet should be OK.
Jan Janssen [Sun, 16 Oct 2022 07:36:21 +0000 (09:36 +0200)]
stub: Fix booting with old kernels
This fixes a regression introduced in e1636807 that removed setting this
value as it seemingly was not used by the kernel and would actively
break above 4G boots. But old kernels (4.18 in particular) will not boot
properly if it is not filled out by us.
The original issue was using the truncated value to then jump into the
kernel entry point, which we do not do anymore. So setting this value
again on newer kernels is fine.
gpt-auto: rename all functions that operate on a DissectedPartition object add_partition_xyz()
The function for handling regular mounts based on DissectedPartition
objects is called add_partition_mount(), so let's follow this scheme for
all other functions that handle them, too. This nicely separates out the
low-level functions (which get split up args) from the high-level
functions (which get a DissectedPartition object): the latter are called
add_partition_xyz() the former just add_xyz().
This makes naming a bit more systematic. No change in behaviour.
msizanoen1 [Sat, 8 Oct 2022 12:41:18 +0000 (19:41 +0700)]
journald: harden against forward clock jumps before unclean shutdown
Try harder to inherit the sequence number and ID from the old journal
file before rotating it away.
This helps the libsystemd journal file selection code make better decisions
even in the face of massive incorrect forward clock jumps prior to an
unclean shutdown.
TEST-15: add test for transient units with drop-ins
We want to test four things:
- that the transient units are successfully started when drop-ins exist
- that the transient setings override the defaults
- the drop-ins override the transient settings (the same as for a normal unit)
- that things are the same before and after a reload
To make things more fun, we start and stop units in two different ways: via
systemctl and via a direct busctl invocation. This gives us a bit more coverage
of different code paths.
TEST-15: also test hierarchical drop-ins for slices
Slices are worth testing too, because they don't need a fragment path so they
behave slightly differently than service units. I'm making this a separate
patch from the actual tests that I wanted to add later because it's complex
enough on its own.
TEST-15: allow helper functions to accept other unit types
clear_services() is renamed to clear_units() and now takes a full
unit name including the suffix as an argument.
_clear_service() is renamed to clear_unit() and changed likewise.
create_service() didn't have the same underscore prefix, and I don't think
it's useful or needed for a local function, so it is removed.
In https://github.com/containers/podman/issues/16107, starting of a transient
slice unit fails because there's a "global" drop-in
/usr/lib/systemd/user/slice.d/10-oomd-per-slice-defaults.conf (provided by
systemd-oomd-defaults package to install some default oomd policy). This means
that the unit_is_pristine() check fails and starting of the unit is forbidden.
It seems pretty clear to me that dropins at any other level then the unit
should be ignored in this check: we now have multiple layers of drop-ins
(for each level of the cgroup path, and also "global" ones for a specific
unit type). If we install a "global" drop-in, we wouldn't be able to start
any transient units of that type, which seems undesired.
In principle we could reject dropins at the unit level, but I don't think that
is useful. The whole reason for drop-ins is that they are "add ons", and there
isn't any particular reason to disallow them for transient units. It would also
make things harder to implement and describe: one place for drop-ins is good,
but another is bad. (And as a corner case: for instanciated units, a drop-in
in the template would be acceptable, but a instance-specific drop-in bad?)
Thus, $subject.
While at it, adjust the message. All the conditions in unit_is_pristine()
essentially mean that it wasn't loaded (e.g. it might be in an error state),
and that it doesn't have a fragment path (now that drop-ins are acceptable).
If there's a job for it, it necessarilly must have been loaded. If it is
merged into another unit, it also was loaded and found to be an alias.
Based on the discussion in the bugs, it seems that the current message
is far from obvious ;)
seccomp: drop per arch conditionalization in filter groups
We list plenty of arch-specific syscalls in our filter groups, treat the
s390 syscalls the same.
We handle gracefully anyway if some syscall doesn't exist locally on the
kernel or arch, let's rely on it. This has the benefit that
"systemd-analyze" will comprehensively tell you the syscalls filtered on
any arch for any arch.
Anita Zhang [Wed, 5 Oct 2022 08:40:40 +0000 (01:40 -0700)]
core: only allow systemd-oomd to use SubscribeManagedOOMCGroups
Attempt to address
https://github.com/systemd/systemd/issues/20330#issuecomment-1210028422.
Summary of the comment: Unprivileged users can potentially cause a denial of
service during systemd-oomd unit subscriptions by spamming requests to
SubscribeManagedOOMCGroups. As systemd-oomd.service is the only unit that
should be accessing this method, add a check on the caller's unit name to deter
them from successfully using this method.
msizanoen1 [Wed, 12 Oct 2022 06:40:05 +0000 (13:40 +0700)]
shared/logs-show: do not overwrite journal time in export format with source timestamps
Using _SOURCE_{MONOTONIC,REALTIME}_TIMESTAMP in place of the results of
sd_journal_get_{monotonic,realtime}_usecs in export formats might cause
internal inconsistency of realtime timestamp values within a journal export,
violating the export file format and causing systemd-journal-remote to
mass-generate journal files.
Fix this by using the real journal timestamps for
__{REALTIME,MONOTONIC}_TIMESTAMP.
NEWS: rework the description of systemd-measure a bit again
Try to separate the description so that changes are described first, and the
discussion follows separately. Remove some repeated verbose descriptions of the
subject: if one sentence describes that UKI contains an signature and describes
it in detail, the next sentence can just say "the signature" without
elaborating. Also, we don't do version-keying yet, so don't say "future"
kernels — older kernels will work too.
Yu Watanabe [Fri, 14 Oct 2022 07:18:35 +0000 (16:18 +0900)]
udev-builtin-kmod: support to run without arguments
If no module name is provided, then try to load modules based on the
device modealias.
Previously, MODALIAS property is passed as an argument, but it may
contain quotation. Hence, unfortunately the modalias may be modified
and cannot load expected modules.
install: include full type name in special UnitFilePresetMode values
Typically the _MAX and _INVALID special enum values use the full type as
prefix, even if the actual values of the enum might not. Let's follow
this rule here too.
install: make InstallChange enum type a proper type
We can just make this an enum, as long as we ensure it has enough range,
which we can do by adding -ERRNO_MAX as one possible value (at least on
GNU C). We already do that at multiple other places, so let's do this
here too.