Yu Watanabe [Mon, 8 Mar 2021 06:39:53 +0000 (15:39 +0900)]
sd-event: re-check new epoll events when a child event is queued
Previously, when a process outputs something and exit just after
epoll_wait() but before process_child(), then the IO event is ignored
even if the IO event has higher priority. See #18190.
This can be solved by checking epoll event again after process_child().
However, there exists a possibility that another process outputs and
exits just after process_child() but before the second epoll_wait().
When the IO event has lower priority than the child event, still IO
event is processed.
So, this makes new epoll events and child events are checked in a loop
until no new event is detected. To prevent an infinite loop, the number
of maximum trial is set to 10.
Luca Boccassi [Sun, 14 Mar 2021 12:36:15 +0000 (12:36 +0000)]
man: specify that ProtectProc= does not work with root/cap_sys_ptrace
When using hidepid=invisible on procfs, the kernel will check if the
gid of the process trying to access /proc is the same as the gid of
the process that mounted the /proc instance, or if it has the ptrace
capability:
Given we set up the /proc instance as root for system services,
The same restriction applies to CAP_SYS_PTRACE, if a process runs with
it then hidepid=invisible has no effect.
ProtectProc effectively can only be used with User= or DynamicUser=yes,
without CAP_SYS_PTRACE.
Update the documentation to explicitly state these limitations.
Daan De Meyer [Fri, 12 Mar 2021 22:09:44 +0000 (22:09 +0000)]
boot: Move console declarations to missing_efi.h
These were added to eficonex.h in gnu-efi 3.0.13. Let's move them
to missing_efi.h behind an appropriate guard to fix the build with
recent versions of gnu-efi.
Kevin Backhouse [Fri, 12 Mar 2021 17:00:56 +0000 (18:00 +0100)]
ask-password-api: fix error handling on invalid unicode character
The integer overflow happens when utf8_encoded_valid_unichar() returns an error
code. The error code is a negative number: -22. This overflows when it is
assigned to `z` (type `size_t`). This can cause an infinite loop if the value
of `q` is 22 or larger.
To reproduce the bug, you need to run `systemd-ask-password` and enter an
invalid unicode character, followed by a backspace character.
Yu Watanabe [Mon, 8 Mar 2021 06:39:53 +0000 (15:39 +0900)]
sd-event: re-check new epoll events when a child event is queued
Previously, when a process outputs something and exit just after
epoll_wait() but before process_child(), then the IO event is ignored
even if the IO event has higher priority. See #18190.
This can be solved by checking epoll event again after process_child().
However, there exists a possibility that another process outputs and
exits just after process_child() but before the second epoll_wait().
When the IO event has lower priority than the child event, still IO
event is processed.
So, this makes new epoll events and child events are checked in a loop
until no new event is detected. To prevent an infinite loop, the number
of maximum trial is set to 10.
Frantisek Sumsal [Thu, 11 Mar 2021 11:49:00 +0000 (12:49 +0100)]
repart: fix the loop dev support check
Since f17bdf8264e231fa31c769bff2475ef698487d0b the test-repart was
effectively disabled, since `/dev/loop-control` is a character special
file, whereas `-f` works only on regular files. Even though we could use
`-c` to check specifically for character special files, let's use `-e`
just in case.
Michal Sekletar [Tue, 9 Mar 2021 16:22:32 +0000 (17:22 +0100)]
install: refactor find_symlinks() and don't search for symlinks recursively
After all we are only interested in symlinks either in top-level config
directory or in .wants and .requires sub-directories.
As a bonus this should speed up ListUnitFiles() roughly 3-4x on systems
with a lot of units that use drop-ins (e.g. SSH jump hosts with a lot of
user session scopes).
Tables with only one column aren't really tables, they are lists. And if
each cell only consists of a single word, they are probably better
written in a single line. Hence, shorten the man page a bit, and list
boot loader spec partition types in a simple sentence.
Also, drop "root-secondary" from the list. When dissecting images we'll
upgrade "root-secondary" to "root" if we mount it, and do so only if
"root" doesn't exist. Hence never mention "root-secondary" as we never
will mount a partition under that id.
This makes sure nspawn's --volatile=yes switch works again: there we
have a read-only image that is overmounted by a tmpfs (with the
exception of /usr). This we need to mkdir all mount points even though
the image is read-only.
Hence, let's drop the optimizatio of avoiding mkdir() on images that are
read-only, it's wrong and misleading here, since the image itself might
be read-only but our mounts are not.
dissect-image: clean up meaning of DISSECT_IMAGE_MKDIR
Previously handling of DISSECT_IMAGE_MKDIR was pretty weird and broken:
it would control both if we create the top-level mount point when
mounting an image, and the inner mount points for images that consist of
multiple file systems. However, the latter is redundant, since 1f0f82f1311e4c52152b8e2b6f266258709c137d does this too, a few lines
further up – unconditionally!
Hence, let's make the meaning of DISSECT_IMAGE_MKDIR more strict: it
shall be only about the top-level mount point, not about the inner ones
(where we'll continue to create what is missing alwayway). Having a
separate flag for the top-level mount point is relevant, since the mount
point dir created by it will remain on the host fs – unlike the
directories we create inside the image, which will stay within the
image.
This slightly change of meaning is actually inline with what the flag is
actually used for and documented in systemd-dissect.
shared/fstab-util: teach fstab_filter_options() a mode where all values are returned
Apart from tests, the new argument isn't used anywhere, so there should be no
functional change. Note that the two arms of the big conditional are switched, so the
diff is artificially inflated. The actual code change is rather small. I dropped the
path which extracts ret_value manually, because it wasn't supporting unescaping of the
escape character properly.
shared/fstab-util: pass through the escape character
… when not used to escape the separator (,) or the escape character (\).
This mostly restores behaviour from before 0645b83a40d1c782f173c4d8440ab2fc82a75006,
but still allows "," to be escaped.
basic/extract-word: allow escape character to be escaped
With EXTRACT_UNESCAPE_SEPARATORS, backslash is used to escape the separator.
But it wasn't possible to insert the backslash itself. Let's allow this and
add test.
basic/extract_word: try to explain what the various options do
A test for stripping of escaped backslashes without any flags was explicitly
added back in 4034a06ddb82ec9868cd52496fef2f5faa25575f. So it seems to be on
purpose, though I would say that this is at least surprising and hence deserves
a comment.
In test-extract-word, add tests for standalone EXTRACT_UNESCAPE_SEPARATORS.
Only behaviour combined with EXTRACT_CUNESCAPE was tested.
shared/fstab-util: immediately drop empty options again
In the conversion from strv_split() to strv_split_full() done in 7bb553bb98a57b4e03804f8192bdc5a534325582, EXTRACT_DONT_COALESCE_SEPARATORS was
added. I think this was just by mistake… We never look for "empty options", so
whether we immediately ignore the extra separator or store the empty string in
strv, should make no difference.
generators: warn but ignore failure to write timeouts
When we failed to split the options (because of disallowed quoting syntax, which
might be a bug in its own), we would silently fail. Instead, let's emit a warning.
Since we ignore the value if we cannot parse it anyway, let's ignore this error
too.
This creates a "well known" for sgx_enclave ownership. By doing this here we
avoid the risk that various projects making use of the device will provide
similar-but-slightly-incompatible installation instructions, in particular
using different group names.
ACLs are actually a better approach to grant access to users, but not in all
cases, so we want to provide a standard group anyway.
Mode is 0o660, not 0o666 because this is very new code and distributions are
likely to not want to give full access to all users. This might change in the
future, but being conservative is a good default in the beginning.
Rules for /dev/sgx_provision will be provided by libsg-ae-pce:
https://github.com/intel/linux-sgx/issues/678.
Frantisek Sumsal [Wed, 10 Mar 2021 15:41:35 +0000 (16:41 +0100)]
coredump: omit coredump info when -q is used with the `debug` verb
Skip printing the coredump info table when using the `debug` verb in
combination with the `-q/--quiet` option. Useful when trying to gather
coredump info non-interactively via scripted gdb commands.
Frantisek Sumsal [Wed, 10 Mar 2021 11:30:04 +0000 (12:30 +0100)]
test: fix permissions of the ASan udev workaround
otherwise udev complains about the file being world-writable:
systemd-udevd[228]: Configuration file /etc/udev/rules.d/00-set-LD_PRELOAD.rules is marked world-writable. Please remove world writability permission bits. Proceeding anyway.
The patch seems to cause usb devices to get some attributes set from the parent
PCI device. 'hwdb' builtin has support for breaking iteration upwards on usb
devices. But when '--subsystem=foo' is specified, iteration is continued. I'm
sure it *could* be figured out, but it seems hard to get all the combinations
correct. So let's revert to functional status quo ante, even if does the lookup
more than once unnecessarily.
When running TEST-22 under ASan, there's a chain of events which causes
`stat` to output an extraneous ASan error message, causing following
fail:
```
+ test -d /tmp/d/1
++ stat -c %U:%G:%a /tmp/d/1
==82==ASan runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD.
+ test = daemon:daemon:755
.//usr/lib/systemd/tests/testdata/units/testsuite-22.02.sh: line 24: test: =: unary operator expected
```
This is caused by `stat` calling nss which in Arch's configuration calls
the nss-systemd module, that pulls in libasan which causes the $LD_PRELOAD
error message, since `stat` is an uninstrumented binary.
The $LD_PRELOAD variable is explicitly unset for all testsuite-* services
since it causes various issues when calling uninstrumented libraries, so
setting it globally is not an option. Another option would be to set
$LD_PRELOAD for each `stat` call, but that would unnecessarily clutter
the test code.
socket-util: refuse ifnames with embedded '%' as invalid
So Linux has this (insane — in my opinion) "feature" that if you name a
network interface "foo%d" then it will automatically look for the
interface starting with "foo…" with the lowest number that is not used
yet and allocates that.
We should never clash with this "magic" handling of ifnames, hence
refuse this, since otherwise we never know what the name is we end up
with.
We should probably switch things from a deny list to an allow list
sooner or later and be much stricter. Since the kernel directly enforces
only very few rules on the names, we'd need to do some research what is
safe and what is not first, though.
sd-netlink: use setsockopt_int() also for NETLINK_ADD/DROP_MEMBERSHIP
We use 'unsigned' as the type, but netlink(7) says the type is 'int'.
It doesn't really matter, since they are both the same size. Let's use
our helper to shorten the code a bit.
PID1 already logs about the service being started, so this line isn't necessary
in normal use. Also, by the time it is emitted, the service has already
signalled readiness, so let's not say "starting" but "started".
We drop the reference stored in Manager.managed_oom_varlink_request in two code paths:
vl_disconnect() which is installed as a disconnect callback, and in manager_varlink_done().
But we also make a disconnect from manager_varlink_done(). So we end up with the following
call stack:
(gdb) bt
0 vl_disconnect (s=0x112c7b0, link=0xea0070, userdata=0xe9bcc0) at ../src/core/core-varlink.c:414
1 0x00007f1366e9d5ac in varlink_detach_server (v=0xea0070) at ../src/shared/varlink.c:1210
2 0x00007f1366e9d664 in varlink_close (v=0xea0070) at ../src/shared/varlink.c:1228
3 0x00007f1366e9d6b5 in varlink_close_unref (v=0xea0070) at ../src/shared/varlink.c:1240
4 0x0000000000524629 in manager_varlink_done (m=0xe9bcc0) at ../src/core/core-varlink.c:479
5 0x000000000048ef7b in manager_free (m=0xe9bcc0) at ../src/core/manager.c:1357
6 0x000000000042602c in main (argc=5, argv=0x7fff439c43d8) at ../src/core/main.c:2909
When we enter vl_disconnect(), m->managed_oom_varlink_request.n_ref==1.
When we exit from vl_discconect(), m->managed_oom_varlink_request==NULL. But
varlink_close_unref() has a copy of the pointer in *v. When we continue executing
varlink_close_unref(), this pointer is dangling, and the call to varlink_unref()
is done with an invalid pointer.
timedated: fix skipping of comments in config file
Reading file '/usr/lib/systemd/ntp-units.d/80-systemd-timesync.list'
Failed to add NTP service "# This file is part of systemd.", ignoring: Invalid argument
Failed to add NTP service "# See systemd-timedated.service(8) for more information.", ignoring: Invalid argument
efi-loader: make efi_loader_entry_name_valid() check a bit stricter
Previously we'd just check if the ID was no-empty an no longer than
FILENAME_MAX. The latter was probably a mistake, given the comment next
to it. Instead of fixing that to check for NAME_MAX let's instead just
switch over to filename_is_valid() which odes a similar check, plus a
some minor additional checks. After all we do want that valid EFI boot
menu entry ids are usable as filenames.
This fixes two checks where we compare string sizes when validating with
FILENAME_MAX. In both cases the check apparently wants to check if the
name fits in a filename, but that's not actually what FILENAME_MAX can
be used for, as it — in contrast to what the name suggests — actually
encodes the maximum length of a path.
In both cases the stricter change doesn't actually change much, but the
use of FILENAME_MAX is still misleading and typically wrong.
Yu Watanabe [Mon, 8 Mar 2021 03:00:32 +0000 (12:00 +0900)]
seccomp: do not ignore deny-listed syscalls with errno when list is allow-list
Previously, if the hashmap is allow-list and a new deny-listed syscall
is added, seccomp_parse_syscall_filter() simply drop the new syscall
from hashmap even if error number is specified.
This makes 'allow-list' hashmap store two types of entries:
- allow-listed syscalls, which are stored with negative value (-1).
- deny-listed syscalls, which are stored with specified errno.
Yu Watanabe [Mon, 8 Mar 2021 02:54:05 +0000 (11:54 +0900)]
core: drop meaningless parse_syscall_and_errno() calls
parse_syscall_and_errno() does not check the validity of syscall name or
syscall group name, but it just split into syscall name and errno.
So, it is not necessary to call it for SystemCallLog=.