Add support for conditions on the machines firmware
This allows to limit units to machines that run on a certain firmware
type. For device tree defined machines checking against the machine's
compatible is also possible.
network: neighbor: Always add neighbors with replace
We were duplicating setting flags for the message and a combination of
NLM_F_APPEND and NLM_F_CREATE which does not make sense. We should have
been using NLM_F_REPLACE and NLM_F_CREATE since the kernel can
dynamically create neighbors prior to us adding an entry. Otherwise, we
can end up with cases where the message will time out after ~25s even
though the neighbor still gets added. This delays the rest of the setup
of the interface even though the error is ultimately ignored.
system-conf: drop reference to ShutdownWatchdogUsec=
Commit 65224c1d0e50667a87c2c4f840c49d4918718f80 renamed ShutdownWatchdogUsec
into RebootWatchdogUsec but left a reference of ShutdownWatchdogUsec in
system.conf.
Julia Kartseva [Mon, 16 Nov 2020 08:26:44 +0000 (00:26 -0800)]
tests: add test program for SocketBind{Allow|Deny}=
Verify that service exited correctly if valid ports are passed to
SocketBind{Allow|Deny}=
Use `ncat` program starting a listening service binding to a specified
port, e.g.
"timeout --preserve-status -sSIGTERM 1s /bin/nc -l -p ${port} -vv"
Julia Kartseva [Mon, 26 Apr 2021 02:10:40 +0000 (19:10 -0700)]
core, bpf: add socket-bind feature to unit
Add supported and install unit interface for socket-bind feature.
supported verifies that
- unified cgroup hierarchy (cgroup v2) is used
- BPF_FRAMEWORK (libbpf + clang + llvm + bpftool) was available in
compile time
- kernel supports BPF_PROG_TYPE_CGROUP_SOCK_ADDR
- bpf programs can be loaded into kernel
- bpf link can be used
install:
- load bpf_object from bpf skeleton
- resize rules map to fit socket_bind_allow and socket_bind deny rules
from cgroup context
- populate cgroup-bpf maps with rules
- get bpf programs from bpf skeleton
- attach programs to unit cgroup using bpf link
- save bpf link in the unit
* Add `bpf-framework` feature gate with 'auto', 'true' and 'false' choices
* Add libbpf [0] dependency
* Search for clang llvm-strip and bpftool binaries in compile time to
generate bpf skeleton.
For libbpf [0], make 0.2.0 [1] the minimum required version.
If libbpf is satisfied, set HAVE_LIBBPF config option to 1.
If `bpf-framework` feature gate is set to 'auto', means that whether
bpf feature is enabled or now is defined by the presence of all of
libbpf, clang, llvm and bpftool in build
environment.
With 'auto' all dependencies are optional.
If the gate is set to `true`, make all of the libbpf, clang and llvm
dependencies mandatory.
If it's set to `false`, set `BPF_FRAMEWORK` to false and make libbpf
dependency optional.
libbpf dependency is dynamic followed by the common pattern in systemd.
meson, bpf: add build rule for socket_bind program
Julia Kartseva [Sat, 14 Nov 2020 01:02:50 +0000 (17:02 -0800)]
bpf: add build script for bpf programs
Add a build script to compile bpf source code. A program in restricted
C is compiled into an object file. Object file is converted to BPF
skeleton [0] header file.
If build with custom meson build rule, the target header will reside in
build/ directory (not in source tree), e.g the path for socket_bind:
`build/src/core/bpf/socket_bind/socket-bind.skel.h`
Script runs the phases:
* clang to generate *.o from restricted C
* llvm-strip to remove useless DWARF info
* bpf skeleton generation with bpftool
These phases are logged to stderr for debug purposes.
To include BTF debug information, -g option is passed to clang.
Julia Kartseva [Sat, 14 Nov 2020 01:40:17 +0000 (17:40 -0800)]
bpf: add socket-bind BPF program code sources
Introduce BPF program compiled from BPF source code in
restricted C - socket-bind.
It addresses feature request [0].
The goal is to allow systemd services to bind(2) only to a predefined set
of ports. This prevents assigning socket address with unallowed port
to a socket and creating servers listening on that port.
This compliments firewalling feature presenting in systemd:
whereas cgroup/{egress|ingress} hooks act on packets, this doesn't
protect from untrusted service or payload hijacking an important port.
While ports in 0-1023 range are restricted to root only, 1024-65535
range is not protected by any mean.
Performance is another aspect of socket_bind feature since per-packet
cost can be eliminated for some port-based filtering policies.
The feature is implemented with cgroup/bind{4|6} hooks [1].
In contrast to the present systemd approach using raw bpf instructions,
this program is compiled from sources. Stretch goal is to
make bpf ecosystem in systemd more friendly for developer and to clear
path for more BPF programs.
Specifying the test number manually is tedious and prone to errors (as
recently proven). Since we have all the necessary data to work out the
test number, let's do it automagically.
test-unit-serialize: add a very basic test that command deserialization works
We should test both serialization and deserialization works properly.
But the serialization/deserialization code is deeply entwined with the
manager state, and I think quite a bit of refactoring will be required before
this is possible. But let's at least add this simple test for now.
After 4b30f2e135ee84041bb597edca7225858f4ef4fb, reading stable_secret
sysctl property fails with -ENOMEM, instead of -EIO.
This is due to read_full_virtual_file() uses read() as the backend while
read_one_line_file() uses fgetc(). And each functions return different
error on fails.
Anyway, the failure is harmless here. So, the log message and comment is
updated.
homectl: pick up cached/credential store/env var passwords *before* issuing first request
Previously, we'd generally attempt the operation first, without any
passwords, and only query for a password if that operation then fails
and asks for one. This is done to improve compatibility with
password-less authentication schemes, such as security tokens and
similar.
This patch modifies this slightly: if a password can be acquired cheaply
via the keyring password cache, the $CREDENTIALS_PATH credential store,
or the $PASSWORD/$PIN environment variables, acquire it *before* issuing
the first requested.
This should save us a pointless roundtrip, and should never hurt.
tests: use setfacl to give $SUDO_USER read permissions on artifacts
We have to invoke the tests as superuser, and not being able to read
the journal as the invoking user is annoying. I don't think there are
any security considerations here, since the invoking user can already
put arbitrary code in the Makefile and test scripts which get executed
with root privileges.
fstab-generator: clean up mount point flags handling
Let's rename MountpointsFlags → MountPointFlags. In most of our codebase
we name things mount_point/MountPoint rather than mountpoint/Mountpoint,
do so here too.
Also, prefix the enum values with "MOUNT_". The fact the enum values
weren#t prefixed was pretty unique in our codebase, and pretty
surprising. Let's fix that.
This is just refactoring, no actual change in behaviour
test: move the logic to support /skipped into shared logic
The logic to query test state was rather complex. I don't quite grok the point
of ret=$((ret+1))… But afaics, the precise result was always ignored by the
caller anyway.
oomd works way better with swap, so let's make the test less flaky by
configuring a swap device for it. This also allows us to drop the ugly
`cat`s from the load-generating script.
Jan Synacek [Wed, 3 Jun 2020 08:33:21 +0000 (10:33 +0200)]
install: warn if WantedBy targets don't exist
Currently, if [Install] section contains WantedBy=target that doesn't exist,
systemd creates the symlinks anyway. That is just user-unfriendly.
Let's be nice and warn about installing non-existent targets.
dissect-image: prefer PARTN= uevent property over "partition" sysfs attr
The kernel will send us a PARTN= uevent proprty with partition add
events, let's use it instead of going for the "partition" sysfs attr.
It's less racy that way and there are reports the sysfs attr shows up
after the device, which makes it evern worse.
Peter Morrow [Tue, 13 Apr 2021 16:20:42 +0000 (17:20 +0100)]
core: allow services stuck in reloading state to exit
If a service is in reloading state but has exited do not delay
the final exit until the service reload timer expires. Instead allow
the service to exit immediately since we can't expect the service to
ever transition out of reloading state.
For example if a service sent RELOADING=1 but crashed before it could
send READY=1 then it should be restarted if the service had
Restart= configured.
Signed-off-by: Peter Morrow <pemorrow@linux.microsoft.com>
repart: don't try to extract directory of root dir when copying directories
It's OK to specify the root dir as target directory when copying
directories. However, in that case path_extract_filename() is going to
fail, because the root dir simply has not filename.
Let's address that by moving the call further down into the loop, when
we made sure that the target dir doesn't exist yet (the root dir always
exists, hence this check is sufficient).
Moreover, in the branch for copying regular files, also move the calls
down, and generate friendly error messages in case people try to
overwrite dirs with regular files (and the root dir is just a special
case of a dir).
Altogether this makes CopyFiles=/some/place:/ work, i.e. copying some
dir on the host into the root dir of the newly created fs. Previously
this would fail with an error about the inability to extract a filename
from "/", needlessly.
repart: prefix the correct path with root dir in log output
When we copy files into the freshly formatted file system, the mount
point prefix must be prepended to the *target* path, not the *source*
path. Not just in code but in the log message about it, too.
Igor Zhbanov [Tue, 20 Apr 2021 17:22:28 +0000 (17:22 +0000)]
journald: Retry if posix_fallocate returned -1 (EINTR)
On some conditions (particularly when mobile CPUs are going to sleep),
the posix_fallocate(), which is called when a new journal file is allocated,
can return -1 (EINTR). This is counted as a fatal error. So the journald
closes both old and journals, and simply throwing away further incoming
events, because of no log files open.
Introduce posix_fallocate_loop() that restarts the function in the case
of EINTR. Also let's make code base more uniform by returning negative
values on error.
Fix assert in test-sigbus.c that incorrectly counted positive values as
success. After changing the function return values, that will actually work.
Fixes: #19041 Signed-off-by: Igor Zhbanov <i.zhbanov@omprussia.ru>
generator: write out special systemd-fsck-usr.service
So far all file systems where checked by instances of
systemd-fsck@.service, with the exception of the root fs which was
covered by systemd-fsck-root.service. The special handling is necessary
to deal with ordering issues: we typically want the root fs to be
checked before all others, and — weirdly — allow mounting it before the
fsck done (for compat with initrd-less boots).
This adds similar special handling for /usr: if the hierarchy is placed
on a separate file system check it with a special
systemd-fsck-usr.service instead of a regular sysemd-fsck@.service
instance. Reason is again ordering: we want to allow mounting of /usr
without the root fs already being around in the initrd, to cover for
cases where the root fs is created on first boot and thus cannot be
mounted/checked before /usr.
network: enable DHCP broadcast flag if required by interface
Some interfaces require that the DHCPOFFER message is sent via broadcast
if they can't receive unicast messages before they've been configured
with an IP address.
E.g., s390 ccwgroup network interfaces operating in layer3 mode face
this limitation. This can prevent the interfaces from receiving an
IP address via DHCP, if the have been configured for layer3.
To allow DHCP over such interfaces, we're introducing a new device
property ID_NET_DHCP_BROADCAST which can be set for those.
The networkd DHCP client will check whether this property is set
for an interface, and if so will set the broadcast flag, unless
the network configuration for the interface has an explicit
RequestBroadcast setting.
Besides that, we're adding a udev rule to set this device property
for ccwgroup devices operating in layer3 mode, which is the case
if the ID_NET_DRIVER property is qeth_l3.
I was worried that the text size will grow, but apparently that's not the
case:
With --optimization=2:
$ size build/src/shared/libsystemd-shared-248.a.p/gpt.c.o*
text data bss dec hex filename
3674 1104 0 4778 12aa build/src/shared/libsystemd-shared-248.a.p/gpt.c.o.old
3085 1104 0 4189 105d build/src/shared/libsystemd-shared-248.a.p/gpt.c.o
(I don't understand the generated assembly, even though it seems to work:
Disassembly of section .text.gpt_partition_type_is_usr_verity:
It is made inline in the hope that the compiler will be able to optimize
all the va_args boilerplate away, and do an efficient comparison when
the arguments are all constants.