bpf-devices: if a device node is referenced which doesn't exist, downgrade log message
Currently in many of our test cases you'll see a warning about a tun
device not being around. Let's make that quiet, since if there's no such
device there's no point in adding it to a policy anyway, and it makes
useless noise go away.
We keep the warning as a warning if a device node is missing for other
errors than ENOENT.
bpf-devices: normalize the return handling of functions that put together policy
under some conditions we suppress generating BPF programs. Let's
systematically return 0 when we do this, and 1 if we did actually
soething, instead of second guessing this in the caller.
This is not only more correct, but allows us to suppress BPF programs in
more cases in later commits.
bpf-devices: normalize how we pass around major/minor values
There's some unclarity whether major/minor of device nodes are supposed
to be "unsigned" or "dev_t". Various codebases assume the latter, but
glibc's major()/minor() types actually return a value typed to
"unsigned". On glibc dev_t is actually 64bit even if the kernel only
exposes 32bit. Hence this distinction kinda matters.
Let's clean things up a bit with handling: let's followe glibc's type
system here, and use unsigned (and not int).
Also let's pass invalid major/minor values around as UINT_MAX rather
than via pointers, to match how we usually do this, and to shorten our
code a bit. This is safe, since given the linux dev_t space being 32bit
only we can't possibly have a valid major or minor this hight, given
they must be smaller in size. While other archs disagree on the types of
major/minor, they also tend to have similar limits. In fact on FreeBSD
for example major()/minor() returns a signed int. Which would hence also
mean that UINT_MAX cannot be a valid major or minor.
test: adjust test-path to fail gracefully with the new pidfd_spawn stuff
Since 2e106312e2 the test unit fails with 'resources' result instead of
'exit-code', which the test didn't account for when running unprivileged.
Before 2e106312e2:
$ /root/systemd/build/test-path
Failed to start transient scope unit: Interactive authentication required.
Couldn't allocate a scope unit for this test, proceeding without.
...
-.slice: Failed to enable/disable controllers on cgroup /user.slice/user-1000.slice/session-1.scope, ignoring: Permission denied
app.slice: Failed to create cgroup /user.slice/user-1000.slice/session-1.scope/app.slice: Permission denied
-.slice: Failed to enable/disable controllers on cgroup /user.slice/user-1000.slice/session-1.scope, ignoring: Permission denied
app.slice: Failed to create cgroup /user.slice/user-1000.slice/session-1.scope/app.slice: Permission denied
...
line 151: path-exists.path: state = running; result = success (left: 29986250)
line 151: path-exists.service: state = start; result = success
path-exists.service: Main process exited, code=exited, status=219/CGROUP
path-exists.service: Failed with result 'exit-code'.
line 151: path-exists.path: state = running; result = success (left: 29985948)
line 151: path-exists.service: state = failed; result = exit-code
Failed to start service path-exists.service, aborting test: failed/exit-code
After 2e106312e2:
$ /root/systemd/build/test-path
Failed to start transient scope unit: Interactive authentication required.
Couldn't allocate a scope unit for this test, proceeding without.
...
-.slice: Failed to enable/disable controllers on cgroup /user.slice/user-1000.slice/session-1.scope, ignoring: Permission denied
app.slice: Failed to create cgroup /user.slice/user-1000.slice/session-1.scope/app.slice: Permission denied
-.slice: Failed to enable/disable controllers on cgroup /user.slice/user-1000.slice/session-1.scope, ignoring: Permission denied
app.slice: Failed to create cgroup /user.slice/user-1000.slice/session-1.scope/app.slice: Permission denied
path-exists.service: Failed to spawn executor: No such file or directory
path-exists.service: Failed to spawn 'start' task: No such file or directory
path-exists.service: Failed with result 'resources'.
packit: temporarily build systemd without BPF stuff
The kernel-tools meta-package was retired in Rawhide, but its
replacement has not landed, yet. Until that happens, let's build without
the bpf-framework stuff.
Daan De Meyer [Thu, 8 Feb 2024 09:54:54 +0000 (10:54 +0100)]
Add systemd.default_debug_tty=
Let's allow configuring the debug tty independently of enabling/disabling
the debug shell. This allows mkosi to configure the correct tty while
leaving enabling/disabling the debug tty to the user.
sysext: rename "directory_name" field to "full_identifier"
So the field contains simply the full name of the command being invoked,
hence rename the field to match the contents, and to mirror the
"short_identifier" field.
Interestingly, the field is apparently not actually used by anything
though! But we are not going to remove it, since a follow-up commit will
start making use of it.
Yu Watanabe [Thu, 8 Feb 2024 03:47:39 +0000 (12:47 +0900)]
network: make Reload bus method synchronous
Prompted by https://github.com/systemd/systemd/pull/30085#discussion_r1401534107.
Note, like Reconfigure bus method, even reconfiguration for an interface is
triggered by Reload method, the method only wait for the link enters
configuring state (or unmanaged state if no matching .network file exists).
Users still need to invoke systemd-networkd-wait-online if it is
necessary to wait for the interface enters configured state after Reload
medhod.
As described in https://github.com/systemd/systemd/issues/31235, the preset
state for systemd-homed-activate.service was unclear. On the one hand, we have
a preset with 'enable systemd-homed.service', and systemd-homed.service has
'Also=systemd-homed-activate.service systemd-homed-firstboot.service', so
'preset systemd-homed.service' would also enable those two services, but
'preset systemd-homed-activate.service' would disable it, because the presets
don't say it is enabled. It seems that this configuration is internally
inconsistent. As described in the issue, maybe systemctl should be smarter
here, or warn about such configs. Either way, let's make our config consistent.
Luca Boccassi [Wed, 7 Feb 2024 00:36:39 +0000 (00:36 +0000)]
portable: add --copy=mixed to copy images and link profiles
This new mode copies resources provided by the client, so that they
remain available for inspect/detach even if the original images are
deleted, but symlinks the profile as that is owned by the OS, so that
updates are automatically applied.
man: mention that preset-all is performed during early boot
The intro of systemd-firstboot is rewritten to make it clearer how it fits into
the big picture. Systemd does some machine-id and presets and
systemd-firstboot.service is used to interactively fill in the blanks.
sd-dhcp6-client: allow setting send-release when client is running
The send-release option only affects to the client when STOPPING. There
is no reason to do not allow this option to be set while the client is
running.
An user might want to delay the decision of sending a RELEASE message to
a later stage where the client is already running.
process-util: use only the least significant byte from personality()
The personality() syscall returns a 32-bit value where the top three
bytes are reserved for flags that emulate historical or architectural
quirks, and only the least significant byte reflects the actual
personality we're interested in (in opinionated_personality()).
Use the newly defined mask in the corresponding test as well, otherwise
the test fails on some more "exotic" architectures that set some of the
"quirk" flags:
~# uname -m
armv7l
~# build/test-seccomp
...
/* test_lock_personality */
current personality=0x0
safe_personality(PERSONALITY_INVALID)=0x800000
Assertion '(unsigned long) safe_personality(current) == current' failed at src/test/test-seccomp.c:970, function test_lock_personality(). Aborting.
lockpersonalityseccomp terminated by signal ABRT.
Assertion 'wait_for_terminate_and_check("lockpersonalityseccomp", pid, WAIT_LOG) == EXIT_SUCCESS' failed at src/test/test-seccomp.c:996, function test_lock_personality(). Aborting.
Aborted (core dumped)
See: personality(2) and comments in sys/personality.h
Yu Watanabe [Fri, 2 Feb 2024 04:08:35 +0000 (13:08 +0900)]
network: set 'removing' flag to remembered object
Previously, if address_remove() or friends called with a temporary
object, the removing flag is assigned to the temporary object, and is
not set to the remembered object. Hence, e.g.
route_is_ready_to_configure() wrongly judge a required address for a
route is (still) ready, hence networkd fails to configure the route.
After the commit, remembered Address objects by Link are always given by
kernel. Hence, it is not necessary to set the flag, as it is always
ignored by the kernel, and the kernel set the flag on notification if it
is necessary.
This is in preparation for https://github.com/systemd/systemd/pull/30360 to be
merged in a future release. As described there:
nscd is known to be racy [1] and it was already deprecated and later dropped
in Fedora a while back [1,2]. We don't need to support obsolete stuff in
systemd, and the cache in systemd-resolved provides a better solution anyway.
Note that our "support" is only the signal to flush the cache that we send at
various points. Nscd itself may still exist, dropping it is a decision to be
made in glibc.
Mike Yuan [Sun, 4 Feb 2024 15:22:46 +0000 (23:22 +0800)]
core: reuse credential dir across start and start-post if populated,
fresh otherwise
Currently, exec_setup_credential() always rewrite all credentials
upon exec_invoke(), i.e. invocation of each ExecCommand, and within
a single tmpfs instance. This is problematic though:
* When writing each tmp cred file, we essentially double the size
of the credential. Therefore, if one cred is bigger than half
of CREDENTIALS_TOTAL_SIZE_MAX, confusing ENOSPC occurs (see also
https://github.com/systemd/systemd/pull/24734#issuecomment-1925440546)
* Credential is a unit-wide thing and thus should not change
during the whole lifetime of main process. However, if e.g.
a on-disk credential or SetCredential= in unit file
changes between ExecStart= and ExecStartPost=,
the credentials are overwritten when the latter gets to run,
and the already-running main process is suddenly seeing
completely different creds.
So, let's try to reuse final cred dir if the main process has started
and the tmpfs has been populated, so that the creds used is stable
across all ExecStart= and ExecStartPost=-s. We still want to retain
the ability of updating creds through ExecStartPre= though, therefore
we forcibly use a fresh cred dir for those. 'Fresh' means to actually
unmount the old tmpfs first, so the first problem goes away, too.
Felix Riemann [Fri, 2 Feb 2024 17:08:52 +0000 (18:08 +0100)]
cryptenroll: Fix reading keyfile from socket
systemd-cryptenroll uses the READ_FULL_FILE_CONNECT_SOCKET flag when
reading the keyfile to also allow reading it from a socket. But it also
sets the offset to 0, causing an unnecessary seek to the beginning of
the newly opened keyfile and disables socket support again, as these do
not support seeking.
Disable seeking entirely to remove the unneeded seek and restore support
for reading the keyfile from a socket again as with systemd-cryptsetup.
Also= lists units which should be enabled/disabled together with the first unit.
But userdbd is independent of homed, we shouldn't e.g. disable it even if homed
is disabled.
load-fragment: set PATH_CHECK_NON_API_VFS flag at various other places
I tried to be conservative here, and hence in doubt I left the flag off,
but in some cases I really can't see any reason why it would make sense
to specifiy paths into API VFS, hence add it there, to lock things down
a bit.
parse-helpers: add new PATH_CHECK_NON_API_VFS flag
In various contexts it's a bit icky to allow paths below /proc/, /sys/,
/dev/ i.e. file hierarchies where API VFS are placed. Let's add a new
flag for path_simplify_and_warn() to check for this and refuse a path if
in these paths.
Enable this when parsing WorkingDirectory=.
This is inspired by CVE-2024-21626, which uses trickery around the cwd
and /proc/self/fd/.
AFAICS we are not actually vulnerable to the same issue as explained in
the CVE since we execute the WorkingDirectory= setting very late, i.e.
long after we set up the new mount namespace. But let's filter out icky
stuff better earlier than later, as extra safety precaution.
Luca Boccassi [Fri, 12 Jan 2024 21:32:20 +0000 (21:32 +0000)]
core: add support for pidfd_spawn
Added in glibc 2.39, allows cloning into a cgroup and to get
a pid fd back instead of a pid. Removes race conditions for
both changing cgroups and getting a reliable reference for the
child process.
We already use __VA_OPT__ in multiple places, which was introduced in
gcc 8 [0], so let's bump the baseline to reflect that. I chose gcc 8.4,
as that was the lowest 8.x version I could easily get my hands on when I
verified this (on Ubuntu Focal with the gcc-8 package).
Mike Yuan [Sun, 4 Feb 2024 11:36:06 +0000 (19:36 +0800)]
core/service: don't setup credentials for ExecCondition= and ExecReload=
This seems to be a mistake in #27279. I believe credentials should
not be made available to condition or reload tasks. In most cases
they're irrelevant from the actual job of the service. Also, currently
the first ExecCondition= or ExecReload= cannot access creds anyway,
making the incompatibility introduced negligible.
If people actually come up with valid use cases, we can always
revisit this.