If we created the dir successfully, we let chmod_and_chown_at() do its thing
and shouldn't go into the part where we check if the existing directory has the
right permissions and ownership and possibly adjust them. The code was doing
that, by relying on the fact that chmod_and_chown_at() does not return -EEXIST.
That's probably true, but seems unnecessarilly complicated.
William Roberts [Fri, 24 Feb 2023 20:11:16 +0000 (14:11 -0600)]
tpm2: add support for a trusted SRK
Prevent attackers from spoofing the tpmKey portion of the AuthSession by
adding a trusted key to the LUKS header metadata. Also, use a persistent
object rather than a transient object.
This provides the following benifits:
1. No way to MITM the tpmKey portion of the session, see [1] for
details.
2. Strengthens the encrypted sessions, note that the bindKey could be
dropped now.
3. Speed, once it's created we just use it.
4. Owner Auth is needed to call create primary, so using the SRK
creates a scratch space for normal users.
This is a "first to set" model, in where the first person to set the key
in the LUKS header wins. Thus, setup should be done in a known good
state. If an SRK, which is a primary key at a special persistent
address, is found, it will use whatever is there. If not, it creates an
SRK. The SRK follows the convetions used through the tpm2-software
organization code on GitHub [2], however, a split has occured between
Windows and Linux with respect to SRK templates. The Linux SRK is
generated with the unique field size set to 0, in Windows, it properly
sets the size to key size in bytes and the unique data to all 0's of that
size. Note the proper templates for SRKs is covered in spec [3].
However, the most important thing, is that both SRKs are passwordless,
and thus they should be interchangable. If Windows is the first to make
the SRK, systemd will gladly accept it and vice-versa.
1. Without the bindKey being utilized, an attacker was able to intercept
this and fake a key, thus being able to decrypt and encrypt traffic as
needed. Introduction of the bindKey strengthened this, but allows for
the attacker to brute force AES128CFB using pin guesses. Introduction of
the salt increases the difficulty of this attack as well as DA attacks
on the TPM objects itself.
pam_nologin looks for /etc/nologin and /run/nologin.
user-sessions creates (and removes) /run/nologin, but also removes
/etc/nologin. (This behaviour is unchanged since the introduction
of the binary in e92787416c691c3f34f47349e5eae3fa68eae856.)
By not removing pam_nologin we fully drop compatibility with PAM < 1.1.
This has the advantage that now /etc/nologin can be used by administrator to
disable user logins, e.g. for extended maintanance. We already specified
PAM >= 1.1.2 as dependency, so this was already covered.
We now modify our cmdline to use '=' for all arguments,
but didn't change early setup check to work with that.
So every daemon-reexec does a full setup, thus breaking
running user sessions.
Daan De Meyer [Thu, 22 Dec 2022 13:59:56 +0000 (14:59 +0100)]
find-esp: Add openat() like helpers that operate on fds
We also rework the internals of find-esp to work on directory file
descriptors instead of absolute paths and do a lot of general cleanups.
By passing the parent directory file descriptor to verify_fsroot_dir()
along with the name of the directory we're operating on, we can get rid
of the fallback that goes via path to open the parent directory if '..'
fails due to permission errors.
Daan De Meyer [Thu, 30 Mar 2023 08:39:53 +0000 (10:39 +0200)]
btrfs-util: Add btrfs_get_block_device_at()
Let's make btrfs_get_block_device_fd() more generic by renaming it
to btrfs_get_block_device_at() so it can operate on only paths, dir_fd
and path, or only on fd by using xopenat().
Let's always operate on paths without resolving the final component.
If the path is a symlink, it could point to a vendor default in /usr,
in which case we definitely do not want to modify the vendor defaults.
To avoid this from happening, we replace the symlink with our own file
instead of modifying the file the symlink points at.
Frantisek Sumsal [Fri, 31 Mar 2023 16:42:38 +0000 (18:42 +0200)]
test: set ReadWritePaths= for test-.services when built w/ coverage
Let's make the dropin, to make the build dir writable for gcov, a bit
more generic, so it can be used by all units starting with prefix test-.
This should help with a bunch of recent reports about missing coverage I
got, as well as with existing test units using DynamicUser=true.
This might feel a bit like a magic trick from behind the curtains, but I
want to touch the actual tests as little as possible, since it makes them
unnecessarily messy (see the various workarounds for sanitizers), and
the coverage reports are generated only in a specific CI job anyway.
core: skip deps on oomd if v2 or memory unavailable
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2055664 Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2172146
User report that systemd repeatedly logs about not being able to start oomd
when booted with v1:
Feb 20 16:52:33 systemd[1]: systemd-oomd.service - Userspace Out-Of-Memory (OOM) Killer was skipped because of a failed condition check (ConditionControlGroupController=v2).
Feb 20 16:52:34 systemd[1]: systemd-oomd.service - Userspace Out-Of-Memory (OOM) Killer was skipped because of a failed condition check (ConditionControlGroupController=v2).
Feb 20 16:52:34 systemd[1]: systemd-oomd.service - Userspace Out-Of-Memory (OOM) Killer was skipped because of a failed condition check (ConditionControlGroupController=v2).
Feb 20 16:52:34 systemd[1]: systemd-oomd.service - Userspace Out-Of-Memory (OOM) Killer was skipped because of a failed condition check (ConditionControlGroupController=v2).
Feb 20 16:52:34 systemd[1]: systemd-oomd.service - Userspace Out-Of-Memory (OOM) Killer was skipped because of a failed condition check (ConditionControlGroupController=v2).
Feb 20 16:52:34 systemd[1]: systemd-oomd.service - Userspace Out-Of-Memory (OOM) Killer was skipped because of a failed condition check (ConditionControlGroupController=v2).
Feb 20 16:52:34 systemd[2067491]: Queued start job for default target default.target.
Feb 20 16:52:34 systemd[1]: systemd-oomd.service - Userspace Out-Of-Memory (OOM) Killer was skipped because of a failed condition check (ConditionControlGroupController=v2).
Feb 20 16:52:34 systemd[2067491]: Created slice app.slice - User Application Slice.
Feb 20 16:52:34 systemd[1]: systemd-oomd.service - Userspace Out-Of-Memory (OOM) Killer was skipped because of a failed condition check (ConditionControlGroupController=v2).
systemd-oomd.service that pulls systemd-oomd.socket in (because it requires
it); systemd-oomd.service itself is pulled by user@.service because
systemd-oomd package installs an override config file that sets
ManagedOOMMemoryPressure=kill.
Add a check to the code that adds the implicit dependency to skip the
dep if we cannot start it. The check is done exactly the same as in oomd
itself.
Thomas Blume [Tue, 14 Mar 2023 14:21:29 +0000 (15:21 +0100)]
test: use setpriv instead of su for user switch from root
systemd-repart needs to find mkfs.ext4 for the test.
This is located in the directory /usr/sbin on openSUSE Tumbleweed.
But since the variable ALWAYS_SET_PATH in /etc/login.defs is set to yes,
su re-initializes the $PATH variable and removes /usr/sbin.
Hence, mkfs.ext4 is not found and the test fails.
Using setpriv instead of su fixes this issue and is more appropriate to
do the switch user task from root.
[zjs: move setpriv to $BASICTOOLS and force-push to retrigger CI]
TODO: drop items regarding swap-for-hibernate-only-use
I doubt we should bother. Swap always makes sense, and having a swap
partition for hibernate only without using it all the time just makes
the system worse overall.
bootctl: clean up handling of files with no version information
get_file_version() would return:
- various negative errors if the file could not be accessed or if it was not a
regular file
- 0/NULL if the file was too small
- -ESRCH or -EINVAL if the file did not contain the marker
- -ENOMEM or permissions errors
- 1 if the marker was found
bootctl status iterates over /EFI/{systemd,BOOT}/*.efi and checks if the files
contain a systemd-boot version tag. Resource or permission errors should be
fatal, but lack of version information should be silently ignored.
OTOH, when updating or installing bootloader files, the version is expected
to be present.
get_file_version() is changed to return -ESRCH if the version is unavailable,
and other errnos for permission or resource errors.
The logging is reworked to always display an error if encountered, but also
to log the status at debug level what the result of the version inquiry is.
This makes it figure out what is going on:
/efi/EFI/systemd/systemd-bootx64.efi: EFI binary LoaderInfo marker: "systemd-boot 253-6.fc38"
/efi/EFI/BOOT/BOOTfbx64.efi: EFI binary has no LoaderInfo marker.
/efi/EFI/BOOT/BOOTIA32.EFI: EFI binary has no LoaderInfo marker.
/efi/EFI/BOOT/BOOTX64.EFI: EFI binary LoaderInfo marker: "systemd-boot 253-6.fc38"
Frantisek Sumsal [Thu, 30 Mar 2023 17:26:53 +0000 (19:26 +0200)]
coverage: add a wrapper for execveat()
gcov provides wrappers for the exec*() calls but there's none for execveat(),
which means we lose all coverage prior to the call. To mitigate this, let's
add a simple execveat() wrapper in gcov's style[0], which dumps and resets
the coverage data when needed.
This applies only when we're built with -Dfexecve=true.
We have three states:
- ENABLE_COREDUMP and systemd-coredump is installed,
- ENABLE_COREDUMP but systemd-coredump is not installed,
- !ENABLE_COREDUMP.
In the last case we would not do any coredumping-related setup in pid1, which
means that coredumps would go to to the working directory of the process, but
actually limits are set to 0. This is inherited by children of pid1.
As discussed extensively in https://github.com/systemd/systemd/pull/26607, this
default is bad: dumps are written to arbitrary directories and not cleaned up.
Nevertheless, the kernel cannot really fix it. It doesn't know where to write,
and it doesn't know when that place would become available. It is only the
userspace that can tell this to the kernel. So the only sensible change in the
kernel would be to default to '|/bin/false', i.e. do what we do now.
In the middle case, we disabled writing of coredumps via a pattern, but raise
the RLIMIT_CORE. We need to raise the limit because we can't raise it later
after processes have been forked off. This means we behave correctly, but allow
coredumping to be enabled at a later point without a reboot.
This patch makes the last case behave like the middle case. This means that
even if systemd is compiled with systemd-coredump, it still does the usual
setup. If users want to restore the kernel default, they need to provide two
drop-in files:
for sysctl.d, with 'kernel.core_pattern=core'
for systemd.conf, with 'DefaultLimitCORE=0'.
The general idea is that pid1 does the safe thing. A distro may want to use
something different than the systemd-coredump machinery, and then that would
could packaged together with the drop-ins to change the configuration.
Luca Boccassi [Tue, 28 Mar 2023 15:19:47 +0000 (16:19 +0100)]
sysext: stop storing under /usr/lib[/local]/extensions/
sysexts are meant to extend /usr. All extension images and directories are opened and merged in a
single, read-only overlayfs layer, mounted on /usr.
So far, we had fallback storage directories in /usr/lib/extensions and /usr/local/lib/extensions.
This is problematic for three reasons.
Firstly, technically, for directory-based extensions the kernel will reject
creating such an overlay, as there is a recursion problem. It actively
validates that a lowerdir is not a child of another lowerdir, and fails with
-ELOOP if it is. So having a sysext /usr/lib/extensions/myextdir/ would result
in an overlayfs config lowerdir=/usr/lib/extensions/myextdir/usr/:/usr which is
not allowed, as indicated by Christian the kernel performs this check:
/*
* Check if this layer root is a descendant of:
* - another layer of this overlayfs instance
* - upper/work dir of any overlayfs instance
*/
<...>
/* Walk back ancestors to root (inclusive) looking for traps */
while (!err && parent != next) {
if (is_lower && ovl_lookup_trap_inode(sb, parent)) {
err = -ELOOP;
pr_err("overlapping %s path\n", name);
Secondly, there's a confusing aspect to this recursive storage. If you
have /usr/lib/extensions/myext.raw which contains /usr/lib/extensions/mynested.raw
'systemd-sysext merge' will only pick up the first one, but both will appear in
the merged root under /usr/lib/extensions/. So you have two extension images, both
appear in your merged filesystem, but only one is actually in use.
Finally, there's a conceptual aspect: the idea behind sysexts and hermetic /usr
is that the /usr tree is not modified locally, but owned by the vendor. Dropping
extensions in /usr thus goes contrary to this foundational concept.
Yu Watanabe [Mon, 27 Mar 2023 19:14:03 +0000 (04:14 +0900)]
test-kernel-install: several cleanups
- allow to run without $PROJECT_BUILD_ROOT,
- drop unnecessary export for bootctl,
- enable -x option to show commands,
- use 'test ! -e' to check the nonexistence of files,
- show more debugging logs.
units: let's establish the coredump socket before writting core_pattern sysctl
It's a bit nicer if we only write the sysctl core_pattern once the
coredump socket is established, since it's the backend for the handler.
Given the systemd-coredump.socket basically has no dependencies that run
before it this should not really make things slower or so, it just
removes the tiny window where core pattern is in effect that wants to
connect to the backend socket but cannot.
The status quo isn't terrible, and not too different in effect: either
way, until the socket unit is up we won't process coredumps. It's mostly
what kind of behaviour you get then: an error due to /bin/false being
invoked, or an error because systemd-coredump can't connect to its
socket. After this patch we'll exclusively see the former.
Let's use more _cleanup_ expressions. Various other modernizations. No
actual code changes, except for maybe a conversion to use heap memory
when generating an array of fds, instead of stack as before. Given that
fdstores are typically user controlled, that should be a wise idea.