Previous behaviour can be restored via --expand-environment=no:
$ systemd-run -q --scope --expand-environment=no echo '$TERM'
$TERM
Fixes #22948.
At some level, this is a compat break. Fortunately --scope is not very widely
used, so I think we can get away with this. Having different syntax depending
on whether --scope was used or not was bad UX.
run: add --expand-environment=no to disable server-side envvar expansion
This uses StartExecEx to get the equivalent of ExecStart=:. StartExecEx was
added in b3d593673c5b8b0b7d781fd26ab2062ca6e7dbdb, so this will not work with
older systemds.
A hint is emitted if we get an error indicating lack of support. PID1 returns
SD_BUS_ERROR_PROPERTY_READ_ONLY, but I'm checking for
SD_BUS_ERROR_UNKNOWN_PROPERTY too for safety.
Just refactoring, in preparation for future changes.
(Though I think it'd be reasonable to do anyway, those functions were
awfully long.)
'git diff' displays this badly. The middle part of start_transient_service()
is moved to make_transient_service_unit(), and the middle part of
start_transient_trigger() is moved to make_transient_trigger_unit().
start_transient_service() would return two ints: one normally and one via
*retval. We can just return one int and propagate it directly, because we
use DEFINE_MAIN_FUNCTION_WITH_POSITIVE_FAILURE().
The property name is called ExecStartEx, but we have to write it as ExecStart=
in the unit file. :(
Bug introduced in b3d593673c5b8b0b7d781fd26ab2062ca6e7dbdb when ex-properties
were initially added.
In addition, we cannot escape $ as $$, because when ":" is used, we wouldn't
unescape $$ back to $.
Our escaping of '$' is '$$', not '\$'. We would write unit files that
were not valid:
$ systemd-run --user bash -c 'echo $$; sleep 1000'
Running as unit: run-r1c7c45b5b69f487c86ae205e12100808.service
$ systemctl cat --user run-r1c7c45b5b69f487c86ae205e12100808
# /run/user/1000/systemd/transient/run-r1c7c45b5b69f487c86ae205e12100808.service
...
ExecStart="/usr/bin/bash" "-c" "echo \$\$\; sleep 1000"
Similarly, ';' cannot be escaped as '\;'. Only a handful of characters
listed in "Supported escapes" is allowed.
Escaping of "'" can be done, but it's not useful because we use double quotes
around the string anyway whenever we do escaping.
unit_write_setting() is called all over the place. In a great majority of
places we write either fixed strings or something that we generate ourselves,
so no escaping or quoting is needed. (And it's not allowed, e.g.
'Type="oneshot"' would not work.) But if we forgot to add escaping or quoting
for a free-style string, it would probably allow writing a unit file that would
be read completely wrong. I looked over various places where
unit_write_setting() is called, and I couldn't find any place where
quoting/escaping was forgotten. But trying to figure out the full
ramifications of this change is not easy.
__builtin_popcount() is a bit of a mouthful, so let's provide a helper.
Using _Generic has the advantage that if a type other then the ones on
the list is given, compilation will fail. This is nice, because if by any
change we pass a wider type, it is rejected immediately instead of being
truncated.
log.h is also needed. It is included transitively, but let's include it
directly.
test-core-unit: add new test file for unit_escape_setting() and friends
None of the existing test files fit very well. test-unit-serialize is
pretty close, but it does special cgroup setup, which we don't need in
this case. I hope we can add more tests in the future for this basic
functionality, so I'm adding a brand new file names after the source file
it's testing.
Move the tests that link to libcore into a separate subgroup.
They are special and it makes sense to keep them together. While
at it, make the list alphabetical.
Also, merge the list additions into one. No idea why it was like that.
manager: remove transient unit directory during startup
I was testing transient units and user@.service crashed. I restarted it, and
tried to create a transient unit. It failed because
/run/user/1000/systemd/transient/ remained after the previous aborted run:
Failed to start transient service unit: Unit run-u0.service was already loaded or has a fragment file.
Remove the directory during initial startup so we don't get confused by our own
files.
core: a more informative error when SetProperties/StartTransientUnit fails
I was changing how some properties are appended to the StartTransientUnit call
and messed up the message contents. When something is wrong with how the
message is structed, we would return a very generic
"Failed to start transient service unit: No such device or address".
Mention that it was property setting that failed, and translate ENXIO to a
different message. bus_unit_set_properties() or any of the children it calls
may also return other errors, in particular EBADMSG or ENOMEM, but the error
message that is generated for those is understandable, so we don't need to
"translate" them explicitly.
bus_unit_set_properties() is called from two places, so it seems nicer to
generate the message internally, rather than ask the caller to do that. Also,
now bus_unit_set_properties() always sets <error>, which is nicer for the
callers.
man: move description of command line substitution out of ExecStart=
The description was split — part was under ExecStart= and part in "Command lines".
Now the whole generic part is moved to the separate section, and under ExecStart=
only the stuff that is specific to that option is described.
This just moves the text and removes some repetitions.
The function had a provision for NULL input, and would return NULL, but that
looks like an error and all callers pass in a non-NULL arg and report oom on
NULL. So assert that the input is non-NULL.
All callers specifed the output buffer, so we can simplify the logic to only
make an allocation if appropriate and change the return type to 'const *'.
If we created the dir successfully, we let chmod_and_chown_at() do its thing
and shouldn't go into the part where we check if the existing directory has the
right permissions and ownership and possibly adjust them. The code was doing
that, by relying on the fact that chmod_and_chown_at() does not return -EEXIST.
That's probably true, but seems unnecessarilly complicated.
William Roberts [Fri, 24 Feb 2023 20:11:16 +0000 (14:11 -0600)]
tpm2: add support for a trusted SRK
Prevent attackers from spoofing the tpmKey portion of the AuthSession by
adding a trusted key to the LUKS header metadata. Also, use a persistent
object rather than a transient object.
This provides the following benifits:
1. No way to MITM the tpmKey portion of the session, see [1] for
details.
2. Strengthens the encrypted sessions, note that the bindKey could be
dropped now.
3. Speed, once it's created we just use it.
4. Owner Auth is needed to call create primary, so using the SRK
creates a scratch space for normal users.
This is a "first to set" model, in where the first person to set the key
in the LUKS header wins. Thus, setup should be done in a known good
state. If an SRK, which is a primary key at a special persistent
address, is found, it will use whatever is there. If not, it creates an
SRK. The SRK follows the convetions used through the tpm2-software
organization code on GitHub [2], however, a split has occured between
Windows and Linux with respect to SRK templates. The Linux SRK is
generated with the unique field size set to 0, in Windows, it properly
sets the size to key size in bytes and the unique data to all 0's of that
size. Note the proper templates for SRKs is covered in spec [3].
However, the most important thing, is that both SRKs are passwordless,
and thus they should be interchangable. If Windows is the first to make
the SRK, systemd will gladly accept it and vice-versa.
1. Without the bindKey being utilized, an attacker was able to intercept
this and fake a key, thus being able to decrypt and encrypt traffic as
needed. Introduction of the bindKey strengthened this, but allows for
the attacker to brute force AES128CFB using pin guesses. Introduction of
the salt increases the difficulty of this attack as well as DA attacks
on the TPM objects itself.
pam_nologin looks for /etc/nologin and /run/nologin.
user-sessions creates (and removes) /run/nologin, but also removes
/etc/nologin. (This behaviour is unchanged since the introduction
of the binary in e92787416c691c3f34f47349e5eae3fa68eae856.)
By not removing pam_nologin we fully drop compatibility with PAM < 1.1.
This has the advantage that now /etc/nologin can be used by administrator to
disable user logins, e.g. for extended maintanance. We already specified
PAM >= 1.1.2 as dependency, so this was already covered.
We now modify our cmdline to use '=' for all arguments,
but didn't change early setup check to work with that.
So every daemon-reexec does a full setup, thus breaking
running user sessions.
Daan De Meyer [Thu, 22 Dec 2022 13:59:56 +0000 (14:59 +0100)]
find-esp: Add openat() like helpers that operate on fds
We also rework the internals of find-esp to work on directory file
descriptors instead of absolute paths and do a lot of general cleanups.
By passing the parent directory file descriptor to verify_fsroot_dir()
along with the name of the directory we're operating on, we can get rid
of the fallback that goes via path to open the parent directory if '..'
fails due to permission errors.
Daan De Meyer [Thu, 30 Mar 2023 08:39:53 +0000 (10:39 +0200)]
btrfs-util: Add btrfs_get_block_device_at()
Let's make btrfs_get_block_device_fd() more generic by renaming it
to btrfs_get_block_device_at() so it can operate on only paths, dir_fd
and path, or only on fd by using xopenat().
Let's always operate on paths without resolving the final component.
If the path is a symlink, it could point to a vendor default in /usr,
in which case we definitely do not want to modify the vendor defaults.
To avoid this from happening, we replace the symlink with our own file
instead of modifying the file the symlink points at.
Frantisek Sumsal [Fri, 31 Mar 2023 16:42:38 +0000 (18:42 +0200)]
test: set ReadWritePaths= for test-.services when built w/ coverage
Let's make the dropin, to make the build dir writable for gcov, a bit
more generic, so it can be used by all units starting with prefix test-.
This should help with a bunch of recent reports about missing coverage I
got, as well as with existing test units using DynamicUser=true.
This might feel a bit like a magic trick from behind the curtains, but I
want to touch the actual tests as little as possible, since it makes them
unnecessarily messy (see the various workarounds for sanitizers), and
the coverage reports are generated only in a specific CI job anyway.
core: skip deps on oomd if v2 or memory unavailable
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2055664 Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2172146
User report that systemd repeatedly logs about not being able to start oomd
when booted with v1:
Feb 20 16:52:33 systemd[1]: systemd-oomd.service - Userspace Out-Of-Memory (OOM) Killer was skipped because of a failed condition check (ConditionControlGroupController=v2).
Feb 20 16:52:34 systemd[1]: systemd-oomd.service - Userspace Out-Of-Memory (OOM) Killer was skipped because of a failed condition check (ConditionControlGroupController=v2).
Feb 20 16:52:34 systemd[1]: systemd-oomd.service - Userspace Out-Of-Memory (OOM) Killer was skipped because of a failed condition check (ConditionControlGroupController=v2).
Feb 20 16:52:34 systemd[1]: systemd-oomd.service - Userspace Out-Of-Memory (OOM) Killer was skipped because of a failed condition check (ConditionControlGroupController=v2).
Feb 20 16:52:34 systemd[1]: systemd-oomd.service - Userspace Out-Of-Memory (OOM) Killer was skipped because of a failed condition check (ConditionControlGroupController=v2).
Feb 20 16:52:34 systemd[1]: systemd-oomd.service - Userspace Out-Of-Memory (OOM) Killer was skipped because of a failed condition check (ConditionControlGroupController=v2).
Feb 20 16:52:34 systemd[2067491]: Queued start job for default target default.target.
Feb 20 16:52:34 systemd[1]: systemd-oomd.service - Userspace Out-Of-Memory (OOM) Killer was skipped because of a failed condition check (ConditionControlGroupController=v2).
Feb 20 16:52:34 systemd[2067491]: Created slice app.slice - User Application Slice.
Feb 20 16:52:34 systemd[1]: systemd-oomd.service - Userspace Out-Of-Memory (OOM) Killer was skipped because of a failed condition check (ConditionControlGroupController=v2).
systemd-oomd.service that pulls systemd-oomd.socket in (because it requires
it); systemd-oomd.service itself is pulled by user@.service because
systemd-oomd package installs an override config file that sets
ManagedOOMMemoryPressure=kill.
Add a check to the code that adds the implicit dependency to skip the
dep if we cannot start it. The check is done exactly the same as in oomd
itself.
Thomas Blume [Tue, 14 Mar 2023 14:21:29 +0000 (15:21 +0100)]
test: use setpriv instead of su for user switch from root
systemd-repart needs to find mkfs.ext4 for the test.
This is located in the directory /usr/sbin on openSUSE Tumbleweed.
But since the variable ALWAYS_SET_PATH in /etc/login.defs is set to yes,
su re-initializes the $PATH variable and removes /usr/sbin.
Hence, mkfs.ext4 is not found and the test fails.
Using setpriv instead of su fixes this issue and is more appropriate to
do the switch user task from root.
[zjs: move setpriv to $BASICTOOLS and force-push to retrigger CI]
TODO: drop items regarding swap-for-hibernate-only-use
I doubt we should bother. Swap always makes sense, and having a swap
partition for hibernate only without using it all the time just makes
the system worse overall.
bootctl: clean up handling of files with no version information
get_file_version() would return:
- various negative errors if the file could not be accessed or if it was not a
regular file
- 0/NULL if the file was too small
- -ESRCH or -EINVAL if the file did not contain the marker
- -ENOMEM or permissions errors
- 1 if the marker was found
bootctl status iterates over /EFI/{systemd,BOOT}/*.efi and checks if the files
contain a systemd-boot version tag. Resource or permission errors should be
fatal, but lack of version information should be silently ignored.
OTOH, when updating or installing bootloader files, the version is expected
to be present.
get_file_version() is changed to return -ESRCH if the version is unavailable,
and other errnos for permission or resource errors.
The logging is reworked to always display an error if encountered, but also
to log the status at debug level what the result of the version inquiry is.
This makes it figure out what is going on:
/efi/EFI/systemd/systemd-bootx64.efi: EFI binary LoaderInfo marker: "systemd-boot 253-6.fc38"
/efi/EFI/BOOT/BOOTfbx64.efi: EFI binary has no LoaderInfo marker.
/efi/EFI/BOOT/BOOTIA32.EFI: EFI binary has no LoaderInfo marker.
/efi/EFI/BOOT/BOOTX64.EFI: EFI binary LoaderInfo marker: "systemd-boot 253-6.fc38"
Frantisek Sumsal [Thu, 30 Mar 2023 17:26:53 +0000 (19:26 +0200)]
coverage: add a wrapper for execveat()
gcov provides wrappers for the exec*() calls but there's none for execveat(),
which means we lose all coverage prior to the call. To mitigate this, let's
add a simple execveat() wrapper in gcov's style[0], which dumps and resets
the coverage data when needed.
This applies only when we're built with -Dfexecve=true.
We have three states:
- ENABLE_COREDUMP and systemd-coredump is installed,
- ENABLE_COREDUMP but systemd-coredump is not installed,
- !ENABLE_COREDUMP.
In the last case we would not do any coredumping-related setup in pid1, which
means that coredumps would go to to the working directory of the process, but
actually limits are set to 0. This is inherited by children of pid1.
As discussed extensively in https://github.com/systemd/systemd/pull/26607, this
default is bad: dumps are written to arbitrary directories and not cleaned up.
Nevertheless, the kernel cannot really fix it. It doesn't know where to write,
and it doesn't know when that place would become available. It is only the
userspace that can tell this to the kernel. So the only sensible change in the
kernel would be to default to '|/bin/false', i.e. do what we do now.
In the middle case, we disabled writing of coredumps via a pattern, but raise
the RLIMIT_CORE. We need to raise the limit because we can't raise it later
after processes have been forked off. This means we behave correctly, but allow
coredumping to be enabled at a later point without a reboot.
This patch makes the last case behave like the middle case. This means that
even if systemd is compiled with systemd-coredump, it still does the usual
setup. If users want to restore the kernel default, they need to provide two
drop-in files:
for sysctl.d, with 'kernel.core_pattern=core'
for systemd.conf, with 'DefaultLimitCORE=0'.
The general idea is that pid1 does the safe thing. A distro may want to use
something different than the systemd-coredump machinery, and then that would
could packaged together with the drop-ins to change the configuration.