So far, when outputing information about copy progress we'd suppress the
digit after the dot if it is zero. That makes the progress bar a bit
"jumpy", because sometimes there are two more character cells used than
other times. Let's just always output one digit after the dot here
hence, to avoid this.
core: Use oom_group_kill attribute if OOMPolicy=kill
For managed oom kills, we check the user.oomd_ooms property which
reports how many times systemd-oomd recursively killed the entire
cgroup. For kernel OOM kills, we check the oom_kill property from
memory.events which reports how many processes were killed by the
kernel OOM killer in the corresponding cgroup and its child cgroups.
For units with Delegate=yes, this is problematic, becase OOM kills
in child cgroups that were handled by the delegated unit will still
be treated as unit OOM kills by systemd.
Specifically, if systemd is managing the delegated cgroup and
memory.oom.group=1 is set on both the service cgroup and the child
cgroup, if the child cgroup is OOM killed and this is handled by systemd
running inside the delegated units, when the unit exits later, it will
still be treated as oom-killed because oom_kill in memory.events will
contain the OOM kills that happened in the child cgroup.
To allow addressing this, the oom_group_kill property was added to the
memory.events and memory.events.local files which allows reading how many
times the entire cgroup was oom killed by the kernel if memory.oom.group=1.
If we read this from memory.events.local, we know how many times the unit's
entire cgroup (plus child cgroups) got oom killed by the kernel. This matches
what we report for systemd-oomd managed oom kills and avoids reporting the
unit as oom-killed if a child cgroup was oom killed by the kernel due to
having memory.oom.group=1 set on it.
Since this is only available from kernel 5.12 onwards, we fall back to
reading the oom_kill field from memory.events if the oom_group_kill property
is not available.
So, arch-chroot currently uses a rather cursed setup:
it sets up a PID namespace, but mounts /proc/ from the outside
into the chroot tree, and then call chroot(2), essentially
making it somewhere between chroot(8) and a full-blown
container. Hence, the PID dirs in /proc/ reveal the outer world.
The offending commit switched chroot detection to compare
/proc/1/root and /proc/OUR_PID/root, exhibiting the faulty behavior
where the mentioned environment now gets deemed to be non-chroot.
Now, this is very much an issue in arch-chroot. However,
if /proc/ is to be properly associated with the pidns,
then we'd treat it as a container and no longer a chroot.
Also, the previous logic feels more readable and more
honestly reported errors in proc_mounted(). Hence I opted
for reverting the change here. Still note that the culprit
(once again :/) lies in the arch-chroot's pidns impl, not
systemd.
firewall-util: remove iptables/libiptc backend support (#38976)
This removes iptables/libiptc backend support in firewall-util, as
already announced by 5c68c51045c27d77b7afc211df7304a958d8cf24.
Then, this drops meaningless `FirewallContext` wrapper.
core: if we cannot decode a TPM credential skip over it for ImportCredential=
let's skip over credentials we cannot decode when they are found with
ImportCredential=. When installing an OS on some disk and using that
disk on a different machine than assumed we'll otherwise end up with a
broken boot, because the credentials cannot be decoded when starting
systemd-firstboot. Let's handle this somewhat gracefully.
This leaves handling for LoadCredential=/SetCredential= as it is (i.e.
failure to decrypt results in service failure), because it is a lot more
explicit and focussed as opposed to ImportCredentials= which looks
everywhere, uses globs and so on and is hence very vague and unfocussed.
creds-util: tweak error code generation in decrypt_credential_and_warn() a bit, and add a comment listing it
Let's make some specific condition more recognizable via error codes of
their own, and in particular remove confusion between EREMOTE as
returned by tpm2_unseal() and by us.
Nick Rosbrook [Thu, 18 Sep 2025 13:16:02 +0000 (09:16 -0400)]
basic: validate timezones in get_timezones()
Depending on the packaging of tzdata, /usr/share/zoneinfo/tzdata.zi may
reference zones or links that are not actually present on the system.
E.g. on Debian and Ubuntu, there is a tzdata-legacy package that
contains "legacy" zones and links, but they are still referenced in
/usr/share/zoneinfo/tzdata.zi shipped by the main tzdata package.
Right now, get_timezoes() does not validate timezones when building the
list, which makes the following possible:
Since mountfsd was added in 702a52f4b5d49cce11e2adbc740deb3b644e2de0 the
caps bounding set line was commented. That's an accident. Fix that. (We
need to add a bunch of caps to the list).
units: explicitly reset TTY before running stuff on console
This adds TTYReset=yes to all units which run directly on the TTY. We
already had this in place for the gettys, but this adds it for the rest
that basically has StandardInput=tty + StandardOutput=tty set.
Originally, for these tools it wasn't necessary to reset the TTY,
because we after all already reset /dev/console very very early on once,
during PID1's early initialization, and hence there's no real reason to
do it again for these early boot services. But that's actually not
right, because since #36666 the TTY we reset from PID 1 is typically
/dev/console but the TTY those services are invoked on is typically the
resolved version of that, i.e. wherever that points. Now you might
think: if one is just an alias to the other, why does it matter to reset
this again? Well, because it's only a half-assed alias, and as it turns
out WIOCSWINSZ is not propagated from one to the other, i.e the terminal
dimesions we initialize for /dev/console don't propagate to whatever
that points to.
One option to address that would be to immediately propagate this down
ourselves (or to fix the kernel for it), but it felt safer to simply do
the reset again before the use, after all these one one-off services,
and there's no point in optimizing much here. Moreover, its probably
safer to give the guarantee that when the firstboot stuff (which after
all queries for pws to set) runs it definitely certainly guaranteed has
a properly reset terminal.
It's not well-formed to begin with. And util-linux's mount(8)
is pretty much ubiquitously employed, hence it will be rejected
elsewhere too. Just stop pretending it is valid just because
glibc parser is sloppy.
Let's synchronize the buffer sizes used when passing around the disk
images, i.e. size both our internal buffers and the pipe buffers the
same (so that we can always write()/read() everything in one gone -
except for the noise compression inserts).
Let's also increase the buffer sizes from 16K to 128K, which made a
difference for me, because it reduces the number of syscalls quite a
bit.
This changes the instances of lexical to lexicographic, thus making it easier
to grep for instances of lexicographic order, since there's only one variant of
the word to consider.
Lexicographic is chosen since there are slightly fewer instances of lexical and
lexicographic seems a better fit than lexical after checking a few
dictionaries.
The words lexical, lexicographic, and lexicographical are synonyms in
computing, meaning an alphabetical order. Both the Oxford dictionary and
Merriam-Webster make no distinction between lexicographic and lexicographical,
with only Wiktionary adding a more precise meaning of
Meeting lexicographical standards or requirements; worthy of being included
in a dictionary. [1]
Since, outside of computing, lexicographic(al) has the more specific meaning
pertaining to lexicography, i.e. the editing or making of dictionaries [2], and
lexical only has this as a secondary meaning after its linguistic meaning [3],
lexicographic fits the meaning of including and ordering entries better.
chase: mask away CHASE_MUST_BE_REGULAR in chase_openat()
We pin the parent directory of the specified directory via CHASE_PARENT,
but if we do that we really should mask off CHASE_MUST_BE_REGULAR,
because a parent dir of course is a dir, nothing else. The
CHASE_MUST_BE_REGULAR after all should apply to the file created in that
dir, not to the parent.
sd-json: make sure JSON_BUILD_STRING_UNDERSCORIFY() maps + to _, too
This is ultimately preparation for making systemd-creds's --with-key=
switch also accessible via Varlink, because it uses "+" inside an the
enum name. It makes sense to to allow this generally however.
sd-boot: allow configuration of log levels (#38701)
This allows for more liberal usage of logging functionality as messages
will no longer always show up on screen, regardless of urgency. The log
level to use can be configured through an SMBIOS type 11 string
(`io.systemd.boot.loglevel=`) or by using the `log-level` option in
loader.conf. Valid values are debug, info, notice, warning, err, crit,
alert, and emerg. By default, info will be used.
basic/efivars: read EFI variables using one read(), not two (#38864)
In https://github.com/systemd/systemd/issues/38842 it is reported that
we're again having trouble accessing EFI variables:
```
[ 292.212415] H (udev-worker)[253]: Reading EFI variable /sys/firmware/efi/efivars/LoaderDevicePartUUID-4a67b082-0a4c-41cf-b6c7-440b29bb8c4f.
...
[ 344.397961] H (udev-worker)[253]: Detected slow EFI variable read access on LoaderDevicePartUUID-4a67b082-0a4c-41cf-b6c7-440b29bb8c4f: 52.185510s
```
We don't know what causes the slowdown, but it seems reasonable to avoid
unnecessary read() calls. We would read the 4-byte attr first, and then
the actual value later. But our code always reads the value (and
discards the attr in all cases except one, when _writing_ the variable),
so let's optimize for the case where we read the value and read the
whole contents in one read().
Tobias Heider [Mon, 25 Aug 2025 14:07:54 +0000 (16:07 +0200)]
stub: fix file path handling for loaded kernel
- Actually pass the new memory file path to parent_loaded_image->FilePath
- Restore old parent_loaded_image if Linux returns
- Pass the same kernel_file_path in load_via_boot_services path
- s/Re-use/Patch in comment explaining what we are doing
Felix Pehla [Sat, 23 Aug 2025 15:27:20 +0000 (17:27 +0200)]
sd-boot: efi-log: use log levels internally
Change log_internal() to receive a log level from which a text color is
derived, rather than the text color directly, and adjust various log_*
macros to use them internally.
Implements the ability to add recovery keys to existing user accounts
via homectl update --recovery-key=yes. Previously, recovery keys could
only be configured during initial user creation, requiring users to
recreate their entire home directory to add recovery keys later.