bootspec: honour profile number when sorting properly
This corrects sorting of menu entries regarding profile numbers:
1. If the profile number is unset, let's treat this identical to profile
0, when ordering stuff, because an item with no profile is
conceptually the same as an item with only a profile 0.
2. Let's take the profile number into account also if sort keys are
used. This was makes profiles work sensibly in type 1 entries, via
the recently added "profile" stanza.
Also change the order of flags to be more logical. First the option
to specify at what fields we look, then the option to specify how we
return their name, the the value, and finally what to do if the value
is missing.
Michael Vogt [Fri, 13 Mar 2026 13:46:37 +0000 (14:46 +0100)]
coccinelle: add checks for pointer access without NULL check
The fix in 8f1751a111 made me wonder if we could automatically detect
when pointers are accessed but when this might not be safe. Systemd
is already using a lot of `assert(dst)` and this change now forces
us to use them.
So this commit (ab)uses coccinelle to flag any pointer parameter
dereference not preceded by assert(param), ASSERT_PTR(param), or an
explicit NULL check. It adds integration into meson as a new "coccinelle"
test suite (just like clang-tidy) and is run in CI. The check is not
perfect but seems a reasonable heuristic.
For this RFC commit it is scoped to a subset, it excludes 25 dirs right
now and includes around 100. About 300 warnings left. Busywork that I am
happy to do if there is agreement that it is worth it.
With this in place we would have caught the bug from 8f1751a111 in CI:
```
FAIL: check-pointer-deref.cocci found issues in systemd/src/boot:
diff -u -p systemd/src/boot/measure.c /tmp/nothing/measure.c
--- systemd/src/boot/measure.c
+++ /tmp/nothing/measure.c
@@ -312,7 +312,6 @@ EFI_STATUS tpm_log_tagged_event(
if (err != EFI_SUCCESS)
return err;
- *ret_measured = true;
return EFI_SUCCESS;
}
```
This also adds a new POINTER_MAY_BE_NULL() for the cases when the
called function will do the NULL check (like `iovec_is_set()`).
Enabling locking by default would constitute a major footgun and
compatibility break on upgrades. This functionality is useful, but it
requires the rest of the system to be "ported" to use systemd-imds
first. The user or distro should opt in to "locked" mode only after
doing the integration work.
Luca Boccassi [Thu, 26 Mar 2026 15:36:43 +0000 (15:36 +0000)]
labeler: update to latest commit and limit file-based label to 5 (#41358)
When doing large refactors or large changes the bot spams
labels left and right, making the PR unreadable. Use the
new option to limit the bot to a max of 5 file-based
labels. If more than 5 would be set, all file-based labels
are skipped.
optionally run a software TPM at boot as fallback for TPM less machines (#41016)
In various scenarios it's useful to be able to run a software TPM as
fallback on a machine if it doesn't natively have a hardware for it, in
order to provide somewhat systematic interfacing for legacy machines.
This adds the infrastructure for it.
Relevant parts:
1. On EFI systems sd-stub will now generate a random secret on first
boot and store it in a persistent NV variable, which is marked
inaccessible to the OS. It then derives a per-OS secret from that which
is passed to the OS via the initrd logic. This is generally useful, but
in particular is intended to secure the software TPM at least a bit: it
provides better security than nothing (i.e. the only protection in place
is that the firmwrae protections work, but this is also what shim relies
on, hence maybe not too bad), and allows swtpm to encrypt its storage
with something.
2. systemd-tpm2-generator is extended to optionally start an swtpm if no
tpm hw is detected. Because this is of course a major downgrade in
security, this has to be requested explicitly at boot via a kernel
cmdline switch.
3. This optionally mounts the ESP from the initrd. This is general
infrastructure, and has been requested before, but is particularly
interesting in the context of software TPMs: the state needs to be
stored somewhere, and that before the rootfs can be unlocked.
4. This introduces a special new separator measurement for PCRs 0-7 that
isolates all measurements from pre-os/uefi world from those done by the
OS. I added this for three reasons:
- in the swtpm case we'll not have any pre-os/uefi measurements, and we
need to be able to determine cleanly that this is the case. this hence
is supposed to play a similar role as the usual firmware separator
measurement, that however cleanly fixates to the PCRs even the case
where the firmware measurements don't exist at all
- this is a very comprehensive fix for #40567
- not all firmwares generate the firmware separator at all, but it is
essential to seal off firmware variables from OS generated ones. This
can fill the void to some degree.
6. This introduces a new kernel cmdline switch
systemd.tpm2-measured-os=1, which allows force enabling all our
measurement logic, even if UKI TPM measurements are not done. This is
supposed to be used for the swtpm case so that one gets all the
measurements even without having the early boot verified boot chain in
place.
Benefits of all this: systems that care about TPMs have a (lower
security) compat glue in place that allows supporting legacy hw the same
way as modern hw in many ways, so that remote attestation and other uses
can reasonably work with the same codepaths.
Also see: https://github.com/lxc/incus-os/pull/667 regarding prior
similar work.
units: make use of nvpcrs only after the NV anchor completion measurement is done
This makes sure we don't use the "hardware" or "verity" nvpcrs before
the NV anchor measurement is done.
This is mostly to avoid confusing output, and to indirectly ensure the
nvpcr allocation in tpm2-setup is the load bearing one, but it should
not be load bearing for security afaics.
This has been requested previously for PCR 7 (#40567), but let's do that
for all firmware owned PCRs, since some firmwares forget to measure
their own separator. Let's hence measure our own guranteed one.
creds-util: only lock against public key PCR stuff if we are booted with UEFI supporting TPMs
The UKI public key PCR stuff only works if we get PCR measurements from
the pre-boot environment, hence automatically disable the logic by
default if we don't have that.
pcrlock: deal with firmwares which understand TPM but where no TPM is available
This is a potentially common case in VMs: firmwares might know the
concept of TPMs, but the hardware is not enabled in the specific VM.
Let's handle this case nicely.
So far we always conditioned our TPM magic on the UKI having detected
TPM support in the firmware. This is a bit limiting when we want to
support a software TPM that is not visible to the firmware. Hence let's
split this up, and add a separate control that can be set via the kernel
command line. However, as before, let's by default inherit the firmare
TPM discovery state into it, to retain the current behaviour unless
overriden.
With this in place, boot with "systemd.tpm2_measured_os=1
systemd.tpm2_software_fallback=1" on the kernel cmdline to get the swtpm
fallback and then a measured OS based on it.
tree-wide: relax TPM available checks for many cases
In many cases it's essential to know if the firmware supports a TPM, but
in others we should accept it if the firmware doesn't have TPM support,
in particular if we want to run the OS with a software TPM.
Hence, add tpm2_is_mostly_supported() as function similar to
tpm2_is_fully_supported(), with the only difference that the former
doesn't insist on a firmware supported TPM. Then, change a number of
users over to this (but not all).
tpm2-generator: if requested run things with an swtpm
We want to start the software TPM fallback only if no real hw is
evailable and if the user opts-in to this behaviour. Add a generator
that drives all this, based on kernel command line configuration.
tpm2: add "systemd-tpm2-swtpm" wrapper for "swtpm"
For TPM-less systems it's sometimes valuable to have a fill-in software
TPM running from early boot on, so that TPM-based functionality can
"just work" and rely on TPM semantics, even if it's at a substantially
weaker security level.
This adds a wrapper around swtpm. It's a binary that chainloads swtpm
but does a few preparatory steps and integrates into systemd's logic
otherwise.
All this is then exposed as systemd-tpm2-swtpm.service.
The service is not hooked into much yet, that is added in later commits.
Luca Boccassi [Thu, 26 Mar 2026 15:07:56 +0000 (15:07 +0000)]
labeler: limit file-based label to 5
When doing large refactors or large changes the bot spams
labels left and right, making the PR unreadable. Use the
new option to limit the bot to a max of 5 file-based
labels. If more than 5 would be set, all file-based labels
are skipped.
Mike Yuan [Wed, 25 Mar 2026 15:00:14 +0000 (16:00 +0100)]
creds: if newline is explicitly requested, skip tty check
Before this commit, the > 0 state of arg_newline tristate is
simply ignored.
Yes, this is a minor compat break, but I'd argue the previous
behavior was not useful as "yes" is treated the same as "auto".
An issue also reported that it was quite surprising.
Michael Vogt [Thu, 26 Mar 2026 12:25:02 +0000 (13:25 +0100)]
core: drop incorrect comment about SD_BUS_VTABLE_CAPABILITY(CAP_SYS_BOOT)
The comment about `SD_BUS_VTABLE_CAPABILITY(CAP_SYS_BOOT)` is
incorrect. To quote @YHNdnzj:
```
nah, the capability-based permission model is a legacy from kdbus. We cannot do it race-freely without it. Please simply drop the comment.
```
The manager_do_set_objective() was doing a bunch of work to
check the `root` path that can already be done via
`json_dispatch_path` so instead of duplicating use the helper.
gpt-auto-generator: generate an initrd ESP mount if it makes sense
We need to store state persistently for the software TPM (i.e. the root
key). But given that TPMs are generally used to unlock the rootfs, this
storage cannot be on the rootfs. Hence let's use the ESP instead, as the
next best thing, that is guaranteed to exist during early boot, given we
just were booted from it.
This defines automatic logic for this, but does not cause the ESP mount
job to be enqueued (since typically we don't actually want that
mounted), this is left for the actual services that needs to be done.
Note that the mount here is set up quite differently from the one from
the host: since initrds are short-lived anyway, it seemed pointless to
use autofs. Moreover this uses a fixed place to mount the ESP, inspired
by the /sysroot/ + /sysusr/ mount naming. All that to simplify things a
bit for the consumers (which is mostly swtpm)
Cynthia [Tue, 17 Mar 2026 22:30:31 +0000 (23:30 +0100)]
kernel-install(uki): filter comments from cmdline
This change aligns the behaviour of UKI generation with the behaviour
of BLS. The latter filters out lines starting with a #, allowing users
to add comments and/or temporarily remove some flags from the kernel
command line.
The kernel-install test have been adjusted to use a multiline cmdline
with a comment in it. Without this patch, the test fails.
This adds a new cloud IMDS client, that can cover AWS, Azure, GCP,
Hetzner imds to varying degrees. Each cloud has this and it's very basic
functionality, hence I think it makes sense to add this to systemd.
Since the clouds are all different this tries hard to do the abstract
common logic in code, but encode the endpoint details, and well-known
keys in hwdb, attached to the DMI id device.
Why all this?
* Efficiency: we can schedule this in the initrd, at the earliest points
possible, without unnecessary delays
* Robustness: imds is typically slow and/or heavily rate-limited:
systemd-imdsd as single entrypoint can deal with that, and provide a
reliable, cached interface
* Security: the idea is that systemd-imdsd is the only service behing
able to access the IMDS HTTP, and the host carries a blackhole route for
it otherwise. That way sensitive info can be kept away from clients, and
requires polkit auth for access
* Simplicity: extraction of systemd's system credentials from IMDS
userdata happens with systemd's own infra, and for many usecases that
should already be enough.
* Measurements: before accepting the IDMS userdata, it can be measured
into a PCR, as any other configuration input for the system
Daan De Meyer [Thu, 26 Mar 2026 12:33:38 +0000 (12:33 +0000)]
ci: Support multi-line review comments in claude-review
Pass side, start_line, and start_side through to createReviewComment()
when present, enabling multi-line review comments. Update the prompt to
document all positioning fields using JSON Schema and make line required.
They document it here:
https://octokit.github.io/rest.js/v22/#pulls-create-review-comment
but apparently that's out of date and this doesn't work anymore.
Luca Boccassi [Fri, 13 Mar 2026 01:52:12 +0000 (01:52 +0000)]
boot: add checks for invalid splash images in UKI
A malformed bmp with 8bits depth but smaller color
map would cause out of bounds reads. This is not a real
problem as the image is signed, but better to be safe.
imds: add generator that hooks in IMDS logic on cloud guests
The infrastructure added in the previous commits added support for IMDS
client functionality, but didn't really to enable the logic by default
on suitable hosts.
This commit adds a generator that automatically hooks the IMDS
functionality into the boot process if it detects that the system is
running on a compliant cloud system. it enables both the imds daemon and
the client.
This automatically measures the IMDS 'userdata' into PCR 12, i.e. where
we measure the other owner-supplied configuration, such as confexts and
credentials and similar.
(Why 12? It's really about who owns the data and what it is for.
PCRs/NvPCRs are scarce hence there's a strong incentive to not go
overboard with new allocations, and IMDS userdata in purpose and owner
is very very similar to confexts and credentials, hence let's reuse the
PCR for this purpose.)
imds: add "systemd-imds" tool that is a simple client to "systemd-imdsd"
This is a client tool to the systemd-imdsd@.service added in the
previous commit. It's mostly just a 1:1 IPC client via Varlink. It can
be used to query any IMDS key, but it's primary usecase is to acquire
the "userdata" from IMDS. Moreover, if invoked with the --import switch
it will check if the userdata contains a list of system credentials. If
so, it will import them into the local credstore. If the userdata does
not look like a list of system credentials no operation is executed,
under the assumption the data is intended for cloud-init instead.
It also imports a couple of other fields, if available and recogniuzed,
such as SSH keys and the hostname.
imds: add new systemd-imdsd.service that makes IMDS data accessible locally
This service's job is to talk to a VM associated IMDS service provided
by the local Cloud. It tries to abstract the protocol differences
various IMDS implementations implement, but does *not* really try to
abstract more than a few basic fields of the actual IMDS metadata.
IMDS access is wrapped in a Varlink API that local clients can talk to.
If possible this makes use of the IMDS endpoint information that has
been added to hwdb in the preceeding commit. However, endpoint info can
also be provided via kernel command line and credentials. For debugging
purposes we also accept them via environment variables and command line
arguments.
This adds a concept of early-boot networking, just enough to be able to
talk to the IMDS service. It is minimally configurable via a kernel
cmdline option (and a build-time option): the user may choose between
"locked" and "unlocked" mode. In the former mode direct access to IMDS via
HTTPS is blocked via a prohibit route (and thus all IMDS communication
has to be done via systemd-imdsd@.service). In the latter case no such
lockdown takes place, and IMDS may be acquired both via this new service
and directly. The latter is typically a good idea for compatibility with
current systems, the former is preferable for secure installations.
This adds a hardware database that contains information about IDMS
functionality of various clouds, keyed off the SMBIOS identification of
each. Currently this contains information about 6 major clouds, but the
idea is that this grows to include more and more major clouds.
Nothing uses this data yet, that's added in a later commit.
Daan De Meyer [Thu, 26 Mar 2026 08:36:51 +0000 (09:36 +0100)]
ci: Use path instead of file in claude-review prompt as JSON key
In https://github.com/systemd/systemd/pull/40980 claude hallucinated
and used "path" instead of "file" as the JSON key. Since "path" is
arguably more correct than "file" anyway, let's switch to that.
Patrick Wicki [Fri, 20 Mar 2026 14:56:56 +0000 (15:56 +0100)]
tpm2-util: fix PCR bank guessing without EFI
Since 7643e4a89 efi_get_active_pcr_banks() is used to determine the
active PCR banks. Without EFI support, this returns -EOPNOTSUPP. This in
turns leads to cryptenroll and cryptsetup attach failures unless the PCR
bank is explicitly set, i.e.
$ systemd-cryptenroll $LUKS_PART --tpm2-device=auto --tpm2-pcrs='7'
[...]
Could not read pcr values: Operation not supported
But it works fine with --tpm2-pcrs='7:sha256'.
Similarly, unsealing during cryptsetup attach also fails if the bank
needs to be determined:
Failed to unseal secret using TPM2: Operation not supported
Catch the -EOPNOTSUPP and fallback to the guessing strategy.
Signed-off-by: Patrick Wicki <patrick.wicki@subset.ch>
resolved: move resetting of {etc_hosts|static_records}_last to manager_dispatch_reload_signal()
This addresses
https://github.com/systemd/systemd/pull/41213#pullrequestreview-4002247053
which I somehow missed earlier.
Claude found a real issue for the case of manager_etc_hosts_flush().
We'll do the equivalent change in manager_static_records_flush() too,
even though it's not really necessary there, simply to keep things
nicely mirrored.
Mike Yuan [Wed, 25 Mar 2026 17:30:14 +0000 (18:30 +0100)]
memory-util: avoid passing invalid pointer to memcmp() when length == 16
If length is exactly 16, the loop would finish with length == 0,
but we'd carry on to the memcmp() check, where the 'p + 16' passed
would be invalid memory. memcmp() demands valid pointers even if
size is specified to 0, hence let's catch this ourselves.
random-seed: when we have a reasonable RNG then create random seed file if missing
Previously we'd never write the ESP random seed file (or initialize the
random seed EFI table) if it didn't already exist. Let's adjust this a
bit, and also create it fresh if we have a "good" random source, i.e. if
the EFI table already existed or if the RNG protocol is implemented by
EFI.
This is useful as it increases the chance the random seed table is
valid, and we can use it as source for randomness in later stages.
Michael Vogt [Wed, 25 Mar 2026 10:53:36 +0000 (11:53 +0100)]
varlink: comment that "more" flag IDL comment is API
External tools that use the systemd varlink ecosystem require
to know if a specific varlink method supports/requires the
"more" flag from the IDL. This is tracked upstream in
https://github.com/varlink/varlink.github.io/issues/26
As an intermediate step systemd adds the (very nice) comments
```
# [Requires 'more' flag]
or
# [Supports 'more' flag]
```
to the various methods.
This commit extends the comment around the code that adds the
comment to clarify that this should be considered API and that
the comment should not be changed as external tools (like e.g.
the varlink-http-bridge) rely on it.
For some reason the IMDS PR for the first time triggers an issue with
the DEFER_VOID_CALL() logic relying on assert() and being places in
macro.h, let's hence move this elsewhere.
I can measure a throughput difference between using using buffer sizes
4096 and 8184 on my hardware, so it really seems that this is doing
something beyond buggy firmware.
Original PR https://github.com/systemd/systemd/pull/17635 didn't give any
explanation for the limit of 4096, but that's probably what was supported by
the kernel drivers at the time.
A web search shows that CISCO VIC 15000 supports 16k, so allow up to that.
Ronan Pigott [Wed, 11 Mar 2026 17:52:49 +0000 (10:52 -0700)]
resolved: use the SOA to find chain of trust quicker
sd-resolved does dnssec "backwards" compared to most resolvers.
A typical strategy is to start from the DNS root and gather the
requisite keys on the way down, but sd-resolved requests the final
answer it wants and then goes searching for the requisite keys later.
We don't know in advance under which names we should expect to find
those keys, because we don't know the zone cuts a priori, but we can use
what we have found in prior responses to make an educated guess. This
was more or less the intent of 47690634f157, but it was partially
regressed in d840783db520 while fixing a bug handling totally empty
responses.
Fixes #37472
Ref: 47690634f157 ("resolved: don't request the SOA for every dns label") Fixes: d840783db520 ("resolved: always progress DS queries")
Frantisek Sumsal [Tue, 24 Mar 2026 13:29:27 +0000 (14:29 +0100)]
homectl: apply all --member-of= groups from a comma-separated list
Commit 0e1ede4b4b6d1ce6b5b6cda5f803e4f1b5aa4a03 introduced a bug where
we'd always fetch the "original" (empty) list of groups when processing
a comma-separated list of groups from the --member-of= option, so only
the last group from the list would get applied. This bug was then later
(in 316e9887f2a48bd1c4efa3e31b4bfbaeb22de3a3) refactored into a separate
function.
resolved: resolve insecure answers with unsupported sig algorithms (#40778)
sd-resolved does not support all the permissible DNSSEC signature
algorithms, and some are intentionally unsupported as a matter of
policy. Answers that can only be validated via unsupported algorithms
should be treated as if they were unsigned, per RFC4035 § 5.2.
Previously, sd-resolved tried to properly record insecure answers for
unsupported algortihms, but did not record this status for each of the
auxilliary DNSSEC transactions, so the primary transaction had no way to
know if there was a plausible DNSKEY with an unsupported signature
algorithm in the chain of trust.
This commit adds the insecure DNSKEYs that use unsupported algorithms to
the list of validated keys for each transaction, so that dependent
transactions can learn that a plausible chain of trust exists, even if
no authenticated one does, and report the insecure answer.
shared/verbs: allow multiple verbs to be handled by a single function
With the uintptr_t data parameter, it is actually quite nice to have
VERB(do_impl, "name-a", …)
VERB(do_impl, "name-b", …)
int do_impl(…) { … }
To make this work, the do_impl_data struct needs to have a unique name and
we also need to suppress the warning about the forward declaration for
do_impl being repeated. I think it's fine to suppress the warning, it's
not needed for anything. If somebody declares the function with the same
name by mistake, the implementations are going to conflict too.
vmspawn: add disk type selection for root and extra drives (#41301)
vmspawn previously hardcoded virtio-blk for all drives. This adds
--image-disk-type= to select the root disk type (virtio-blk,
virtio-scsi, or nvme) and allows per-drive overrides via a
colon-separated prefix on --extra-drive=. The format and disk type
prefixes can appear in any order since their value sets don't overlap.
For virtio-scsi, a single shared controller is created with drives
attached as scsi-hd devices. For nvme, each drive gets its own
controller. Both have serial number length limits (30 and 20 characters
respectively), so long filenames are replaced with a truncated SHA-256
hex digest.