Philip Withnall [Fri, 29 May 2026 14:34:56 +0000 (15:34 +0100)]
sysupdate: Add varlink CheckNew() method
This is the first varlink method added to sysupdate. The D-Bus interface
(via sysupdated) will remain for now; the varlink interface will exist
in parallel.
This method can be called via:
```
varlinkctl call ./path/to/systemd-sysupdate \
io.systemd.SysUpdate.CheckNew \
'{"target":{"class":"host"}}'
SYSTEMD_SYSUPDATE_NO_VERIFY=1 \
varlinkctl call ./path/to/systemd-sysupdate \
io.systemd.SysUpdate.CheckNew \
'{"target":{"class":"component","name":"some-component"}}'
```
This includes some changes to run the integration tests again using the
varlink interface rather than running `systemd-sysupdate` directly, to
test the new interface.
This adds the scaffolding for being able to call sysupdate via varlink,
but it doesn’t yet define or implement any methods. Those will come in
following commits.
The existing `systemd-sysupdate.service` and `systemd-sysupdate.timer`
(which periodically ran `systemd-sysupdate update`) have been renamed
to `systemd-sysupdate-update.{service,timer}` to make way for a new
`systemd-sysupdate@.service` and `systemd-sysupdate.socket` file to
handle varlink activation.
Philip Withnall [Thu, 4 Jun 2026 15:47:37 +0000 (16:47 +0100)]
sd-json: Fix validation of optional fields within a mandatory struct
If a varlink method takes a struct/object as a parameter, and it’s
marked as `SD_JSON_MANDATORY`, and it has an optional field inside it
which is *not* marked as `SD_JSON_MANDATORY`, we want to not require
that field to be set.
Previously, due to using the `merged_flags` from both the mandatory
struct and the optional field, `SD_JSON_MANDATORY` was effectively
always set on the optional field even if we didn’t want it. This
resulted in an error being emitted if the mandatory struct was provided
in a varlink call, but without the optional field.
Fix that by validating the field’s presence only against its own flags
and not also the flags of its parent.
Philip Withnall [Thu, 4 Jun 2026 15:06:15 +0000 (16:06 +0100)]
sysupdate: Factor some Target handling code out of sysupdated
This will be used in upcoming commits to varlinkify `systemd-sysupdate`;
it will need a way to identify targets over varlink, and the existing
way with a `Target` over D-Bus seems to work quite well.
Philip Withnall [Tue, 2 Jun 2026 11:59:04 +0000 (12:59 +0100)]
sysupdate: Minor fix to a cleanup function on an error path
`process_image()` has historically used `umount_and_freep` to clean up
the mounted directory locally, but callers to it have used
`umount_and_rmdir_and_freep`.
No directory is created after any of the error return paths in
`process_image()`, so it should probably be using
`umount_and_rmdir_and_freep` too.
Philip Withnall [Fri, 29 May 2026 13:40:08 +0000 (14:40 +0100)]
sysupdate: Move global arg_* variables into Context
This is another step towards varlinkifying the program, as it means the
various verb implementations are no longer relying on global state from
the command line.
As part of this, move init of the `Context` struct into a new
`context_from_cmdline()` function.
Additionally pass some context into config parsing `userdata` arguments,
as various config parsers were using `arg_root` via a sneaky `extern`.
Philip Withnall [Fri, 29 May 2026 12:26:47 +0000 (13:26 +0100)]
sysupdate: Change Context to be stack allocated
There’s no need for it to be heap allocated — there’s only ever one
instance of it, and it’s allocated for the lifetime of a `verb_*()`
function.
Simplify things a bit by making it stack allocated. This will also help
with upcoming commits where we introduce derived context structs to help
with varlinkifying sysupdate. By allowing `Context` to be stack
allocated we can include it in the derived context structs.
As part of this, rename `context_make_{offline,online}()` to
`context_load_{offline,online}()` for clarity (since they no longer init
the struct).
Philip Withnall [Fri, 29 May 2026 11:59:54 +0000 (12:59 +0100)]
sysupdate: Factor process_image() into context_make_{offline,online}()
`process_image()` is always called immediately before (almost) every
`context_make_online()` or `context_make_offline()`, and the structures
it allocates have the same lifetime as `Context`, so we might as well
factor them all together to reduce duplication.
This will also simplify the following commit, which changes heap
allocation of `Context`s, and simplify upcoming changes to factor out
`arg_*` handling.
The call in `verb_pending_or_reboot()` is safe because it already
validates that `arg_image` is `NULL`, hence `process_image()` will bail
out early.
Philip Withnall [Tue, 23 Jun 2026 15:23:54 +0000 (16:23 +0100)]
sysupdate: Factor context creation out of installdb_cleanup_component()
This makes it like all the other verbs and therefore easier to refactor.
At the same time, remove the separate `component` argument and instead
use the `component` set on the `Context`. This guards against bugs, as
various parts of the `Context` state depend on the component (for
example, `installdb_fd`) and overriding the component without also
overriding its dependent variables will lead to bugs.
Wang Yu [Fri, 26 Jun 2026 04:07:12 +0000 (12:07 +0800)]
man: fix first argument in Environment= expansion example
The example states that the first /bin/echo invocation (using ${ONE})
receives the argument 'one' (with literal single quotes). However,
Environment=ONE='one' strips the syntactic single quotes during
unquoting — see systemd.syntax(7), "Quotes themselves are removed" —
so ONE holds the value one, and ${ONE} (exact-value substitution,
always a single argument) yields the argument one without quotes.
Fede2782 [Fri, 26 Jun 2026 08:06:26 +0000 (10:06 +0200)]
hwbd: correctly map Bluetooth Key on MSI Modern 15 H AI C1MG laptop
Previously the key was unknown so add the correct mapping as it does not follow the general
case for MSI Laptops.
[ 192.562000] atkbd serio0: Unknown key released (translated set 2, code 0xd7 on isa0060/serio0).
[ 192.562011] atkbd serio0: Use 'setkeycodes e057 <keycode>' to make it known.
Add it currently as a definition specific for this model but can be generalized to other MSI
Laptops if this issue is present also elsewhere.
pcrlock: reject device path node shorter than its header
event_log_record_extract_firmware_description() walks the device path
of a UEFI_IMAGE_LOAD_EVENT taken from the firmware TPM2 measurement log.
The per-node loop checks the remaining bytes against the node and its
declared length, but never that dp->length covers the 4-byte node header
offsetof(packed_EFI_DEVICE_PATH, path).
For a Media/File-Path node with length 3, the file-name extraction
computes dp->length - offsetof(packed_EFI_DEVICE_PATH, path) == 3 - 4,
which wraps to SIZE_MAX. utf16_to_utf8() treats SIZE_MAX as unbounded
and runs char16_strlen() over dp->path, reading past the log buffer; a
length of 0 also leaves dp non-advancing.
efi_get_boot_option() in src/shared/efi-api.c already rejects such nodes
with "if (dpath->length < 4) break;"; do the same here.
vmspawn: deliver credentials via initrd cpio under SEV-SNP (#42272)
Re-enables `--set-credential=` / `--load-credential=` under
`--coco=sev-snp` by packaging credentials into a cpio appended to the
initrd, mirroring what `systemd-stub` does for ESP-sourced credentials.
The initrd is covered by the launch measurement via `kernel-hashes=on`,
so the credentials are too.
Tested end-to-end on an SNP-capable host: credentials passed via
`--set-credential=` land in `/run/credentials/@encrypted/` inside the
guest.
dongshengyuan [Tue, 16 Jun 2026 01:19:15 +0000 (09:19 +0800)]
nss-resolve: fix blank array checks and improve NSS status codes
Use sd_json_variant_is_blank_array() instead of is_blank_object() for
p.addresses and p.names, which are declared as JSON arrays. The wrong
predicate never triggered, allowing empty arrays to bypass the guards:
for p.names this caused a size_t underflow leading to an out-of-bounds
heap write; for p.addresses it returned success with no addresses.
Add explicit n_addresses == 0 guards after the family-filter loops so
entries with unsupported families also return NOTFOUND rather than
crashing on a NULL dereference.
In gethostbyname3_r (family-specific entry point), return NO_DATA for
all zero-address results — both blank array and all-filtered — since
both mean "name resolved, no record of the requested family". Keep
HOST_NOT_FOUND in gethostbyname4_r (both-families) where a blank or
all-unsupported result genuinely means the name was not found.
Signed-off-by: dongshengyuan <dongshengyuan@uniontech.com> Co-developed-by: Claude Opus 4.8 <noreply@anthropic.com>
Yu Watanabe [Thu, 25 Jun 2026 17:49:59 +0000 (02:49 +0900)]
journal: Prevent total log loss on unclean shutdown at high write rates (#42639)
In Meta production we have been considering using journald more widely
for some time. One of the blockers to doing that which I have noticed is
that often journald seems to have vastly less data after lockups/power
failures compared to plain files, which is not great when debugging
outages.
On small write rates this tends to be hard to reproduce, but when
writing thousands of messages a second, an unclean shutdown can result
in the end result being an active journal file with a header that
records an arena larger than the data that actually reached disk. What
happens is then that journalctl then discards the entire file(!),
completely ignoring that there is a huge amount of data which is
actually perfectly readable.
The reason for that is that the journal header is updated on every
append, while the file size and newly written arena contents are only
made durable on the filesystem's own schedule. After a crash, the header
can therefore describe writes which were logically completed by journald
but whose backing data or file metadata never reached disk.
Take the following example of how this can happen at high log rates:
1. journald appends objects into an mmap()ed arena, periodically growing
the file with fallocate() in FILE_SIZE_INCREASE (8M) steps and advancing
the header's arena_size tail pointers as it goes along.
2. The header is dirtied on every append, and its arena_size is advanced
at each fallocate(). It is, from the kernel's perspective, an ordinary
data page and is only made durable by the kernel's periodic page cache
writeback on its own schedule. The file's length, by contrast, is
metadata, made durable only when the filesystem commits a transaction
(or on an fsync(), which journald does not issue between sync
intervals).
3. journald marks journals NOCOW, so the header's data block is
overwritten in place and is decoupled from the size metadata. Nothing
orders the two with respect to each other. Writeback therefore can
routinely persist a header whose arena_size has run ahead of the file
length recorded on disk.
4. Power is lost. On the next boot the persisted header reflects an
arena_size and tail pointers which have been advanced for appends.
However their payload and the file metadata were never committed, so
header_size + arena_size now points well past the end of the file as it
exists on disk.
5. journal_file_verify_header() then rejects this with -ENODATA:
That is correct when opening for writing, because we must not append to
a file whose recorded state we cannot trust, and the caller must rotate
it away. But the same check also runs on read only opens, where it is
actively harmful. In the case of journalctl, the entire file is skipped,
even though the data hash table, the field hash table, and the head of
the array all are present and fully intact, and the great majority of
entries are physically present. In fact, only a very small part of the
most recently written tail is missing, but everything before is
readable. This results in mistakenly rejecting the entire file as
corrupt.
This happens extremely frequently on machines with high write rates
during power cuts or lockups. In testing writing ~7500 msg/s through
journald and then cutting power, I reproduced it in ten out of ten
attempts across different machines.
In each case, the header was left claiming ~296M of arena while only
~192-208M had reached disk. In this case, journalctl reports that it has
recovered 0 of ~335000 messages. Whether a given crash trips the
condition depends on where it falls relative to the header's writeback,
but when it does, the loss today is total. After this patch the vast
majority of messages can be retrieved.
Let's fix this by keeping the rejection for writing, but for read-only
opens, let's just clamp the arena to the real file size and skip the
consistency checks on the now unreliable tail pointers. The reader will
walk the entry array chain from its intact head and stop at the
truncation point by the bounds check that already exists, so there's no
need to do any more than that there.
Shihao Ren [Thu, 25 Jun 2026 07:15:29 +0000 (15:15 +0800)]
analyze: don't treat user-scope services as running as root in `security`
`systemd-analyze security --user foo.service` currently flags units
without `User=` as running as root. For user manager instances this is
impossible: per systemd.exec(5), switching user identity is not
permitted there, so the service always runs under the calling user's
UID.
Track the runtime scope inside SecurityInfo and short-circuit
security_info_runs_privileged() and assess_user() for
RUNTIME_SCOPE_USER, so that User=/DynamicUser=, SupplementaryGroups=
and RemoveIPC= are no longer marked as if the service ran as root in
both the bus-backed and --offline paths.
unit-name: introduce "strict" mode for unit name mangling (#42638)
unit_name_mangle_with_suffix() is quite benevolent by default and allows
the unit to "transition" into a different unit type than what's
requested via its suffix argument. For example, calling
unit_name_mangle_with_suffix() with "/foo/bar" as a unit name and
".service" as a suffix would give you "foo-bar.mount", without any
warning or error.
This could then lead to a quite confusing errors in certain situations:
```
~# systemd-run --remain-after-exit --unit /foo/bar true
Failed to start transient service unit: Cannot set property RemainAfterExit, or unknown property.
```
Given we can't change the default behaviour of
unit_name_mangle_with_suffix() as some parts of systemd already depend
on its "benevolence" (like systemctl), let's introduce a new flag -
UNIT_NAME_MANGLE_STRICT - that checks if the mangled/resolved unit
name's suffix matches the requested one and errors out if not.
With the flag used throughout systemd-run's code, the error in the above
case is now a bit more clear:
```
~# build/systemd-run --remain-after-exit --unit /foo/bar true
Path "/foo/bar" resolves to unit type "mount", but "service" is expected as unit.
Failed to mangle unit name: Invalid argument
```
Resolves: #39996
dongshengyuan [Thu, 25 Jun 2026 03:30:25 +0000 (11:30 +0800)]
homed: fix home_unlocking_finish reporting success as failure
In home_unlocking_finish(), the success path calls operation_result_unref()
with the local variable r and the uninitialized error object. If either
user_record_good_authentication() or home_save_record() fails (both are
logged as "ignoring"), r is left negative and the D-Bus caller receives
an error reply despite the home having been unlocked successfully.
This causes PAM to reject the session even though the home directory is
mounted and accessible.
Fix by passing 0 and NULL — consistent with every other success path in
the file (home_locking_finish(), home_activation_finish(), etc.).
Chris Down [Thu, 18 Jun 2026 07:07:04 +0000 (16:07 +0900)]
journal: Tolerate lost tail hash chain nodes
The data and field hash table chains have the same problem the previous
commit fixed for entry array chains. New data and field objects are
linked at the tail of their hash bucket by patching the previous tail
object's next_hash_offset in place, so after a crash a persisted
predecessor (or the bucket head) can point at an object whose body never
reached disk.
journal_file_find_data_object_with_hash() and
journal_file_find_field_object_with_hash() walk those chains while
resolving matches, and on -EADDRNOTAVAIL/-EBADMSG from
journal_file_move_to_object() they simply return the error directly.
That propagates up to real_journal_next(), which discards the whole file
from the query.
Give those two lookups the same tolerance: on a read-only file, treat an
unreadable chain node as the end of the bucket chain.
Chris Down [Wed, 17 Jun 2026 11:45:11 +0000 (19:45 +0800)]
journal: Recover filtered journal queries after crash truncated writes
generic_array_get() which is used for the unfiltered iteration path in
the previous commit treats a chain pointer that resolves past the end of
the file as the end of the chain. In that case, moving to the missing
array object returns -EADDRNOTAVAIL (or -EBADMSG), and it either stops
(going downwards) or steps back to the previous array (going upwards).
However, generic_array_bisect(), which is used for filtered or seeking
reads does not. On -EADDRNOTAVAIL/-EBADMSG from
journal_file_move_to_object(), it instead returns the error directly to
the caller, which propagates out through
sd_journal_next()/sd_journal_previous() and aborts the query.
The per-data entry array chain has the same issue as the global one,
since n_entries and entry_array_offset are (re)written in place as
entries are linked, and thus after a crash they can reference more
arrays than actually reached the disk. That is to say in practical
terms, a journal recovered for reading by the previous commit could
nevertheless still drop matching entries from `journalctl FIELD=value`,
and a seqnum or time seek into the lost region could fail outright.
Let's give generic_array_bisect() the same tolerance generic_array_get()
already has. That is, when moving to an entry array object fails, treat
the chain as ending at the previous array. This means that the result
matches what generic_array_get() would yield for the same file.
Luca Boccassi [Wed, 24 Jun 2026 12:41:06 +0000 (13:41 +0100)]
journal-remote: fix hostname double-free on request_meta() error paths
request_handler() owns the hostname var and passes it by value to
request_meta(), which hands it to source_new(), which stores it in
source->importer.name without copying. If build_accept_encoding()
then fails, the hostname var is freed, and then the caller's
_cleanup_free_ frees it a second time.
* f7762b7143 sandbox: Preserve net caps across user namespace before unsharing net
* 582eadee34 Revert "Put build history into the output directory"
* 5ef262bc53 action: don't fail if apk cannot be downloaded
* bdd341ff9b Lock the package cache during package manager invocations
* da49fe976c Put build history into the output directory
* 1c392f1918 tests: Use unique machine names
* e4f4026e30 tests: Reduce VM RAM size
* de41a5e03e Don't leak gpg-agent when signing with gpg
* 1bc5d61e1d ci: Pin openSUSE to second-to-last Tumbleweed snapshot
* c4d565a009 test: Use the main build's snapshot for extension builds
* 718b06c866 tests: ignore masked units in check-and-shutdown
* 0dc5ecbc02 ci: enable postmarketOS in integration testing
* d4c6761ad3 action: install apk to /usr/bin
* 9980f31309 mkosi-vm: add systemd-efistub to postmarketOS config
* 5640ace38f mkosi.conf: add grub to postmarketOS
* 6741b440c0 mkosi-initrd: add sulogin, device-mapper to postmarketOS initrd
* c3575c035c mkosi-tools: add missing packages to postmarketOS tools tree
* 0774bc2498 mkosi-tools: add apk-tools to tools trees for Arch and OpenSuSE
| * bb87e48401 curl: Retry on failures
|/
* 41fea1dd8d dnf: Work around librepo rejecting valid repomd signatures cross-distro
* 647e3b610b dnf: Proper repository metadata signature requirement
* 46d907cce2 dnf: Don't skip unavailable repositories during makecache
* a91e89c3b7 run_locale_gen: noop if output_format is confext
* 30329e401b tests: Make integration tests runnable locally
* be549f04db config: Don't propagate $MKOSI_DNF when using a tools tree
* 42ed648981 build(deps): bump actions/upload-artifact from 7.0.0 to 7.0.1
* fd5eedd62b build(deps): bump aws-actions/configure-aws-credentials
* 86733c703d tree: check for root when copying SELinux attributes as well
* de2256f8fe Skip security.ima xattrs when copying tree as non-root
| * 08ebf6d678 vmspawn: Exclude secure-boot unless requested
|/
* 1d3c51e36d obs workflow: do not build aarch64/i586
Luca Boccassi [Fri, 8 May 2026 13:21:33 +0000 (14:21 +0100)]
homectl: retry DeactivateHome on transient busy errors
When 'homectl deactivate' is called immediately after a preceding
operation, the umount inside systemd-homework can fail with EBUSY
because something briefly holds a reference to the home mount (e.g. a
concurrent inspect). systemd-homed already handles this gracefully
by moving the home into the 'lingering' state and retrying deactivation
after 15 seconds, but the bus reply for the original DeactivateHome
call returns the org.freedesktop.home1.HomeBusy error immediately,
which makes TEST-46-HOMED flaky.
Fix homectl to follow homed and retry for up to 30 seconds on HomeBusy
and add a test case trying to make the issue more reproducible.
shared/tpm2-util: use a define instead of a const static variable
Let's do the standard thing. The 'static const' variable requires space
and less efficient code (moving from memory instead of a const insertion).
This doesn't matter much, but let's follow the standard pattern.
Paul Meyer [Tue, 23 Jun 2026 14:07:34 +0000 (16:07 +0200)]
tpm2: re-manufacture software TPM when state dir is incomplete
setup_swtpm() decided whether a software TPM had already been
manufactured by checking whether the state directory was empty. But
manufacture_swtpm() writes swtpm's config files before forking
swtpm_setup, so an interrupted manufacture leaves the directory
non-empty yet without a usable TPM. The next boot then mistook it for a
complete TPM and started swtpm against a broken state directory.
Keying off a swtpm state file like tpm2-00.permall is no better, as
swtpm_setup gives no guarantee any single one is written atomically or
last. Instead, have manufacture_swtpm() write a marker (.manufactured)
as its very last step, once swtpm_setup has exited successfully, and
gate on it: re-manufacture when it is missing in the initrd, and refuse
rather than start a broken TPM outside it.
Paul Meyer [Wed, 24 Jun 2026 06:58:10 +0000 (08:58 +0200)]
tpm2: write swtpm config files atomically via the state directory fd
Open the swtpm state directory once and write the three config files
relative to that fd with WRITE_STRING_FILE_ATOMIC, rather than by path
with a plain truncating write. Writing atomically ensures a crash or a
concurrent reader never observes a half-written config file, and
operating through a single directory fd lets later steps reuse it.
Luca Boccassi [Thu, 25 Jun 2026 11:16:52 +0000 (12:16 +0100)]
Translations update from Fedora Weblate (#42749)
Translations update from [Fedora
Weblate](https://translate.fedoraproject.org) for
[systemd/main](https://translate.fedoraproject.org/projects/systemd/main/).
Frantisek Sumsal [Wed, 17 Jun 2026 12:09:43 +0000 (14:09 +0200)]
unit-name: introduce "strict" mode for unit name mangling
unit_name_mangle_with_suffix() is quite benevolent by default and allows
the unit to "transition" into a different unit type than what's
requested via its suffix argument. For example, calling
unit_name_mangle_with_suffix() with "/foo/bar" as a unit name and
".service" as a suffix would give you "foo-bar.mount", without any
warning or error.
This could then lead to a quite confusing errors in certain situations:
~# systemd-run --remain-after-exit --unit /foo/bar true
Failed to start transient service unit: Cannot set property RemainAfterExit, or unknown property.
Given we can't change the default behaviour of
unit_name_mangle_with_suffix() as some parts of systemd already depend
on its "benevolence" (like systemctl), let's introduce a new flag -
UNIT_NAME_MANGLE_STRICT - that checks if the mangled/resolved unit
name's suffix matches the requested one and errors out if not.
With the flag used throughout systemd-run's code, the error in the above
case is now a bit more clear:
~# build/systemd-run --remain-after-exit --unit /foo/bar true
Path "/foo/bar" resolves to unit type "mount", but "service" is expected as unit.
Failed to mangle unit name: Invalid argument
Currently translated at 100.0% (286 of 286 strings)
Co-authored-by: Fco. Javier F. Serrador <fserrador@gmail.com>
Translate-URL: https://translate.fedoraproject.org/projects/systemd/main/es/
Translation: systemd/main
The new polkit will return a new detail regarding a successful
authentication: the actual result type, which we can use to
see whether the user authenticated as admin. This can be used
to grant additional privileges.
Paul Meyer [Wed, 24 Jun 2026 10:43:40 +0000 (12:43 +0200)]
units: harden systemd-report-sign-plain@.service
Apply sandboxing. The plain backend's needs writable StateDirectory and
/dev/urandom for key generation. The service must stay root (the
private key is root-only), but everything else is locked down.
journald: bound field length in extra-fields reader
client_context_read_extra_fields() reads a 64-bit field length v from
the per-unit log-extra-fields file. n = sizeof(uint64_t) + v overflows
when v is near UINT64_MAX, so the "left < n" check is bypassed and the
following memchr() scans v bytes past the buffer. Bound v against the
remaining bytes instead, which cannot overflow.
Luca Boccassi [Wed, 24 Jun 2026 12:56:37 +0000 (13:56 +0100)]
uid-range: fix out-of-bounds write in uid_range_partition()
uid_range_partition() filled the grown entries[] buffer backwards in
place. The backward-fill invariant (the write cursor stays above the
read index) only holds when every source entry contributes at least
one partition; an entry with nr < size contributes zero, so the cursor
stalls while the read index keeps descending. A later multi-part
entry's writes then overwrite the still-live zero-part slot, the
corrupted slot is re-read as a one-part entry, and the next
range->entries[--t] underflows.
Add a forward compaction first pass that drops the zero-part entries
before the backward fill.
Luca Boccassi [Wed, 24 Jun 2026 12:20:56 +0000 (13:20 +0100)]
dhcp6: reject IA_PD_PREFIX with invalid prefix length
dhcp6_option_parse_ia_pdprefix() validates the lifetimes but never the
prefixlen byte, so a delegated prefix with prefixlen == 0 or > 128 is
stored in the lease and handed over.
RFC 8415 defines the prefix length as 1 to 128, and the send-side
option_append_pd_prefix() already rejects 0, so reject the out-of-range
values on the receive path too.
Luca Boccassi [Wed, 24 Jun 2026 12:53:07 +0000 (13:53 +0100)]
sd-lldp-rx: keep object ref around event callbacks
If the user callback set via sd_lldp_rx_set_callback() drops the last
reference to the sd_lldp_rx object, trying to use it later does not go
well. Take a ref to keep the objects alive as long as they are needed.
dongshengyuan [Thu, 25 Jun 2026 08:40:28 +0000 (16:40 +0800)]
systemctl: fix continue placement in clean-or-freeze error handling
When sd_bus_call() fails, the continue was inside the
'if (ret == EXIT_SUCCESS)' guard, so only the first failure skipped
adding the unit to the job waiter. On the second and subsequent
failures, the unit was still passed to bus_wait_for_units_add_unit()
despite no job being started, causing bus_wait_for_units_run() to
hang indefinitely.
Move continue outside the guard so any failure skips the waiter
registration. The guard still prevents ret from being overwritten by
a later error code.
Michael Vogt [Tue, 23 Jun 2026 15:34:18 +0000 (17:34 +0200)]
basic: add assert() when doing pointer deref
Lennart reminded me in [1] that we need to add assert() in functions
that do pointer access. For the simple `*p` pointer dereferences
we even have an automatic coccinelle script that ensures that as
part of the automatic code checks.
However for deref in the `p->` style this is not supported right
now and adding it to coccinelle is hard because its too slow for
this kind of check. So I created a (slightly messy) tree-sitter
python script to see how many asserts we are currently missing.
This commit is the result of running it over the `src/basic`
dir and fixing the flagged issues. I plan to tidy it up and
add it to the checks too but this is orthogonal to this commit.
dongshengyuan [Thu, 25 Jun 2026 08:40:05 +0000 (16:40 +0800)]
core: fix fd leak in exec_shared_runtime_deserialize_one
The userns/netns/ipcns fdpairs were declared as plain int arrays without
_cleanup_close_pair_. If exec_shared_runtime_add() fails (e.g. OOM on
hashmap_ensure_put), the already-opened fds are leaked.
Since exec_shared_runtime_add() uses TAKE_FD on success, the array
entries are reset to -1 after ownership transfer, so adding
_cleanup_close_pair_ is safe and closes the fds only when they were
never consumed.
dongshengyuan [Thu, 25 Jun 2026 08:19:14 +0000 (16:19 +0800)]
network: roll back ipv6ll_address on link_ipv6ll_gained() failure
If link_ipv6ll_gained() fails after ipv6ll_address is set, the address
remains non-null and the null-guard in address_update() never triggers
again, permanently suppressing SLAAC, DHCPv6 and RA on that link.
Clear ipv6ll_address on the failure path so the guard can fire when
the address is re-announced.
dongshengyuan [Thu, 25 Jun 2026 08:01:42 +0000 (16:01 +0800)]
sd-journal: fix memzero size in data hash table setup
journal_file_setup_data_hash_table() allocates s * sizeof(HashItem)
bytes for the hash table but then only zeroes s bytes, leaving 15/16 of
the entries uninitialized. This corrupts the hash chain in any newly
created journal file.
The adjacent journal_file_setup_field_hash_table() already uses the
correct size.
Paul Meyer [Wed, 17 Jun 2026 16:03:55 +0000 (18:03 +0200)]
units: harden systemd-tpm2-swtpm.service
Lock down the software TPM service: restrict the runtime directory (which
holds the AES key sealing swtpm's state) to 0700, and apply the usual
sandboxing (NoNewPrivileges, MemoryDenyWriteExecute, ProtectSystem-adjacent
Protect*/Restrict* knobs, PrivateNetwork, PrivateTmp, a @system-service
syscall filter, etc.).
A few common knobs can't be used here: the service must keep CAP_SYS_ADMIN
(needed for the ioctl that creates the vtpm proxy device on /dev/vtpmx),
and it needs runtime access to the ESP and its backing block device at a
path only known at runtime, which rules out PrivateDevices=, DevicePolicy=,
ProtectSystem= and User=/DynamicUser=.
Paul Meyer [Tue, 23 Jun 2026 12:46:24 +0000 (14:46 +0200)]
tpm2: stop the software TPM before the ESP is unmounted on shutdown
swtpm keeps its state on the ESP (--tpmstate=dir=) and thus holds it
busy for as long as it runs, but nothing ensured it was stopped before
the ESP was unmounted on shutdown, leaving boot.mount failing to
unmount.
Two things were missing:
- systemd-tpm2-swtpm.service has DefaultDependencies=no, which strips
the implicit shutdown.target membership, so it was torn down late
rather than stopped in an ordered manner. Add
Conflicts=/Before=shutdown.target, as the sibling
systemd-tpm2-setup{,-early}.service units already do.
- The generator only ordered the service
After=boot.automount/efi.automount. Ordering after the .automount
units is enough for start-up, but only an ordering against the actual
.mount units makes the service stop (releasing the ESP) before the
file system is unmounted. Add boot.mount/efi.mount to the After= line;
this is a no-op at start-up, as the mount has no job of its own there
(it is triggered on access via the automount).
Paul Meyer [Tue, 23 Jun 2026 12:40:51 +0000 (14:40 +0200)]
test: add TEST-92-TPM2-SWTPM for the software TPM fallback
Boot a VM in EFI mode without a hardware/firmware TPM and with
systemd.tpm2_software_fallback=yes, so systemd-tpm2-generator manufactures a
software TPM on the ESP in the initrd and chainloads swtpm. Assert the service
starts, the vtpm-proxy device shows up, and a systemd-creds TPM2 seal/unseal
round-trip works. Then reboot and confirm the sealed secret still unseals,
i.e. the TPM state persisted on the ESP across the reboot.
Luca Boccassi [Wed, 24 Jun 2026 18:02:06 +0000 (19:02 +0100)]
dhcp-message-dump: guard against negative option type before indexing
dhcp_option_type_from_code() returns _DHCP_OPTION_TYPE_INVALID (-EINVAL)
for the PAD and END option codes, and dump_dhcp_option_one() uses the
returned value directly as an index into the functions[] table. Those
codes are excluded by an assert() at the top of the function, but
assert() compiles down to __builtin_unreachable() under NDEBUG, so a
negative array index read is reachable there (and trips static
analyzers). Bail out explicitly on the error return.
Luca Boccassi [Wed, 24 Jun 2026 11:24:37 +0000 (12:24 +0100)]
hostname-setup: avoid O(N^2) string building in wildcard substitution
Building the result one char at a time via strextendn() is O(N^2)
because each call rescans and reallocs the buffer. With lines up to
LONG_LINE_MAX this caused a timeout in fuzz-hostname-setup. Use
GREEDY_REALLOC_APPEND to make it linear.
Luca Boccassi [Wed, 24 Jun 2026 12:43:14 +0000 (13:43 +0100)]
resolved: fix potential use-after-free when freeing DNS extra stub listeners
dns_stub_listener_extra_free() frees the listener while DnsQuery and
DnsStream objects still keep pointers to it. On a reload the extra
listeners are freed before dns_stream_disconnect_all() and
dns_query_free() run, and dns_query_free() then dereferences those
pointers.
Luca Boccassi [Wed, 24 Jun 2026 12:54:05 +0000 (13:54 +0100)]
resolved: avoid dangling hashmap entry on RegisterService failure
bus_method_register_service() inserted the DnssdRegisteredService into
m->dnssd_registered_services before assigning service->manager and
before the sd_bus_track_new()/sd_bus_track_add_sender() calls, so if
either failed, the destructor ran with service->manager still NULL,
so its guarded hashmap_remove() was skipped and the freed service was
left in the hashmap.
LucasTavaresA [Wed, 24 Jun 2026 12:21:29 +0000 (09:21 -0300)]
hwdb: map Brazilian ThinkPad T14 Gen 1 slash key to KEY_RO
On Lenovo ThinkPad T14 Gen 1 AMD model 20UES5TQ00 with the Brazilian
keyboard, the physical slash/question key reports as KEY_RIGHTCTRL.
This keyboard layout has no physical Right Ctrl key in that position. The
key after Space is AltGr, then PrtSc, then the slash/question key. Map the
AT keyboard scancode 0x9d to KEY_RO, matching the ABNT slash/question key
used by Brazilian keyboard layouts.
Verified with evtest:
Event: type 4 (EV_MSC), code 4 (MSC_SCAN), value 9d
Event: type 1 (EV_KEY), code 97 (KEY_RIGHTCTRL), value 1
After applying the hwdb mapping, the key reports as KEY_RO.
DMI: svnLENOVO:pn20UES5TQ00:pvrThinkPadT14Gen1
AT keyboard scancode: 0x9d
Luca Boccassi [Wed, 24 Jun 2026 12:46:04 +0000 (13:46 +0100)]
string-util: check for short input in previous_ansi_sequence()
ellipsize_mem() scans backwards for ANSI escape sequences and calls
previous_ansi_sequence(s, t - s, ...) as t walks down toward s. When
t reaches s + 1 the helper is invoked with length == 1 and computes
'length - 2', which wraps to SIZE_MAX - 1.
TODO: drop bootctl link + sysupdate integration item
This is now implemented: sysupdate calls out to the
/run/systemd/sysupdate/notify/ Varlink directory on completion, and bootctl
binds a socket there that links a UKI plus extras staged below
/var/lib/systemd/uki/ (with .v/ vpick support) via "bootctl link-auto".
test: verify bootctl link-auto and io.systemd.BootControl.LinkAuto
Add a TEST-87 testcase exercising "bootctl link-auto" and the equivalent
io.systemd.BootControl.LinkAuto() Varlink method: a UKI plus extras are staged
below the search directories and we assert the kernel and sidecar resources
are linked into $BOOT. Covered: plain kernel.efi + extras.d/, versioned
kernel.efi.v/ and extras .v/ resolved via vpick, directory priority
(/etc wins over /run), the no-op case when nothing is staged, and the Varlink
method including its empty reply when there is nothing to link.
test: verify sysupdate invokes the notification callout directory
Extend TEST-72-SYSUPDATE with a check that, after a successful update,
systemd-sysupdate connects to every socket linked into
/run/systemd/sysupdate/notify/ and invokes
io.systemd.SysUpdate.Notify.OnCompletedUpdate(). A tiny recorder socket is
hooked into that directory; it captures the request and replies with success.
We assert the recorded call carries the expected method, version and resource
list, and that a subsequent no-op update emits no notification.
sysext: refresh sysexts and confexts on completed system update
Bind the io.systemd.SysUpdate.Notify.OnCompletedUpdate() method in the
sysext Varlink server. systemd-sysext provides a single Varlink service
covering both the sysext and confext image classes, so one notification
refreshes both (equivalent to "systemd-sysext refresh" plus
"systemd-confext refresh"). Hook a socket into
/run/systemd/sysupdate/notify/ via systemd-sysupdate-notify-sysext.socket,
enabled by default via the preset.
bootctl: add link-auto/LinkAuto and auto-link on completed system update
Add a "bootctl link-auto" verb and a matching io.systemd.BootControl.LinkAuto()
Varlink method that behave exactly like "bootctl link" / Link(), except that
the UKI and extra resources are discovered automatically instead of being
passed in. The following directories are searched, in decreasing priority:
/etc/systemd/uki/, /run/systemd/uki/, /var/lib/systemd/uki/ (where
systemd-sysupdate stages downloaded resources), /usr/local/lib/systemd/uki/
and /usr/lib/systemd/uki/.
- the UKI is taken from kernel.efi, or the best version in kernel.efi.v/
(resolved via vpick, without honouring boot-counting suffixes), from the
highest-priority directory that has one;
- extra resources are picked up from extras.d/, matching *.sysext.raw,
*.confext.raw and *.cred, each either as a plain file or as a versioned
*.v/ directory resolved via vpick, combined across all directories with
higher-priority directories winning on conflicts.
Everything is resolved relative to the pinned root directory fd. Files passed
via --extra= on the command line are linked in addition to the auto-discovered
ones.
Also bind io.systemd.SysUpdate.Notify.OnCompletedUpdate() in the boot control
Varlink server, which simply does the same as LinkAuto(), and hook a socket
into /run/systemd/sysupdate/notify/ via systemd-sysupdate-notify-bootctl.socket
(enabled by default via the preset) so a freshly downloaded kernel is linked
into $BOOT automatically after a sysupdate run.
pcrlock: recompute PCR policy on completed system update
Bind the io.systemd.SysUpdate.Notify.OnCompletedUpdate() method in the
pcrlock Varlink server and hook a socket into
/run/systemd/sysupdate/notify/ via systemd-sysupdate-notify-pcrlock.socket,
enabled by default via the preset. When sysupdate signals a completed
update, we unconditionally re-run make-policy, since the set of measured
components may have changed.
sysupdate: notify hook subscribers after a successful update
Define a new io.systemd.SysUpdate.Notify Varlink interface with a single
OnCompletedUpdate() method, and after sysupdate successfully installs an
update, invoke that method on every socket linked into
/run/systemd/sysupdate/notify/ via varlink_execute_directory(). This
gives other components a hook to react to applied updates (e.g. recompute
a TPM policy, link a freshly downloaded kernel, refresh extensions).
The notification carries the component name, the installed version and the
list of updated resources (transfer id + on-disk path). Subscribers are
free to ignore the parameters and just treat the call as a trigger.
Setting SYSTEMD_SYSUPDATE_FORCE_NOTIFY=1 forces the notification to be sent
even when no update was applied (in which case no resource list is included),
so follow-up work can be triggered unconditionally.
Mirror how chaseat() works these days: instead of a single toplevel_fd that
serves as both the root (chroot) boundary and the directory that resolution
starts from, path_pick() now takes a separate root_fd and dir_fd. This lets
callers resolve a path relative to a specific directory fd while confining
symlink and absolute-path resolution to a root directory fd.
All existing callers are updated to pass the same fd for both, preserving
their current behaviour.
Paul Meyer [Sat, 23 May 2026 15:37:40 +0000 (17:37 +0200)]
man: document SEV-SNP credential delivery via initrd cpio
Under --coco=sev-snp, credentials no longer flow through SMBIOS/fw_cfg
(which the guest PID1 discards as unmeasured in confidential VMs) but
through a cpio archive appended to the initrd, landing in the @system
bucket via the new /.extra/system_credentials/ initrd path. Update
systemd-vmspawn(1) to describe this and the guest systemd version
requirement.
Paul Meyer [Sat, 23 May 2026 15:05:56 +0000 (17:05 +0200)]
vmspawn: deliver credentials via initrd cpio under SEV-SNP
Previously, --load-credential / --set-credential were rejected outright
under --coco=sev-snp because the SMBIOS type-11 transport isn't covered
by the launch measurement. PID1 wouldn't have accepted those credentials
anyway (import_credentials_smbios() refuses any SMBIOS-sourced credentials
under a confidential VM).
Instead, when SNP is in use and credentials are present, synthesize a
newc cpio archive containing each credential at
.extra/system_credentials/<id>.cred and append it to the initrd list.
The existing merge_initrds() path then concatenates it into the single
initrd file QEMU loads, which kernel-hashes=on covers in the SEV-SNP
launch digest. PID1's import_credentials_boot() picks them up from the
trusted /.extra/system_credentials/ path and routes them to the @system
bucket, so units can consume them via LoadCredential= unchanged.
Direct kernel boot (--linux=) is already required under SNP, so the
initrd is always under our control here. The cpio synthesis happens
after all internal machine_credential_add()/machine_credential_load()
call sites so the archive captures the complete credential set (journal
forwarding, vmm.notify_socket, ssh ephemeral keys, etc.).
The cpio path is intentionally scoped to SNP: it requires a guest PID1
that knows about /.extra/system_credentials/, and we don't want to
regress credential delivery for non-CoCo guests running older systemd
versions in the guest. Consider switching when the new path is widely
available.
Paul Meyer [Sat, 23 May 2026 14:25:55 +0000 (16:25 +0200)]
shared: add userspace cpio writer for credentials
Add a small newc-format cpio encoder that builds an archive with each
credential as a file under .extra/system_credentials/<id>.cred and
writes it to a temp file. This mirrors what systemd-stub produces from
ESP credentials, so PID1's import_credentials_boot() picks them up
unchanged via the new /.extra/system_credentials/ initrd path.
Motivated by vmspawn under SEV-SNP, where SMBIOS credentials aren't
covered by the launch measurement and are discarded by PID1 in
confidential guests, so they must be delivered via the measured initrd
instead. The writer lives in src/shared/ so other host-side tooling
can reuse it.
Paul Meyer [Wed, 3 Jun 2026 11:23:27 +0000 (13:23 +0200)]
core: import trusted initrd credentials
PID1's import_credentials_boot() so far always treated initrd-delivered
credentials as untrusted: anything found under /.extra/credentials/
was routed into ENCRYPTED_CREDENTIALS_DIRECTORY, forcing consumers to
use LoadCredentialEncrypted= and provide credentials in systemd-creds
encrypted form. That matches the trust model for stub which sources
credentials from the EFI System Partition (mountable and editable
offline).
Host-side producers that take responsibility for the trust of the cpio
they hand to the kernel have a different model: e.g. systemd-vmspawn
builds an initrd-credentials cpio whose bytes are covered by the SEV-SNP
launch measurement via QEMU's kernel-hashes=on (or, in non-confidential
setups, by the host itself being the trust root). Forcing those through
the @encrypted bucket would require null-key wrapping on the host and
LoadCredentialEncrypted= on the consumer side, a needless API split for
unit files that should otherwise be portable between confidential and
non-confidential boots.
Extend import_credentials_boot() to import credentials from
/.extra/system_credentials/, routed into SYSTEM_CREDENTIALS_DIRECTORY.
Consumers access these via LoadCredential= directly, with no
encrypted-credential ceremony.
The per-directory walk is hoisted into a small static helper so the new
target bucket can share the existing copy/validation logic; behavior
for the existing untrusted paths is unchanged.
Also set the new RECURSE_DIR_MUST_BE_REGULAR flag as a drive-by.
sysupdate: automatically clean up orphaned files after auto-update (#42714)
This adds an operation equivalent to "systemd-sysupdate cleanup" after
an update completed (regardless if that update was entirely successful
or not). This ensures that any orphaned files are automatically cleaned
up, if they are not referenced by any transfer file's patterns anymore.
sysupdate: automatically clean up orphaned files after auto-update
This adds an operation equivalent to "systemd-sysupdate cleanup" after
an update completed (regardless if that update was entirely successful
or not). This ensures that any orphaned files are automatically cleaned
up, if they are not referenced by any transfer file's patterns anymore.
This happens in test units with many commands, so reset the timer when
a command completes and the test advances. The number of Exec
instructions is bounded so this will terminate jobs that are really
stuck anyway.
tunaichao [Wed, 24 Jun 2026 06:01:06 +0000 (14:01 +0800)]
core: pin restrict-fsaccess initramfs_s_dev store width to skeleton field
The clear-store in restrict_fsaccess_clear_initramfs_trust() writes a fixed
4 bytes (*(uint32_t *)(p + INITRAMFS_S_DEV_OFF) = 0). INITRAMFS_S_DEV_OFF is
derived from the skeleton, so the offset tracks any field widening, but the
store width does not: were initramfs_s_dev widened (e.g. __u32 -> __u64) in
the BPF program, the store would clear only the low 4 bytes and silently
leave the initramfs trust window partially open. That is exactly the class
of bug the mirror-struct asserts (removed earlier in this branch) guarded
against.
Add a compile-time assert pinning the store width to the skeleton field
width (sizeof_field(typeof_field(struct restrict_fsaccess_bpf, bss[0]),
initramfs_s_dev) == sizeof(uint32_t)), so widening the field fails the build
instead of clearing half of it.
tunaichao [Tue, 23 Jun 2026 07:45:49 +0000 (15:45 +0800)]
core: derive restrict-fsaccess initramfs_s_dev offset from skeleton
Building with -Dbpf=enabled -Dbpf_compiler=gcc (GCC's BPF backend) fails on
the static assertions in bpf-restrict-fsaccess.c, introduced in 68fe7fa4d6:
The hand-written struct restrict_fsaccess_bss lists the BPF .bss globals in
source declaration order and asserts that its layout matches the skeleton's
generated bss struct. bpftool gen skeleton emits that struct from the BTF
.bss DATASEC, whose member order reflects the physical order the compiler
placed the variables, not the source order. clang preserves declaration
order, so the asserts pass; gcc reorders .bss globals, so initramfs_s_dev no
longer sits at offset 0 and the asserts fail.
This is more than a build break: restrict_fsaccess_clear_initramfs_trust()
clears initramfs_s_dev by mmap()ing the .bss map and storing 0 at a hardcoded
offset 0. Under the gcc layout that store would clobber the wrong global,
silently leaving the initramfs trust window open after switch_root instead of
closing it. The asserts were correctly catching this.
Fix it by deriving the offset from the generated skeleton instead of a mirror
struct: drop struct restrict_fsaccess_bss and the four field-order
assert_cc()s, take INITRAMFS_S_DEV_OFF from the skeleton's bss struct
(offsetof(typeof_field(struct restrict_fsaccess_bpf, bss[0]),
initramfs_s_dev)), and store at p + INITRAMFS_S_DEV_OFF. The offset is a
compile-time constant, so clang (offset 0) is unchanged while gcc tracks the
real layout. A retained assert_cc() documents the 4-byte alignment the
single-store atomicity relies on.
sd-varlink: mark varlink sockets via xattrs (#42454)
Linux 7.0 added the ability to mark socket inodes with xattrs. Let's use
that to clearly mark all our Varlink sockets as being varlink related.
This is then used to implement a very useful new command "varlinkctl
list-sockets" which lists all varlink entrypoint sockets marked this
way.
By marking not just the entrypoint inodes but also the connection
sockets properly, we can one day add an ebpf based "varlinkctl trace"
command that watches varlink sockets for traffic. but that's material
for a later PR.
Frantisek Sumsal [Tue, 23 Jun 2026 19:29:53 +0000 (21:29 +0200)]
test: skip fdstore tests if test-fdstore is not available
When the test suite is run in the "standalone" mode, the minimal
container might not contain the test-fdstore binary that's needed for a
couple of tests. Since installing systemd-tests into the minimal
container pulls in a lot of other dependencies, let's just skip the
affected tests instead to avoid this.