core: use proper service type of TEST-07-PID.user-namespace-path.sh
TEST-07-PID.user-namespace-path.sh is flaky as Type=simple is used
(implicitly), explicitly use Type=exec instead to ensure the namespaces
are created before starting another service reusing the same namespaces.
varlink-idl: add infra to test our enum parsers against varlink IDL enums
In many cases we want to expose enums for which we have the usual
xyz_to_string()/xyz_from_string() via Varlink as enums. Let's add some
infra to test the tables against each other, to automatically detect
when they deviate.
In order to implement this properly, let's export/introduce clean
json_underscorefy()/json_dashify(), for dealing with the fact that our
enums usually use dash separates ames, but Varlink doesn't allow that.
(This does not add the test cases for all enum types we expose right
now, but only adds the general infra).
This allows a service to reuse the user namespace created for an
existing service, similarly to NetworkNamespacePath=. The configuration
is the initial user namespace (e.g. ID mapping) is preserved.
* 8e2833a5b6 Automatically figure out the name of the top-level tar dir
* dffbf2beba Make sure fallback source is listed first
* 1d3b892105 Enable sysupdate and sysupdated
Add support for nvindex-based additional PCRs for TPM2, aka "NvPCRs" (#39463)
This is based on the code from #33276, but is cleaned up, and goes for a
modified approach:
the original PR allocated nvindexes fully dynamically, and that created
big headaches, because the assignments needed to be propagated into the
early boot process, and that meant stuffing them as sidecards to the
boot UKIs.
The TCG then offered us a fixed nvindex range assigned to us, and
happily said yes to that, but since then the discussion stalled, we
couldn't get any answer from TCG on this anymore.
This code uses the range that was hinted to us to use, but not
officially assigned to us by default, but makes it build time
configurable so that downstreams can change this.
(This does *not* make it runtime configurable, because that's really
hard, because of the early boot issue again).
This PR comes with a CI test and full docs. And I think this is really a
version should that be merged.
tpm2-util: add infra for allocating nvindex-based PCRs (aka "NvPCRs")
We'd like to measure various additional things into PCRs, but all
available ones to the OS are already used for various purposes. Hence,
let's introduce a new concept of "NV Index based PCRs", i.e. let's use
TPM2 nv indexes of type TPM2_NT_EXTEND that mostly behave like real
PCRs, but which we can allocate relatively freely from the nv index
space. Let's call these "fake" PCRs "NvPCRs".
My original intention was to get a fixed NV index range assigned from
the TCG, either for Linux or for systemd as a project, but this stalled
with no further updates from the TCG for more than a year and a half
now. I was told an NV index range to use though, even if it never was
officially assigned, hence this PR uses this by default. But the range
is configurable at build time, on purpose, so that downstreams have some
flexibility to change this if they want. To abstract the actual nvindex
number away we introduce a naming concept, so that nvindexes are
referenced by name string rather than number.
NvPCRs are defined in little JSON snippets in /usr/lib/nvpcr/*.nvpcr,
that match up index number and name, as well as pick a hash algorithm.
There's one complication: these nvindex (like any nvindex) can be
deleted by anyone with access to the TPM, and then be recreated. This
could be used to reset the NvPCRs to zero during runtime, which defeats
the whole point of them. Our way out: we measure a secret as first thing
after creation into the NvPCRs. (Or actually, we measure a per-NvPCR
secret we derive from a system secret via an HMAC of the NvPCR name) and
the nvindex handle). This "anchoring" secret is stored in /run/ +
/var/lib/ + ESP/XBOOTLDR (the latter encrypted as credential, locked to
the TPM), to make it available at the whole runtime of the OS.
tpm2-util: make tpm2_undefine_policy_nv_index() generic
We can use this to remove any kind of nvindex, hence give it a generic
name.
Also instead of passing "NONE" as session if none is specified, pass
PASSWORD instead, so that the function actually becomes useful if no
session is specified (the only user so far, pcrlock always provides a
session, hence this is no change in behaviour).
creds-util: add automatic mode for tpm2 based creds
This reworkds TPM2 based creds a bit. Instead of mapping the key type
"tpm2" directly to a TPM2 key without PK, let's map it to an "automatic"
key type that either picks PK or doesn't, depending on what's available.
That should make things easier to grok for people, as the nitty gritty
details of PK or not PK are made autmatic. Moreover it gives us more
leverage to change the TPM2 enrollment types later (for example, we
definitely want to start pinning SRK, and hook up pcrlock too, for
creds, which we currently don't).
This hence adds a new _CRED_AUTO_TPM2
pseudo-type we automatically maps to CRED_AES256_GCM_BY_TPM2_HMAC_WITH_PK
or CRED_AES256_GCM_BY_TPM2_HMAC depending if PK as available. Similar,
_CRED_AUTO_HOST_AND_TPM2 is added, which does the same for the
host/nonhost cred type.
This does not introduce any new type on the wire, it just changes how we
select the right key type.
To make the code more readable this also adds some categorization macros
for the keys, instead of repeating the list of key types at multiple
places.
man/ukify: mention all functionality in intro, add example of direct boot
Over the time, the functionality in ukify has grown. This should all be briefly
mentioned in the first section so the user does't have to read the whole page
to figure out what types of functionality are implemnted.
Also add an example of direct kernel boot. It's a nifty technology (and frankly
underutilized, considering how cool it is is).
man/sd-boot: add some meat to the direct kernel boot example
Unfortunately qemu still default to BIOS boot, so for the direct kernel
boot with an efi file to be of any use, the complex param used to switch
to UEFI mode needs to be provided.
Yu Watanabe [Sun, 26 Oct 2025 07:33:11 +0000 (16:33 +0900)]
pe-binary: drop pe_hash() and friends when OpenSSL support is disabled
These three functions are currently only used by sbsign, which requires
OpenSSL. Moreover, pe_hash() and uki_hash() anyway do not work if
OpenSSL is disabled. Let's only declare them when OpenSSL support is
enabled.
docs: add comment about requiring the mount hierarchy to be mounted MS_SHARED
This has been tripping up container manager people. let's document this
explicitly.
(Note that the container interface could really use some updates, i.e.
it was written before a time where cgroup namespacing was a thing. But I
am too lazy to fix that now, so let's just add this once facet.)
doc: indicate Type=oneshot also detects invocation failures
Type `simple` explicitly mentions that invocation failures like a missing binary
or `User=` name won’t get detected – whereas type `exec` mentions that it does.
Type `oneshot` refers to being similar to `simple`, which could lead one to
assume it doesn’t detect such invocation failures either – it seems however it
does.
Indicate this my changing its wording to be similar to `exec`.
Signed-off-by: Christoph Anton Mitterer <mail@christoph.anton.mitterer.name>
Virtual block devices are a bit weird: they have no parent device, and
thus cannot be related to the subsystem they belong to, except by
pattern matching their name. This is OK to do if one knows what to look
for. However for tools that do not want to carry a list of known
subsystems with their appropriate matching patters this sucks. Let's
introduce a new ID_BLOCK_SUBSYSTEM property we can set on block devices
that carries an explicit string for this. Do so for a small number of
key subsystems: DM, loopback and zram.
repart: if device node is specified as "-", calculate needed disk space
So far repart always required specification of a device node. And if
none was specified, then we'd fine the node backing the root fs. Let's
optionally allow that the device node is explicitly not specified (i.e.
specified as "-" or ""), in which case we'll just print the size of the
minimal image given the definitions.
repart: move definitions + dry_run + empty fields into Context
This is preparation for making this eventually available via Varlink,
where we'd like to create Context object for each call that we can free
once it is done, but not inherit state from an earlier call.
Also fixes a couple of cases where we accessed arg_node, but where we
should have accessed the Context-specific copy in .node.
Yu Watanabe [Sun, 26 Oct 2025 04:12:01 +0000 (13:12 +0900)]
libcryptsetup: drop several unnecessary checks for existences of functions by libcryptsetyp
The functions crypt_set_metadata_size() and friends are supported since
libcryptsetup-2.0.
This also merges checks for functions used for supporting libcryptsetup
plugins with others.
Moreover, check existence of one more function (crypt_logf) that is used in
libcryptsetup plugins.
Let's use the proper uint32_t parsers initially, so that the usual logic
of formatting integers as decimal strings, works too for uids/gids. Not
because it made any sense to encode them like that, but just to be
systematic here.
sd-json: make sure all dispatch helpers do something sensible in case of "null" JSON value
Most of our dispatch helpers already do something useful in case they
are invoked on a null JSON value: they translate this to the appropriate
niche value for the type, if there is one.
Add the same for *all* dispatchers we have, to make this fully
systematic.
For various types it's not always clear which niche value to pick. I
opted for UINT{8,16,32,64}_MAX for the various unsigned integers, which
maps our own use in most cases. I opted for -1 for the various signed
integer types. For arrays/blobs of stuff I opted for the empty
array/blob, and for booleans I opted for false.
Of course, in various cases this is not going to be the right niche
value, but that's entirely fine, after all before a json value reaches a
dispatcher function it must pass one of two type checks first:
1. Either the .type field of sd_json_dispatch_field must be
_SD_JSON_VARIANT_TYPE_INVALID to not do a type check at all
2. Or the .type field is set, but then the SD_JSON_NULLABLE flag must be
set in .flags.
This means, accidentally generating the niche values on null is not
really likely.
Daan De Meyer [Wed, 29 Oct 2025 21:39:48 +0000 (22:39 +0100)]
parse-util: Add parse_capability_set()
Let's extract common capability parsing code into a generic function
parse_capability_set() with a comprehensive set of unit tests.
We also replace usages of UINT64_MAX with CAP_MASK_UNSET where
applicable and replace the default value of CapabilityBoundingSet
with CAP_MASK_ALL which more clearly identifies that it is initialized
to all capabilities.
AI (copilot) was used to extract the generic function and write the
unit tests, with manual review and fixing afterwards to make sure
everything was correct.
Luca Boccassi [Fri, 31 Oct 2025 16:46:49 +0000 (16:46 +0000)]
test: add test case for verity deferred removal without sharing
I recently found out (the hard way) that on an older version
there was a bug when the verity sharing is disabled: the
deferred close flag was not set correctly, so verity devices
were leaked.
This is not an issue in main currently, but add a test case
to cover it just in case, to avoid future regressions.
systemctl: downgrade or silence warnings for --now
When calling systemctl enable/disable/reenable --now, we'd always fail with
error when operating offline. This seemly overly restricitive. In particular,
if systemd is not running at all, the service is not running either, so
complaining that we can't stop it is completely unnecessary. But even when
operating in a chroot where systemd is not running, let's just emit a warning
and return success. It's fairly common to have installation or package scripts
which do such calls and not starting/restarting the service in those scenarios
is the desired and expected operation. (If --now is called in combination
with --global or --root=, keep returning an error.)
Also make the messages nicer. I was adding some docs to tell the user to run
'systemctl enable --now', and checked how the command can fail, and the error
message that the user might see in some common scenarios was too complicated.
Split it up to be nicer.
Daan De Meyer [Fri, 31 Oct 2025 21:30:46 +0000 (22:30 +0100)]
core: Add RootDirectoryFileDescriptor= (#39480)
RootDirectory= but via a open_tree() file descriptor. This allows
setting up the execution environment for a service by the client in a
mount namespace and then starting a transient unit in that execution
environment using the new property.
We also add --root-directory= and --same-root-dir= to systemd-run to
have it run services within the given root directory. As systemd-run
might be invoked from a different mount namespace than what systemd is
running in, systemd-run opens the given path with open_tree() and then
sends it to systemd using the new RootDirectoryFileDescriptor= property.
Yu Watanabe [Fri, 31 Oct 2025 14:03:14 +0000 (23:03 +0900)]
reread-partition-table: take exclusive lock when requested
Before aa47d8ade18cc4a079fef5a1aaa37d763507104e, we took an exclusive lock
for the whole block device, but with the commit, a shared lock is taken.
That causes, during we requesting the kernel to reread partition table,
udev workers can process the block device or its partitions.
Let's make udev workers not process block devices during rereading
partition table again.