man/ukify: mention all functionality in intro, add example of direct boot
Over the time, the functionality in ukify has grown. This should all be briefly
mentioned in the first section so the user does't have to read the whole page
to figure out what types of functionality are implemnted.
Also add an example of direct kernel boot. It's a nifty technology (and frankly
underutilized, considering how cool it is is).
man/sd-boot: add some meat to the direct kernel boot example
Unfortunately qemu still default to BIOS boot, so for the direct kernel
boot with an efi file to be of any use, the complex param used to switch
to UEFI mode needs to be provided.
Yu Watanabe [Sun, 26 Oct 2025 07:33:11 +0000 (16:33 +0900)]
pe-binary: drop pe_hash() and friends when OpenSSL support is disabled
These three functions are currently only used by sbsign, which requires
OpenSSL. Moreover, pe_hash() and uki_hash() anyway do not work if
OpenSSL is disabled. Let's only declare them when OpenSSL support is
enabled.
docs: add comment about requiring the mount hierarchy to be mounted MS_SHARED
This has been tripping up container manager people. let's document this
explicitly.
(Note that the container interface could really use some updates, i.e.
it was written before a time where cgroup namespacing was a thing. But I
am too lazy to fix that now, so let's just add this once facet.)
doc: indicate Type=oneshot also detects invocation failures
Type `simple` explicitly mentions that invocation failures like a missing binary
or `User=` name won’t get detected – whereas type `exec` mentions that it does.
Type `oneshot` refers to being similar to `simple`, which could lead one to
assume it doesn’t detect such invocation failures either – it seems however it
does.
Indicate this my changing its wording to be similar to `exec`.
Signed-off-by: Christoph Anton Mitterer <mail@christoph.anton.mitterer.name>
Virtual block devices are a bit weird: they have no parent device, and
thus cannot be related to the subsystem they belong to, except by
pattern matching their name. This is OK to do if one knows what to look
for. However for tools that do not want to carry a list of known
subsystems with their appropriate matching patters this sucks. Let's
introduce a new ID_BLOCK_SUBSYSTEM property we can set on block devices
that carries an explicit string for this. Do so for a small number of
key subsystems: DM, loopback and zram.
repart: if device node is specified as "-", calculate needed disk space
So far repart always required specification of a device node. And if
none was specified, then we'd fine the node backing the root fs. Let's
optionally allow that the device node is explicitly not specified (i.e.
specified as "-" or ""), in which case we'll just print the size of the
minimal image given the definitions.
repart: move definitions + dry_run + empty fields into Context
This is preparation for making this eventually available via Varlink,
where we'd like to create Context object for each call that we can free
once it is done, but not inherit state from an earlier call.
Also fixes a couple of cases where we accessed arg_node, but where we
should have accessed the Context-specific copy in .node.
Yu Watanabe [Sun, 26 Oct 2025 04:12:01 +0000 (13:12 +0900)]
libcryptsetup: drop several unnecessary checks for existences of functions by libcryptsetyp
The functions crypt_set_metadata_size() and friends are supported since
libcryptsetup-2.0.
This also merges checks for functions used for supporting libcryptsetup
plugins with others.
Moreover, check existence of one more function (crypt_logf) that is used in
libcryptsetup plugins.
Let's use the proper uint32_t parsers initially, so that the usual logic
of formatting integers as decimal strings, works too for uids/gids. Not
because it made any sense to encode them like that, but just to be
systematic here.
sd-json: make sure all dispatch helpers do something sensible in case of "null" JSON value
Most of our dispatch helpers already do something useful in case they
are invoked on a null JSON value: they translate this to the appropriate
niche value for the type, if there is one.
Add the same for *all* dispatchers we have, to make this fully
systematic.
For various types it's not always clear which niche value to pick. I
opted for UINT{8,16,32,64}_MAX for the various unsigned integers, which
maps our own use in most cases. I opted for -1 for the various signed
integer types. For arrays/blobs of stuff I opted for the empty
array/blob, and for booleans I opted for false.
Of course, in various cases this is not going to be the right niche
value, but that's entirely fine, after all before a json value reaches a
dispatcher function it must pass one of two type checks first:
1. Either the .type field of sd_json_dispatch_field must be
_SD_JSON_VARIANT_TYPE_INVALID to not do a type check at all
2. Or the .type field is set, but then the SD_JSON_NULLABLE flag must be
set in .flags.
This means, accidentally generating the niche values on null is not
really likely.
Daan De Meyer [Wed, 29 Oct 2025 21:39:48 +0000 (22:39 +0100)]
parse-util: Add parse_capability_set()
Let's extract common capability parsing code into a generic function
parse_capability_set() with a comprehensive set of unit tests.
We also replace usages of UINT64_MAX with CAP_MASK_UNSET where
applicable and replace the default value of CapabilityBoundingSet
with CAP_MASK_ALL which more clearly identifies that it is initialized
to all capabilities.
AI (copilot) was used to extract the generic function and write the
unit tests, with manual review and fixing afterwards to make sure
everything was correct.
Luca Boccassi [Fri, 31 Oct 2025 16:46:49 +0000 (16:46 +0000)]
test: add test case for verity deferred removal without sharing
I recently found out (the hard way) that on an older version
there was a bug when the verity sharing is disabled: the
deferred close flag was not set correctly, so verity devices
were leaked.
This is not an issue in main currently, but add a test case
to cover it just in case, to avoid future regressions.
systemctl: downgrade or silence warnings for --now
When calling systemctl enable/disable/reenable --now, we'd always fail with
error when operating offline. This seemly overly restricitive. In particular,
if systemd is not running at all, the service is not running either, so
complaining that we can't stop it is completely unnecessary. But even when
operating in a chroot where systemd is not running, let's just emit a warning
and return success. It's fairly common to have installation or package scripts
which do such calls and not starting/restarting the service in those scenarios
is the desired and expected operation. (If --now is called in combination
with --global or --root=, keep returning an error.)
Also make the messages nicer. I was adding some docs to tell the user to run
'systemctl enable --now', and checked how the command can fail, and the error
message that the user might see in some common scenarios was too complicated.
Split it up to be nicer.
Daan De Meyer [Fri, 31 Oct 2025 21:30:46 +0000 (22:30 +0100)]
core: Add RootDirectoryFileDescriptor= (#39480)
RootDirectory= but via a open_tree() file descriptor. This allows
setting up the execution environment for a service by the client in a
mount namespace and then starting a transient unit in that execution
environment using the new property.
We also add --root-directory= and --same-root-dir= to systemd-run to
have it run services within the given root directory. As systemd-run
might be invoked from a different mount namespace than what systemd is
running in, systemd-run opens the given path with open_tree() and then
sends it to systemd using the new RootDirectoryFileDescriptor= property.
Yu Watanabe [Fri, 31 Oct 2025 14:03:14 +0000 (23:03 +0900)]
reread-partition-table: take exclusive lock when requested
Before aa47d8ade18cc4a079fef5a1aaa37d763507104e, we took an exclusive lock
for the whole block device, but with the commit, a shared lock is taken.
That causes, during we requesting the kernel to reread partition table,
udev workers can process the block device or its partitions.
Let's make udev workers not process block devices during rereading
partition table again.
Daan De Meyer [Tue, 28 Oct 2025 22:47:26 +0000 (23:47 +0100)]
core: Add RootDirectoryFileDescriptor=
RootDirectory= but via a open_tree() file descriptor. This allows
setting up the execution environment for a service by the client in
a mount namespace and then starting a transient unit in that execution
environment using the new property.
We also add --root-directory= and --same-root-dir= to systemd-run to
have it run services within the given root directory. As systemd-run
might be invoked from a different mount namespace than what systemd is
running in, systemd-run opens the given path with open_tree() and then
sends it to systemd using the new RootDirectoryFileDescriptor= property.
importd: support OS tree "mangling" unpriv too (#39406)
Split out of #38728
(background: os tree "mangling" is what we do if a tarball with an OS
image inside it if is nested inside an extra top-level dir inside the
tarball, which we need to "mangle" and move everything inside one level
up)
This also drops the mkosi testuser from the wheel and systemd-journal
groups as the integration tests rely on the testuser not being to read
the full journal.
Mike Yuan [Thu, 30 Oct 2025 14:38:19 +0000 (15:38 +0100)]
core/exec-invoke: switch keep_fds to heap allocation
Hardcoding total size of the array is error-prone, especially
considering the exeuctable_fd is added far below, so the '4' is
not entirely obvious. Also we seldomly do VLAs.
Mike Yuan [Wed, 29 Oct 2025 20:25:42 +0000 (21:25 +0100)]
core/service: only pass socket fds to control processes
If socket is used as stdio, we'd currently imply EXEC_PASS_FDS
and dump the whole set of fds to the control processes. This is
pretty much unexpected and unnecessary though, instead let's
pass only the socket fds.
Yes, this is a compat break, but a relatively minor one I'd
argue. And we can always revisit things if users do complain.
Mike Yuan [Wed, 29 Oct 2025 20:20:26 +0000 (21:20 +0100)]
core/execute: merge n_storage_fds and n_extra_fds into stashed_fds
The distinction between fdstore and extra fds is only meaningful
to struct Service. As far as executor is concerned they're just
some fds to pass to the service. Let's just merge it hence,
for the sake of simplicity.