Virtual block devices are a bit weird: they have no parent device, and
thus cannot be related to the subsystem they belong to, except by
pattern matching their name. This is OK to do if one knows what to look
for. However for tools that do not want to carry a list of known
subsystems with their appropriate matching patters this sucks. Let's
introduce a new ID_BLOCK_SUBSYSTEM property we can set on block devices
that carries an explicit string for this. Do so for a small number of
key subsystems: DM, loopback and zram.
repart: if device node is specified as "-", calculate needed disk space
So far repart always required specification of a device node. And if
none was specified, then we'd fine the node backing the root fs. Let's
optionally allow that the device node is explicitly not specified (i.e.
specified as "-" or ""), in which case we'll just print the size of the
minimal image given the definitions.
repart: move definitions + dry_run + empty fields into Context
This is preparation for making this eventually available via Varlink,
where we'd like to create Context object for each call that we can free
once it is done, but not inherit state from an earlier call.
Also fixes a couple of cases where we accessed arg_node, but where we
should have accessed the Context-specific copy in .node.
Let's use the proper uint32_t parsers initially, so that the usual logic
of formatting integers as decimal strings, works too for uids/gids. Not
because it made any sense to encode them like that, but just to be
systematic here.
sd-json: make sure all dispatch helpers do something sensible in case of "null" JSON value
Most of our dispatch helpers already do something useful in case they
are invoked on a null JSON value: they translate this to the appropriate
niche value for the type, if there is one.
Add the same for *all* dispatchers we have, to make this fully
systematic.
For various types it's not always clear which niche value to pick. I
opted for UINT{8,16,32,64}_MAX for the various unsigned integers, which
maps our own use in most cases. I opted for -1 for the various signed
integer types. For arrays/blobs of stuff I opted for the empty
array/blob, and for booleans I opted for false.
Of course, in various cases this is not going to be the right niche
value, but that's entirely fine, after all before a json value reaches a
dispatcher function it must pass one of two type checks first:
1. Either the .type field of sd_json_dispatch_field must be
_SD_JSON_VARIANT_TYPE_INVALID to not do a type check at all
2. Or the .type field is set, but then the SD_JSON_NULLABLE flag must be
set in .flags.
This means, accidentally generating the niche values on null is not
really likely.
Daan De Meyer [Wed, 29 Oct 2025 21:39:48 +0000 (22:39 +0100)]
parse-util: Add parse_capability_set()
Let's extract common capability parsing code into a generic function
parse_capability_set() with a comprehensive set of unit tests.
We also replace usages of UINT64_MAX with CAP_MASK_UNSET where
applicable and replace the default value of CapabilityBoundingSet
with CAP_MASK_ALL which more clearly identifies that it is initialized
to all capabilities.
AI (copilot) was used to extract the generic function and write the
unit tests, with manual review and fixing afterwards to make sure
everything was correct.
Luca Boccassi [Fri, 31 Oct 2025 16:46:49 +0000 (16:46 +0000)]
test: add test case for verity deferred removal without sharing
I recently found out (the hard way) that on an older version
there was a bug when the verity sharing is disabled: the
deferred close flag was not set correctly, so verity devices
were leaked.
This is not an issue in main currently, but add a test case
to cover it just in case, to avoid future regressions.
systemctl: downgrade or silence warnings for --now
When calling systemctl enable/disable/reenable --now, we'd always fail with
error when operating offline. This seemly overly restricitive. In particular,
if systemd is not running at all, the service is not running either, so
complaining that we can't stop it is completely unnecessary. But even when
operating in a chroot where systemd is not running, let's just emit a warning
and return success. It's fairly common to have installation or package scripts
which do such calls and not starting/restarting the service in those scenarios
is the desired and expected operation. (If --now is called in combination
with --global or --root=, keep returning an error.)
Also make the messages nicer. I was adding some docs to tell the user to run
'systemctl enable --now', and checked how the command can fail, and the error
message that the user might see in some common scenarios was too complicated.
Split it up to be nicer.
Daan De Meyer [Fri, 31 Oct 2025 21:30:46 +0000 (22:30 +0100)]
core: Add RootDirectoryFileDescriptor= (#39480)
RootDirectory= but via a open_tree() file descriptor. This allows
setting up the execution environment for a service by the client in a
mount namespace and then starting a transient unit in that execution
environment using the new property.
We also add --root-directory= and --same-root-dir= to systemd-run to
have it run services within the given root directory. As systemd-run
might be invoked from a different mount namespace than what systemd is
running in, systemd-run opens the given path with open_tree() and then
sends it to systemd using the new RootDirectoryFileDescriptor= property.
Yu Watanabe [Fri, 31 Oct 2025 14:03:14 +0000 (23:03 +0900)]
reread-partition-table: take exclusive lock when requested
Before aa47d8ade18cc4a079fef5a1aaa37d763507104e, we took an exclusive lock
for the whole block device, but with the commit, a shared lock is taken.
That causes, during we requesting the kernel to reread partition table,
udev workers can process the block device or its partitions.
Let's make udev workers not process block devices during rereading
partition table again.
Daan De Meyer [Tue, 28 Oct 2025 22:47:26 +0000 (23:47 +0100)]
core: Add RootDirectoryFileDescriptor=
RootDirectory= but via a open_tree() file descriptor. This allows
setting up the execution environment for a service by the client in
a mount namespace and then starting a transient unit in that execution
environment using the new property.
We also add --root-directory= and --same-root-dir= to systemd-run to
have it run services within the given root directory. As systemd-run
might be invoked from a different mount namespace than what systemd is
running in, systemd-run opens the given path with open_tree() and then
sends it to systemd using the new RootDirectoryFileDescriptor= property.
importd: support OS tree "mangling" unpriv too (#39406)
Split out of #38728
(background: os tree "mangling" is what we do if a tarball with an OS
image inside it if is nested inside an extra top-level dir inside the
tarball, which we need to "mangle" and move everything inside one level
up)
This also drops the mkosi testuser from the wheel and systemd-journal
groups as the integration tests rely on the testuser not being to read
the full journal.
Mike Yuan [Thu, 30 Oct 2025 14:38:19 +0000 (15:38 +0100)]
core/exec-invoke: switch keep_fds to heap allocation
Hardcoding total size of the array is error-prone, especially
considering the exeuctable_fd is added far below, so the '4' is
not entirely obvious. Also we seldomly do VLAs.
Mike Yuan [Wed, 29 Oct 2025 20:25:42 +0000 (21:25 +0100)]
core/service: only pass socket fds to control processes
If socket is used as stdio, we'd currently imply EXEC_PASS_FDS
and dump the whole set of fds to the control processes. This is
pretty much unexpected and unnecessary though, instead let's
pass only the socket fds.
Yes, this is a compat break, but a relatively minor one I'd
argue. And we can always revisit things if users do complain.
Mike Yuan [Wed, 29 Oct 2025 20:20:26 +0000 (21:20 +0100)]
core/execute: merge n_storage_fds and n_extra_fds into stashed_fds
The distinction between fdstore and extra fds is only meaningful
to struct Service. As far as executor is concerned they're just
some fds to pass to the service. Let's just merge it hence,
for the sake of simplicity.
Daan De Meyer [Thu, 30 Oct 2025 11:28:19 +0000 (12:28 +0100)]
run0: Add --empower
--empower gives full privileges to a non-root user. Currently this
includes all capabilities but we leave the option open to add more
privileges via this option in the future.
Why is this useful? When running privileged development or debugging
commands from your home directory (think bpftrace, strace and such),
you want any files written by these tools to be owned by your current
user, and not by the root user. run0 --empower will allow you to run
all privileged operations (assuming the tools check for capabilities
and not UIDs), while any files written by the tools will still be owned
by the current user.
This creates a chicken-and-egg problem: we stuff the pcrlock policy into
a credential in the ESP, but credentials get measured into PCR 12, hence
PCR 12 is both input and output of the pcrlock logic, which makes
impossible to calculate.
Let's drop PCR 12 for now.
(We might want to pass the policy some other way one day, to avoid this,
but that's something for another day.)
Note that this still allows locking to PCR12 if people want to (for
example because they don't need this for the rootfs, and hence need no
cred passing via the ESP), this hence only changes the default, nothing
more.
Yu Watanabe [Thu, 23 Oct 2025 02:19:52 +0000 (11:19 +0900)]
network/sysctl: logs when per-link IPMasquerade= setting changes the global IPv6Forwarding= setting
All other cases, settings on different interfaces are completely
independent. But IPMasquerade=yes on an interface enables the global
IPv6Forwarding= setting, and hence affects other interfaces.
Let's log about that.
Prompted by https://github.com/systemd/systemd/issues/39304#issuecomment-3430382233.
* ea1d871ecd Add missing networkd socket units
* b76b5da2e6 Merge #214 `Drop backwards compat logic from integration tests script`
* 7208fa2b1b Require systemd-rpm-macros for build
* 2e1a6c7474 Require python3-zstandard in ELN
* 79c9db1bc8 Require systemd-libs and systemd-shared to be in the same version
* db38445a7e Drop two patches with workaround (selinux, kernel)
* 593a204189 Version 258.1
* a3e9e27982 Change '%{systemd}' to systemd in Conflicts/Provides/Requires/Recommends
* 88877a4184 Require systemd-networkd and systemd-udev to be in the same version
* 8a446daec7 Version 258 💝
* cceac93491 Pre-create /etc/userdb directory
* b442086d5f Version 258~rc4
* 327e54e421 Add to patch to create userdb root directory with correct label
* 2289d65726 Fix unit name in scriptlet
* 5acde9f1fd Add workaround patch to hopefully pass podman CI tests
* 1f5ed0da1f Version 258~rc3
* 50936458a7 obs: move recipe files in place
* 1bdb4efe40 obs: switch to xz for compression
* be7a4d0863 Version 258~rc2
* 2ace9416e8 obs: also use version with tilde for Source0
* 8d1645af75 Use again %{version} when building in OBS
* 98cc5fd91a Version 258~rc1
* ed7d2f1132 Add "test" that LTO effectively removes unused code from shared lib
* 40b38a04d2 Build docs on 64-bit architectures only
* 5d30fd3b26 Version 257.7
Daan De Meyer [Tue, 28 Oct 2025 21:54:14 +0000 (22:54 +0100)]
mount-util: Iterate mountinfo backwards when unmounting
Submounts will always be located further in the mountinfo file, so
when we're unmounting, iterating backwards is likely to be more
efficient than iterating forwards. It'll also reduce the amount of
EBUSY debug logging we'll get since we'll stop trying to unmount
parent mounts with submounts which will always fail with EBUSY.
If you're using `udevadm monitor` from a script, without a tty, then
libc defaults to being fully-buffered, and won't flush stdout after
newlines. This is fine for tools that dump a bunch of data and then
exit immediately. It's a problem for tools like `udevadm monitor` which
have long pauses: the buffered data can get stuck in the buffer for an
unbounded amount of time.
In the Cockpit project we've been working around this for some time with
`stdbuf` which is a `LD_PRELOAD` hack to change the libc buffering
behaviour, but we'd like to stop doing that.
Let's make sure we flush the buffer after each event.