json: add builder/dispatcher for PidRef → JSON and back
So far, at the one place we sent a PID over Varlink we did so as a
simple numeric pid_t value. That's of course is racy, since classic PIDs
are recycled too eagerly.
Let's address that, by passing around JSON objects distantly resembling our
PidRef structure. Note that this JSON object does *not* contain the
pidfd, however, but just the pidfd inode number if known.
I originally planned to include the pidfd in some direct form, but I
figured that's not really the best idea, since we always need a
side-channel of some form for that (i.e. AF_UNIX/SCM_RIGHTS), but we
should be able to report about PIDs even without that.
Moreover, while sending the pid number and pidfd id around should always
be OK to do, it's a lot more problematic to always send a pidfd around,
since that implies that fd passing is on and it is OK to install fds
remotely in some IPC peers fd table. For example, when doing a wild dump
of service manager service state we really shouldn't end up with a bunch
of fds installed in our client's fd table.
Hence, all in all I think it is cleaner to define a structure carrying
pid number and pidfd inode id, wich is passed directly as JSON. And then
optionally, in a separate field also pass around a pidfd where it makes
sense.
Note that sending around pidfds is not that beneficial anymore if we
have the pidfd inode id, because we can always securely and reliably get
a pidfd back from a pair of pid + inode id: first we do pidfd_open() on
the pid, and then we check if it is really the right one by comparing
.st_ino after fstat().
This logic is implemented gracefully: if for some reason pidfd/pidfd
inode nrs are not available (too old kernel), we'll fall back to plain
PID numbers.
The dispatching logic knows two distinct levels of validation of the
provided PID data: if SD_JSON_STRICT is specified we'll acquire a pidfd
for the PID, thus verifying it currently exists and failing if it
doesn't. If the flag is not set, well just store the provided info
as-is, will try to acquire a pidfd for it, but not fail if we cannot.
Both modes are important in different contexts.
Also note that in addition to the pidfd inode nr we always store the
current boot ID of the system in the JSON object, since only the
combination of pidfd inode nr and boot ID of the system really is a
world-wide unique reference to a process.
When dispatching a JSON pid field we operate somewhat gracefully: we
either support the triplet structure of pid, pid inode nr, boot id, or
we accept a simple classic UNIX pid.
varlink-idl: introduce c/.h file for common varlink IDL structures
Some structures we'll use in various varlink interfaces, move them to a
common .c/.h file. For now this is only the dual timestamp object, but
there will be more soon.
Uday Shankar [Thu, 10 Oct 2024 20:29:10 +0000 (14:29 -0600)]
udev: allow persistent storage rules for ublk devices
Tools such as lsblk which query the udev database instead of probing
devices directly fail when run on ublk devices. For instance, in the
following commands, the partition type is missing, despite the fact that
/dev/ublkb0 was just partitioned with a single Linux filesystem type
partition.
$ lsblk /dev/ublkb0
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
ublkb0 259:0 0 31.3G 0 disk
└─ublkb0p1 259:1 0 31.2G 0 part
$ lsblk -o pkname,parttype /dev/ublkb0
PKNAME PARTTYPE
ublkb0
This happens because ublk devices are missing from a couple of
whitelists in the udev rules which are responsible for populating the
database with the data lsblk is looking for. Add the ublk devices to
these whitelists.
David Rheinsberg [Fri, 11 Oct 2024 07:53:25 +0000 (09:53 +0200)]
docs/DESKTOP_ENVIRONMENTS: fix formatting
The annotation about omittance is meant to be about the `RANDOM` string.
However, the current formatting makes it look like the entire naming
scheme is optional. Fix this.
Yu Watanabe [Thu, 10 Oct 2024 03:30:41 +0000 (12:30 +0900)]
sd-netlink: various cleanups
- use uint8_t, uint16_t, and so on, rather than unsigned char, unsigned
short, and so on, respectively,
- rename output parameters to ret or ret_xyz,
- add several missing assertions.
man: reword comment a bit regarding ExecStartPre= multiple commands
The documentation claimed that ExecStartPre=/ExecStartPost= accepts
multiple command lines, in contrast to ExecStart=. This is half an
untruth, because ExecStart= allows that too – as long as Type=oneshot is
set.
Hence, reword this a bit, and do not emphasize the contrast.
Yu Watanabe [Wed, 9 Oct 2024 01:07:31 +0000 (10:07 +0900)]
login: provide delayed action in ScheduledShutdown property
Even though we can get the existence of delayed action through
PreparingForShutdownWithMetadata property or friends, for consistency
with CancelScheduledShutdown() method, it is better to also provide the
information through ScheduledShutdown property.
Tobias Fleig [Tue, 8 Oct 2024 14:54:43 +0000 (07:54 -0700)]
stub: Add support for .initrd addon files
Teaches systemd-stub how to load additional initrds from addon files.
This is very similar to the support for .ucode sections in addon files,
but with different ordering. Initrds from addons have a chance to
overwrite files from the base initrd in the UKI.
WilliButz [Fri, 4 Oct 2024 17:51:57 +0000 (19:51 +0200)]
repart: derive hash partition size from SizeMaxBytes= of data sibling
This change makes it possible for repart to create dm-verity hash
partitions for a custom amount of protected data. When the property
`SizeMaxBytes=` is specified for a dm-verity data partition, the size
of the corresponding hash partition is set to accommodate hash data
for this maximum size, rather than the actual contents its data
sibling. However, the contained hash data continues to be generated
from said sibling.
hwdb: move key 66/65 handling from specific to generic HP laptop coverage
This takes the idea from #18595 and implements it based on our current
hwdb: the original PR suggested the keys 66/65 are a generic HP thing,
and not limited to specific laptops. The current specific laptop entries
do not contradict that claim.
Hence, let's move them from the specific sections matching some HP
laptops to the generic section matching all.
This uses the correct key names, which have long been fixed (which used
to be a problem our CI was tripped off by).
This is not tested, but I think fairly risk-less, and should allow us to
get rid of a really old PR.
Chen Guanqiao [Wed, 2 Oct 2024 05:10:21 +0000 (13:10 +0800)]
mount: optimize mountinfo traversal by decoupling device discovery
In mount_load_proc_self_mountinfo(), device_found_node() is synchronously called
during the traversal of mountinfo entries. When there are a large number of
mount points, and the device types are not significantly different, this results
in excessive time consumption during device discovery, causing a performance
bottleneck. This issue is particularly prominent on servers with a large number
of cores in IDC.
This patch decouples device discovery from the mountinfo traversal process,
avoiding redundant device operations. As a result, it significantly improves
performance, especially in environments with numerous mount points.
The documentation says the option takes a boolean or one of the "self"
and "identity". But the parser uses private_users_from_string() which
also accepts "off". Let's drop the implicit support of "off".
Yu Watanabe [Mon, 7 Oct 2024 21:19:04 +0000 (06:19 +0900)]
core: suppress one debugging log
Otherwise, the log is shown even when getting properties.
Even though it is in the debug level, that's quite noisy.
[ 338.785847] TEST-55-OOMD.sh[1624]: Oct 07 16:35:15 H systemd[1]: TEST-55-OOMD-testmunch.service: Unit not running in private mount namespace, cannot live mount
[ 338.786985] TEST-55-OOMD.sh[1624]: Oct 07 16:35:17 H systemd[1]: TEST-55-OOMD-testmunch.service: Unit not running in private mount namespace, cannot live mount
[ 338.787412] TEST-55-OOMD.sh[1624]: Oct 07 16:35:20 H systemd[1]: TEST-55-OOMD-testmunch.service: Unit not running in private mount namespace, cannot live mount
[ 338.791776] TEST-55-OOMD.sh[1624]: Oct 07 16:35:22 H systemd[1]: TEST-55-OOMD-testmunch.service: Unit not running in private mount namespace, cannot live mount
[ 338.792938] TEST-55-OOMD.sh[1624]: Oct 07 16:35:24 H systemd[1]: TEST-55-OOMD-testmunch.service: Unit not running in private mount namespace, cannot live mount
[ 338.793225] TEST-55-OOMD.sh[1624]: Oct 07 16:35:26 H systemd[1]: TEST-55-OOMD-testmunch.service: Unit not running in private mount namespace, cannot live mount
[ 338.793424] TEST-55-OOMD.sh[1624]: Oct 07 16:35:28 H systemd[1]: TEST-55-OOMD-testmunch.service: Unit not running in private mount namespace, cannot live mount
[ 338.796448] TEST-55-OOMD.sh[1624]: Oct 07 16:35:31 H systemd[1]: TEST-55-OOMD-testmunch.service: Unit not running in private mount namespace, cannot live mount
[ 338.797997] TEST-55-OOMD.sh[1624]: Oct 07 16:35:33 H systemd[1]: TEST-55-OOMD-testmunch.service: Unit not running in private mount namespace, cannot live mount
[ 338.799206] TEST-55-OOMD.sh[1624]: Oct 07 16:35:35 H systemd[1]: TEST-55-OOMD-testmunch.service: Unit not running in private mount namespace, cannot live mount
The method was added with migration of resources in mind (e.g. process's
allocated memory will follow it to the new scope), however, such a
resource migration is not in cgroup semantics. The method may thus have
the intended users and others could be guided to StartTransientUnit().
Since this API was advertised in a regular release, start the removal
with a deprecation message to callers.
Eventually, the goal is to remove the method to clean up DBus API and
simplify code (removal of cgroup_context_copy()).
Part of DBus docs is retained to satisfy build checks.
Catch up with the nice little toys the kernel fs developers have added
for us. Preferably, let's make use of the new F_DUPFD_QUERY fcntl() call
that checks whether two fds are just duplicates of each other
(duplicates as in dup(), not as in open() of the same inode, i.e.
whether they share a single file offset and so on).
This API is much nicer, since it is a core kernel feature, unlike the
kcmp() call we so far used, which is part of the (optional)
checkpoint/restore stuff.