Ryan Wilson [Fri, 11 Oct 2024 20:38:58 +0000 (13:38 -0700)]
Add integration test for ExtraFileDescriptors after daemon-reexec
This commit adds a corresponding integration test for ExtraFileDescriptors
after systemctl daemon-reexec. This ensures systemd keeps the file
descriptors while the service manager is restarting and we don't lose
ability to restart the service correctly.
Matteo Croce [Fri, 4 Oct 2024 23:39:37 +0000 (01:39 +0200)]
timer: add unit tests for DeferReactivation
Create a unit test for systemd timer DeferReactivation config option.
The test works by creating a timer which fires every 5 seconds and
starts an unit which runs for 5 seconds.
With DeferReactivation=true, the timer must fire every 5+5 seconds,
instead of the 5 it fires normally.
As we need at least two timer runs to check if the delta is correct,
the test duration on success will be at least 20 seconds.
To be safe, the test script waits 35 seconds: this is enough to get
at least three runs but low enough to avoid clogging the CI.
Arthur Shau [Thu, 14 Mar 2024 19:43:13 +0000 (12:43 -0700)]
timer: introduce DeferReactivation setting
By default, in instances where timers are running on a realtime schedule,
if a service takes longer to run than the interval of a timer, the
service will immediately start again when the previous invocation finishes.
This is caused by the fact that the next elapse is calculated based on
the last trigger time, which, combined with the fact that the interval
is shorter than the runtime of the service, causes that elapse to be in
the past, which in turn means the timer will trigger as soon as the
service finishes running.
This behavior can be changed by enabling the new DeferReactivation setting,
which will cause the next calendar elapse to be calculated based on when
the trigger unit enters inactivity, rather than the last trigger time.
Thus, if a timer is on an realtime interval, the trigger will always
adhere to that specified interval.
E.g. if you have a timer that runs on a minutely interval, the setting
guarantees that triggers will happen at *:*:00 times, whereas by default
this may skew depending on how long the service runs.
Let's remove stack directories and their lock files by workers if
possible.
Now, lock files must be created before creating stack directories, hence
lock files are moved to /run/udev/links.lock/ , e.g.,
Before:
/run/udev/links/disk\x2fby-diskseq\x2f1/.lock
After:
/run/udev/links.lock/disk\x2fby-diskseq\x2f1
Matteo Croce [Fri, 11 Oct 2024 16:26:58 +0000 (18:26 +0200)]
report bpf_current_task_under_cgroup() errors to userspace
bpf_current_task_under_cgroup() returns 1 if the task is under the
specified cgroup, 0 if not, negative if an error happens.
Differentiate the 1 and -1 cases, and report to userspace when we got
and error.
An error like this is mostly unlikely, the only common one is that the
userspace doesn't populate the map, and the call returns -EAGAIN.
Tested by mocking the return value of bpf_current_task_under_cgroup():
Enumeration completed
enp1s0f0np0: Configuring with /etc/systemd/network/20-test.network.
Sysctl monitor BPF returned error: Link number out of range
Sysctl monitor BPF returned error: No CSI structure available
Sysctl monitor BPF returned error: Invalid exchange
Sysctl monitor BPF returned error: Exchange full
Sysctl monitor BPF returned error: Invalid request code
Sysctl monitor BPF returned error: Unknown error 58
Sysctl monitor BPF returned error: Device not a stream
Sysctl monitor BPF returned error: Timer expired
Sysctl monitor BPF returned error: Machine is not on the network
Sysctl monitor BPF returned error: Object is remote
Sysctl monitor BPF returned error: Advertise error
network/address: warn but ignore Broadcast= setting for an IPv6 address
Previously, the below was refused and the IPv6 address would not assigned.
===
[Address]
Address=2001:db8:0:f101::15/64
Broadcast=192.168.0.255
===
However, in the following case, networkd warned about the broadcast
address would be ignored, and the IPv6 address would be configured.
===
[Address]
Broadcast=192.168.0.255
Address=2001:db8:0:f101::15/64
===
I don't think list is particularly useful here. The passed fds are
constant for the lifetime of service, and with this commit we track
the number of extra fds in a dedicated var anyway.
This is a new syscall provided by the kernel used to implement faster
uprobes. It's not supposed to be called by userspace, but only by kernel
generated uprobe code.
It should be fine to allow this, as the kernel authenticates the
invocation itself, and we shouldn't break compat with things.
Note that this allowlisting is not sufficient to make ureprobe() work.
libseccomp must be tought the syscall too, but this can happen
independently.
smbios: move validation of SMBIOS table sizes fully into get_smbios_table()
We do half a validation currently ourselves (i.e. check the header fits
into the rest of the data), and leave the other half to the
caller (i.e. check the table fits into the rest of the data).
get_smbios_table() is changed to accept the minimum object size and
validates it before returning a table.
Daan De Meyer [Thu, 10 Oct 2024 13:54:57 +0000 (15:54 +0200)]
stdio-bridge: Use customized log message for forwarding bus
Let's more clearly indicate that we failed to set up the server
which forwards messages from the remote client to the local bus
instead of logging a generic bus client message.
Daan De Meyer [Wed, 9 Oct 2024 10:10:44 +0000 (12:10 +0200)]
bus-util: Move geteuid() check out of bus_connect_system_systemd()
Let's move this check to bus_connect_transport_systemd() so that
bus_connect_system_systemd() will only ever connect to the manager
private manager bus instance and fail otherwise.
Daan De Meyer [Wed, 9 Oct 2024 09:44:34 +0000 (11:44 +0200)]
bus-util: Drop fallback to system/user bus if manager bus doesn't work
We have various callsites that explicitly need the manager bus and
won't work with the system bus, like daemon-reexec and friends which
can't properly wait until the operation has finished unless using the
manager bus.
If we silently fall back to the system bus for these operations, we
can end up with rather hard to debug issues so let's remove the fallback
as it was added back in 2013 in a6aa89122d2fa5e811a72200773068c13bfffea2
without a clear explanation of why it was needed (I expect as a fallback
if kdbus wasn't available but that's not a thing anymore these days).
Daan De Meyer [Wed, 9 Oct 2024 14:37:06 +0000 (16:37 +0200)]
update-utmp: Make reconnect logic more robust
We might also fail to connect to the private manager bus itself if
the daemon-reexec is still ongoing, so let's handle that as well by
retrying on ECONNREFUSED.
Daan De Meyer [Wed, 9 Oct 2024 12:49:07 +0000 (14:49 +0200)]
mkosi: Fix up ownership of testuser home directory on first boot
When building unprivileged, the testuser home directory ends up
owned by root:root because mkosi can't chown directories to other
owners when running unprivileged. So let's fix up the testuser
ownership on first boot with tmpfiles instead.
json: add builder/dispatcher for PidRef → JSON and back
So far, at the one place we sent a PID over Varlink we did so as a
simple numeric pid_t value. That's of course is racy, since classic PIDs
are recycled too eagerly.
Let's address that, by passing around JSON objects distantly resembling our
PidRef structure. Note that this JSON object does *not* contain the
pidfd, however, but just the pidfd inode number if known.
I originally planned to include the pidfd in some direct form, but I
figured that's not really the best idea, since we always need a
side-channel of some form for that (i.e. AF_UNIX/SCM_RIGHTS), but we
should be able to report about PIDs even without that.
Moreover, while sending the pid number and pidfd id around should always
be OK to do, it's a lot more problematic to always send a pidfd around,
since that implies that fd passing is on and it is OK to install fds
remotely in some IPC peers fd table. For example, when doing a wild dump
of service manager service state we really shouldn't end up with a bunch
of fds installed in our client's fd table.
Hence, all in all I think it is cleaner to define a structure carrying
pid number and pidfd inode id, wich is passed directly as JSON. And then
optionally, in a separate field also pass around a pidfd where it makes
sense.
Note that sending around pidfds is not that beneficial anymore if we
have the pidfd inode id, because we can always securely and reliably get
a pidfd back from a pair of pid + inode id: first we do pidfd_open() on
the pid, and then we check if it is really the right one by comparing
.st_ino after fstat().
This logic is implemented gracefully: if for some reason pidfd/pidfd
inode nrs are not available (too old kernel), we'll fall back to plain
PID numbers.
The dispatching logic knows two distinct levels of validation of the
provided PID data: if SD_JSON_STRICT is specified we'll acquire a pidfd
for the PID, thus verifying it currently exists and failing if it
doesn't. If the flag is not set, well just store the provided info
as-is, will try to acquire a pidfd for it, but not fail if we cannot.
Both modes are important in different contexts.
Also note that in addition to the pidfd inode nr we always store the
current boot ID of the system in the JSON object, since only the
combination of pidfd inode nr and boot ID of the system really is a
world-wide unique reference to a process.
When dispatching a JSON pid field we operate somewhat gracefully: we
either support the triplet structure of pid, pid inode nr, boot id, or
we accept a simple classic UNIX pid.
varlink-idl: introduce c/.h file for common varlink IDL structures
Some structures we'll use in various varlink interfaces, move them to a
common .c/.h file. For now this is only the dual timestamp object, but
there will be more soon.
Daan De Meyer [Thu, 10 Oct 2024 20:37:39 +0000 (22:37 +0200)]
rpm/systemd-update-helper: Use systemctl reload to reexec/reload user managers
Let's always use systemctl reload to reexec and reload user managers
now that it always implies a reexec. This moves all the job management
logic to pid 1 instead of bash and reduces the complexity of the logic
as we remove systemd-run, pam and systemd-stdio-bridge from the equation.
Mike Yuan [Thu, 10 Oct 2024 19:32:17 +0000 (21:32 +0200)]
units/{user,capsule}@.service: issue daemon-reexec when notify-reloading
Closes #28367 (but not really in the exact form, see below)
We have the problem of restarting all user manager instances
after upgrade. Current approaches involve systemctl kill
with SIGRTMIN+25, which is async and feels rather ugly [1][2];
or systemctl --machine=user@ --user, which requires entering
each user session. Neither is particularly elegant.
Instead, let's just signal daemon-reexec when user@.service
is reloaded from system manager. Our long goal of dropping
daemon-reload in favor of reexec (see TODO) is unlikely to happen
due to user dbus restrictions, but here the synchronization
is done via READY=1.
#28367 would not really work for us now I come to think about it,
because all processes will be reparented to pid1 as soon as
original user manager process exits. This alternative approach
seems good enough for our use case.
Mike Yuan [Thu, 10 Oct 2024 19:06:35 +0000 (21:06 +0200)]
core/manager-serialize: drop serialization for Manager.ready_sent
This field indicates whether READY=1 has been sent to
the service manager/supervisor. Whenever we reload/reexec/soft-reboot,
manager_send_reloading() always resets it to false first,
so that READY=1 is sent after reloading finishes. Hence
we utterly get "false" at all times. Kill it.
The offending commit wrongly assumed that the second READY=1
notification is for system scope only, but it also serves the purpose
of flushing out previous STATUS= containing user unit job status.
Uday Shankar [Thu, 10 Oct 2024 20:29:10 +0000 (14:29 -0600)]
udev: allow persistent storage rules for ublk devices
Tools such as lsblk which query the udev database instead of probing
devices directly fail when run on ublk devices. For instance, in the
following commands, the partition type is missing, despite the fact that
/dev/ublkb0 was just partitioned with a single Linux filesystem type
partition.
$ lsblk /dev/ublkb0
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
ublkb0 259:0 0 31.3G 0 disk
└─ublkb0p1 259:1 0 31.2G 0 part
$ lsblk -o pkname,parttype /dev/ublkb0
PKNAME PARTTYPE
ublkb0
This happens because ublk devices are missing from a couple of
whitelists in the udev rules which are responsible for populating the
database with the data lsblk is looking for. Add the ublk devices to
these whitelists.
David Rheinsberg [Fri, 11 Oct 2024 07:53:25 +0000 (09:53 +0200)]
docs/DESKTOP_ENVIRONMENTS: fix formatting
The annotation about omittance is meant to be about the `RANDOM` string.
However, the current formatting makes it look like the entire naming
scheme is optional. Fix this.
Yu Watanabe [Thu, 10 Oct 2024 03:30:41 +0000 (12:30 +0900)]
sd-netlink: various cleanups
- use uint8_t, uint16_t, and so on, rather than unsigned char, unsigned
short, and so on, respectively,
- rename output parameters to ret or ret_xyz,
- add several missing assertions.
man: reword comment a bit regarding ExecStartPre= multiple commands
The documentation claimed that ExecStartPre=/ExecStartPost= accepts
multiple command lines, in contrast to ExecStart=. This is half an
untruth, because ExecStart= allows that too – as long as Type=oneshot is
set.
Hence, reword this a bit, and do not emphasize the contrast.
Ivan Kruglov [Thu, 10 Oct 2024 09:51:57 +0000 (11:51 +0200)]
machine: switch to use PidRef when lookup machine by pid in dbus and varlink interfaces
This commit includes adding introduce manager_get_machine_by_pidref() as a replacement for manager_get_machine_by_pid()
and moving surrounding code to utilise PidRef.