maia x. [Mon, 6 Jan 2025 18:31:44 +0000 (10:31 -0800)]
core: reload confexts when reloading notify-reload services
`ExtensionImages=` and `ExtensionDirectories=` now let you specify
vpick-named extensions; however, since they just get set up once when
the service is started, you can't see newer versions without restarting
the service entirely. Here, also reload confext extensions when you
reload a service. This allows you to deploy a new version of some
configuration and have it picked up at reload time without interruption
to your workload.
Right now, we would only reload confext extensions and leave the sysext
ones behind, since it didn't seem prudent to swap out what is likely
program code at reload. This is made possible by only going for the
`SYSTEMD_CONFEXT_HIERARCHIES` overlays (which only contains `/etc`).
Implementation wise, this uses the new kernel API and two collaborating
child processes under the host & child namespaces in order to gather the
right FDs needed:
- (1) In child, set up the extension images and directories in a slave
mountns, and obtain their FDs.
- (2) Fork into a grandchild under target process namespace, and do a
"fake" unmount to obtain the FD of the underlying target folder
say /etc).
- (3) In the child again, set up new overlay under host NS rights.
We do not want to do I/O heavy jobs inline in PID1 blocking the state
machine, so add separate async states to handle this case.
maia x. [Tue, 17 Dec 2024 22:24:02 +0000 (14:24 -0800)]
mkosi: add socat to minimal-0/1 images
Include 'socat' as part of the tools built with minimal-0/minimal-1
images used for integration testing. This would allow notify-reload
services to signal reloading without using the systemd-notify tool
and other socket communication test cases.
Note that 'ncat', although included in the minimal image, cannot be
used because it dies before systemd can figure out the unit it's
from. See 67e5622bfe037603e47f38cf02e58d522b3c18f7.
journald: rename primary object from "Server" to "Manager"
In all our daemons the primary entrypoint object is called "Manager".
But so far there was one exception: in journald it was called "Server".
Let's normalize that, and stick to the same nomenclature everywhere, to
make journald less special.
Mike Yuan [Tue, 13 May 2025 20:58:02 +0000 (22:58 +0200)]
fork-journal: use char* const* for strv input param
This is compatible with char** and is what I originally
asked for in
https://github.com/systemd/systemd/pull/36858#discussion_r2086792739
Someone needs to read better ;-)
Yu Watanabe [Tue, 13 May 2025 14:02:13 +0000 (23:02 +0900)]
login,udev: avoid race between systemd-logind and systemd-udevd in setting ACLs
Previously, both udevd and logind modifies ACLs of a device node. Hence,
there exists a race something like the following:
1. udevd reads an old state file,
2. logind updates the state file, and apply new ACLs,
3. udevd applies ACLs based on the old state file.
This makes logind not update ACLs but trigger uevents for relevant
devices to make ACLs updated by udevd.
Yu Watanabe [Tue, 13 May 2025 14:50:22 +0000 (23:50 +0900)]
login: do not call manager_process_seat_device() more than once per event
When udevd broadcasts an event for e.g. a graphics device with master-of-seat
tag, then previously manager_process_seat_device() was called twice for
the event.
With this commit, the function is called only once even for an event for
such device.
This returns to the original approach proposed in
https://github.com/systemd/systemd/pull/17270. After review, the approach was
changed to use sd_pid_get_owner_uid() instead. Back then, when running in a
typical graphical session, sd_pid_get_owner_uid() would usually return the user
UID, and when running under sudo, geteuid() would return 0, so we'd trigger the
secure path.
sudo may allocate a new session if is invoked outside of a session (depending
on the PAM config). Since nowadays desktop environments usually start the user
shell through user units, the typical shell in a terminal emulator is not part
of a session, and when sudo is invoked, a new session is allocated, and
sd_pid_get_owner_uid() returns 0 too. Technically, the code still works as
documented in the man page, but in the common case, it doesn't do the expected
thing.
$ build/test-sd-login |& rg 'get_(owner_uid|cgroup|session)'
sd_pid_get_session(0) → No data available
sd_pid_get_owner_uid(0) → 1000
sd_pid_get_cgroup(0) → /user.slice/user-1000.slice/user@1000.service/app.slice/app-ghostty-transient-5088.scope/surfaces/556FAF50BA40.scope
I think it's worth checking for sudo because it is a common case used by users.
There obviously are other mechanims, so the man page is extended to say that
only some common mechanisms are supported, and to (again) recommend setting
SYSTEMD_LESSSECURE explicitly. The other option would be to set "secure mode"
by default. But this would create an inconvenience for users doing the right
thing, running systemctl and other tools directly, because then they can't run
privileged commands from the pager, e.g. to save the output to a file. (Or the
user would need to explicitly set SYSTEMD_LESSSECURE. One option would be to
set it always in the environment and to rely on sudo and other tools stripping
it from the environment before running privileged code. But that is also fairly
fragile and it obviously relies on the user doing a complicated setup to
support a fairly common use case. I think this decreases usability of the
system quite a bit. I don't think we should build solutions that work in
priniciple, but are painfully inconvenient in common cases.)
compress: deal with zstd decoder issues gracefully
If zstd frames are corrupted the initial size returned for the current
frame might be wrong. Don#t assert() on that, but handle it gracefully,
as EBADMSG
logs-show: use memory_startswith() rather than startswith()
Let's be strict here: this data is conceptually not NUL terminated,
hence use memory_startswith() rather than startswith() (which implies
NUL termination). All other similar cases in logs-show.c got this right.
Fix the remaining three, too.
journal-upload-journal: handle partially written fields gracefully
With the more efficient sync semantics it's more likely that
journal-upload-journal will try to read a partially written message.
Previously we'd fail then. Let's instead treat this gracefully,
expecting that this is either the end or will be fixed shortly (and
we'll get notified via inotify about it and recheck).
journal-remote: destroy event sources before MHD context
The MHD context owns the fd we watch via our event source, hence when we
destroy the context before the event source the event source might still
reference the fd that is now invalid. Hence swap the order.
journald: make journal Varlink IPC accessible to unpriv clients
The Synchronize() function is just too useful for clients, so that we
can make "systemd-run -v --user" actually useful. Hence let's make the
socket accessible without privs. Deny most method calls however, except
for the Synchronize() call.
Previously, if the Synchronize() varlink call is issued we'd wait for
journald to become idle before returning success. That is problematic
however: on a busy system journald might never become idle. Hence, let's
beef up the logic to ensure that we do not wait longer than necessary:
i.e. we make sure we process any data enqueued before the sync request
was submitted, but not more.
Implementing this isn't trivial unfortunately. To deal with this
reasonably, we need to determine somehow for incoming log messages
whether they are from before or after the point in time where the sync
requested was received.
For AF_UNIX/SOCK_DGRAM we can use SO_TIMESTAMP to directly compare
timestamps of incoming messages with the timestamp of the sync request
(unfortunately only CLOCK_REALTIME).
For AF_UNIX/SOCK_STREAM we can call SIOCINQ at the moment we initiate
the sync, and then continue processing incoming traffic, counting down
the bytes until the SIOCINQ returned bytes have been processed. All
further data must have been enqueued later hence.
With those two mechanisms in place we can relatively reliably
synchronize the journal.
This also adds a boolean argument "offline" to the Synchronize() call,
which controls whether to offline the journal after processing the
pending messages. it defaults to true, for compat with the status quo
ante. But for most cases the offlining is probably not necessary, and is
cheaper to do without, hence allow not to do it.
journald: downgrade event source priority of kmsg to same as native/syslog inputs
So far we schduled kmsg events at higher priority than native/syslog
ones. But that's quite problematic, since it means that kmsg events can
drown out native/syslog log events. And this actually shows up in some
CI tests.
Address that, and schedule all three sources at the same priority, so
that the earlier event always is processed first, regarding which
protocol is used.
journalctl: optionally delay --follow exit for a journal synchronization
Let's optionally issue a Varlink Synchronize() call in --follow mode
when asked to terminate. This is useful so that the tool can be called
and it is guaranteed it processed all messages generated before the
request to exit before it exits.
We want this in "systemd-run -v" in particular, so that we can be sure
we are not missing any log output from the invoked service before it
exits
Allow callers to synchronize on the point in time where the journal file
watches are fully established, in --follow mode.
Tools can invoke journalctl using this, knowing that any log message
happening after the READY=1 is definitely going to be processed by the
journalctl invocation.
sd-netlink: allow configuration of flags parameter when creating message object
We soon want to add for sock_diag(7) netlink sockets. Those reuse the
same message type codes for request and response but with different
message formats. Hence we need to look at NLM_F_REQUEST to determine
which message policy to apply. Hence it is essential to know the flags
parameters right away when creating a message, since we cannot do early
validation otherwise.
This only adds support for setting the flags value right at the moment
of creation of the message object, it does not otherwise add
sock_diag(7) support, that is added in a later message.
This also corrects the flag for synthetic NLMSG_ERROR messages which
should not have the NLM_F_REQUEST flag set (since they are responses,
not requests).
Let's extend pid1's varlink interface and add a Describe method to get
the global Manager object information as a JSON object
(io.systemd.Manager.Describe).
Because the new varlink interface should be available on both the user
managers and the system manager, we also make the necessary changes to
expose a varlink server on user managers.
This PR is first part of https://github.com/systemd/systemd/pull/33965
with minimal changes to address feedback.
Add support for a sysfail boot entry. Sysfail boot entries can be used
for optional tweaking the automatic selection order in case a failure
state of the system in some form is detected (boot firmware failure
etc).
The EFI variable `LoaderEntrySysFail` contains the sysfail boot loader
entry to use. It can be set using bootctl:
```
$ bootctl set-sysfail sysfail.conf
```
The `LoaderEntrySysFail` EFI variable would be unset automatically
during next boot by `systemd-boot-clear-sysfail.service` if no system
failure occured, otherwise it would be kept as it is and a system
failure reason will be saved to `LoaderSysFailReason` EFI variable.
`sysfail_check()` expected to be extented to support possibleconditions
when we should boot sysfail("recovery") boot entry.
Also add support for using a sysfail boot entry in case of UEFI firmware
capsule update failure [1]. The status of a firmware update is obtained
from the EFI System Resource Table (ESRT), which provides an optional
mechanism for identifying device and system firmware resources for the
purposes of targeting firmware updates to those resources.
Current implementation uses the value of LastAttemptStatus field from
ESRT, which describes the result of the last firmware update attempt for
the firmware resource entry. The field is updated each time an
`UpdateCapsule()` is attempted for an ESRT entry and is preserved across
reboots (non-volatile).
This can be be used in setups with support for A/B OTA updates, where
the boot firmware and Linux/RootFS might be updated synchronously.
The check is activated by adding "sysfail-firmware-upd" to loader.conf
Let's extend pid1's varlink interface and add a Describe method to
get the global Manager object information as a JSON object
(io.systemd.Manager.Describe).
Because the new varlink interface should be available on both the
user managers and the system manager, we also make the necessary
changes to expose a varlink server on user managers.
This breaks the rule stated at the beginning of help_sudo_mode():
> NB: Let's not go overboard with short options: we try to keep a modicum of compatibility with
> sudo's short switches, hence please do not introduce new short switches unless they have a roughly
> equivalent purpose on sudo. Use long options for everything private to run0.
Mike Yuan [Mon, 12 May 2025 14:10:03 +0000 (16:10 +0200)]
core: accept "|" ExecStart= prefix to spawn target user's shell; teach run0 about the new logic (#37071)
I've always been reluctant to invoke the current user's shell in another
user's context, hence was fully grounded in `sudo -i`. With this bit in
place `run0` will finally be feature-complete on my side ;-)
Yu Watanabe [Mon, 12 May 2025 14:04:49 +0000 (23:04 +0900)]
udev: sort received events by their seqnum (#37314)
The kernel sometimes sends uevents in a random order, so previously the
received events were not sorted by their seqnum. We determine which
event is ready for processing by using the assumption that queued events
are sorted by their seqnum. Let's sort the received events before queue
them, to make events processed in a correct ordering.
Igor Opaniuk [Thu, 23 Jan 2025 12:31:04 +0000 (13:31 +0100)]
sd-boot: use sysfail entry for UEFI firmware update failure
Add support for using a sysfail boot entry in case of UEFI firmware
capsule update failure [1]. The status of a firmware update is obtained from
the EFI System Resource Table (ESRT), which provides an optional mechanism
for identifying device and system firmware resources for the purposes of
targeting firmware updates to those resources.
Current implementation uses the value of LastAttemptStatus field from
ESRT, which describes the result of the last firmware update attempt for
the firmware resource entry. The field is updated each time an
UpdateCapsule() is attempted for an ESRT entry and is preserved across
reboots (non-volatile).
This can be be used in setups with support for A/B OTA updates, where
the boot firmware and Linux/RootFS might be updated synchronously.
[1] https://uefi.org/specs/UEFI/2.10/23_Firmware_Update_and_Reporting.html Signed-off-by: Igor Opaniuk <igor.opaniuk@foundries.io>
Igor Opaniuk [Mon, 24 Mar 2025 14:33:16 +0000 (15:33 +0100)]
bootctl: configure a sysfail entry
You can configure the sysfail boot entry using the bootctl command:
$ bootctl set-sysfail sysfail.conf
The value will be stored in the `LoaderEntrySysFail` EFI variable.
The `LoaderEntrySysFail` EFI variable would be unset automatically
during next boot by `systemd-boot-clear-sysfail.service` if no
system failure occured, otherwise it would be kept as it is and a system
failure reason will be saved to `LoaderSysFailReason` EFI variable.
Signed-off-by: Igor Opaniuk <igor.opaniuk@foundries.io>
Igor Opaniuk [Mon, 24 Mar 2025 14:30:49 +0000 (15:30 +0100)]
sd-boot: add support for a sysfail entry
Add support for a sysfail boot entry. Sysfail boot entries can be
used for optional tweaking the automatic selection order in case a
failure state of the system in some form is detected (boot firmware
failure etc).
The EFI variable `LoaderEntrySysFail` holds the boot loader entry to
be used in the event of a system failure. If a failure occurs, the reason
will be stored in the `LoaderSysFailReason` EFI variable.
sysfail_check() expected to be extented to support possible
conditions when we should boot sysfail("recovery") boot entry.
Signed-off-by: Igor Opaniuk <igor.opaniuk@foundries.io>
This mostly makes sure we do something reasonable when our tool is
called from a boot of an entry that was already marked as definitely
"bad" on a previous boot. Such an entry we can return into a "good"
state, but we cannot return it into an "indeterminate" state, because
the status quo ante is already known.
Daan De Meyer [Sat, 10 May 2025 20:19:22 +0000 (22:19 +0200)]
meson: Don't create static library target unless option is enabled
While we don't build these by default, all the source files still
get added to the compile_commands.json file by meson, which can confuse
tools as they might end up analyzing the source files twice or analyzing
the wrong one.
To avoid this issue, only define the static library target if the
corresponding option is enabled.
Daan De Meyer [Fri, 9 May 2025 18:48:51 +0000 (20:48 +0200)]
meson: Remove unneeded include directories
meson by default adds the current source and build directory as include
directories. Because we structure our meson code by gathering a giant dict
of everything we want to do and then doing all the actual target generation
in the top level meson.build, this behavior does not make sense at all because
we end up adding the top level repository directory as an include directory
which is never what we want.
At the same time, let's also make sure the top level directory of the build
directory is not an include directory, by moving the version.h generation
into the src/version subdirectory and then adding the src/version subdirectory
of the build directory as an include directory instead of the top level
repository directory.
Making this change means that language servers such as clangd can't get
confused when they automatically insert an #include line and insert
"#include "src/basic/fs-util.h" instead of "#include "fs-util.h".