missing: change our close_range() syscall wrapper to map glibc's
So glibc exposes a close_range() syscall wrapper now, but they decided
to use "unsigned" as type for the fds. Which is a bit weird, because fds
are universally understood to be "int". The kernel internally uses
"unsigned", both for close() and for close_range(), but weirdly,
userspace didn't fix that for close_range() unlike what they did for
close()... Weird.
But anyway, let's follow suit, and make our wrapper match glibc's.
Michal Koutný [Fri, 9 Feb 2024 15:03:00 +0000 (16:03 +0100)]
service: Demote log level of NotifyAccess= messages to debug
The situation is a service like
Type=notify
NotifyAccess=main
and the service uses some of the systemd helper utilities, e.g.
coredumpctl. The service process will pass NOTIFY_SOCKET to the helper
child (accidentally) and the result is a spurious notification and
the warning message:
> Jan 18 09:38:01 host systemd[1]: sdnotify.service: Got notification message from PID 13736, but reception only permitted for main PID 13549
Notification from helpers seem like an unintentional composition of the
commit c118b577fa ("coredumpctl: define main through macro") and commit 6b636c2d27 ("main-func: send main exit code to parent via sd_notify() on
exit"). The former used the handy macro for a main function, the latter
equipped any main function with the notification. (Further extended in
the commit 623a00020f ("notify: Add EXIT_STATUS field").)
Since notification from systemd utitilities are meant to extend
rudimentary exit()/wait() pair generally, they may happen to land into
service's NOTIFY_SOCKET. Tone down messages of notification that won't
match NotifyAccess=.
For the other verbs turning off JSON mode makes sense, but for "call"
not so much, after all the contents of a method call reply is JSON we
couldn't really show any other way.
Hence, when JSON output was not configured otherwise in "call", default
to the same as -j.
It exposes the varlink_collect() call we internally provide: it collects
all responses of a method call that is issued with the "more" method
call flag. It then returns the result as a single JSON array.
This reworks varlink_collect() so that it is not just a wrapper around
varlink_observe(), varlink_bind_reply() and others. It becomes a first
class operation.
This has various benefits:
1. Memory management is normalized: the reply json variant is now
tracked as part of the varlink object, and thus we do not pass
ownership to the caller. This is just like we do it for simple method
calls and removes a lot of confusion.
2. The bind reply/user data pointer can be used for user stuff, we'll
not silently override this.
3. We enforce an overall time-out operation on the whole thing, so that
this synchronous operation does no longer block forever.
units: enable MaxConnectionsPerSocket= for all our Accept=yes units
Let's make sure that user's cannot DoS services for other users so
easily, and enable MaxConnectionsPerSocket= by default for all of them.
Note that this is mostly paranoia for systemd-pcrextend.socket and
systemd-sysext.socket: the socket is only accessible to root anyway,
hence the accounting shouldn#t change anything. But this is just a
safety net, in preparation that we open up some functionality of these
services sooner or later.
pid1: make MaxConnectionsPerSource= also work for AF_UNIX sockets
The setting currently puts limits on connections per IP address and
AF_UNIX CID. Let's extend it to cover AF_UNIX too, where it puts a limit
on connections per UID.
This is particularly useful for the various Accept=yes Varlink services
we now have, as it means, the number of per-user instance services
cannot grow without bounds.
Eric Daigle [Fri, 9 Feb 2024 07:09:34 +0000 (23:09 -0800)]
firstboot: validate keymap entry
As described in #30940, systemd-firstboot currently does not perform
any validation on keymap entry, allowing nonexistent keymaps to be
written to /etc/vconsole.conf. This commit adds validation checks
based on those already performed on locale entry, preventing invalid
keymaps from being set.
Franck Bui [Wed, 7 Feb 2024 12:41:48 +0000 (13:41 +0100)]
gpt-auto-generator: be more defensive when checking the presence of ESP in fstab
Looking for the ESP node is useful to shortcut things but if we're told that
the node is not referenced in fstab that doesn't necessarily mean that ESP is
not mounted via fstab. Indeed the check is not reliable in all cases. Firstly
because it assumes that udev already set the symlinks up. This is not the case
for initrd-less boots. Secondly the devname of the ESP partition can be wrongly
constructed by the dissect code. For example, the approach which consists in
appending "p<partnum>" suffix to construct the partition devname from the disk
devname doesn't work for DM devices.
Hence this patch makes the logic more defensive and do not mount neither ESP
nor XBOOTLDR automatically if any path in paths that starts with /efi or /boot
exists.
Yu Watanabe [Tue, 2 Jan 2024 19:28:25 +0000 (04:28 +0900)]
logs-show: get timestamp and boot ID only when necessary
Previously, get_display_timestamp() is unconditionally called even if we
will show logs in e.g. json format.
This drops unnecessary call of get_display_timestamp().
This also makes journal fields in each entry parsed only once in
output_short(). Still output_verbose() twice though.
This should improve performance of dumping journals.
Replaces #29365.
Co-authored-by: Costa Tsaousis <costa@netdata.cloud>
Yu Watanabe [Tue, 2 Jan 2024 19:28:11 +0000 (04:28 +0900)]
sd-journal: drop to use Hashmap to manage journal files per boot ID
As reported at https://github.com/systemd/systemd/pull/30209#issuecomment-1831344431,
using hashmap in frequently called function reduces performance.
Let's replace it with a single array and bsearch.
Replaces #29366.
Co-authored-by: Costa Tsaousis <costa@netdata.cloud>
Yu Watanabe [Sat, 20 Jan 2024 13:14:14 +0000 (22:14 +0900)]
journalctl: call all cleanup functions before raise()
Note, even with this, memory allocated internally by glibc is not freed.
But, at least, memory explicitly allocated by us is freed cleanly even
Ctrl-C is pressed during 'journalctl --follow'.
Yu Watanabe [Tue, 2 Jan 2024 19:27:59 +0000 (04:27 +0900)]
sd-journal: cache last entry offset and journal file state
When the offset of the last entry object (or last object for journal
files generated by an old journald) is not changed, the timestamps
should be updated by journal_file_read_tail_timestamp() are unchanged.
So, we can drop to call fstat() in the function.
As, the journal header is always mapped, so we can read the offset and
journal file state without calling fstat.
Still, when the last entry offset is changed, we may need to call fstat()
to read the entry object. But, hopefully the number of fstat() call
can be reduced.
Mike Yuan [Wed, 31 Jan 2024 17:25:49 +0000 (01:25 +0800)]
core/service: allow RestartForceExitStatus= for oneshot services
I think this was just overlooked in #13754, which removed
the restriction of Restart= on Type=oneshot services.
There's no reason to prevent RestartForceExitStatus=
now that Restart= has been allowed.
Mike Yuan [Wed, 31 Jan 2024 17:47:35 +0000 (01:47 +0800)]
core/service: make error msg match with conditions
This was discussed in
https://github.com/systemd/systemd/pull/13754#discussion_r333395362.
I think we should actually list "success" Restart= settings instead.
There are more error statuses than success ones after all, and this
list hasn't really changed for quite some time.
Daan De Meyer [Mon, 25 Dec 2023 22:11:22 +0000 (23:11 +0100)]
repart: Add --generate-fstab= and --generate-crypttab= options
These can be used along with two new settings MountPoint= and
EncryptedVolume= to write fstab and crypttab entries to the given
paths respectively in the root directory that repart is operating on.
This is useful to cover scenarios that aren't covered by the
Discoverable Partitions Spec. For example when one wants to mount
/home as a separate btrfs subvolume. Because multiple btrfs subvolumes
can be mounted from the same partition, we allow specifying MountPoint=
multiple times to add multiple entries for the same partition.
test: make the MemoryHigh= limit a bit more generous with sanitizers
When we're running with sanitizers, sd-executor might pull in a
significant chunk of shared libraries on startup, that can cause a lot
of memory pressure and put us in the front when sd-oomd decides to go on
a killing spree. This is exacerbated further on Arch Linux when built
with gcc, as Arch ships unstripped gcc-libs so sd-executor pulls in over
30M of additional shared libs on startup:
Yu Watanabe [Tue, 2 Jan 2024 19:30:29 +0000 (04:30 +0900)]
sd-journal: do not read unnecessary object
In journal_file_next_entry(), if the passed offset matches an entry object,
then generic_array_bisect() returns the object, but the object we
requested is the next (or previous) object. Hence, we should not validate
the object returned by generic_array_bisect(), otherwise it may fail
when the journal is corrupted.
Note the validity of the entry object that should be returned by
journal_file_next_entry() will be checked in the following generic_array_get().
So, when journal_file_next_entry() succeeds, the returned object is
always validated.
Yu Watanabe [Tue, 2 Jan 2024 19:30:24 +0000 (04:30 +0900)]
sd-journal: always put verified object into the chain cache
Let's consider the case that
- the first array contains valid entries,
- all entries in the second array are corrupted.
Then, when we are going to upwards, and a call of generic_array_bisect()
matches the last entry of the first array, then the second array was
cached with last_index == UINT64_MAX, instead of the first array with
its last entry.
Hence, when generic_array_bisect() is called next time, the function call
of test() always fail. So, the cache entry is mostly meaningless.
Luca Boccassi [Wed, 11 Oct 2023 18:23:40 +0000 (19:23 +0100)]
repart: support OpenSSL engines/providers for signing
The provider API which is new requires providers, which are not
widely available and don't work very well yet, so also use a
fallback with the legacy engine API.
bpf-devices: if a device node is referenced which doesn't exist, downgrade log message
Currently in many of our test cases you'll see a warning about a tun
device not being around. Let's make that quiet, since if there's no such
device there's no point in adding it to a policy anyway, and it makes
useless noise go away.
We keep the warning as a warning if a device node is missing for other
errors than ENOENT.
bpf-devices: normalize the return handling of functions that put together policy
under some conditions we suppress generating BPF programs. Let's
systematically return 0 when we do this, and 1 if we did actually
soething, instead of second guessing this in the caller.
This is not only more correct, but allows us to suppress BPF programs in
more cases in later commits.
bpf-devices: normalize how we pass around major/minor values
There's some unclarity whether major/minor of device nodes are supposed
to be "unsigned" or "dev_t". Various codebases assume the latter, but
glibc's major()/minor() types actually return a value typed to
"unsigned". On glibc dev_t is actually 64bit even if the kernel only
exposes 32bit. Hence this distinction kinda matters.
Let's clean things up a bit with handling: let's followe glibc's type
system here, and use unsigned (and not int).
Also let's pass invalid major/minor values around as UINT_MAX rather
than via pointers, to match how we usually do this, and to shorten our
code a bit. This is safe, since given the linux dev_t space being 32bit
only we can't possibly have a valid major or minor this hight, given
they must be smaller in size. While other archs disagree on the types of
major/minor, they also tend to have similar limits. In fact on FreeBSD
for example major()/minor() returns a signed int. Which would hence also
mean that UINT_MAX cannot be a valid major or minor.
dev-setup: normalize logging around lock_dev_console()
Previously this function would log loudly in some cases but not in
others. Clean this up, and dont log at all, matching our coding style
which says we should either log in all error cases or in none.
Both callers of this function do logging already, hence no need to
duplicate it here.
test: adjust test-path to fail gracefully with the new pidfd_spawn stuff
Since 2e106312e2 the test unit fails with 'resources' result instead of
'exit-code', which the test didn't account for when running unprivileged.
Before 2e106312e2:
$ /root/systemd/build/test-path
Failed to start transient scope unit: Interactive authentication required.
Couldn't allocate a scope unit for this test, proceeding without.
...
-.slice: Failed to enable/disable controllers on cgroup /user.slice/user-1000.slice/session-1.scope, ignoring: Permission denied
app.slice: Failed to create cgroup /user.slice/user-1000.slice/session-1.scope/app.slice: Permission denied
-.slice: Failed to enable/disable controllers on cgroup /user.slice/user-1000.slice/session-1.scope, ignoring: Permission denied
app.slice: Failed to create cgroup /user.slice/user-1000.slice/session-1.scope/app.slice: Permission denied
...
line 151: path-exists.path: state = running; result = success (left: 29986250)
line 151: path-exists.service: state = start; result = success
path-exists.service: Main process exited, code=exited, status=219/CGROUP
path-exists.service: Failed with result 'exit-code'.
line 151: path-exists.path: state = running; result = success (left: 29985948)
line 151: path-exists.service: state = failed; result = exit-code
Failed to start service path-exists.service, aborting test: failed/exit-code
After 2e106312e2:
$ /root/systemd/build/test-path
Failed to start transient scope unit: Interactive authentication required.
Couldn't allocate a scope unit for this test, proceeding without.
...
-.slice: Failed to enable/disable controllers on cgroup /user.slice/user-1000.slice/session-1.scope, ignoring: Permission denied
app.slice: Failed to create cgroup /user.slice/user-1000.slice/session-1.scope/app.slice: Permission denied
-.slice: Failed to enable/disable controllers on cgroup /user.slice/user-1000.slice/session-1.scope, ignoring: Permission denied
app.slice: Failed to create cgroup /user.slice/user-1000.slice/session-1.scope/app.slice: Permission denied
path-exists.service: Failed to spawn executor: No such file or directory
path-exists.service: Failed to spawn 'start' task: No such file or directory
path-exists.service: Failed with result 'resources'.
packit: temporarily build systemd without BPF stuff
The kernel-tools meta-package was retired in Rawhide, but its
replacement has not landed, yet. Until that happens, let's build without
the bpf-framework stuff.
Daan De Meyer [Thu, 8 Feb 2024 09:54:54 +0000 (10:54 +0100)]
Add systemd.default_debug_tty=
Let's allow configuring the debug tty independently of enabling/disabling
the debug shell. This allows mkosi to configure the correct tty while
leaving enabling/disabling the debug tty to the user.
sysext: rename "directory_name" field to "full_identifier"
So the field contains simply the full name of the command being invoked,
hence rename the field to match the contents, and to mirror the
"short_identifier" field.
Interestingly, the field is apparently not actually used by anything
though! But we are not going to remove it, since a follow-up commit will
start making use of it.