Daan De Meyer [Sat, 23 Nov 2024 10:36:54 +0000 (11:36 +0100)]
repart: Take configured minimum and maximum size into account for Minimize=
- Let's check if the minimum size we got is larger than the configured
maximum partition size and fail early if it is.
- Let's make sure for writable filesystems that we make the minimal
filesystem at least as large as the minimum partition size, to allow
creating minimal filesystems with a minimum size.
Daan De Meyer [Fri, 13 Dec 2024 13:48:07 +0000 (13:48 +0000)]
core: Bind mount notify socket to /run/host/notify in sandboxed units (#35573)
To be able to run systemd in a Type=notify transient unit, the notify
socket can't be bind mounted to /run/systemd/notify as systemd in the
transient unit wants to use that as its own notify socket which
conflicts with systemd on the host.
Instead, for sandboxed units, let's bind mount the notify socket to
/run/host/notify as documented in the container interface. Since we
don't guarantee a stable location for the notify socket and insist users
use $NOTIFY_SOCKET to get its path, this is safe to do.
Daan De Meyer [Wed, 11 Dec 2024 18:45:28 +0000 (18:45 +0000)]
core: Bind mount notify socket to /run/host/notify in sandboxed units
To be able to run systemd in a Type=notify transient unit, the notify
socket can't be bind mounted to /run/systemd/notify as systemd in the
transient unit wants to use that as its own notify socket which conflicts
with systemd on the host.
Instead, for sandboxed units, let's bind mount the notify socket to
/run/host/notify as documented in the container interface. Since we don't
guarantee a stable location for the notify socket and insist users use
$NOTIFY_SOCKET to get its path, this is safe to do.
Soumyadeep Ghosh [Thu, 12 Dec 2024 13:20:33 +0000 (18:50 +0530)]
hwdb: move down touchpad toggle section from generic to product specific
adding `KEYBOARD_KEY_76` in generic section is causing a regression
in MSI GF63. Moving this down fixes.
This commit also adds a probable KEY Code for MSI GF63 touchpad toggling
Luca Boccassi [Fri, 13 Dec 2024 12:25:13 +0000 (12:25 +0000)]
core: Add PrivateUsers=full (#35183)
Recently, PrivateUsers=identity was added to support mapping the first
65536 UIDs/GIDs from parent to the child namespace and mapping the other
UID/GIDs to the nobody user.
However, there are use cases where users have UIDs/GIDs > 65536 and need
to do a similar identity mapping. Moreover, in some of those cases,
users want a full identity mapping from 0 -> UID_MAX.
To support this, we add PrivateUsers=full that does identity mapping for
all available UID/GIDs.
Note to differentiate ourselves from the init user namespace, we need to
set up the uid_map/gid_map like:
```
0 0 1
1 1 UINT32_MAX - 1
```
as the init user namedspace uses `0 0 UINT32_MAX` and some applications
- like systemd itself - determine if its a non-init user namespace based
on uid_map/gid_map files.
Note systemd will remove this heuristic in running_in_userns() in
version 258 (https://github.com/systemd/systemd/pull/35382) and uses
namespace inode. But some users may be running a container image with
older systemd < 258 so we keep this hack until version 259 for version
N-1 compatibility.
In addition to mapping the whole UID/GID space, we also set
/proc/pid/setgroups to "allow". While we usually set "deny" to avoid
security issues with dropping supplementary groups
(https://lwn.net/Articles/626665/), this ends up breaking dbus-broker
when running /sbin/init in full OS containers.
Florian Schmaus [Sat, 16 Nov 2024 09:29:35 +0000 (10:29 +0100)]
logind: let system-wide idle begin at the time logind was initialized
Initialize the start of the system-wide idle time with the time logind was
initialized and not with the start of the Unix epoch. This means that systemd
will not repport a unreasonable long idle time (around 54 years at the time of
writing this), especially at in the early boot, while no login manager session,
e.g,. gdm, had a chance to provide a more accurate start of the idle period.
Luca Boccassi [Thu, 12 Dec 2024 16:46:11 +0000 (16:46 +0000)]
mkosi: update debian commit reference
* e8b7c9a4dd Install 81-net-bridge.rules
* 50d2997a07 Install systemd-creds bash completion
* ff0c42823c test: fix flaky boot-and-services test
* 2a19dee4ba test: fix flaky boot-and-services test
* a15a0bfe60 Update changelog for 257-2 release
* c24eafcb7e Backport patches to fix test failures
* 29840f9b68 udev: install dmi_memory_id and its rules on riscv64
* 44893bdb32 Update changelog for 257-1 release
* 7f71d995fb Update symbols file for v257
* 2dd2b80499 Update upstream source from tag 'upstream/257'
* 51a3271a85 Update changelog for 257~rc3-1 release
* 8e687227c5 Update symbols for 257~rc3
* c9bae527d6 Drop patches, merged upstream
* e8cf329870 Update upstream source from tag 'upstream/257_rc3'
* 794457516d autopkgtest: fix one more tzdata dependency
* 16bb143da1 Bump version in tzdata dependency due to p-u upload
* f2ddf70604 sysctl: Add file trigger on /usr/lib/sysctl.d to restart systemd-sysctl
* 79260cb0f4 Increase minimum sections in stub PE header on arm64/armhf/riscv64 to 500
* ed3af24635 systemd-ukfy: recommend systemd-boot-efi for the stub
Ryan Wilson [Sat, 30 Nov 2024 22:14:35 +0000 (14:14 -0800)]
core: Set /proc/pid/setgroups to allow for PrivateUsers=full
When trying to run dbus-broker in a systemd unit with PrivateUsers=full,
we see dbus-broker fails with EPERM at `util_audit_drop_permissions`.
The root cause is dbus-broker calls the setgroups() system call and this
is disallowed via systemd's implementation of PrivateUsers= by setting
/proc/pid/setgroups = deny. This is done to remediate potential privilege
escalation vulnerabilities in user namespaces where an attacker can remove
supplementary groups and gain access to resources where those groups are
restricted.
However, for OS-like containers, setgroups() is a pretty common API and
disabling it is not feasible. So we allow setgroups() by setting
/proc/pid/setgroups to allow in PrivateUsers=full. Note security conscious
users can still use SystemCallFilter= to disable setgroups() if they want
to specifically prevent this system call.
Luca Boccassi [Wed, 11 Dec 2024 20:44:25 +0000 (20:44 +0000)]
semaphore: skip some tests
semaphore CI runs are always very close to the limit of 1hr, and often
time out when it's particularly oversubscribed.
Skip some low-value test cases to shorten the runtime.
This function was listed in the public sd-varlink.h header, but not
actually made public. Fix that. It's quite useful, the comment in it
describes the usecase nicely.
Yu Watanabe [Wed, 11 Dec 2024 20:11:46 +0000 (05:11 +0900)]
libfido2-util: show also verity features when listing FIDO2 devices (#35295)
This way, users don't have to check those features using an external
program, or wait for later failure when trying to enroll using an
unsupported feature.
E.g.:
```
# systemd-cryptenroll --fido2-device list
PATH MANUFACTURER PRODUCT RK CLIENTPIN UP UV
/dev/hidraw2 Yubico YubiKey OTP+FIDO+CCID yes no yes no
```
This introduces a new unit condition check: that matches if a specific
kmod module is allowed. This should be generally useful, but there's one
usecase in particular: we can optimize modprobe@.service with this and
avoid forking out a bunch of modprobe requests during boot for the same
kmods.
Checking if a kernel module is loaded is more complicated than just
checking if /sys/module/$MODULE/ exists, since kernel modules typically
take a while to initialize and we must check that this is complete (by
checking if the sysfs attr "initstate" is "live").
Tobias Klauser [Wed, 11 Dec 2024 14:10:39 +0000 (15:10 +0100)]
profile.d: don't bail if $SHELL_* variables are unset
If - for whatever reason - a script uses set -u (nounset) and includes
/etc/profile.d/70-systemd-shell-extra.sh (e.g. transitively via
/etc/profile) the script would fail with:
/etc/profile.d/70-systemd-shell-extra.sh: line 15: SHELL_PROMPT_PREFIX: unbound variable
Now that we have an explicit userns check we can drop the heuristic for
it, given that it's kinda wrong (because mapping the full host UID range
into a userns is actually a thing people do).
Hence, just delete the code and only keep the userns inode check in
place.
Now that we have a reliable pidns check I don't think we really should
look for cgroupns anymore, it's too weak a check. I mean, if I myself
would implement a desktop app sandbox (like flatpak) I'd always enable
cgroupns, simply to hide the host cgroup hierarchy.
Mike Yuan [Tue, 10 Dec 2024 15:14:34 +0000 (16:14 +0100)]
basic/fileio: clean up executable_is_script() a bit
- Rename to script_get_shebang_interpreter and return
-EMEDIUMTYPE if the executable is not a script.
We nowadays utilize the scheme of making ret param
of getters optional, and use them directly as checkers.
- Don't unnecessarily read the whole line, but check
only the shebang first.
man: document unprivileged is not for reading properties
Document the fact that read-only properties may not have the flag
SD_BUS_VTABLE_UNPRIVILEGED as that is not obvious especially given the
flag is accepted for writable properties.
Based on the check in `add_object_vtable_internal` called by
`sd_bus_add_object_vtable` (as of the current tip of the main branch f7f5ba019206cacd486b0892fec76f70f525e04d):
case _SD_BUS_VTABLE_PROPERTY: {
[...]
if ([...] ||
[...]
(v->flags & SD_BUS_VTABLE_UNPRIVILEGED && v->type == _SD_BUS_VTABLE_PROPERTY)) {
r = -EINVAL;
goto fail;
}
(where `_SD_BUS_VTABLE_PROPERTY` means read-only property whereas
`_SD_BUS_VTABLE_WRITABLE_PROPERTY` maps to writable property).
This was implemented in the commit adacb9575a09981fcf11279f2f661e3fc21e58ff ("bus: introduce "trusted" bus
concept and encode access control in object vtables") where
`SD_BUS_VTABLE_UNPRIVILEGED` was introduced:
Writable properties are also subject to SD_BUS_VTABLE_UNPRIVILEGED
and SD_BUS_VTABLE_CAPABILITY() for controlling write access to them.
Note however that read access is unrestricted, as PropertiesChanged
messages might send out the values anyway as an unrestricted
broadcast.
libfido2-util: show also verity features when listing FIDO2 devices
This way, users don't have to check those features using an external program, or
wait for later failure when trying to enroll using an unsupported feature.
Mike Yuan [Wed, 27 Nov 2024 23:40:11 +0000 (00:40 +0100)]
process-util: make sure we don't report ppid == 0
Previously, if pid == 0 and we're PID 1, get_process_ppid()
would set ret to getppid(), i.e. 0, which is inconsistent
when pid is explicitly set to 1. Ensure we always handle
such case by returning -EADDRNOTAVAIL.
Luca Boccassi [Wed, 11 Dec 2024 13:40:10 +0000 (13:40 +0000)]
test-fd-util: compare FDs to /bin/sh instead of /dev/null
/dev/null is a character device, so same_fd() in the fallback path
that compares fstat will fail, as that bails out if the fd refers
to a char device. This happens on kernels without F_DUPFD_QUERY and
without kcmp.
/* test_same_fd */
Assertion 'same_fd(d, e) > 0' failed at src/test/test-fd-util.c:111, function test_same_fd(). Aborting.
Luca Boccassi [Wed, 11 Dec 2024 12:01:18 +0000 (12:01 +0000)]
test-fd-util: skip test when lacking privileges to create a new namespace
To reproduce, as an unprivileged user start a docker container and build
and run the unit tests inside it:
$ docker run --rm -ti debian:bookworm bash
...
/* test_close_all_fds */
Successfully forked off '(caf-plain)' as PID 10496.
Skipping PR_SET_MM, as we don't have privileges.
(caf-plain) succeeded.
Failed to fork off '(caf-noproc)': Operation not permitted
Assertion 'r >= 0' failed at src/test/test-fd-util.c:392, function test_close_all_fds(). Aborting.
Luca Boccassi [Wed, 11 Dec 2024 12:10:13 +0000 (12:10 +0000)]
test-capability: CAP_LINUX_IMMUTABLE is not available in unprivileged containers
have ambient caps: yes
Capabilities:cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap=ep
Failed to drop auxiliary groups list: Operation not permitted
Failed to change group ID: Operation not permitted
Capabilities:cap_dac_override,cap_net_raw=ep
Capabilities:cap_dac_override=ep
Successfully forked off '(getambient)' as PID 12505.
Skipping PR_SET_MM, as we don't have privileges.
Ambient capability cap_linux_immutable requested but missing from bounding set, suppressing automatically.
Assertion 'x < 0 || FLAGS_SET(c, UINT64_C(1) << CAP_LINUX_IMMUTABLE)' failed at src/test/test-capability.c:273, function test_capability_get_ambient(). Aborting.
(getambient) terminated by signal ABRT.
src/test/test-capability.c:258: Assertion failed: expected "r" to succeed, but got error: Protocol error
logind: drop one duplicate param in manager_is_inhibited()
In the review in https://github.com/systemd/systemd/pull/30307#pullrequestreview-2255002732
removal of the excessive boolean parameters was requested. We don't need
a separate boolean param here, since we always pass true with a uid and
false otherwise.