varlink: tweak what we include in "system error" messages
We so far only included the numeric Linux errno. That's pretty Linux
specific however. Hence, let's improve things and include an origin
string, that clearly marks Linux as origin. Also, include the string
name of the error.
Take these two fields into account when translating back, too. So that
we prefer going by symbolic name rather than by numeric id.
sd-json: make it safe to call sd_json_dispatch_full() with a NULL table
This is useful for generating good errors when dispatching varlink
methods that take no parameters, as we'll still generate precise errors
in that case, taking a NULL table as equivalent as one with no
entries.
userdb: define new 64K "foreign UID" range (#35932)
This is establish the basic concepts for #35685, in the hope to get this
merged first.
This defines a special, fixed 64K UID range that is supposed to be used
by directory container images on disk, that is mapped to a dynamic UID
range at runtime (via idmapped mounts).
This enables a world where each container can run with a dynamic UID
range, but this in no way leaks onto the disk, thus making supposedly
dynamic, transient UID range assignments persistent.
This is infrastructure later used for the primary part of #35685: unpriv
container execution with directory images inside user's home dirs, that
are assigned to this special "foreign UID range".
This PR only defines the ranges, synthesizes NSS records for them via
userdb, and then exposes them in a new "systemd-dissect --shift" command
that can re-chown a container directory tree into this range (and in
fact any range).
This comes with docs. But no tests. There are tests in #35685 that cover
all this, but they are more comprehensive and also test nspawn's hook-up
with this, hence are excluded from this PR.
Daan De Meyer [Thu, 9 Jan 2025 14:45:41 +0000 (15:45 +0100)]
fmf: Use one fewer than number of available CPUs again
This effectively reverts b8582198ca1e6fe390f7169e623a9130b68a6b36
as I can not get the testing farm bare metal machines working
downstream and even if I managed to, without also using the testing
farm bare metal machines upstream (for which there is no capacity),
the setup would very quickly bitrot anyway so we'll just run the
container based tests for now.
Stash the subscriber list when we disconenct from the bus (#35406)
If we unexpectly disconnect from the bus, systemd would end up dropping
the list of subscribers, which breaks the ability of clients like logind
to monitor the state of units.
Stash the list of subscribers into the deserialized state in the event
of a disconnect so that when we recover we can renew the broken
subscriptions.
pam: add session class "none" to disable logind sessions (#35171)
pam_systemd is used to create logind sessions and to apply extended
attributes from json user records. Not every application that creates a
pam session expects a login scope, but may be interested in the extended
attributes of json user records. Session class "none" implements this
service by disabling logind for this session altogether.
Daan De Meyer [Thu, 9 Jan 2025 10:59:58 +0000 (11:59 +0100)]
TEST-06-SELINUX: Add knob to allow checking for AVCs (#35921)
When running the integration tests downstream, it's useful to be able to
test that a new systemd version doesn't introduce any AVC denials, so
let's add a knob to make that possible.
Daan De Meyer [Wed, 8 Jan 2025 12:31:11 +0000 (13:31 +0100)]
TEST-06-SELINUX: Add knob to allow checking for AVCs
When running the integration tests downstream, it's useful to be
able to test that a new systemd version doesn't introduce any AVC
denials, so let's add a knob to make that possible.
Daan De Meyer [Thu, 9 Jan 2025 10:28:15 +0000 (11:28 +0100)]
test: Only plug in integration-test-setup.sh in interactive mode
If we're not running interactively, there's no point in the features
from integration-test-setup.sh which are intended for interactive
development and debugging so lets skip adding it in that case.
Yu Watanabe [Tue, 7 Jan 2025 18:23:29 +0000 (03:23 +0900)]
sd-device: make sd_device_new_from_path() accept relative path to device node
Even though udevadm accepts relative syspath, previously, udevadm
could not use relative path to device node:
===
$ cd /dev
$ udevadm info sda
Bad argument "sda", expected an absolute path in /dev/ or /sys/ or a unit name: Invalid argument
$ udevadm info /usr/../dev/sda
Unknown device "/usr/../dev/sda": No such device
===
With this change, both the above cases work fine.
Note, still sd_device_new_from_devname() requires absolute path starts
with /dev/, for safety.
Daan De Meyer [Wed, 8 Jan 2025 21:20:42 +0000 (22:20 +0100)]
fmf: Use different heuristic for number of process with many CPUs
Downstream we sometimes end up with machines with lots of CPUs which
leads to running out of memory when trying to run the tests in VMs.
So let's switch to a different heuristic when we have lots of CPUs to
avoid running out of memory.
Ronan Pigott [Thu, 28 Nov 2024 19:53:32 +0000 (12:53 -0700)]
dbus: stash the subscriber list when we disconenct from the bus
If we unexpectly disconnect from the bus, systemd would end up dropping
the list of subscribers, which breaks the ability of clients like logind
to monitor the state of units.
Stash the list of subscribers into the deserialized state in the event
of a disconnect so that when we recover we can renew the broken
subscriptions.
hwids: add a new efi firmware type of device entry (#35747)
This change adds a new firmware type device entry for the .hwids
section.
It also adds compile time validations and appropriate unit tests for
them.
chid_match() and related helpers have been updated accordingly.
Duplicate of https://github.com/systemd/systemd/pull/35281
Last review feedback's from this above PR has been incorporated and
merged.
Remove no longer needed login-options override. Fixes agetty autologin.
The need for -o was introduced in db6aeda to set the -p flag for login.
Setting -o overrides agettys built-in handling of arguments, so "-- \\u" was needed to mimic it.
This broke the autologin-feature, since the -f (noauth) flag is not passed to login [1].
But with 3d2157e, the -p flag is dropped, but the full change wasn't reverted,
leaving autologin still broken - But for no reason since agetty does the right thing.
This makes the UID range configurable via build time options, but of
course it really shouldn't be changed. The default range I picked is
outside even of IPAs current (ridiculously large) allocation ranges,
hence hopefully minimizes conflicts.
nsresource: optionally mangle userns names passed to nsresourced (#35900)
We enforce quite strict rules on naming userns we assign uid ranges to
for users. So strict that they are hard to get right for clients. hence,
let's optionally mangle provided strings so that they work for us.
This should make it much easier to work with the API, as something
reasonable happens regarldess what kind of garbage a client sets as
name.
mangling the name is opt-in for clients, so that there's tight control
for the client on the name, but also "fire and forget".
pid1: allow removal of foreign-owned subcgroups of cgroups owned by some user (#35922)
This improves operation in unprivileged userns environments, where
unpriv user code might invoke a container with a delegated userns UID
range, and thus ends up with a subcgroup owned by another UID. With this
patch any user is always allowed to remove their own cgroups even if it
has subcgroups owned by other users.
This removes a DoS of sorts, and enforces the rule that users strictly
own everything below cgroups they own.
test: add testcase that verifies we can safely delete subcgroups owned by other users if we own the parent
This is a test for the previous commits: we create an unpriv, delegated cgroup in
--user mode, then create a subcgroup that is owned by some other user
(to mimic the case where an unpriv user got a userns with delegated UIDs
assigned), and then try to stop the unit. traditionally this would fail,
because our unpriv systemd --user instance can't remove the subcrroup
owned by someone else. With the earlier patches this is addressed.
pid1: add D-Bus API for removing delegated subcgroups
When running unprivileged containers, we run into a scenario where an
unpriv owned cgroup has a subcgroup delegated to another user (i.e. the
container's own UIDs). When the owner of that cgroup dies without
cleaning it up then the unpriv service manager might encounter a cgroup
it cannot delete anymore.
Let's address that: let's expose a method call on the service manager
(primarly in PID1) that can be used to delete a subcgroup of a unit one
owns. This would then allow the unpriv service manager to ask the priv
service manager to get rid of such a cgroup.
This commit only adds the method call, the next commit then adds the
code that makes use of this.
pid1: allow moving processes in a userns owned by the user, too
Let's liberalize process migration a bit. Previously, PID 1 would only
allow you to move processes into your own cgroups, if those processes
are owned by you too. This is now slightly relaxed: it's now also OK if
the processes are in a userns owned by you.
This makes process migration more useful in context of unpriv userns.
nl6720 [Wed, 8 Jan 2025 14:19:33 +0000 (16:19 +0200)]
dissect-image: mount the ESP with fmask=0177 (#35871)
Avoid showing the files on the ESP (i.e. a FAT formatted volume) as
executable by removing the execute permission from them.
IMO this makes the colored output of `ls` more sensible since the file
system will be mounted with `noexec` anyway.
Add a `fstype_can_fmask_dmask` function that checks if a file system
type can use the `fmask` and `dmask` mount options.
This replaces `fstype_can_umask` since it was only used in
`partition_pick_mount_options` which only cares about the file system
support for fmask & dmask now.
It somewhat reduces the coverage of the feature since there are more file
systems that support umask as opposed to those supporting dmask & dmask,
but it should not be much of an issue since fmask & dmask are supported
by vfat, exfat and ntfs3.
nsresourced: add ability to mangle specified name if necessary
Let's optionally mangle any passed name on the server side so that it is
useful for identifying a userns, if it isn't suitable for that
right-away. This mostly means truncating it if too long.
It's just too nasty to leave this to the client side, since they'd have
to understand the precise rules for naming userns then.
While we are at it, add full Varlink IDL comments.
namespace-util: return recognizable error if namespace_open_by_type() fails because ns type not supported
This makes sure the the codepath that derives an nsfd from a pid works
the same for the pidfd case and the non-pidfd case: if we can verify
that /proc/ is mounted but the /proc/$PID/ns/ files are missing, we can
assume the ns type is not supported by the kernel. Hence return the same
ENOPKG error in this case as we already do in the pidfd ioctl based
codepath.
Mike Yuan [Tue, 7 Jan 2025 17:28:33 +0000 (18:28 +0100)]
Bump minimum kernel baseline to 5.4, recommended version to 5.7
As requested, a list of kernel version to feature mapping
for kernels older than minimum baseline is also included,
in order to ease potential backport work.
Luca Boccassi [Tue, 7 Jan 2025 00:40:02 +0000 (00:40 +0000)]
obs: also trigger Fedora package builds
The package is logistically separated, as the rpm sources conflict from Fedora
conflict with the rpm sources from SUSE (some files have the same name and
location but different, incompatible content), so Fedora builds can't be
triggered from the same package. The result is the same.
This is something I think we should have added a long time ago: a
flavour of open() that safely ensures the inode we are opening is a
regular file, before we open it. It does this by means of pinning the
inode via O_PATH first, and after verification actually opening it.
This ports some code over to this, but sooner or later we should
probably use this a lot more, so that we don't accidentally open weird
stuff such as device nodes or pipes, where we should not.
pretty-print: drop extra ';' from progress reporting end sequence
This corrects the closing sequence for the ConEmu progress reporting
final sequence. We by mistake sent two final ;;, where only one was
expected. The terminals I tested this with didn't care, but Ghostty
apparently does. Let's fix things and generate the closing sequence as
per doc: