Luca Boccassi [Sat, 2 Nov 2024 12:04:49 +0000 (12:04 +0000)]
Add support for id-mapped mounts to Exec directories (#34078)
Currently, bind-mounted directories within a user/mount namespace get
the uid/gid stored on their files. If the host creates a file in the
source directory, it will still show as root in the namespace.
Id-mapping is a filesystem feature that allows a mount namespace to show
a different uid than what is actually stored on a file. Add support for
id-mappings to exec directories, so that the files within the mount
namespace are owned by the unprivileged uid/gid.
In the host namespace, creating a file "test":
```
root@abeltran-test:/var/lib/andresstatedir# ls -lah
total 8.0K
drwxr-xr-x 2 root root 4.0K Aug 21 23:48 .
drwx------ 3 root root 4.0K Aug 21 23:47 ..
-rw-r--r-- 1 root root 0 Aug 21 23:48 test
```
Within the unit namespace:
```
root@abeltran-test:/var/lib/sampleservice# ls -lah
total 4.0K
drwxr-xr-x 2 63750 63750 4.0K Aug 21 23:48 .
drwxr-xr-x 3 root root 60 Aug 21 23:47 ..
-rw-r--r-- 1 63750 63750 0 Aug 21 23:48 test
```
```
root@abeltran-test:/# mount | grep and
/dev/sda1 on /var/lib/private/andresstatedir type ext4 (rw,nosuid,noexec,relatime,idmapped,discard,errors=remount-ro,commit=30)
```
Luca Boccassi [Sat, 2 Nov 2024 11:27:28 +0000 (11:27 +0000)]
logind: respect SD_LOGIND_ROOT_CHECK_INHIBITORS with weak blockers (#34969)
The check for the old flag was not restored when the weak blocker was
added, add it back. Also skip polkit check for root for the weak
blocker, to keep compatibility with the previous behaviour.
Luca Boccassi [Thu, 31 Oct 2024 16:02:38 +0000 (16:02 +0000)]
logind: respect SD_LOGIND_ROOT_CHECK_INHIBITORS with weak blockers
The check for the old flag was not restored when the weak
blocker was added, add it back. Also skip polkit check for
root for the weak blocker, to keep compatibility with the
previous behaviour.
Daan De Meyer [Fri, 1 Nov 2024 12:05:46 +0000 (13:05 +0100)]
mkosi: Set BuildSourcesEphemeral=no in mkosi.clangd
We're just running a language server so no need to put a writable
overlay on top of the build sources to prevent modifications. This
hopefully helps the language server track modifications to the source
files better.
Luca Boccassi [Fri, 1 Nov 2024 12:25:35 +0000 (12:25 +0000)]
coredump: lock down EnterNamespace= mount even more (#34975)
Let's disable symlink following if we attach a container's mount tree to
our own mount namespace. We afte rall mount the tree to a different
location in the mount tree than where it was inside the container, hence
symlinks (if they exist) will all point to the wrong places (even if
relative, some might point to other places). And since symlink attacks
are a thing, and we let libdw operate on the tree, let's lock this down
as much as we can and simply disable symlink traversal entirely.
This makes use of the new TIOCGPTPEER pty ioctl() for directly opening a
PTY peer, without going via path names. This is nice because it closes a
race around allocating and opening the peer. And also has the nice
benefit that if we acquired an fd originating from some other
namespace/container, we can directly derive the peer fd from it, without
having to reenter the namespace again.
Luca Boccassi [Mon, 28 Oct 2024 19:58:58 +0000 (19:58 +0000)]
core: add read-only flag for exec directories
When an exec directory is shared between services, this allows one of the
service to be the producer of files, and the other the consumer, without
letting the consumer modify the shared files.
This will be especially useful in conjunction with id-mapped exec directories
so that fully sandboxed services can share directories in one direction, safely.
Adrian Vovk [Fri, 2 Feb 2024 03:53:09 +0000 (22:53 -0500)]
homed: Allow user to change parts of their record
This allows an unprivileged user that is active at the console to change
the fields that are in the selfModifiable allowlists (introduced in a
previous commit) without authenticating as a system administrator.
Administrators can disable this behavior per-user by setting the
relevant selfModifiable allowlists, or system-wide by changing the
policy of the org.freedesktop.home1.update-home-by-owner Polkit action.
coredump: lock down EnterNamespace= mount even more
Let's disable symlink following if we attach a container's mount tree to
our own mount namespace. We afte rall mount the tree to a different
location in the mount tree than where it was inside the container, hence
symlinks (if they exist) will all point to the wrong places (even if
relative, some might point to other places). And since symlink attacks
are a thing, and we let libdw operate on the tree, let's lock this down
as much as we can and simply disable symlink traversal entirely.
coredump: rework protocol between coredump pattern handler and processing service (#34970)
In
https://github.com/systemd/systemd/commit/68511cebe58977ea68ae4f57c6462e979efd1cff
the ability to pass the
coredump's mount namespace fd from the coredump patter handler was added
to systemd-coredump. For this the protocol was augmented, in attempt to
provide both forward and backward compatibility.
The protocol as of v256: one or more datagrams with journal log fields
about the coredump are sent via an SOCK_SEQPACKET connection. It is
finished with a zero length datagram which carries the coredump fd (this
last datagram is called "sentinel" sometimes).
The protocol after
https://github.com/systemd/systemd/commit/68511cebe58977ea68ae4f57c6462e979efd1cff
is extended
so that after the sentinal a 2nd sentinel is sent, with a pair of fds:
the coredump fd *again* and a mount fd (acquired via open_tree()) of the
container's mount tree. It's a bit ugly to send the coredump fd a 2nd
time, but what's more important the implementation didn't work: since on
SOCK_SEQPACKET a zero sized datagram cannot be distinguished from EOF
(which is a Linux API design mistake), an early EOF would be
misunderstood as a zero size datagram lacking any fd, which resulted in
protocol termination.
Moreover, I think if we touch the protocol we should make the move to
pidfs at the same time.
All of the above is what this protocol rework addresses.
1. A pidfd is now sent as well
2. The protocol is now payload, followed by the coredump fd datagram (as
before). But now followed by a second empty datagram with a pidfd,
and a third empty datagram with the mount tree fd. Of this the latter
two or last are optional. Thus, it's now a stream of payload
datagrams with one, two or three fd-laden datagrams as sentinel. If
we read the 2nd or 3rd sentinel without an attached fd we assume this
is actually an EOF (whether it actually is one or not doesn't matter
here). This should provide nice up and down compatibility.
3. The mount_tree_fd is moved into the Context object. The pidfd is
placed there too, as a PidRef. Thus the data we pass around is now
the coredump fd plus the context, which is simpler and makes a lot
more semantical sense I think.
4. The "first" boolean is replaced by an explicit state engine enum
instead of passing a boolean picking the destruction method just have
different functions. That's much nicer in context of _cleanup_, and how
we usually do things.
Use pidref to acquire some fields. This just makes use of the pidref
helpers we already have. We acquire a lot of other data via classic pids
still, but for that we first have to write race-free pidref getters,
hence leave that for another time.
coredump: rework protocol between coredump pattern handler and processing service
In 68511cebe58977ea68ae4f57c6462e979efd1cff the ability to pass the
coredump's mount namespace fd from the coredump patter handler was added
to systemd-coredump. For this the protocol was augmented, in attempt to
provide both forward and backward compatibility.
The protocol as of v256: one or more datagrams with journal log fields
about the coredump are sent via an SOCK_SEQPACKET connection. It is
finished with a zero length datagram which carries the coredump fd (this
last datagram is called "sentinel" sometimes).
The protocol after 68511cebe58977ea68ae4f57c6462e979efd1cff is extended
so that after the sentinal a 2nd sentinel is sent, with a pair of fds:
the coredump fd *again* and a mount fd (acquired via open_tree()) of the
container's mount tree. It's a bit ugly to send the coredump fd a 2nd
time, but what's more important the implementation didn't work: since on
SOCK_SEQPACKET a zero sized datagram cannot be distinguished from EOF
(which is a Linux API design mistake), an early EOF would be
misunderstood as a zero size datagram lacking any fd, which resulted in
protocol termination.
Moreover, I think if we touch the protocol we should make the move to
pidfs at the same time.
All of the above is what this protocol rework addresses.
1. A pidfd is now sent as well
2. The protocol is now payload, followed by the coredump fd datagram (as
before). But now followed by a second empty datagram with a pidfd,
and a third empty datagram with the mount tree fd. Of this the latter
two or last are optional. Thus, it's now a stream of payload
datagrams with one, two or three fd-laden datagrams as sentinel. If
we read the 2nd or 3rd sentinel without an attached fd we assume this
is actually an EOF (whether it actually is one or not doesn't matter
here). This should provide nice up and down compatibility.
3. The mount_tree_fd is moved into the Context object. The pidfd is
placed there too, as a PidRef. Thus the data we pass around is now
the coredump fd plus the context, which is simpler and makes a lot
more semantical sense I think.
4. The "first" boolean is replaced by an explicit state engine enum
Let's rename this local variable, since we are not operating on the
coredump process here after all, but on the leader of the namespace the
coredump process in, which is quite different, hence let's make this
very clear via the name.
The detailed error response is already logged, hence not necessary to
log again with the errno converted from the error response, which typically
less informative, e.g.
===
varlink-26-26: Setting state idle-server
varlink-26-26: Received message: {"method":"io.systemd.UserDatabase.GetUserRecord","parameters":{"service":""}}
varlink-26-26: Changing state idle-server → processing-method
varlink-26-26: Sending message: {"error":"io.systemd.UserDatabase.BadService","parameters":{}}
varlink-26-26: Changing state processing-method → processed-method
varlink-26-26: Callback for io.systemd.UserDatabase.GetUserRecord returned error: Invalid request descriptor
varlink-26-26: Changing state processed-method → idle-server
varlink-26-26: Got POLLHUP from socket.
===
Luca Boccassi [Thu, 31 Oct 2024 21:10:28 +0000 (21:10 +0000)]
Rework sysupdate meson options (#34832)
systemd-sysupdated is still unstable and we'd like to make breaking
changes to it even after the v257 release, so we document it as such and
disable building it by default in release builds. The distro can still
opt-in, and we still build it in developer mode so it has CI coverage
meson: add separate option for sysupdated, disable in release builds
This commit introduces a build-time option to enable/disable sysupdated
separately from sysupdate. 'auto' translated to enabled by default in
developer builds.
Mike Gilbert [Thu, 24 Oct 2024 16:24:35 +0000 (12:24 -0400)]
posix_spawn_wrapper: do not set POSIX_SPAWN_SETSIGDEF flag
Setting this flag is a noop without a corresponding call to
posix_spawnattr_setsigdefault.
If we call posix_spawnattr_setsigdefault with a full signal set,
it causes glibc's posix_spawn implementation to call sigaction 63 times,
once for each signal. That seems wasteful.
This feature is really only useful for signals which have their
disposition set to SIG_IGN. Otherwise the dispostion gets set to
SIG_DFL automatically, either by clone(CLONE_CLEAR_SIGHAND) or the
subsequent execve.
As far as I can tell, systemd does not have any signals set to SIG_IGN
under normal operating conditions.
Mike Yuan [Thu, 31 Oct 2024 14:45:15 +0000 (15:45 +0100)]
systemctl: don't fall back to immediate shutdown silently if we cannot schedule one
The previous behavior of systemctl --when= seems absurd, i.e.
if we fail to schedule shutdown in the future it's performed
immediately. Let's instead hard fail, which also removes the need
of specializing on certain errnos (preparation for later commits).
boot: stop appending NUL to .sdmagic and .sbat sections
Those text sections had a trailing NUL byte. It's debatable whether this is a
good idea or not. Correctly written consumers will look at the section size so
they wouldn't need this. Shim doesn't use a trailing NUL, so let's follow suit.
898e9edc469f87fdb6018128bac29eef0a5fe698 reworked this code, but didn't actually
change the logic. We have always been appending the trailing zero by using a
NUL-terminated string as the section contents. (I checked this with v253.18
from before the elf2efi rework.)
.sdmagic contains a string like "#### LoaderInfo: systemd-boot 257~devel ####",
which changes with each version, so previous versions would compare unequal
anyway, so we don't need to worry about backwards compatibility.
Of these, in some corner case scenarios BEL makes problem (see #34604).
Hence switch away from that wherever we use it, and prefer the \x1b\x5c
instead. That's preferable over \x9c, since the latter is also a valid
UTF-8 codepoint. See discussion here for example:
After the commit d2ebf5cc1d59e29139f06efaa3a9b2c184cdaa25, sd_varlink_error()
returns negative errno, hence the function always return negative errno
on failure.
The test container exits shortly, hence when varlinkctl is called, the
container may be already terminated. Let's make the container live
infinitely.
Also, this makes the os-release files removed after the container is started.
sd-varlink: change sd_varlink_error() to always return an error
Let's make sure that sd_varlink_error() always returns an error code, so
that we can use it in a style "return sd_varlink_error(…);" everywhere,
which has two effects: return a good error reply to clients, and exit
the current stack frame with a failure code.
Interestingly sd_varlink_error_invalid_parameter() already worked like
this in some cases, but sd_varlink_error() itself didn't.
This is an alternative to the error handling tweak proposed in #34882,
but I think is a lot more generically useful, since it establishes a
pattern.
I checked our codebase, and this change should generally be OK without
breaking callsites, since the current callers (with exception of the
machined case from #34882) called sd_varlink_error() in the outermost
varlink method call dispatch stack frame, where this behaviour change
does not alter anything.
This is similar btw, how sd_bus_error_setf() and friends always return
error codes too, synthesized from its parameters.
All our public headers strive to C90 compatibility with a few
extensions, and thus avoided stdbool.h and bool.
The sd_json_format_enabled() helper seems like a poor place to start
requiring stdbool.h now.
Also drop __extension__ since we are not using it anywhere else in very
similar inline functions.
(And we probably should drop any _sd_const declarations on inline
functions. Given that the compiler has the function implementation
around always, because it's in the header there's really no reason to
specify this manually, the compiler can trivially figure this out on its
own. But that's for another time.)