coredump: lock down EnterNamespace= mount even more
Let's disable symlink following if we attach a container's mount tree to
our own mount namespace. We afte rall mount the tree to a different
location in the mount tree than where it was inside the container, hence
symlinks (if they exist) will all point to the wrong places (even if
relative, some might point to other places). And since symlink attacks
are a thing, and we let libdw operate on the tree, let's lock this down
as much as we can and simply disable symlink traversal entirely.
coredump: rework protocol between coredump pattern handler and processing service (#34970)
In
https://github.com/systemd/systemd/commit/68511cebe58977ea68ae4f57c6462e979efd1cff
the ability to pass the
coredump's mount namespace fd from the coredump patter handler was added
to systemd-coredump. For this the protocol was augmented, in attempt to
provide both forward and backward compatibility.
The protocol as of v256: one or more datagrams with journal log fields
about the coredump are sent via an SOCK_SEQPACKET connection. It is
finished with a zero length datagram which carries the coredump fd (this
last datagram is called "sentinel" sometimes).
The protocol after
https://github.com/systemd/systemd/commit/68511cebe58977ea68ae4f57c6462e979efd1cff
is extended
so that after the sentinal a 2nd sentinel is sent, with a pair of fds:
the coredump fd *again* and a mount fd (acquired via open_tree()) of the
container's mount tree. It's a bit ugly to send the coredump fd a 2nd
time, but what's more important the implementation didn't work: since on
SOCK_SEQPACKET a zero sized datagram cannot be distinguished from EOF
(which is a Linux API design mistake), an early EOF would be
misunderstood as a zero size datagram lacking any fd, which resulted in
protocol termination.
Moreover, I think if we touch the protocol we should make the move to
pidfs at the same time.
All of the above is what this protocol rework addresses.
1. A pidfd is now sent as well
2. The protocol is now payload, followed by the coredump fd datagram (as
before). But now followed by a second empty datagram with a pidfd,
and a third empty datagram with the mount tree fd. Of this the latter
two or last are optional. Thus, it's now a stream of payload
datagrams with one, two or three fd-laden datagrams as sentinel. If
we read the 2nd or 3rd sentinel without an attached fd we assume this
is actually an EOF (whether it actually is one or not doesn't matter
here). This should provide nice up and down compatibility.
3. The mount_tree_fd is moved into the Context object. The pidfd is
placed there too, as a PidRef. Thus the data we pass around is now
the coredump fd plus the context, which is simpler and makes a lot
more semantical sense I think.
4. The "first" boolean is replaced by an explicit state engine enum
instead of passing a boolean picking the destruction method just have
different functions. That's much nicer in context of _cleanup_, and how
we usually do things.
Use pidref to acquire some fields. This just makes use of the pidref
helpers we already have. We acquire a lot of other data via classic pids
still, but for that we first have to write race-free pidref getters,
hence leave that for another time.
coredump: rework protocol between coredump pattern handler and processing service
In 68511cebe58977ea68ae4f57c6462e979efd1cff the ability to pass the
coredump's mount namespace fd from the coredump patter handler was added
to systemd-coredump. For this the protocol was augmented, in attempt to
provide both forward and backward compatibility.
The protocol as of v256: one or more datagrams with journal log fields
about the coredump are sent via an SOCK_SEQPACKET connection. It is
finished with a zero length datagram which carries the coredump fd (this
last datagram is called "sentinel" sometimes).
The protocol after 68511cebe58977ea68ae4f57c6462e979efd1cff is extended
so that after the sentinal a 2nd sentinel is sent, with a pair of fds:
the coredump fd *again* and a mount fd (acquired via open_tree()) of the
container's mount tree. It's a bit ugly to send the coredump fd a 2nd
time, but what's more important the implementation didn't work: since on
SOCK_SEQPACKET a zero sized datagram cannot be distinguished from EOF
(which is a Linux API design mistake), an early EOF would be
misunderstood as a zero size datagram lacking any fd, which resulted in
protocol termination.
Moreover, I think if we touch the protocol we should make the move to
pidfs at the same time.
All of the above is what this protocol rework addresses.
1. A pidfd is now sent as well
2. The protocol is now payload, followed by the coredump fd datagram (as
before). But now followed by a second empty datagram with a pidfd,
and a third empty datagram with the mount tree fd. Of this the latter
two or last are optional. Thus, it's now a stream of payload
datagrams with one, two or three fd-laden datagrams as sentinel. If
we read the 2nd or 3rd sentinel without an attached fd we assume this
is actually an EOF (whether it actually is one or not doesn't matter
here). This should provide nice up and down compatibility.
3. The mount_tree_fd is moved into the Context object. The pidfd is
placed there too, as a PidRef. Thus the data we pass around is now
the coredump fd plus the context, which is simpler and makes a lot
more semantical sense I think.
4. The "first" boolean is replaced by an explicit state engine enum
Let's rename this local variable, since we are not operating on the
coredump process here after all, but on the leader of the namespace the
coredump process in, which is quite different, hence let's make this
very clear via the name.
The detailed error response is already logged, hence not necessary to
log again with the errno converted from the error response, which typically
less informative, e.g.
===
varlink-26-26: Setting state idle-server
varlink-26-26: Received message: {"method":"io.systemd.UserDatabase.GetUserRecord","parameters":{"service":""}}
varlink-26-26: Changing state idle-server → processing-method
varlink-26-26: Sending message: {"error":"io.systemd.UserDatabase.BadService","parameters":{}}
varlink-26-26: Changing state processing-method → processed-method
varlink-26-26: Callback for io.systemd.UserDatabase.GetUserRecord returned error: Invalid request descriptor
varlink-26-26: Changing state processed-method → idle-server
varlink-26-26: Got POLLHUP from socket.
===
Luca Boccassi [Thu, 31 Oct 2024 21:10:28 +0000 (21:10 +0000)]
Rework sysupdate meson options (#34832)
systemd-sysupdated is still unstable and we'd like to make breaking
changes to it even after the v257 release, so we document it as such and
disable building it by default in release builds. The distro can still
opt-in, and we still build it in developer mode so it has CI coverage
meson: add separate option for sysupdated, disable in release builds
This commit introduces a build-time option to enable/disable sysupdated
separately from sysupdate. 'auto' translated to enabled by default in
developer builds.
Mike Gilbert [Thu, 24 Oct 2024 16:24:35 +0000 (12:24 -0400)]
posix_spawn_wrapper: do not set POSIX_SPAWN_SETSIGDEF flag
Setting this flag is a noop without a corresponding call to
posix_spawnattr_setsigdefault.
If we call posix_spawnattr_setsigdefault with a full signal set,
it causes glibc's posix_spawn implementation to call sigaction 63 times,
once for each signal. That seems wasteful.
This feature is really only useful for signals which have their
disposition set to SIG_IGN. Otherwise the dispostion gets set to
SIG_DFL automatically, either by clone(CLONE_CLEAR_SIGHAND) or the
subsequent execve.
As far as I can tell, systemd does not have any signals set to SIG_IGN
under normal operating conditions.
Mike Yuan [Thu, 31 Oct 2024 14:45:15 +0000 (15:45 +0100)]
systemctl: don't fall back to immediate shutdown silently if we cannot schedule one
The previous behavior of systemctl --when= seems absurd, i.e.
if we fail to schedule shutdown in the future it's performed
immediately. Let's instead hard fail, which also removes the need
of specializing on certain errnos (preparation for later commits).
boot: stop appending NUL to .sdmagic and .sbat sections
Those text sections had a trailing NUL byte. It's debatable whether this is a
good idea or not. Correctly written consumers will look at the section size so
they wouldn't need this. Shim doesn't use a trailing NUL, so let's follow suit.
898e9edc469f87fdb6018128bac29eef0a5fe698 reworked this code, but didn't actually
change the logic. We have always been appending the trailing zero by using a
NUL-terminated string as the section contents. (I checked this with v253.18
from before the elf2efi rework.)
.sdmagic contains a string like "#### LoaderInfo: systemd-boot 257~devel ####",
which changes with each version, so previous versions would compare unequal
anyway, so we don't need to worry about backwards compatibility.
After the commit d2ebf5cc1d59e29139f06efaa3a9b2c184cdaa25, sd_varlink_error()
returns negative errno, hence the function always return negative errno
on failure.
The test container exits shortly, hence when varlinkctl is called, the
container may be already terminated. Let's make the container live
infinitely.
Also, this makes the os-release files removed after the container is started.
sd-varlink: change sd_varlink_error() to always return an error
Let's make sure that sd_varlink_error() always returns an error code, so
that we can use it in a style "return sd_varlink_error(…);" everywhere,
which has two effects: return a good error reply to clients, and exit
the current stack frame with a failure code.
Interestingly sd_varlink_error_invalid_parameter() already worked like
this in some cases, but sd_varlink_error() itself didn't.
This is an alternative to the error handling tweak proposed in #34882,
but I think is a lot more generically useful, since it establishes a
pattern.
I checked our codebase, and this change should generally be OK without
breaking callsites, since the current callers (with exception of the
machined case from #34882) called sd_varlink_error() in the outermost
varlink method call dispatch stack frame, where this behaviour change
does not alter anything.
This is similar btw, how sd_bus_error_setf() and friends always return
error codes too, synthesized from its parameters.
All our public headers strive to C90 compatibility with a few
extensions, and thus avoided stdbool.h and bool.
The sd_json_format_enabled() helper seems like a poor place to start
requiring stdbool.h now.
Also drop __extension__ since we are not using it anywhere else in very
similar inline functions.
(And we probably should drop any _sd_const declarations on inline
functions. Given that the compiler has the function implementation
around always, because it's in the header there's really no reason to
specify this manually, the compiler can trivially figure this out on its
own. But that's for another time.)
Yu Watanabe [Sun, 27 Oct 2024 07:38:24 +0000 (16:38 +0900)]
network/netdev: replace old NetDev object with newer one on reload
Then, when a .netdev file of a stacked netdev is modified, the netdev
can be reconfigured with the updated setting by something like the
following way:
```
ip link del vlan99
networkctl reload
```
Note, removing the vlan interface in the above example may not be necessary,
e.g. when only VLAN flags, egress mapping, or ingress mapping are updated.
But, it is necessary when VLAN ID is updated.
Michael Ferrari [Wed, 9 Oct 2024 15:30:44 +0000 (17:30 +0200)]
firstboot: generalize prompt_loop more
Allows unifying the custom logic for the hostname and root shell. Root
password prompting remains separate as it's logic is substantially
different to the other prompts.
Daan De Meyer [Wed, 30 Oct 2024 12:53:31 +0000 (13:53 +0100)]
ask-password: Allow configuring the keyring timeout via an environment variable
In mkosi, we want an easy way to set the keyring timeout for every
tool we invoke that might use systemd-ask-password to query for a
password which is then stored in the kernel keyring. Let's make this
possible via a new $SYSTEMD_ASK_PASSWORD_KEYRING_TIMEOUT_SEC environment
variable.
Using an environment variable means we don't have to modify every separate
tool to add a CLI option allowing to specify the timeout. In mkosi specifically,
we'll set up a new session keyring for the mkosi process linked to the user keyring
so that any pins in the user keyring are used if available, and otherwise we'll query
for and store password in mkosi's session keyring with a zero timeout so that they stay
in the keyring until the mkosi process exits at which point they're removed from the
keyring.
Łukasz Stelmach [Tue, 29 Oct 2024 14:53:45 +0000 (15:53 +0100)]
core: make mount(8) and swapon(8) inherit SMACK label from systemd
By default mount(8), umount(8), swapon(8) and swapoff(8) should run with
with the SMACK label inherited from systemd rather than the default one
meant for services.
Several netdevs cannot set IFLA_ADDRESS or IFLA_MTU attribute on update.
Currently, the vtable field is unused, as we do not support updating
existing netdevs. Preparation for later commits.
Yu Watanabe [Sun, 27 Oct 2024 07:29:24 +0000 (16:29 +0900)]
network: drop no-op cleanup
- network_load() is always called with an empty OrderedHashmap, renamed the output
parameter to 'ret'.
- When netdev_load() is called on startup, the hashmap is NULL. When it is
called on reloading, the hashmap is not cleaned up.
Hence, then these cleanups are always no-op. Let's drop them.
Yu Watanabe [Wed, 23 Oct 2024 19:40:45 +0000 (04:40 +0900)]
network: process queued remove requests before networkd is stopped
This makes networkd process all queued remove requests when a
terminating or restarting signal is received. Otherwise, e.g. DHCPv4
address will not be removed on stop, especially when
KeepConfiguration=no.
cryptenroll,homectl,journalctl: adjust messages before qrcodes
Users will generally know what a qrcode is, so let's not treat them as dumb and
explain that it can be scanned. OTOH, we should say what the qrcode contains
and it is useful to give a hint why the users would want to scan it. Reword
messages accordingly.
(Also, don't say "to your phone", when somebody might be using a stolen phone,
or something else then a phone.)
Adrian Vovk [Sat, 19 Oct 2024 01:04:41 +0000 (21:04 -0400)]
man: warn that sysupdate's API is unstable
There's still some breaking changes we want to make to sysupdated, but
they'll potentially take months and we don't want to block the systemd
release for that long. So, we can instead mark sysupdate's API as
unstable
Let's make ConfigurationDirectory= a bit less "special-casey", by hiding
the fact that it's the only per-service dir we do not do chown()ing for
inside of a new EXEC_DIRECTORY_TYPE_SHALL_CHOWN() helper.