cgroup: when we unload a unit, also update all its parent's members mask
This way we can corectly ensure that when a unit that requires some
controller goes away, we propagate the removal of it all the way up, so
that the controller is turned off in all the parents too.
cgroup: drastically simplify caching of cgroups members mask
Previously we tried to be smart: when a new unit appeared and it only
added controllers to the cgroup mask we'd update the cached members mask
in all parents by ORing in the controller flags in their cached values.
Unfortunately this was quite broken, as we missed some conditions when
this cache had to be reset (for example, when a unit got unloaded),
moreover the optimization doesn't work when a controller is removed
anyway (as in that case there's no other way for the parent to iterate
though all children if any other, remaining child unit still needs it).
Hence, let's simplify the logic substantially: instead of updating the
cache on the right events (which we didn't get right), let's simply
invalidate the cache, and generate it lazily when we encounter it later.
This should actually result in better behaviour as we don't have to
calculate the new members mask for a whole subtree whever we have the
suspicion something changed, but can delay it to the point where we
actually need the members mask.
This allows us to simplify things quite a bit, which is good, since
validating this cache for correctness is hard enough.
cgroup: extend reasons when we realize the enable mask
After creating a cgroup we need to initialize its
"cgroup.subtree_control" file with the controllers its children want to
use. Currently we do so whenever the mkdir() on the cgroup succeeded,
i.e. when we know the cgroup is "fresh". Let's update the condition
slightly that we also do so when internally we assume a cgroup doesn't
exist yet, even if it already does (maybe left-over from a previous
run).
This shouldn't change anything IRL but make things a bit more robust.
cgroup: in unit_invalidate_cgroup() actually modify invalidation mask
Previously this would manipulate the realization mask for invalidating
the realization. This is a bit ugly though as the realization mask's
primary purpose to is to reflect in which hierarchies a cgroup currently
exists, and it's probably a good idea to keep that in sync with
realities.
We nowadays have the an explicit fields for invalidating cgroup
controller information, the "cgroup_invalidated_mask", let's use this
one instead.
The effect is pretty much the same, as the main consumer of these masks
(unit_has_mask_realize()) checks both anyway.
cgroup: be more careful with which controllers we can enable/disable on a cgroup
This changes cg_enable_everywhere() to return which controllers are
enabled for the specified cgroup. This information is then used to
correctly track the enablement mask currently in effect for a unit.
Moreover, when we try to turn off a controller, and this works, then
this is indicates that the parent unit might succesfully turn it off
now, too as our unit might have kept it busy.
So far, when realizing cgroups, i.e. when syncing up the kernel
representation of relevant cgroups with our own idea we would strictly
work from the root to the leaves. This is generally a good approach, as
when controllers are enabled this has to happen in root-to-leaves order.
However, when controllers are disabled this has to happen in the
opposite order: in leaves-to-root order (this is because controllers can
only be enabled in a child if it is already enabled in the parent, and
if it shall be disabled in the parent then it has to be disabled in the
child first, otherwise it is considered busy when it is attempted to
remove it in the parent).
To make things complicated when invalidating a unit's cgroup membershup
systemd can actually turn off some controllers previously turned on at
the very same time as it turns on other controllers previously turned
off. In such a case we have to work up leaves-to-root *and*
root-to-leaves right after each other. With this patch this is
implemented: we still generally operate root-to-leaves, but as soon as
we noticed we successfully turned off a controller previously turned on
for a cgroup we'll re-enqueue the cgroup realization for all parents of
a unit, thus implementing leaves-to-root where necessary.
cgroup: fine tune when to apply cgroup attributes to the root cgroup
Let's tweak when precisely to apply cgroup attributes on the root
cgroup.
With this we now follow the following rules:
1. On cgroupsv2 we never apply any regular cgroups to the host root,
since the attributes generally do not exist there.
2. On cgroupsv1 we do not apply any "weight" or "shares" style
attributes to the host root cgroup, since they don't make much sense
on the top level where there's only one group, hence no need to
compare weights against each other. The other attributes are applied
to the host root cgroup however.
3. In any case we don't apply attributes to the root of container
environments (and --user roots), under the assumption that this is
managed by the manager further up. (Note that on cgroupsv2 this is
even enforced by the kernel)
4. BPF pseudo-attributes are applied in all cases (since we can have as
many of them as we want)
Let's emphasize that this function checks for the host root cgroup, i.e.
returns false for the root cgroup when we run in a container where
CLONE_NEWCGROUP is used. There has been some confusion around this
already, for example cgroup_context_apply() uses the function
incorrectly (which we'll fix in a later commit).
cgroup: only install cgroup release agent when we own the root cgroup
If we run in a container we shouldn't patch around this, and most likely
we can't anyway, and there's not much point in complaining about this.
Hence let's strictly say: the agent is private property of the host's
system instance, nothing else.
The way this is used drifted a bit from the original intent. Let's update
the description and add some examples to inspire people to texts that look
less bad during initial boot.
Ideally, coccinelle would strip unnecessary braces too. But I do not see any
option in coccinelle for this, so instead, I edited the patch text using
search&replace to remove the braces. Unfortunately this is not fully automatic,
in particular it didn't deal well with if-else-if-else blocks and ifdefs, so
there is an increased likelikehood be some bugs in such spots.
I also removed part of the patch that coccinelle generated for udev, where we
returns -1 for failure. This should be fixed independently.
Tim Ruffing [Wed, 21 Nov 2018 20:41:15 +0000 (21:41 +0100)]
vconsole: Don't skip udev call for dummy device
Kernel 4.19 supports deferred console takeover [0], i.e., fbcon will
take over the console only when the first text is displayed on the
console. Before that event, only the dummy console is active. Our
currently udev rules call systemd-vconsole on every vtcon except for
dummy consoles. Thus the exception for dummy consoles prevents a call
to systemd-vconsole when no text is displayed on the console, and as a
consequence, the keymap will not be set in that case. This is wrong and
leads to issues when keyboard input is expected without text on the
console, e.g., when a graphical password prompt is used in the boot
process.
Yann E. MORIN [Wed, 21 Nov 2018 17:09:04 +0000 (18:09 +0100)]
basic/user-util: properly protect use of gshadow
Commit 100d5f6ee6 (user-util: add new wrappers for [...] database
files), ammended by commit 4f07ffa8f5 (Use #if instead of #ifdef for
ENABLE_GSHADOW) moved code from sysuser to basic/user-util.
In doing so, the combination of both commits properly propagated the
ENABLE_GSHADOW conditions around the function manipulating gshadow, but
they forgot to protect the inclusion of the gshadow.h header.
Fix that to be able to build on C libraries that do not provide gshadow
(e.g. uClibc-ng, where it does not exist.)
Checking if resume= is configured is a good idea, but it turns out we cannot do
it reliably:
- the code only supported boot options with sd-boot, and it's not very widely
used. This means that for most systemd we could only check the current
commandline, not the next one.
- Various systems resume without, e.g. Debian has
/etc/initramfs-tools/conf.d/resume in the initramfs.
Making those checks better would be possible with enough effort, but there'll
be always new systems that boot in a slightly different way and we would need
to keep adding new cases. Longer term, we want to rely on autodetecting the
resume partition, and then checks like this will not be necessary at all. It is
quite clear from the number of bug reports that the number of poeple impacted
by this is quite high now, so let's just drop this.
Franck Bui [Mon, 27 Aug 2018 21:16:10 +0000 (23:16 +0200)]
logind: stop managing VT switches if no sessions are registered on that VT
When no sessions are registered on a given VT (anymore), we should always let
the kernel processes VT switching (instead of simply emitting a warning)
otherwise the requests sent by the kernel are simply ignored making the VT
switch requested by users simply impossible.
Even if it shouldn't happen, this case was encountered in issue #9754, so
better to be safe than sorry.
Franck Bui [Mon, 27 Aug 2018 20:13:21 +0000 (22:13 +0200)]
logind: become the controlling terminal process before restoring VT
Basically when a session ends, logind notices and restores VT_AUTO so the
kernel takes back VT-switching over.
logind achieves that by watching the process that took control of the session
(via the "TakeControl" D-Bus method), aka "the watched process", which can
be different from the one that initially opened the VT aka "the terminal
controlling process".
In this case the terminal controlling process can exit after the watched one
did and while logind is restoring the VT.
Even if logind took care to re-open the VT in case the VT was already in HUP
state, it wasn't enough because the terminal controlling process could have
exited right after, leaving the VT in HUP state and in VT_PROCESS mode making
further VT-switching impossible.
This patch fixes this situation by forcing logind to become the terminal
controlling process.
systemd already sets the umask (see e3b8d0637dd755b3426f3363b2cdad63f738116c). When
running under systemd, we don't need to set it. And when *not* running under
systemd, for example during development, there is no reason to override the user
config. Let's just drop those calls.
Michal Koutný [Fri, 2 Nov 2018 19:56:08 +0000 (20:56 +0100)]
core: Detect initial timer state from serialized data
We keep a mark whether a single-shot timer was triggered in the caller's
variable initial. When such a timer elapses while we are
serializing/deserializing the inner state, we consider the timer
incorrectly as elapsed and don't trigger it later.
This patch exploits last_trigger timestamp that we already serialize,
hence we can eliminate the argument initial completely.
This removes the call to log_close(), and refactors how fork() is done. Now
the parent also goes through normal cleanup. This isn't necessary to use the
macro, but it feels cleaner this way.
This removes a call to log_close(). I don't think this should matter.
The call to mac_selinux_init() is moved after parse_argv(). We probably
don't need selinux when printing help().
For PID 1 we adjust the umask to 0, but generators should not run that
way, given that they might be implemented as shell scripts and such.
Let's hence explicitly adjust the umask for them.
We already do this for unit generators. Let's do this for env
generators, too.
tests: skip test_exec_ambientcapabilities on Travis CI under ASan
Let's not bother contributors with spurious failures nobody can't
seem to reproduce. There is an issue about that where we're trying
to figure out what's going on: https://github.com/systemd/systemd/issues/10696.