Chris Down [Thu, 3 Oct 2019 12:21:29 +0000 (13:21 +0100)]
cgroup: analyze: Report memory configurations that deviate from systemd
This is the most basic consumer of the new systemd-vs-kernel checker,
both acting as a reasonable standalone exerciser of the code, and also
as a way for easy inspection of deviations from systemd internal state.
Chris Down [Mon, 30 Sep 2019 15:13:32 +0000 (16:13 +0100)]
cgroup: Allow checking systemd-internal limits against the kernel
We currently don't have any mitigations against another privileged user
on the system messing with the cgroup hierarchy, bringing the system out
of line with what we've set in systemd. We also don't have any real way
to surface this to the user (we do have logs, but you have to know to
look in the first place).
There are a few possible solutions:
1. Maintaining our own cgroup tree with the new fsopen API and having a
read-only copy for everyone else. However, there are some
complications on this front, and this may be infeasible in some
environments. I'd rate this as a longer term effort that's tangential
to this patch.
2. Actively checking for changes with {fa,i}notify and changing them
back afterwards to match our configuration again. This is also
possible, but it's also good to have a way to do passive monitoring
of the situation without taking hard action. Also, currently daemons
like senpai do actually need to modify the tree behind systemd's
back (although hopefully this should be more integrated soon).
This patch implements another option, where one can, on demand, monitor
deviations in cgroup memory configuration from systemd's internal state.
Currently the only consumer is `systemd-analyze dump`, but the interface
is generic enough that it can also be exposed elsewhere later (for
example, over D-Bus).
Currently only memory limit style properties are supported, but later I
also plan to expand this out to other properties that systemd should
have ultimate control over.
hwdb: Add trackpoint rules for Lenovo Thinkpad 70, 80, 90
Extend the existing rules to match the Thinkpad models for the
previous 3 generations. It will work if a Synaptic Trackpoint is
built into the notebook. It will not work for Elantech trackpoints.
Dan Streetman [Sun, 29 Sep 2019 21:16:55 +0000 (17:16 -0400)]
src/core/automount: use DirectoryMode when calling mkdir -p
mkdir -p is called both when setting up the autofs mount, as well
as after being notified that the real mount unit should be called.
However the first mkdir -p is hardcoded with 0555, while the second
uses the value specified to DirectoryMode in the automount unit; the
second mkdir -p is only needed when called from coldplug, so under
normal operation the dirs are incorrectly created with mode 0555.
This replaces the hardcoded 0555 mode with the value of DirectoryMode.
sd-dhcp-client: merge client_send_release() into sd_dhcp_client_send_release()
The public function and the implementation were split into two for
no particular reason.
We would assert() on the internal state of the client. This should not be done
in a function that is directly called from a public function. (I.e., we should
not crash if the public function is called at the wrong time.)
assert() is changed to assert_return().
And before anyone asks: I put the assert_returns() *above* the internal
variables on purpose. This makes it easier to see that the assert_returns()
are about the state that is passed in, and if they are not satisfied, the
function returns immediately. The compiler doesn't care either way, so
the ordering that is clearest to the reader should be chosen.
It seems that other parts of link_stop_clients() should be skipped
when restarting, but I don't know enough about those other clients to have
an opinion if it is better to stop&start them on restart or not.
Anyway, that can be done in later patches now that the support for restarts
is there.
The code supports SIGTERM and SIGINT to termiante the process. It would
be possible to reporpose one of those signals for the restart operation,
but I think it's better to use a completely different signal to avoid
misunderstandings.
core: rework how logging level is calculated for kill operations
Setting the log level based on the signal made sense when signals that
were used were fixed. Since we allow signals to be configured, it doesn't
make sense to log at notice level about e.g. a restart or stop operation
just because the signal used is different.
This avoids messages like:
six.service: Killing process 210356 (sleep) with signal SIGINT.
core: add support for RestartKillSignal= to override signal used for restart jobs
v2:
- if RestartKillSignal= is not specified, fall back to KillSignal=. This is necessary
to preserve backwards compatibility (and keep KillSignal= generally useful).
nspawn: rename UNIFIED_CGROUP_HIERARCHY to SYSTEMD_NSPAWN_UNIFIED_HIERARCHY
We should never have used an unprefixed environment variable name.
All other systemd-nspawn variables have the "SYSTEMD_NSPAWN_" prefix,
and all other systemd variables have the "SYSTEMD_" prefix.
The new variable name takes precedence, but we fall back to checking the
old one. If only the old one is found, a warning is emitted.
In addition, SYSTEMD_NSPAWN_UNIFIED_HIERARCHY="" is accepted as an override
to avoid looking for the old variable name.
We have a variable with the same name ($UNIFIED_CGROUP_HIERARCHY) in tests,
which governs both systemd-nspawn and qemu behaviour. It is not renamed.
nspawn: consistenly fail if parsing the environment fails
We would parse the environment twice (to re-apply settings after reading
config from disk), but we would not check the return code first time.
This means that for some settings we would ignore invalid values, while
for others, we'd fail at some point.
Let's just consistently fail. Those environment variables define important
aspects of behaviour, and it is better for the user if we ignore invalid
values. (Unknown settings are still ignored, so forward compatibility is
maintained.)
Jay Strict [Thu, 26 Sep 2019 13:54:29 +0000 (15:54 +0200)]
cryptsetup: bump minimum libcryptsetup version to v2.0.1
libcryptsetup v2.0.1 introduced new API calls, supporting 64 bit wide
integers for `keyfile_offset`. This change invokes the new function
call, gets rid of the warning that was added in #7689, and removes
redundant #ifdefery and constant definitions.
See https://gitlab.com/cryptsetup/cryptsetup/issues/359.
Chris Down [Mon, 30 Sep 2019 17:36:13 +0000 (18:36 +0100)]
cgroup: Mark memory protections as explicitly set in transient units
A later version of the DefaultMemory{Low,Min} patch changed these to
require explicitly setting memory_foo_set, but we only set that in
load-fragment, not dbus-cgroup.
Without these, we may fall back to either DefaultMemoryFoo or
CGROUP_LIMIT_MIN when we really shouldn't.
Chris Down [Mon, 30 Sep 2019 17:25:09 +0000 (18:25 +0100)]
cgroup: Respect DefaultMemoryMin when setting memory.min
This is an oversight from https://github.com/systemd/systemd/pull/12332.
Sadly the tests didn't catch it since it requires a real cgroup
hierarchy to see, and it wasn't seen in prod since we're only currently
using DefaultMemoryLow, not DefaultMemoryMin. :-(
Chris Down [Wed, 25 Sep 2019 16:09:38 +0000 (17:09 +0100)]
util-lib: Don't propagate EACCES from find_binary PATH lookup to caller
On one of my test machines, test-path-util was failing because the
find_binary("xxxx-xxxx") was returning -EACCES instead of -ENOENT. This
happens because the PATH entry on that host contains a directory which
the user in question doesn't have access to. Typically applications
ignore permission errors when searching through PATH, for example in
bash:
$ whoami
cdown
$ PATH=/root:/bin type sh
sh is /bin/sh
This behaviour is present on zsh and other shells as well, though. This
patch brings our PATH search behaviour closer to other major Unix tools.
units: make systemd-binfmt.service easier to work with no autofs
See https://bugzilla.redhat.com/show_bug.cgi?id=1731772:
when autofs4 is disabled in the kernel,
proc-sys-fs-binfmt_misc.automount is not started, so the binfmt_misc module is
never loaded. If we added a dependency on proc-sys-fs-binfmt_misc.mount
to systemd-binfmt.service, things would work even if autofs4 was disabled, but
we would unconditionally pull in the module and mount, which we don't want to do.
(Right now we ony load the module if some binfmt is configured.)
But let's make it easier to handle this case by doing two changes:
1. order systemd-binfmt.service after the .mount unit (so that the .service
can count on the mount if both units are pulled in, even if .automount
is skipped)
2. add [Install] section to the service unit. This way the user can do
'systemctl enable proc-sys-fs-binfmt_misc.mount' to get the appropriate behaviour.
basic/arphrd: stop discriminating against NETROM and CISCO
ARPHRD_NETROM was excluded, most likely just because it is protocol No. 0,
and ARPHRD_CISCO was reported under its alias name "HDLC". Let's just
allow defined aliases under the main name.
basic: massively reduce the size of arphdr lookup functions
Our biggest object in libsystemd was a table full of zeros, for the arphdr
names. Let's use a switch (which gcc nicely optimizes for us), instead a
table with a gap between 826 and 65534:
Pavel Hrdina [Mon, 29 Jul 2019 15:50:05 +0000 (17:50 +0200)]
cgroup: introduce support for cgroup v2 CPUSET controller
Introduce support for configuring cpus and mems for processes using
cgroup v2 CPUSET controller. This allows users to limit which cpus
and memory NUMA nodes can be used by processes to better utilize
system resources.
The cgroup v2 interfaces to control it are cpuset.cpus and cpuset.mems
where the requested configuration is written. However, it doesn't mean
that the requested configuration will be actually used as parent cgroup
may limit the cpus or mems as well. In order to reflect the real
configuration cgroup v2 provides read-only files cpuset.cpus.effective
and cpuset.mems.effective which are exported to users as well.
We have the problem that many early boot or late shutdown issues are harder
to solve than they could be because we have no logs. When journald is not
running, messages are redirected to /dev/kmsg. It is also the time when many
things happen in a rapid succession, so we tend to hit the kernel printk
ratelimit fairly reliably. The end result is that we get no logs from the time
where they would be most useful. Thus let's disable the kernels ratelimit.
Once the system is up and running, the ratelimit is not a problem. But during
normal runtime, things also log to journald, and not to /dev/kmsg, so the
ratelimit is not useful. Hence, there doesn't seem to be much point in trying
to restore the ratelimit after boot is finished and journald is up and running.
See kernel's commit 750afe7babd117daabebf4855da18e4418ea845e for the
description of the kenrel interface. Our setting has lower precedence than
explicit configuration on the kenrel command line.
"ratelimit" is a real word, so we don't need to use the other form anywhere.
We had both forms in various places, let's standarize on the shorter and more
correct one.
Georg Müller [Fri, 20 Sep 2019 08:23:45 +0000 (10:23 +0200)]
sd-radv: if lifetime < SD_RADV_DEFAULT_MAX_TIMEOUT_USEC, adjust timeout (#13491)
The RFC states that lifetime (AdvDefaultLifetime) must be at least
MaxRtrAdvInterval (which more or less corresponds to SD_RADV_DEFAULT_MAX_TIMEOUT_USEC
in systemd).
To fulfill this limit, virtually lower MaxRtrAdvInterval and MinRtrAdvInterval
accordingly.
Also check that min is not lower than 3s and max is not lower than 4s.
!r is the same r == 0, so this was short-circuiting the comparison when
streq(a->iff, b->iff) or streq(a->off, b->off). Before the parent commit which
moved those comparisons to the end, this was short-circuiting quite a bit
of the comparison function.