core: implement OOMPolicy= and watch cgroups for OOM killings
This adds a new per-service OOMPolicy= (along with a global
DefaultOOMPolicy=) that controls what to do if a process of the service
is killed by the kernel's OOM killer. It has three different values:
"continue" (old behaviour), "stop" (terminate the service), "kill" (let
the kernel kill all the service's processes).
On top of that, track OOM killer events per unit: generate a per-unit
structured, recognizable log message when we see an OOM killer event,
and put the service in a failure state if an OOM killer event was seen
and the selected policy was not "continue". A new "result" is defined
for this case: "oom-kill".
All of this relies on new cgroupv2 kernel functionality: the
"memory.events" notification interface and the "memory.oom.group"
attribute (which makes the kernel kill all cgroup processes
automatically).
Let's rename the .cgroup_inotify_wd field of the Unit object to
.cgroup_control_inotify_wd. Let's similarly rename the hashmap
.cgroup_inotify_wd_unit of the Manager object to
.cgroup_control_inotify_wd_unit.
Why? As preparation for a later commit that allows us to watch the
"memory.events" cgroup attribute file in addition to the "cgroup.events"
file we already watch with the fields above. In that later commit we'll
add new fields "cgroup_memory_inotify_wd" to Unit and
"cgroup_memory_inotify_wd_unit" to Manager, that are used to watch these
other events file.
So far the priorities for cgroup empty event handling were pretty weird.
The raw events (on cgroupsv2 from inotify, on cgroupsv1 from the agent
dgram socket) where scheduled at a lower priority than the cgroup empty
queue dispatcher. Let's swap that and ensure that we can coalesce events
more agressively: let's process the raw events at higher priority than
the cgroup empty event (which remains at the same prio).
man: say that .link NamePolicy= should be empty for Name= to take effect
The description of NamePolicy= implied this, but didn't spell it out. It's a
very common use case, so let's add a bit of explanation and ehance the example
a bit.
Inspired by https://bugzilla.redhat.com/show_bug.cgi?id=1695894.
Peter A. Bigot [Mon, 30 Apr 2018 12:05:29 +0000 (07:05 -0500)]
units: add time-set.target
time-sync.target is supposed to indicate system clock is synchronized
with a remote clock, but as used through 241 it only provided a system
clock that was updated based on a locally-maintained timestamp. Systems
that are powered off for extended periods would not come up with
accurate time.
Retain the existing behavior using a new time-set.target leaving
time-sync.target for cases where accuracy is required.
Jonas DOREL [Mon, 8 Apr 2019 06:19:58 +0000 (08:19 +0200)]
man: correct units path usage according to FHS (#11388)
According to the Filesystem Hierarchy Standard, "The /usr/local hierarchy is for use by the system administrator when installing software locally. It needs to be safe from being overwritten when the system software is updated". So it should not be used by installed packages.
Jussi Pakkanen [Sat, 6 Apr 2019 19:59:06 +0000 (21:59 +0200)]
meson: drop misplaced -Wl,--undefined argument
Ld's man page says the following:
-u symbol
--undefined=symbol
Force symbol to be entered in the output file as an undefined symbol. Doing
this may, for example, trigger linking of additional modules from standard
libraries. -u may be repeated with different option arguments to enter
additional undefined symbols. This option is equivalent to the "EXTERN"
linker script command.
If this option is being used to force additional modules to be pulled into
the link, and if it is an error for the symbol to remain undefined, then the
option --require-defined should be used instead.
This would imply that it always requires an argument, which this does not
pass. Thus it will grab the next argument on the command line as its
argument. Before it took one of the many -lrt args (presumably) and now it
grabs something other random linker argument and things break.
[zj: this line was added in the first version of the meson configuration back
in 5c23128daba7236a6080383b2a5649033cfef85c. AFAICT, this was a mistake. No
such flag appeared in Makefile.am at the time.]
nspawn: create boot_id and kmsg files for overmounting in /run, not /tmp
/tmp might not be mounted at all yet (given that we support
SYSTEMD_NSPAWN_TMPFS_TMP=0 to turn this off), and /tmp is a dir systemd
usually tries to unmount during shutdown (unlike /run), and we shouldn't
keep it busy. Hence let's just move these deleted files to /run so that
we don't keep /tmp needlessly busy.
test-journal: move tests to /var/tmp/ and set FS_NOCOW_FL
The journal files might not be tiny hence let's write them to /var/tmp/
instead of /tmp. Also, let's turn on NOCOW on the files, as these tests
might apparently be slow on btrfs.
udev-util: allocate an event loop of our own for waiting
We can't use the per-thread default one here, as it might already be
running (for example, that's the case in portabled), and our event loops
are not recursive, hence running them a second time is not OK.
There are environments where /lib might not be necessary (think:
statically compiled portable service binary), hence don't insist on it
if the image is read-only.
basic/log: log any available location information in log_syntax()
We would log "(null):0: Failed to parse system call, ignoring: rseq" from
log_syntax_internal() from log_syntax() from seccomp_parse_syscall_filter_full()
from seccomp_parse_syscall_filter() from config_parse_syscall_filter(),
when generating the built-in @default whitelist. Since it was not based on the
unit file, we would not pass a file name.
So let's make sure that log_syntax() does not print "(null)" pointer (which is
iffy and ugly), and use the unit name as fallback or nothing if both are missing.
In principle, one of the two should be always available, since why use log_syntax()
otherwise, but let's make things more resilient by guarding against this case too.
log_syntax() is called from a thousand places, and often in error path, so it's
hard to verify all callers.
core: imply NNP and SUID/SGID restriction for DynamicUser=yes service
Let's be safe, rather than sorry. This way DynamicUser=yes services can
neither take benefit of, nor create SUID/SGID binaries.
Given that DynamicUser= is a recent addition only we should be able to
get away with turning this on, even though this is strictly speaking a
binary compatibility breakage.