Chris Down [Mon, 3 Dec 2018 14:38:06 +0000 (14:38 +0000)]
cgroup: Add DisableControllers= directive to disable controller in subtree
Some controllers (like the CPU controller) have a performance cost that
is non-trivial on certain workloads. While this can be mitigated and
improved to an extent, there will for some controllers always be some
overheads associated with the benefits gained from the controller.
Inside Facebook, the fix applied has been to disable the CPU controller
forcibly with `cgroup_disable=cpu` on the kernel command line.
This presents a problem: to disable or reenable the controller, a reboot
is required, but this is quite cumbersome and slow to do for many
thousands of machines, especially machines where disabling/enabling a
stateful service on a machine is a matter of several minutes.
Currently systemd provides some configuration knobs for these in the
form of `[Default]CPUAccounting`, `[Default]MemoryAccounting`, and the
like. The limitation of these is that Default*Accounting is overrideable
by individual services, of which any one could decide to reenable a
controller within the hierarchy at any point just by using a controller
feature implicitly (eg. `CPUWeight`), even if the use of that CPU
feature could just be opportunistic. Since many services are provided by
the distribution, or by upstream teams at a particular organisation,
it's not a sustainable solution to simply try to find and remove
offending directives from these units.
This commit presents a more direct solution -- a DisableControllers=
directive that forcibly disallows a controller from being enabled within
a subtree.
Chris Down [Tue, 27 Nov 2018 15:49:41 +0000 (15:49 +0000)]
cgroup: Traverse leaves to realised cgroup to release controllers
This adds a depth-first version of unit_realize_cgroup_now which can
only do depth-first disabling of controllers, in preparation for the
DisableController= directive.
Chris Down [Mon, 26 Nov 2018 13:45:26 +0000 (13:45 +0000)]
cgroup: Rework unit_realize_cgroup_now to explicitly be breadth-first
systemd currently doesn't really expend much effort in disabling
controllers. unit_realize_cgroup_now *may* be able to disable a
controller in the basic case when using cgroup v2, but generally won't
manage as downstream dependents may still use it.
This code doesn't add any logic to fix that, but it starts the process
of moving to have a breadth-first version of unit_realize_cgroup_now for
enabling, and a depth-first version of unit_realize_cgroup_now for
disabling.
util-lib: split out all temporary file related calls into tmpfiles-util.c
This splits out a bunch of functions from fileio.c that have to do with
temporary files. Simply to make the header files a bit shorter, and to
group things more nicely.
No code changes, just some rearranging of source files.
basic/socket-util: use c-escaping to print unprintable socket paths
We are pretty careful to reject abstract sockets that are too long to fit in
the address structure as a NUL-terminated string. And since we parse sockets as
strings, it is not possible to embed a NUL in the the address either. But we
might receive an external socket (abstract or not), and we want to be able to
print its address in all cases. We would call socket_address_verify() and
refuse to print various sockets that the kernel considers legit.
Let's do the strict verification only in case of socket addresses we parse and
open ourselves, and do less strict verification when printing addresses of
existing sockets, and use c-escaping to print embedded NULs and such.
More tests are added.
This should make LGTM happier because on FIXME comment is removed.
format-table: before outputting a color, check if colors are available
This is in many cases redundant, as a similar check is done by various
callers already, but in other cases (where we read the color from a
static table for example), it's nice to let the color check be done by
the table code itself, and since it doesn't hurt in the other cases just
do it again.
parse-util: allow parse_boolean() to take a NULL argument
It's pretty useful to allow parse_boolean() to take a NULL argument and
return an error in that case, rather than abort. i.e. making this a
runtime rather than programming error allows us to shorten code
elsewhere.
seccomp-util: drop process_vm_readv from @debug group
it's already part of @ipc, no need to have it in both. Given that @ipc
is much more popular (as it is part of @system-service for example),
let's not define it a second time.
Split out part of mount-util.c into mountpoint-util.c
The idea is that anything which is related to actually manipulating mounts is
in mount-util.c, but functions for mountpoint introspection are moved to the
new file. Anything which requires libmount must be in mount-util.c.
This was supposed to be a preparation for further changes, with no functional
difference, but it results in a significant change in linkage:
dev-setup: generalize logic we use to create "inaccessible" device nodes
Let's generalize this, so that we can use this in nspawn later on, which
is pretty useful as we need to be able to mask files from the inner
child of nspawn too, where the host's /run/systemd/inaccessible
directory is not visible anymore. Moreover, if nspawn can create these
nodes on its own before the payload this means the payload can run with
fewer privileges.
cgroup: use device_path_parse_major_minor() also for block device paths
Not only when we populate the "devices" cgroup controller we need
major/minor numbers, but for the io/blkio one it's the same, hence let's
use the same logic for both.
stat-util: add new APIs device_path_make_{major_minor|canonical}() and device_path_parse_major_minor()
device_path_make_{major_minor|canonical) generate device node paths
given a mode_t and a dev_t. We have similar code all over the place,
let's unify this in one place. The former will generate a "/dev/char/"
or "/dev/block" path, and never go to disk. The latter then goes to disk
and resolves that path to the actual path of the device node.
device_path_parse_major_minor() reverses device_path_make_major_minor(),
also withozut going to disk.
We have similar code doing something like this at various places, let's
unify this in a single set of functions. This also allows us to teach
them special tricks, for example handling of the
/run/systemd/inaccessible/{blk|chr} device nodes, which we use for
masking device nodes, and which do not exist in /dev/char/* and
/dev/block/*
Previously we'd allow pattern expressions such as "char-input" to match
all input devices. Internally, this would look up the right major to
test in /proc/devices. With this commit the syntax is slightly extended:
- "char-*" can be used to match any kind of character device, and
similar "block-*. This expression would work previously already, but
instead of actually installing a wildcard match it would install many
individual matches for everything listed in /proc/devices.
- "char-<MAJOR>" with "<MAJOR>" being a numerical parameter works now
too. This allows clients to install whitelist items by specifying the
major directly.
The main reason to add these is to provide limited compat support for
clients that for some reason contain whitelists with major/minor numbers
(such as OCI containers).
core: add special handling for devices cgroup allow lists for /dev/block/* and /dev/char/* device nodes
This adds some code to hanlde /dev/block/* and /dev/char/* device node
paths specially: instead of actually stat()ing them we'll just parse the
major/minor name from the name. This is useful 'hack' to allow clients
to install whitelists for devices that don't actually have to exist.
Also, let's similarly handle /run/systemd/inaccessible/{blk|chr}. This
allows us to simplify our built-in default whitelist to not require a
"ignore_enoent" mode for these nodes.
In general we should be careful with hardcoding major/minor numbers, but
in this case this should safe.
path-util: port path_join() over to path_join_many()
We should probably drop path_join() entirely in the long run (and
then rename path_join_many() to it?), but for now let's make one a
wrapper for the other.
parse-util: rework parse_dev() based on safe_atou() and DEVICE_MAJOR_VALID()/DEVICE_MINOR_VALID()
Let's be a bit more careful when parsing major/minor pairs, and filter
out more corner cases. This also means using safe_atou() rather than
sscanf() to avoid weird negative unsigned handling and such.
stat-util: add macros for checking whether major and minor values are in range
As it turns out glibc and the Linux kernel have different ideas about
the size of dev_t and how many bits exist for the major and the minor.
When validating major/minor numbers we should check against the kernel's
actual sizes, hence add macros for this.