]> git.ipfire.org Git - thirdparty/systemd.git/blame - doc/CGROUP_DELEGATION.md
doc: add a bit more documentation about systemd and cgroups and cgroupsv2 and delegation
[thirdparty/systemd.git] / doc / CGROUP_DELEGATION.md
CommitLineData
e30eaff3
LP
1# Control Group APIs and Delegation
2
3So you are wondering about resource management with systemd, you know Linux
4control groups (cgroups) a bit and are trying to integrate your software with
5what systemd has to offer there. Here's a bit of documentation about the
6concepts and interfaces involved with this.
7
8What's described here has been part of systemd and documented since v205
9times. However, it has been updated and improved substantially since, even
10though the concepts stayed mostly the same. This is an attempt to provide more
11comprehensive up-to-date information about all this, particular in light of the
12poor implementations of the components interfacing with systemd of current
13container managers.
14
15Before you read on, please make sure you read the low-level [kernel
16documentation about
17cgroupsv2](https://www.kernel.org/doc/Documentation/cgroup-v2.txt). This
18documentation then adds in the higher-level view from systemd.
19
20This document augments the existing documentation we already have:
21
22* [The New Control Group Interfaces](https://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/)
23* [Writing VM and Container Managers](https://www.freedesktop.org/wiki/Software/systemd/writing-vm-managers/)
24
25These wiki documents are not as up to date as they should be, currently, but
26the basic concepts still fully apply. You should read them too, if you do something
27with cgroups and systemd, in particular as they shine more light on the various
28D-Bus APIs provided. (That said, sooner or later we should probably fold that
29wiki documentation into this very document, too.)
30
31## Two Key Design Rules
32
33Much of the philosophy behind these concepts is based on a couple of basic
34design ideas of cgroupsv2 (which we however try to adapt as far as we can to
35cgroupsv1 too). Specifically two cgroupsv2 rules are the most relevant:
36
371. The **no-processes-in-inner-nodes** rule: this means that it's not permitted
38to have processes directly attached to a cgroup that also has child cgroups and
39vice versa. A cgroup is either an inner node or a leaf node of the tree, and if
40it's an inner node it may not contain processes directly, and if it's a leaf
41node then it may not have child cgroups. (Note that there are some minor
42exceptions to this rule, though. i.e. the root cgroup is special and allows
43both processes and children — which is used in particular to maintain kernel
44threads.)
45
462. The **single-writer** rule: this means that each cgroup only has a single
47writer, i.e. a single process managing it. It's OK if different cgroups have
48different processes managing them. However, only a single process should own a
49specific cgroup, and when it does that ownership is exclusive, and nothing else
50should manipulate it at the same time. This rule ensures that various pieces of
51software don't step on each other's toes constantly.
52
53These two rules have various effects. For example, one corollary of this is: if
54your container manager creates and manages cgroups in the system's root cgroup
55you violate rule #2, as the root cgroup is managed by systemd and hence off
56limits to everybody else.
57
58Note that rule #1 is generally enforced by the kernel if cgroupsv2 is used: as
59soon as you add a process to a cgroup it is ensured the rule is not
60violated. On cgroupsv1 this rule didn't exist, and hence isn't enforced, even
61though it's a good thing to follow it then too. Rule #2 is not enforced on
62either cgroupsv1 nor cgroupsv2 (this is UNIX after all, in the general case
63root can do anything, modulo SELinux and friends), but if you ignore it you'll
64be in constant pain as various pieces of software will fight over cgroup
65ownership.
66
67Note that cgroupsv1 is currently the most deployed implementation of all of
68this, even though it's semantically broken in many ways, and in many cases
69doesn't actually do what people think it does. cgroupsv2 is where things are
70going, and most new kernel features in this area are only added to cgroupsv2,
71and not cgroupsv1 anymore. For example cgroupsv2 provides proper cgroup-empty
72notifications, has support for all kinds of per-cgroup BPF magic, supports
73secure delegation of cgroup trees to less privileged processes and so on, which
74all are not available on cgroupsv1.
75
76## Three Different Tree Setups 🌳
77
78systemd supports three different modes how cgroups are set up. Specifically:
79
801. **Unified** — this is the simplest mode, and exposes a pure cgroupsv2
81logic. In this mode `/sys/fs/cgroup` is the only mounted cgroup API file system
82and all available controllers are exclusively exposed through it.
83
842. **Legacy** — this is the traditional cgroupsv1 mode. In this mode the
85various controllers each get their own cgroup file system mounted to
86`/sys/fs/cgroup/<controller>/`. On top of that systemd manages its own cgroup
87hierarchy for managing purposes as `/sys/fs/cgroup/systemd/`.
88
893. **Hybrid** — this is a hybrid between the unified and legacy mode. It's set
90up mostly like legacy, except that there's also an additional hierarchy
91`/sys/fs/cgroup/unified/` that contains the cgroupsv2 hierarchy. In this mode
92compatibility with cgroupsv1 is retained while some cgroupsv2 features are
93available too. This mode is a stopgap. Don't bother with this too much unless
94you have too much free time.
95
96To say this clearly, legacy and hybrid modes have no future. If you develop
97software today and don't focus on the unified mode, then you are writing
98software for yesterday, not tomorrow. They are primarily supported for
99compatibility reasons and will not receive new features. Sorry.
100
101Superficially, in legacy and hybrid modes it might appear that the parallel
102cgroup hierarchies for each controller are orthogonal from each other. In
103systemd they are not: the hierarchies of all controllers are always kept in
104sync (at least mostly: sub-trees might be suppressed in certain hierarchies if
105no controller usage is required for them). The fact that systemd keeps these
106hierarchies in sync means that the legacy and hybrid hierarchies are
107conceptually very close to the unified hierarchy. In particular this allows us
108talk of one specific cgroup and actually mean the same cgroup in all available
109controller hierarchies. e.g. if we talk about the cgroup `/foo/bar/` then we
110actually mean `/sys/fs/cgroup/cpu/foo/bar/` as well as
111`/sys/fs/cgroup/memory/foo/bar/`, `/sys/fs/cgroup/pids/foo/bar/`, and so on, in
112one. Note that in cgroupsv2 the controller hierarchies aren't orthogonal, hence
113thinking about them as orthogonal won't help you in the long run anyway.
114
115If you wonder how to detect which of these three modes is currently used, use
116`statfs()` on `/sys/fs/cgroup/`. If it reports `CGROUP2_SUPER_MAGIC` in its
117`.f_type` field, then you are in unified mode. If it reports `TMPFS_MAGIC` then
118you are either in legacy or hybrid mode. To distuingish these two cases, run
119`statfs()` again on `/sys/fs/cgroup/unified/`. If that succeeds and reports
120`CGROUP2_SUPER_MAGIC` you are in hybrid mode, otherwise not.
121
122## systemd's Unit Types
123
124The low-level kernel cgroups feature is exposed in systemd in three different
125"unit" types. Specifically:
126
1271. 💼 The `.service` unit type. This unit type is for units encapsulating
128 processes systemd itself starts. Units of these types have cgroups that are
129 the leaves of the cgroup tree the systemd instance manages (though possibly
130 they might contain a sub-tree of their own managed by something else, made
131 possible by the concept of delegation, see below). Service units are usually
132 instantiated based on a unit file on disk that describes the command line to
133 invoke and other properties of the service. However, service units may also
134 be declared and started programmatically at runtime through a D-Bus API
135 (which is called *transient* services).
136
1372. 👓 The `.scope` unit type. This is very similar to `.service`. The main
138 difference: the processes the units of this type encapsulate are forked off
139 by some unrelated manager process, and that manager asked systemd to expose
140 them as a unit. Unlike services, scopes can only be declared and started
141 programmatically, i.e. are always transient. That's because they encapsulate
142 processes forked off by something else, i.e. existing runtime objects, and
143 hence cannot really be defined fully in 'offline' concepts such as unit
144 files.
145
1463. 🔪 The `.slice` unit type. Units of this type do not directly contain any
147 processes. Units of this type are the inner nodes of part of the cgroup tree
148 the systemd instance manages. Much like services, slices can be defined
149 either on disk with unit files or programmatically as transient units.
150
151Slices expose the trunk and branches of a tree, and scopes and services are
152attached to those branches as leaves. The idea is that scopes and services can
153be moved around though, i.e. assigned to a different slice if needed.
154
155The naming of slice units directly maps to the cgroup tree path. This is not
156the case for service and scope units however. A slice named `foo-bar-baz.slice`
157maps to a cgroup `/foo.slice/foo-bar.slice/foo-bar-baz.slice/`. A service
158`quux.service` which is attached to the slice `foo-bar-baz.slice` maps to the
159cgroup `/foo.slice/foo-bar.slice/foo-bar-baz.slice/quux.service/`.
160
161By default systemd sets up four slice units:
162
1631. `-.slice` is the root slice. i.e. the parent of everything else. On the host
164 system it maps directly to the top-level directory of cgroupsv2.
165
1662. `system.slice` is where system services are by default placed, unless
167 configured otherwise.
168
1693. `user.slice` is where user sessions are placed. Each user gets a slice of
170 its own below that.
171
1724. `machines.slice` is where VMs and containers are supposed to be
173 placed. `systemd-nspawn` makes use of this by default, and you're very welcome
174 to place your containers and VMs there too if you hack on managers for those.
175
176Users may define any amount of additional slices they like though, the four
177above are just the defaults.
178
179## Delegation
180
181Container managers and suchlike often want to control cgroups directly using
182the raw kernel APIs. That's entirely fine and supported, as long as proper
183*delegation* is followed. Delegation is a concept we inherited from cgroupsv2,
184but we expose it on cgroupsv1 too. Delegation means that some parts of the
185cgroup tree may be managed by different managers than others. As long as it is
186clear which manager manages which part of the tree each one can do within its
187sub-graph of the tree whatever it wants.
188
189Only sub-trees can be delegated (though whoever decides to request a sub-tree
190can delegate sub-sub-trees further to somebody else if they like
191it). Delegation takes place at a specific cgroup: in systemd there's a
192`Delegate=` property you can set for a service or scope unit. If you do, it's
193the cut-off point for systemd's cgroup management: the unit itself is managed
194by systemd, i.e. all its attributes are managed exclusively by systemd, however
195your program may create/remove sub-cgroups inside it freely, and those then
196become exclusive property of your program, systemd won't touch them — all
197attributes of *those* sub-cgroups can be manipulated freely and exclusively by
198your program.
199
200By turning on the `Delegate=` property for a scope or service you get a few
201guarantees:
202
2031. systemd won't fiddle with your sub-tree of the cgroup tree anymore. It won't
204 change attributes of any cgroups below it, nor will it create or remove any
205 cgroups thereunder, nor migrate processes across the boundaries of that
206 sub-tree as it deems useful anymore.
207
2082. If your service makes use of the `User=` functionality, then the sub-tree
209 will be `chown()`ed to the indicated user so that it can correctly create
210 cgroups below it. Note however that systemd will do that only in the unified
211 hierarchy (in unified and hybrid mode) as well as on systemd's own private
212 hierarchy (in legacy and hybrid mode). It won't pass ownership of the legacy
213 controller hierarchies. Delegation to less privileges processes is not safe
214 in cgroupsv1 (as a limitation of the kernel), hence systemd won't facilitate
215 access to it.
216
2173. Any BPF IP filter programs systemd installs will be installed with
218 `BPF_F_ALLOW_MULTI` so that your program can install additional ones.
219
220In unit files the `Delegate=` property is superficially exposed as
221boolean. However, since v236 it optionally takes a list of controller names
222instead. If so, delegation is requested for listed controllers
223specifically. Note hat this only encodes a request. Depending on various
224parameters it might happen that your service actually will get fewer
225controllers delegated (for example, because the controller is not available on
226the current kernel or was turned off) or more. If no list is specified
227(i.e. the property simply set to `yes`) then all available controllers are
228delegated.
229
230Let's stress one thing: delegation is available on scope and service units
231only. It's expressly not available on slice units. Why that? Because slice
232units are our *inner* nodes of the cgroup trees and we freely attach service
233and scopes to them. If we'd allow delegation on slice units then this would
234mean that that both systemd and your own manager would create/delete cgroups
235below the slice unit and that conflicts with the single-writer rule.
236
237So, if you want to do your own raw cgroups kernel level access, then allocate a
238scope unit, or a service unit (or just use the service unit you already have
239for your service code), and turn on delegation for it.
240
241## Three Scenarios
242
243Let's say you write a container manager, and you wonder what to do regarding
244cgroups for it, as you want your manager to be able to run on systemd systems.
245
246You basically have three options:
247
2481. 😊 The *integration-is-good* option. For this, you register each container you
249 have either as systemd service (i.e. let systemd invoke the executor binary
250 for you) or systemd scope (i.e. your manager executes the binary directly,
251 but then tells systemd about it. In this mode the administrator can use the
252 usual systemd resource management commands individually on containers. By
253 turning on `Delegate=` for these scopes or services you make it possible to
254 run cgroup-enabled programs in your containers, for example a systemd
255 instance running inside it. This option has two sub-options:
256
257 a. You register the service or scope transiently directly by contacting
258 systemd via D-Bus. In this case systemd will just manage the unit for you and
259 nothing else.
260
261 b. Instead you register the service or scope through `systemd-machined`
262 (also via D-Bus). This mini-daemon is basically just a proxy for the same
263 operations as in a. The main benefit of this: this way you let the system
264 know that what you are registering is a container, and this opens up
265 certain additional integration points. For example, `journalctl -M` can
266 then be used to directly look into any container's journal logs (should
267 the container run systemd inside), or `systemctl -M` can be used to
268 directly invoke systemd operations inside the containers. Moreover tools
269 like "ps" can then show you to which container a process belongs (`ps -eo
270 pid,comm,machine`), and even gnome-system-monitor supports it.
271
2722. 🙁 The *i-like-islands* option. If all you care about is your own cgroup tree,
273 and you want to have to do as little as possible with systemd and no
274 interest in integration with the rest of the system, then this is a valid
275 option. For this all you have to do is turn on `Delegate=` for your main
276 manager daemon. Then figure out the cgroup systemd placed your daemon in:
277 you can now freely create sub-cgroups beneath it. Don't forget the
278 *no-processes-in-inner-nodes* rule however: you have to move your main
279 daemon process out of that cgroup (and into a sub-cgroup) before you can
280 start further processes in any of your sub-cgroups.
281
2823. 🙁 The *i-like-continents* option. In this option you'd leave your manager
283 daemon where it is, and would not turn on delegation on its unit. However,
284 as first thing you register a new scope unit with systemd, and that scope
285 unit would have `Delegate=` turned on, and then you place all your
286 containers underneath it. From systemd's PoV there'd be two units: your
287 manager service and the big scope that contains all your containers in one.
288
289BTW: if for whatever reason you say "I hate D-Bus, I'll never call any D-Bus
290API, kthxbye", then options #1 and #3 are not available, as they generally
291involve talking to systemd from your program code, via D-Bus. You still have
292option #2 in that case however, as you can simply set `Delegate=` in your
293service's unit file and you are done and have your own sub-tree. In fact, #2 is
294the one option that allows you to completely ignore systemd's existence: you
295can entirely generically follow the single rule that you just use the cgroup
296you are started in, and everything below it, whatever that might be. That said,
297maybe if you dislike D-Bus and systemd that much, the better approach might be
298to work on that, and widen your horizon a bit. You are welcome.
299
300## Controller Support
301
302systemd supports a number of controllers (but not all). Specifically, supported
303are:
304
305* on cgroupsv1: `cpu`, `cpuacct`, `blkio`, `memory`, `devices`, `pids`
306* on cgroupsv2: `cpu`, `io`, `memory`, `pids`
307
308It is our intention to natively support all cgroupsv2 controllers that might
309come up sooner or later. However, regarding cgroupsv1: at this point we will
310not add support for any other controllers anymore. This means systemd currently
311does not and will never manage the following controllers on cgroupsv1:
312`freezer`, `cpuset`, `net_cls`, `perf_event`, `net_prio`, `hugetlb`. Why not?
313Depending on the case, either their API semantics or implementations aren't
314really usable, or it's very clear they have no future on cgroupsv2, and we
315won't add new code for stuff that clearly has no future.
316
317Effectively this means that all those mentioned cgroupsv1 controllers are up
318for grabs: systemd won't manage them, and hence won't delegate them to your
319code (however, systemd will still mount their hierarchies, simply because it
320mounts all controller hierarchies it finds available in the kernel). If you
321decide to use them, then that's fine, but systemd won't help you with it (but
322also not interfere with it). To be nice to other tenants it might be wise to
323replicate the cgroup hierarchies of the other controllers in them too however,
324but of course that's between you and those other tenants, and systemd won't
325care. Replicating the cgroup hierarchies in those unsupported controllers would
326mean replicating the full cgroup paths in them, and hence the prefixing
327`.slice` components too, otherwise the hierarchies will start being orthogonal
328after all, and that's not really desirable. On more thing: systemd will clean
329up after you in the hierarchies it manages: if your daemon goes down, its
330cgroups will be removed too. You basically get the guarantee that you start
331with a pristine cgroup sub-tree for your service or scope whenever it is
332started. This is not the case however in the hierarchies systemd doesn't
333manage. This means that your programs should be ready to deal with left-over
334cgroups in them — from previous runs, and be extra careful with them as they
335might still carry settings that might not be valid anymore.
336
337Note a particular asymmetry here: if your systemd version doesn't support a
338specific controller on cgroupsv1 you can still make use of it for delegation,
339by directly fiddling with its hierarchy and replicating the cgroup tree there
340as necessary (as suggested above). However, on cgroupsv2 this is different:
341separately mounted hierarchies are not available, and delegation has always to
342happen through systemd itself. This means: when you update your kernel and it
343adds a new, so far unseen controller, and you want to use it for delegation,
344then you also need to update systemd to a version that groks it.
345
346## systemd as Container Payload
347
348systemd can happily run as a container payload's PID 1. Note that systemd
349unconditionally needs write access to the cgroup tree however, hence you need
350to delegate a sub-tree to it. Note that there's nothing too special you have to
351do beyond that: just invoke systemd as PID 1 inside the root of the delegated
352cgroup sub-tree, and it will figure out the rest: it will determine the cgroup
353it is running in and take possession of it. It won't interfere with any cgroup
354outside of the sub-tree it was invoked in. Use of `CLONE_NEWCGROUP` is hence
355optional (but of course wise).
356
357Note one particular asymmetry here though: systemd will try to take possession
358of the root cgroup you pass to it *in* *full*, i.e. it will not only
359create/remove child cgroups below it it will also attempt to manage the
360attributes of it. OTOH as mentioned above, when delegating a cgroup tree to
361somebody else it only passes the rights to create/remove sub-cgroups, but will
362insist on managing the delegated cgroup tree's top-level attributes. Or in
363other words: systemd is *greedy* when accepting delegated cgroup trees and also
364*greedy* when delegating them to others: it insists on managing attributes on
365the specific cgroup in both cases. A container manager that is itself a payload
366of a host systemd which wants to run a systemd as its own container payload
367instead hence needs to insert an extra level in the hierarchy in between, so
368that the systemd on the host and the one in the container won't fight for the
369attributes. That said, you likely should do that anyway, due to the
370no-processes-in-inner-cgroups rule, see below.
371
372When systemd runs as container payload it will make use of all hierarchies it
373has write access to. For legacy mode you need to make at least
374`/sys/fs/cgroup/systemd/` available, all other hierarchies are optional. For
375hybrid mode you need to add `/sys/fs/cgroup/unified/`. Finally, for fully
376unified you (of course, I guess) need to provide only `/sys/fs/cgroup/` itself.
377
378## Some Dos
379
3801. ⚡ If you go for implementation option 1a or 1b (as in the list above), then
381 each of your containers will have its own systemd-managed unit and hence
382 cgroup with possibly further sub-cgroups below. Typically the first process
383 running in that unit will be some kind of executor program, which will in
384 turn fork off the payload processes of the container. In this case don't
385 forget that there are two levels of delegation involved: first, systemd
386 delegates a group sub-tree to your executor. And then your executor should
387 delegate a sub-tree further down to the container payload. Oh, and because
388 of the no-process-in-inner-nodes rule, your executor needs to migrate itself
389 to a sub-cgroup of the cgroup it got delegated, too. Most likely you hence
390 want a two-pronged approach: below the cgroup you got started in, you want
391 one cgroup maybe called `supervisor/` where your manager runs in and then
392 for each container a sibling cgroup of that maybe called `payload-xyz/`.
393
3942. ⚡ Don't forget that the cgroups you create have to have names that are
395 suitable as UNIX file names, and that they live in the same namespace as the
396 various kernel attribute files. Hence, when you want to allow the user
397 arbitrary naming, you might need to escape some of the names (for example,
398 you really don't want to create a cgroup named `tasks`, just because the
399 user created a container by that name, because `tasks` after all is a magic
400 attribute in cgroupsv1, and your `mkdir()` will hence fail with `EEXIST`. In
401 systemd we do escaping by prefixing names that might collide with a kernel
402 attribute name with an underscore. You might want to do the same, but this
403 is really up to you how you do it. Just do it, and be careful.
404
405## Some Don'ts
406
4071. 🚫 Never create your own cgroups below arbitrary cgroups systemd manages, i.e
408 cgroups you haven't set `Delegate=` in. Specifically: 🔥 don't create your
409 own cgroups below the root cgroup 🔥. That's owned by systemd, and you will
410 step on systemd's toes if you ignore that, and systemd will step on
411 yours. Get your own delegated sub-tree, you may create as many cgroups there
412 as you like. Seriously, if you create cgroups directly in the cgroup root,
413 then all you do is ask for trouble.
414
4152. 🚫 Don't attempt to set `Delegate=` in slice units, and in particular not in
416 `-.slice`. It's not supported, and will generate an error.
417
4183. 🚫 Never *write* to any of the attributes of a cgroup systemd created for
419 you. It's systemd's private property. You are welcome to manipulate the
420 attributes of cgroups you created in your own delegated sub-tree, but the
421 cgroup tree of systemd itself is out of limits for you. It's fine to *read*
422 from any attribute you like however. That's totally OK and welcome.
423
4244. 🚫 When not using `CLONE_NEWCGROUP` when delegating a sub-tree to a container
425 payload running systemd, then don't get the idea that you can bind mount
426 only a sub-tree of the host's cgroup tree into the container. Part of the
427 cgroup API is that `/proc/$PID/cgroup` reports the cgroup path of every
428 process, and hence any path below `/sys/fs/cgroup/` needs to match what
429 `/proc/$PID/cgroup` of the payload processes reports. What you can do safely
430 however, is mount the upper parts of the cgroup tree read-only or even
431 replace it with an intermediary `tmpfs`, as long as the path to the
432 delegated sub-tree remains accessible as-is.
433
4345. ⚡ Think twice before delegating cgroupsv1 controllers to less privileged
435 containers. It's not safe, you basically allow your containers to freeze the
436 system with that and worse. Delegation is a strongpoint of cgroupsv2 though,
437 and there it's safe to treat delegation boundaries as privilege boundaries.
438
439And that's it for now. If you have further questions, refer to the systemd
440mailing list.
441
442— Berlin, 2018-04-20