]> git.ipfire.org Git - thirdparty/systemd.git/blame - docs/CGROUP_DELEGATION.md
docs: place all our markdown docs in rough categories
[thirdparty/systemd.git] / docs / CGROUP_DELEGATION.md
CommitLineData
c3e270f4
FB
1---
2title: Control Group APIs and Delegation
4cdca0af 3category: Interfaces
c3e270f4
FB
4---
5
e30eaff3
LP
6# Control Group APIs and Delegation
7
1e46eb59
LP
8*Intended audience: hackers working on userspace subsystems that require direct
9cgroup access, such as container managers and similar.*
10
e30eaff3
LP
11So you are wondering about resource management with systemd, you know Linux
12control groups (cgroups) a bit and are trying to integrate your software with
13what systemd has to offer there. Here's a bit of documentation about the
14concepts and interfaces involved with this.
15
16What's described here has been part of systemd and documented since v205
5b24525a 17times. However, it has been updated and improved substantially, even
e30eaff3
LP
18though the concepts stayed mostly the same. This is an attempt to provide more
19comprehensive up-to-date information about all this, particular in light of the
20poor implementations of the components interfacing with systemd of current
21container managers.
22
23Before you read on, please make sure you read the low-level [kernel
24documentation about
4e1dfa45 25cgroup v2](https://www.kernel.org/doc/Documentation/cgroup-v2.txt). This
e30eaff3
LP
26documentation then adds in the higher-level view from systemd.
27
28This document augments the existing documentation we already have:
29
30* [The New Control Group Interfaces](https://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/)
31* [Writing VM and Container Managers](https://www.freedesktop.org/wiki/Software/systemd/writing-vm-managers/)
32
33These wiki documents are not as up to date as they should be, currently, but
34the basic concepts still fully apply. You should read them too, if you do something
35with cgroups and systemd, in particular as they shine more light on the various
36D-Bus APIs provided. (That said, sooner or later we should probably fold that
37wiki documentation into this very document, too.)
38
39## Two Key Design Rules
40
41Much of the philosophy behind these concepts is based on a couple of basic
4e1dfa45
CD
42design ideas of cgroup v2 (which we however try to adapt as far as we can to
43cgroup v1 too). Specifically two cgroup v2 rules are the most relevant:
e30eaff3
LP
44
451. The **no-processes-in-inner-nodes** rule: this means that it's not permitted
46to have processes directly attached to a cgroup that also has child cgroups and
47vice versa. A cgroup is either an inner node or a leaf node of the tree, and if
48it's an inner node it may not contain processes directly, and if it's a leaf
49node then it may not have child cgroups. (Note that there are some minor
5b24525a 50exceptions to this rule, though. E.g. the root cgroup is special and allows
e30eaff3
LP
51both processes and children — which is used in particular to maintain kernel
52threads.)
53
542. The **single-writer** rule: this means that each cgroup only has a single
55writer, i.e. a single process managing it. It's OK if different cgroups have
56different processes managing them. However, only a single process should own a
57specific cgroup, and when it does that ownership is exclusive, and nothing else
58should manipulate it at the same time. This rule ensures that various pieces of
59software don't step on each other's toes constantly.
60
61These two rules have various effects. For example, one corollary of this is: if
62your container manager creates and manages cgroups in the system's root cgroup
63you violate rule #2, as the root cgroup is managed by systemd and hence off
64limits to everybody else.
65
4e1dfa45 66Note that rule #1 is generally enforced by the kernel if cgroup v2 is used: as
e30eaff3 67soon as you add a process to a cgroup it is ensured the rule is not
4e1dfa45 68violated. On cgroup v1 this rule didn't exist, and hence isn't enforced, even
e30eaff3 69though it's a good thing to follow it then too. Rule #2 is not enforced on
4e1dfa45 70either cgroup v1 nor cgroup v2 (this is UNIX after all, in the general case
e30eaff3
LP
71root can do anything, modulo SELinux and friends), but if you ignore it you'll
72be in constant pain as various pieces of software will fight over cgroup
73ownership.
74
4e1dfa45 75Note that cgroup v1 is currently the most deployed implementation, even though
5b24525a 76it's semantically broken in many ways, and in many cases doesn't actually do
4e1dfa45
CD
77what people think it does. cgroup v2 is where things are going, and most new
78kernel features in this area are only added to cgroup v2, and not cgroup v1
79anymore. For example cgroup v2 provides proper cgroup-empty notifications, has
5b24525a
ZJS
80support for all kinds of per-cgroup BPF magic, supports secure delegation of
81cgroup trees to less privileged processes and so on, which all are not
4e1dfa45 82available on cgroup v1.
e30eaff3
LP
83
84## Three Different Tree Setups 🌳
85
86systemd supports three different modes how cgroups are set up. Specifically:
87
4e1dfa45 881. **Unified** — this is the simplest mode, and exposes a pure cgroup v2
e30eaff3
LP
89logic. In this mode `/sys/fs/cgroup` is the only mounted cgroup API file system
90and all available controllers are exclusively exposed through it.
91
4e1dfa45 922. **Legacy** — this is the traditional cgroup v1 mode. In this mode the
e30eaff3
LP
93various controllers each get their own cgroup file system mounted to
94`/sys/fs/cgroup/<controller>/`. On top of that systemd manages its own cgroup
95hierarchy for managing purposes as `/sys/fs/cgroup/systemd/`.
96
973. **Hybrid** — this is a hybrid between the unified and legacy mode. It's set
98up mostly like legacy, except that there's also an additional hierarchy
4e1dfa45 99`/sys/fs/cgroup/unified/` that contains the cgroup v2 hierarchy. (Note that in
9afd5740
LP
100this mode the unified hierarchy won't have controllers attached, the
101controllers are all mounted as separate hierarchies as in legacy mode,
4e1dfa45 102i.e. `/sys/fs/cgroup/unified/` is purely and exclusively about core cgroup v2
9afd5740 103functionality and not about resource management.) In this mode compatibility
4e1dfa45 104with cgroup v1 is retained while some cgroup v2 features are available
9afd5740
LP
105too. This mode is a stopgap. Don't bother with this too much unless you have
106too much free time.
e30eaff3
LP
107
108To say this clearly, legacy and hybrid modes have no future. If you develop
109software today and don't focus on the unified mode, then you are writing
110software for yesterday, not tomorrow. They are primarily supported for
111compatibility reasons and will not receive new features. Sorry.
112
113Superficially, in legacy and hybrid modes it might appear that the parallel
114cgroup hierarchies for each controller are orthogonal from each other. In
115systemd they are not: the hierarchies of all controllers are always kept in
116sync (at least mostly: sub-trees might be suppressed in certain hierarchies if
117no controller usage is required for them). The fact that systemd keeps these
118hierarchies in sync means that the legacy and hybrid hierarchies are
119conceptually very close to the unified hierarchy. In particular this allows us
5b24525a
ZJS
120to talk of one specific cgroup and actually mean the same cgroup in all
121available controller hierarchies. E.g. if we talk about the cgroup `/foo/bar/`
122then we actually mean `/sys/fs/cgroup/cpu/foo/bar/` as well as
123`/sys/fs/cgroup/memory/foo/bar/`, `/sys/fs/cgroup/pids/foo/bar/`, and so on.
4e1dfa45 124Note that in cgroup v2 the controller hierarchies aren't orthogonal, hence
e30eaff3
LP
125thinking about them as orthogonal won't help you in the long run anyway.
126
127If you wonder how to detect which of these three modes is currently used, use
128`statfs()` on `/sys/fs/cgroup/`. If it reports `CGROUP2_SUPER_MAGIC` in its
129`.f_type` field, then you are in unified mode. If it reports `TMPFS_MAGIC` then
b2454670 130you are either in legacy or hybrid mode. To distinguish these two cases, run
e30eaff3
LP
131`statfs()` again on `/sys/fs/cgroup/unified/`. If that succeeds and reports
132`CGROUP2_SUPER_MAGIC` you are in hybrid mode, otherwise not.
133
134## systemd's Unit Types
135
136The low-level kernel cgroups feature is exposed in systemd in three different
137"unit" types. Specifically:
138
1391. 💼 The `.service` unit type. This unit type is for units encapsulating
140 processes systemd itself starts. Units of these types have cgroups that are
141 the leaves of the cgroup tree the systemd instance manages (though possibly
142 they might contain a sub-tree of their own managed by something else, made
143 possible by the concept of delegation, see below). Service units are usually
144 instantiated based on a unit file on disk that describes the command line to
145 invoke and other properties of the service. However, service units may also
146 be declared and started programmatically at runtime through a D-Bus API
147 (which is called *transient* services).
148
1492. 👓 The `.scope` unit type. This is very similar to `.service`. The main
150 difference: the processes the units of this type encapsulate are forked off
151 by some unrelated manager process, and that manager asked systemd to expose
152 them as a unit. Unlike services, scopes can only be declared and started
153 programmatically, i.e. are always transient. That's because they encapsulate
154 processes forked off by something else, i.e. existing runtime objects, and
155 hence cannot really be defined fully in 'offline' concepts such as unit
156 files.
157
1583. 🔪 The `.slice` unit type. Units of this type do not directly contain any
159 processes. Units of this type are the inner nodes of part of the cgroup tree
160 the systemd instance manages. Much like services, slices can be defined
161 either on disk with unit files or programmatically as transient units.
162
163Slices expose the trunk and branches of a tree, and scopes and services are
164attached to those branches as leaves. The idea is that scopes and services can
165be moved around though, i.e. assigned to a different slice if needed.
166
167The naming of slice units directly maps to the cgroup tree path. This is not
168the case for service and scope units however. A slice named `foo-bar-baz.slice`
169maps to a cgroup `/foo.slice/foo-bar.slice/foo-bar-baz.slice/`. A service
170`quux.service` which is attached to the slice `foo-bar-baz.slice` maps to the
171cgroup `/foo.slice/foo-bar.slice/foo-bar-baz.slice/quux.service/`.
172
173By default systemd sets up four slice units:
174
1751. `-.slice` is the root slice. i.e. the parent of everything else. On the host
4e1dfa45 176 system it maps directly to the top-level directory of cgroup v2.
e30eaff3
LP
177
1782. `system.slice` is where system services are by default placed, unless
179 configured otherwise.
180
1813. `user.slice` is where user sessions are placed. Each user gets a slice of
182 its own below that.
183
1844. `machines.slice` is where VMs and containers are supposed to be
185 placed. `systemd-nspawn` makes use of this by default, and you're very welcome
186 to place your containers and VMs there too if you hack on managers for those.
187
188Users may define any amount of additional slices they like though, the four
189above are just the defaults.
190
191## Delegation
192
193Container managers and suchlike often want to control cgroups directly using
194the raw kernel APIs. That's entirely fine and supported, as long as proper
4e1dfa45
CD
195*delegation* is followed. Delegation is a concept we inherited from cgroup v2,
196but we expose it on cgroup v1 too. Delegation means that some parts of the
e30eaff3
LP
197cgroup tree may be managed by different managers than others. As long as it is
198clear which manager manages which part of the tree each one can do within its
199sub-graph of the tree whatever it wants.
200
201Only sub-trees can be delegated (though whoever decides to request a sub-tree
5b24525a
ZJS
202can delegate sub-sub-trees further to somebody else if they like). Delegation
203takes place at a specific cgroup: in systemd there's a `Delegate=` property you
204can set for a service or scope unit. If you do, it's the cut-off point for
205systemd's cgroup management: the unit itself is managed by systemd, i.e. all
206its attributes are managed exclusively by systemd, however your program may
207create/remove sub-cgroups inside it freely, and those then become exclusive
208property of your program, systemd won't touch them — all attributes of *those*
209sub-cgroups can be manipulated freely and exclusively by your program.
e30eaff3
LP
210
211By turning on the `Delegate=` property for a scope or service you get a few
212guarantees:
213
2141. systemd won't fiddle with your sub-tree of the cgroup tree anymore. It won't
215 change attributes of any cgroups below it, nor will it create or remove any
216 cgroups thereunder, nor migrate processes across the boundaries of that
217 sub-tree as it deems useful anymore.
218
2192. If your service makes use of the `User=` functionality, then the sub-tree
220 will be `chown()`ed to the indicated user so that it can correctly create
221 cgroups below it. Note however that systemd will do that only in the unified
222 hierarchy (in unified and hybrid mode) as well as on systemd's own private
223 hierarchy (in legacy and hybrid mode). It won't pass ownership of the legacy
224 controller hierarchies. Delegation to less privileges processes is not safe
4e1dfa45 225 in cgroup v1 (as a limitation of the kernel), hence systemd won't facilitate
e30eaff3
LP
226 access to it.
227
2283. Any BPF IP filter programs systemd installs will be installed with
229 `BPF_F_ALLOW_MULTI` so that your program can install additional ones.
230
231In unit files the `Delegate=` property is superficially exposed as
232boolean. However, since v236 it optionally takes a list of controller names
233instead. If so, delegation is requested for listed controllers
1a31d050 234specifically. Note that this only encodes a request. Depending on various
e30eaff3
LP
235parameters it might happen that your service actually will get fewer
236controllers delegated (for example, because the controller is not available on
237the current kernel or was turned off) or more. If no list is specified
238(i.e. the property simply set to `yes`) then all available controllers are
239delegated.
240
241Let's stress one thing: delegation is available on scope and service units
5b24525a
ZJS
242only. It's expressly not available on slice units. Why? Because slice units are
243our *inner* nodes of the cgroup trees and we freely attach service and scopes
e5988600 244to them. If we'd allow delegation on slice units then this would mean that
5b24525a
ZJS
245both systemd and your own manager would create/delete cgroups below the slice
246unit and that conflicts with the single-writer rule.
e30eaff3
LP
247
248So, if you want to do your own raw cgroups kernel level access, then allocate a
249scope unit, or a service unit (or just use the service unit you already have
250for your service code), and turn on delegation for it.
251
e2391ce0
LP
252(OK, here's one caveat: if you turn on delegation for a service, and that
253service has `ExecStartPost=`, `ExecReload=`, `ExecStop=` or `ExecStopPost=`
254set, then these commands will be executed within the `.control/` sub-cgroup of
255your service's cgroup. This is necessary because by turning on delegation we
256have to assume that the cgroup delegated to your service is now an *inner*
257cgroup, which means that it may not directly contain any processes. Hence, if
258your service has any of these four settings set, you must be prepared that a
259`.control/` subcgroup might appear, managed by the service manager. This also
260means that your service code should have moved itself further down the cgroup
261tree by the time it notifies the service manager about start-up readiness, so
262that the service's main cgroup is definitely an inner node by the time the
263service manager might start `ExecStartPost=`.)
264
e30eaff3
LP
265## Three Scenarios
266
267Let's say you write a container manager, and you wonder what to do regarding
268cgroups for it, as you want your manager to be able to run on systemd systems.
269
270You basically have three options:
271
5b24525a
ZJS
2721. 😊 The *integration-is-good* option. For this, you register each container
273 you have either as a systemd service (i.e. let systemd invoke the executor
274 binary for you) or a systemd scope (i.e. your manager executes the binary
275 directly, but then tells systemd about it. In this mode the administrator
276 can use the usual systemd resource management and reporting commands
277 individually on those containers. By turning on `Delegate=` for these scopes
278 or services you make it possible to run cgroup-enabled programs in your
279 containers, for example a nested systemd instance. This option has two
280 sub-options:
e30eaff3 281
5b24525a
ZJS
282 a. You transiently register the service or scope by directly contacting
283 systemd via D-Bus. In this case systemd will just manage the unit for you
284 and nothing else.
e30eaff3
LP
285
286 b. Instead you register the service or scope through `systemd-machined`
287 (also via D-Bus). This mini-daemon is basically just a proxy for the same
288 operations as in a. The main benefit of this: this way you let the system
289 know that what you are registering is a container, and this opens up
290 certain additional integration points. For example, `journalctl -M` can
291 then be used to directly look into any container's journal logs (should
292 the container run systemd inside), or `systemctl -M` can be used to
293 directly invoke systemd operations inside the containers. Moreover tools
294 like "ps" can then show you to which container a process belongs (`ps -eo
295 pid,comm,machine`), and even gnome-system-monitor supports it.
296
2972. 🙁 The *i-like-islands* option. If all you care about is your own cgroup tree,
298 and you want to have to do as little as possible with systemd and no
299 interest in integration with the rest of the system, then this is a valid
300 option. For this all you have to do is turn on `Delegate=` for your main
301 manager daemon. Then figure out the cgroup systemd placed your daemon in:
302 you can now freely create sub-cgroups beneath it. Don't forget the
303 *no-processes-in-inner-nodes* rule however: you have to move your main
304 daemon process out of that cgroup (and into a sub-cgroup) before you can
305 start further processes in any of your sub-cgroups.
306
3073. 🙁 The *i-like-continents* option. In this option you'd leave your manager
308 daemon where it is, and would not turn on delegation on its unit. However,
309 as first thing you register a new scope unit with systemd, and that scope
310 unit would have `Delegate=` turned on, and then you place all your
311 containers underneath it. From systemd's PoV there'd be two units: your
312 manager service and the big scope that contains all your containers in one.
313
314BTW: if for whatever reason you say "I hate D-Bus, I'll never call any D-Bus
315API, kthxbye", then options #1 and #3 are not available, as they generally
316involve talking to systemd from your program code, via D-Bus. You still have
317option #2 in that case however, as you can simply set `Delegate=` in your
318service's unit file and you are done and have your own sub-tree. In fact, #2 is
319the one option that allows you to completely ignore systemd's existence: you
320can entirely generically follow the single rule that you just use the cgroup
321you are started in, and everything below it, whatever that might be. That said,
322maybe if you dislike D-Bus and systemd that much, the better approach might be
323to work on that, and widen your horizon a bit. You are welcome.
324
325## Controller Support
326
327systemd supports a number of controllers (but not all). Specifically, supported
328are:
329
4e1dfa45
CD
330* on cgroup v1: `cpu`, `cpuacct`, `blkio`, `memory`, `devices`, `pids`
331* on cgroup v2: `cpu`, `io`, `memory`, `pids`
e30eaff3 332
4e1dfa45
CD
333It is our intention to natively support all cgroup v2 controllers as they are
334added to the kernel. However, regarding cgroup v1: at this point we will not
5b24525a 335add support for any other controllers anymore. This means systemd currently
4e1dfa45 336does not and will never manage the following controllers on cgroup v1:
e30eaff3
LP
337`freezer`, `cpuset`, `net_cls`, `perf_event`, `net_prio`, `hugetlb`. Why not?
338Depending on the case, either their API semantics or implementations aren't
4e1dfa45 339really usable, or it's very clear they have no future on cgroup v2, and we
e30eaff3
LP
340won't add new code for stuff that clearly has no future.
341
4e1dfa45 342Effectively this means that all those mentioned cgroup v1 controllers are up
e30eaff3
LP
343for grabs: systemd won't manage them, and hence won't delegate them to your
344code (however, systemd will still mount their hierarchies, simply because it
345mounts all controller hierarchies it finds available in the kernel). If you
346decide to use them, then that's fine, but systemd won't help you with it (but
347also not interfere with it). To be nice to other tenants it might be wise to
348replicate the cgroup hierarchies of the other controllers in them too however,
349but of course that's between you and those other tenants, and systemd won't
350care. Replicating the cgroup hierarchies in those unsupported controllers would
351mean replicating the full cgroup paths in them, and hence the prefixing
352`.slice` components too, otherwise the hierarchies will start being orthogonal
353after all, and that's not really desirable. On more thing: systemd will clean
354up after you in the hierarchies it manages: if your daemon goes down, its
355cgroups will be removed too. You basically get the guarantee that you start
356with a pristine cgroup sub-tree for your service or scope whenever it is
357started. This is not the case however in the hierarchies systemd doesn't
358manage. This means that your programs should be ready to deal with left-over
359cgroups in them — from previous runs, and be extra careful with them as they
360might still carry settings that might not be valid anymore.
361
362Note a particular asymmetry here: if your systemd version doesn't support a
4e1dfa45 363specific controller on cgroup v1 you can still make use of it for delegation,
e30eaff3 364by directly fiddling with its hierarchy and replicating the cgroup tree there
4e1dfa45 365as necessary (as suggested above). However, on cgroup v2 this is different:
e30eaff3
LP
366separately mounted hierarchies are not available, and delegation has always to
367happen through systemd itself. This means: when you update your kernel and it
368adds a new, so far unseen controller, and you want to use it for delegation,
369then you also need to update systemd to a version that groks it.
370
371## systemd as Container Payload
372
373systemd can happily run as a container payload's PID 1. Note that systemd
374unconditionally needs write access to the cgroup tree however, hence you need
375to delegate a sub-tree to it. Note that there's nothing too special you have to
376do beyond that: just invoke systemd as PID 1 inside the root of the delegated
377cgroup sub-tree, and it will figure out the rest: it will determine the cgroup
378it is running in and take possession of it. It won't interfere with any cgroup
379outside of the sub-tree it was invoked in. Use of `CLONE_NEWCGROUP` is hence
380optional (but of course wise).
381
382Note one particular asymmetry here though: systemd will try to take possession
383of the root cgroup you pass to it *in* *full*, i.e. it will not only
e5988600 384create/remove child cgroups below it, it will also attempt to manage the
e30eaff3
LP
385attributes of it. OTOH as mentioned above, when delegating a cgroup tree to
386somebody else it only passes the rights to create/remove sub-cgroups, but will
387insist on managing the delegated cgroup tree's top-level attributes. Or in
388other words: systemd is *greedy* when accepting delegated cgroup trees and also
389*greedy* when delegating them to others: it insists on managing attributes on
390the specific cgroup in both cases. A container manager that is itself a payload
391of a host systemd which wants to run a systemd as its own container payload
392instead hence needs to insert an extra level in the hierarchy in between, so
393that the systemd on the host and the one in the container won't fight for the
394attributes. That said, you likely should do that anyway, due to the
395no-processes-in-inner-cgroups rule, see below.
396
397When systemd runs as container payload it will make use of all hierarchies it
398has write access to. For legacy mode you need to make at least
399`/sys/fs/cgroup/systemd/` available, all other hierarchies are optional. For
400hybrid mode you need to add `/sys/fs/cgroup/unified/`. Finally, for fully
401unified you (of course, I guess) need to provide only `/sys/fs/cgroup/` itself.
402
403## Some Dos
404
4051. ⚡ If you go for implementation option 1a or 1b (as in the list above), then
406 each of your containers will have its own systemd-managed unit and hence
407 cgroup with possibly further sub-cgroups below. Typically the first process
408 running in that unit will be some kind of executor program, which will in
409 turn fork off the payload processes of the container. In this case don't
410 forget that there are two levels of delegation involved: first, systemd
411 delegates a group sub-tree to your executor. And then your executor should
412 delegate a sub-tree further down to the container payload. Oh, and because
413 of the no-process-in-inner-nodes rule, your executor needs to migrate itself
414 to a sub-cgroup of the cgroup it got delegated, too. Most likely you hence
415 want a two-pronged approach: below the cgroup you got started in, you want
416 one cgroup maybe called `supervisor/` where your manager runs in and then
417 for each container a sibling cgroup of that maybe called `payload-xyz/`.
418
4192. ⚡ Don't forget that the cgroups you create have to have names that are
420 suitable as UNIX file names, and that they live in the same namespace as the
421 various kernel attribute files. Hence, when you want to allow the user
422 arbitrary naming, you might need to escape some of the names (for example,
423 you really don't want to create a cgroup named `tasks`, just because the
424 user created a container by that name, because `tasks` after all is a magic
4e1dfa45 425 attribute in cgroup v1, and your `mkdir()` will hence fail with `EEXIST`. In
e30eaff3
LP
426 systemd we do escaping by prefixing names that might collide with a kernel
427 attribute name with an underscore. You might want to do the same, but this
428 is really up to you how you do it. Just do it, and be careful.
429
430## Some Don'ts
431
4321. 🚫 Never create your own cgroups below arbitrary cgroups systemd manages, i.e
433 cgroups you haven't set `Delegate=` in. Specifically: 🔥 don't create your
434 own cgroups below the root cgroup 🔥. That's owned by systemd, and you will
435 step on systemd's toes if you ignore that, and systemd will step on
436 yours. Get your own delegated sub-tree, you may create as many cgroups there
437 as you like. Seriously, if you create cgroups directly in the cgroup root,
438 then all you do is ask for trouble.
439
4402. 🚫 Don't attempt to set `Delegate=` in slice units, and in particular not in
441 `-.slice`. It's not supported, and will generate an error.
442
4433. 🚫 Never *write* to any of the attributes of a cgroup systemd created for
444 you. It's systemd's private property. You are welcome to manipulate the
445 attributes of cgroups you created in your own delegated sub-tree, but the
446 cgroup tree of systemd itself is out of limits for you. It's fine to *read*
447 from any attribute you like however. That's totally OK and welcome.
448
d11623e9
LP
4494. 🚫 When not using `CLONE_NEWCGROUP` when delegating a sub-tree to a
450 container payload running systemd, then don't get the idea that you can bind
451 mount only a sub-tree of the host's cgroup tree into the container. Part of
452 the cgroup API is that `/proc/$PID/cgroup` reports the cgroup path of every
e30eaff3
LP
453 process, and hence any path below `/sys/fs/cgroup/` needs to match what
454 `/proc/$PID/cgroup` of the payload processes reports. What you can do safely
d11623e9
LP
455 however, is mount the upper parts of the cgroup tree read-only (or even
456 replace the middle bits with an intermediary `tmpfs` — but be careful not to
457 break the `statfs()` detection logic discussed above), as long as the path
458 to the delegated sub-tree remains accessible as-is.
e30eaff3 459
3ee9b2f6
LP
4605. ⚡ Currently, the algorithm for mapping between slice/scope/service unit
461 naming and their cgroup paths is not considered public API of systemd, and
462 may change in future versions. This means: it's best to avoid implementing a
463 local logic of translating cgroup paths to slice/scope/service names in your
464 program, or vice versa — it's likely going to break sooner or later. Use the
465 appropriate D-Bus API calls for that instead, so that systemd translates
466 this for you. (Specifically: each Unit object has a `ControlGroup` property
467 to get the cgroup for a unit. The method `GetUnitByControlGroup()` may be
468 used to get the unit for a cgroup.)
469
4e1dfa45 4706. ⚡ Think twice before delegating cgroup v1 controllers to less privileged
e30eaff3 471 containers. It's not safe, you basically allow your containers to freeze the
4e1dfa45 472 system with that and worse. Delegation is a strongpoint of cgroup v2 though,
e30eaff3
LP
473 and there it's safe to treat delegation boundaries as privilege boundaries.
474
475And that's it for now. If you have further questions, refer to the systemd
476mailing list.
477
478— Berlin, 2018-04-20