]>
Commit | Line | Data |
---|---|---|
c3e270f4 FB |
1 | --- |
2 | title: Control Group APIs and Delegation | |
4cdca0af | 3 | category: Interfaces |
b41a3f66 | 4 | layout: default |
0aff7b75 | 5 | SPDX-License-Identifier: LGPL-2.1-or-later |
c3e270f4 FB |
6 | --- |
7 | ||
e30eaff3 LP |
8 | # Control Group APIs and Delegation |
9 | ||
1e46eb59 LP |
10 | *Intended audience: hackers working on userspace subsystems that require direct |
11 | cgroup access, such as container managers and similar.* | |
12 | ||
e30eaff3 LP |
13 | So you are wondering about resource management with systemd, you know Linux |
14 | control groups (cgroups) a bit and are trying to integrate your software with | |
15 | what systemd has to offer there. Here's a bit of documentation about the | |
16 | concepts and interfaces involved with this. | |
17 | ||
18 | What's described here has been part of systemd and documented since v205 | |
5b24525a | 19 | times. However, it has been updated and improved substantially, even |
e30eaff3 LP |
20 | though the concepts stayed mostly the same. This is an attempt to provide more |
21 | comprehensive up-to-date information about all this, particular in light of the | |
22 | poor implementations of the components interfacing with systemd of current | |
23 | container managers. | |
24 | ||
bb6d563a ZJS |
25 | Before you read on, please make sure you read the low-level kernel |
26 | documentation about the | |
0e685823 | 27 | [unified cgroup hierarchy](https://docs.kernel.org/admin-guide/cgroup-v2.html). |
bb6d563a | 28 | This document then adds in the higher-level view from systemd. |
e30eaff3 LP |
29 | |
30 | This document augments the existing documentation we already have: | |
31 | ||
a25d9395 BF |
32 | * [The New Control Group Interfaces](https://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface) |
33 | * [Writing VM and Container Managers](https://www.freedesktop.org/wiki/Software/systemd/writing-vm-managers) | |
e30eaff3 LP |
34 | |
35 | These wiki documents are not as up to date as they should be, currently, but | |
36 | the basic concepts still fully apply. You should read them too, if you do something | |
37 | with cgroups and systemd, in particular as they shine more light on the various | |
38 | D-Bus APIs provided. (That said, sooner or later we should probably fold that | |
39 | wiki documentation into this very document, too.) | |
40 | ||
41 | ## Two Key Design Rules | |
42 | ||
43 | Much of the philosophy behind these concepts is based on a couple of basic | |
4e1dfa45 CD |
44 | design ideas of cgroup v2 (which we however try to adapt as far as we can to |
45 | cgroup v1 too). Specifically two cgroup v2 rules are the most relevant: | |
e30eaff3 LP |
46 | |
47 | 1. The **no-processes-in-inner-nodes** rule: this means that it's not permitted | |
48 | to have processes directly attached to a cgroup that also has child cgroups and | |
49 | vice versa. A cgroup is either an inner node or a leaf node of the tree, and if | |
50 | it's an inner node it may not contain processes directly, and if it's a leaf | |
51 | node then it may not have child cgroups. (Note that there are some minor | |
5b24525a | 52 | exceptions to this rule, though. E.g. the root cgroup is special and allows |
e30eaff3 LP |
53 | both processes and children — which is used in particular to maintain kernel |
54 | threads.) | |
55 | ||
56 | 2. The **single-writer** rule: this means that each cgroup only has a single | |
57 | writer, i.e. a single process managing it. It's OK if different cgroups have | |
58 | different processes managing them. However, only a single process should own a | |
59 | specific cgroup, and when it does that ownership is exclusive, and nothing else | |
60 | should manipulate it at the same time. This rule ensures that various pieces of | |
61 | software don't step on each other's toes constantly. | |
62 | ||
63 | These two rules have various effects. For example, one corollary of this is: if | |
64 | your container manager creates and manages cgroups in the system's root cgroup | |
65 | you violate rule #2, as the root cgroup is managed by systemd and hence off | |
66 | limits to everybody else. | |
67 | ||
4e1dfa45 | 68 | Note that rule #1 is generally enforced by the kernel if cgroup v2 is used: as |
e30eaff3 | 69 | soon as you add a process to a cgroup it is ensured the rule is not |
4e1dfa45 | 70 | violated. On cgroup v1 this rule didn't exist, and hence isn't enforced, even |
e30eaff3 | 71 | though it's a good thing to follow it then too. Rule #2 is not enforced on |
4e1dfa45 | 72 | either cgroup v1 nor cgroup v2 (this is UNIX after all, in the general case |
e30eaff3 LP |
73 | root can do anything, modulo SELinux and friends), but if you ignore it you'll |
74 | be in constant pain as various pieces of software will fight over cgroup | |
75 | ownership. | |
76 | ||
4e1dfa45 | 77 | Note that cgroup v1 is currently the most deployed implementation, even though |
5b24525a | 78 | it's semantically broken in many ways, and in many cases doesn't actually do |
4e1dfa45 CD |
79 | what people think it does. cgroup v2 is where things are going, and most new |
80 | kernel features in this area are only added to cgroup v2, and not cgroup v1 | |
5c7a4f21 | 81 | anymore. For example, cgroup v2 provides proper cgroup-empty notifications, has |
5b24525a ZJS |
82 | support for all kinds of per-cgroup BPF magic, supports secure delegation of |
83 | cgroup trees to less privileged processes and so on, which all are not | |
4e1dfa45 | 84 | available on cgroup v1. |
e30eaff3 LP |
85 | |
86 | ## Three Different Tree Setups 🌳 | |
87 | ||
88 | systemd supports three different modes how cgroups are set up. Specifically: | |
89 | ||
4e1dfa45 | 90 | 1. **Unified** — this is the simplest mode, and exposes a pure cgroup v2 |
e30eaff3 LP |
91 | logic. In this mode `/sys/fs/cgroup` is the only mounted cgroup API file system |
92 | and all available controllers are exclusively exposed through it. | |
93 | ||
4e1dfa45 | 94 | 2. **Legacy** — this is the traditional cgroup v1 mode. In this mode the |
e30eaff3 LP |
95 | various controllers each get their own cgroup file system mounted to |
96 | `/sys/fs/cgroup/<controller>/`. On top of that systemd manages its own cgroup | |
97 | hierarchy for managing purposes as `/sys/fs/cgroup/systemd/`. | |
98 | ||
99 | 3. **Hybrid** — this is a hybrid between the unified and legacy mode. It's set | |
100 | up mostly like legacy, except that there's also an additional hierarchy | |
4e1dfa45 | 101 | `/sys/fs/cgroup/unified/` that contains the cgroup v2 hierarchy. (Note that in |
9afd5740 LP |
102 | this mode the unified hierarchy won't have controllers attached, the |
103 | controllers are all mounted as separate hierarchies as in legacy mode, | |
4e1dfa45 | 104 | i.e. `/sys/fs/cgroup/unified/` is purely and exclusively about core cgroup v2 |
9afd5740 | 105 | functionality and not about resource management.) In this mode compatibility |
4e1dfa45 | 106 | with cgroup v1 is retained while some cgroup v2 features are available |
9afd5740 LP |
107 | too. This mode is a stopgap. Don't bother with this too much unless you have |
108 | too much free time. | |
e30eaff3 LP |
109 | |
110 | To say this clearly, legacy and hybrid modes have no future. If you develop | |
111 | software today and don't focus on the unified mode, then you are writing | |
112 | software for yesterday, not tomorrow. They are primarily supported for | |
113 | compatibility reasons and will not receive new features. Sorry. | |
114 | ||
115 | Superficially, in legacy and hybrid modes it might appear that the parallel | |
116 | cgroup hierarchies for each controller are orthogonal from each other. In | |
117 | systemd they are not: the hierarchies of all controllers are always kept in | |
118 | sync (at least mostly: sub-trees might be suppressed in certain hierarchies if | |
119 | no controller usage is required for them). The fact that systemd keeps these | |
120 | hierarchies in sync means that the legacy and hybrid hierarchies are | |
121 | conceptually very close to the unified hierarchy. In particular this allows us | |
5b24525a ZJS |
122 | to talk of one specific cgroup and actually mean the same cgroup in all |
123 | available controller hierarchies. E.g. if we talk about the cgroup `/foo/bar/` | |
124 | then we actually mean `/sys/fs/cgroup/cpu/foo/bar/` as well as | |
125 | `/sys/fs/cgroup/memory/foo/bar/`, `/sys/fs/cgroup/pids/foo/bar/`, and so on. | |
4e1dfa45 | 126 | Note that in cgroup v2 the controller hierarchies aren't orthogonal, hence |
e30eaff3 LP |
127 | thinking about them as orthogonal won't help you in the long run anyway. |
128 | ||
129 | If you wonder how to detect which of these three modes is currently used, use | |
130 | `statfs()` on `/sys/fs/cgroup/`. If it reports `CGROUP2_SUPER_MAGIC` in its | |
131 | `.f_type` field, then you are in unified mode. If it reports `TMPFS_MAGIC` then | |
b2454670 | 132 | you are either in legacy or hybrid mode. To distinguish these two cases, run |
e30eaff3 LP |
133 | `statfs()` again on `/sys/fs/cgroup/unified/`. If that succeeds and reports |
134 | `CGROUP2_SUPER_MAGIC` you are in hybrid mode, otherwise not. | |
7833a46c | 135 | From a shell, you can check the `Type` in `stat -f /sys/fs/cgroup` and |
ed1de710 | 136 | `stat -f /sys/fs/cgroup/unified`. |
e30eaff3 LP |
137 | |
138 | ## systemd's Unit Types | |
139 | ||
140 | The low-level kernel cgroups feature is exposed in systemd in three different | |
141 | "unit" types. Specifically: | |
142 | ||
143 | 1. 💼 The `.service` unit type. This unit type is for units encapsulating | |
144 | processes systemd itself starts. Units of these types have cgroups that are | |
145 | the leaves of the cgroup tree the systemd instance manages (though possibly | |
146 | they might contain a sub-tree of their own managed by something else, made | |
147 | possible by the concept of delegation, see below). Service units are usually | |
148 | instantiated based on a unit file on disk that describes the command line to | |
149 | invoke and other properties of the service. However, service units may also | |
150 | be declared and started programmatically at runtime through a D-Bus API | |
151 | (which is called *transient* services). | |
152 | ||
153 | 2. 👓 The `.scope` unit type. This is very similar to `.service`. The main | |
154 | difference: the processes the units of this type encapsulate are forked off | |
155 | by some unrelated manager process, and that manager asked systemd to expose | |
156 | them as a unit. Unlike services, scopes can only be declared and started | |
157 | programmatically, i.e. are always transient. That's because they encapsulate | |
158 | processes forked off by something else, i.e. existing runtime objects, and | |
159 | hence cannot really be defined fully in 'offline' concepts such as unit | |
160 | files. | |
161 | ||
162 | 3. 🔪 The `.slice` unit type. Units of this type do not directly contain any | |
163 | processes. Units of this type are the inner nodes of part of the cgroup tree | |
164 | the systemd instance manages. Much like services, slices can be defined | |
165 | either on disk with unit files or programmatically as transient units. | |
166 | ||
167 | Slices expose the trunk and branches of a tree, and scopes and services are | |
168 | attached to those branches as leaves. The idea is that scopes and services can | |
169 | be moved around though, i.e. assigned to a different slice if needed. | |
170 | ||
171 | The naming of slice units directly maps to the cgroup tree path. This is not | |
172 | the case for service and scope units however. A slice named `foo-bar-baz.slice` | |
173 | maps to a cgroup `/foo.slice/foo-bar.slice/foo-bar-baz.slice/`. A service | |
174 | `quux.service` which is attached to the slice `foo-bar-baz.slice` maps to the | |
175 | cgroup `/foo.slice/foo-bar.slice/foo-bar-baz.slice/quux.service/`. | |
176 | ||
177 | By default systemd sets up four slice units: | |
178 | ||
179 | 1. `-.slice` is the root slice. i.e. the parent of everything else. On the host | |
4e1dfa45 | 180 | system it maps directly to the top-level directory of cgroup v2. |
e30eaff3 LP |
181 | |
182 | 2. `system.slice` is where system services are by default placed, unless | |
183 | configured otherwise. | |
184 | ||
185 | 3. `user.slice` is where user sessions are placed. Each user gets a slice of | |
186 | its own below that. | |
187 | ||
188 | 4. `machines.slice` is where VMs and containers are supposed to be | |
189 | placed. `systemd-nspawn` makes use of this by default, and you're very welcome | |
190 | to place your containers and VMs there too if you hack on managers for those. | |
191 | ||
192 | Users may define any amount of additional slices they like though, the four | |
193 | above are just the defaults. | |
194 | ||
195 | ## Delegation | |
196 | ||
197 | Container managers and suchlike often want to control cgroups directly using | |
198 | the raw kernel APIs. That's entirely fine and supported, as long as proper | |
4e1dfa45 CD |
199 | *delegation* is followed. Delegation is a concept we inherited from cgroup v2, |
200 | but we expose it on cgroup v1 too. Delegation means that some parts of the | |
e30eaff3 LP |
201 | cgroup tree may be managed by different managers than others. As long as it is |
202 | clear which manager manages which part of the tree each one can do within its | |
203 | sub-graph of the tree whatever it wants. | |
204 | ||
205 | Only sub-trees can be delegated (though whoever decides to request a sub-tree | |
5b24525a ZJS |
206 | can delegate sub-sub-trees further to somebody else if they like). Delegation |
207 | takes place at a specific cgroup: in systemd there's a `Delegate=` property you | |
208 | can set for a service or scope unit. If you do, it's the cut-off point for | |
209 | systemd's cgroup management: the unit itself is managed by systemd, i.e. all | |
210 | its attributes are managed exclusively by systemd, however your program may | |
211 | create/remove sub-cgroups inside it freely, and those then become exclusive | |
212 | property of your program, systemd won't touch them — all attributes of *those* | |
213 | sub-cgroups can be manipulated freely and exclusively by your program. | |
e30eaff3 LP |
214 | |
215 | By turning on the `Delegate=` property for a scope or service you get a few | |
216 | guarantees: | |
217 | ||
218 | 1. systemd won't fiddle with your sub-tree of the cgroup tree anymore. It won't | |
219 | change attributes of any cgroups below it, nor will it create or remove any | |
220 | cgroups thereunder, nor migrate processes across the boundaries of that | |
221 | sub-tree as it deems useful anymore. | |
222 | ||
223 | 2. If your service makes use of the `User=` functionality, then the sub-tree | |
224 | will be `chown()`ed to the indicated user so that it can correctly create | |
225 | cgroups below it. Note however that systemd will do that only in the unified | |
226 | hierarchy (in unified and hybrid mode) as well as on systemd's own private | |
227 | hierarchy (in legacy and hybrid mode). It won't pass ownership of the legacy | |
7833a46c | 228 | controller hierarchies. Delegation to less privileged processes is not safe |
4e1dfa45 | 229 | in cgroup v1 (as a limitation of the kernel), hence systemd won't facilitate |
e30eaff3 LP |
230 | access to it. |
231 | ||
232 | 3. Any BPF IP filter programs systemd installs will be installed with | |
233 | `BPF_F_ALLOW_MULTI` so that your program can install additional ones. | |
234 | ||
235 | In unit files the `Delegate=` property is superficially exposed as | |
236 | boolean. However, since v236 it optionally takes a list of controller names | |
237 | instead. If so, delegation is requested for listed controllers | |
1a31d050 | 238 | specifically. Note that this only encodes a request. Depending on various |
e30eaff3 LP |
239 | parameters it might happen that your service actually will get fewer |
240 | controllers delegated (for example, because the controller is not available on | |
241 | the current kernel or was turned off) or more. If no list is specified | |
242 | (i.e. the property simply set to `yes`) then all available controllers are | |
243 | delegated. | |
244 | ||
245 | Let's stress one thing: delegation is available on scope and service units | |
5b24525a | 246 | only. It's expressly not available on slice units. Why? Because slice units are |
7833a46c | 247 | our *inner* nodes of the cgroup trees and we freely attach services and scopes |
e5988600 | 248 | to them. If we'd allow delegation on slice units then this would mean that |
5b24525a ZJS |
249 | both systemd and your own manager would create/delete cgroups below the slice |
250 | unit and that conflicts with the single-writer rule. | |
e30eaff3 LP |
251 | |
252 | So, if you want to do your own raw cgroups kernel level access, then allocate a | |
253 | scope unit, or a service unit (or just use the service unit you already have | |
254 | for your service code), and turn on delegation for it. | |
255 | ||
200aa358 LP |
256 | The service manager sets the `user.delegate` extended attribute (readable via |
257 | `getxattr(2)` and related calls) to the character `1` on cgroup directories | |
258 | where delegation is enabled (and removes it on those cgroups where it is | |
259 | not). This may be used by service programs to determine whether a cgroup tree | |
260 | was delegated to them. Note that this is only supported on kernels 5.6 and | |
261 | newer in combination with systemd 251 and newer. | |
262 | ||
e2391ce0 LP |
263 | (OK, here's one caveat: if you turn on delegation for a service, and that |
264 | service has `ExecStartPost=`, `ExecReload=`, `ExecStop=` or `ExecStopPost=` | |
265 | set, then these commands will be executed within the `.control/` sub-cgroup of | |
266 | your service's cgroup. This is necessary because by turning on delegation we | |
267 | have to assume that the cgroup delegated to your service is now an *inner* | |
268 | cgroup, which means that it may not directly contain any processes. Hence, if | |
269 | your service has any of these four settings set, you must be prepared that a | |
270 | `.control/` subcgroup might appear, managed by the service manager. This also | |
271 | means that your service code should have moved itself further down the cgroup | |
272 | tree by the time it notifies the service manager about start-up readiness, so | |
273 | that the service's main cgroup is definitely an inner node by the time the | |
a8b993dc LP |
274 | service manager might start `ExecStartPost=`. Starting with systemd 254 you may |
275 | also use `DelegateSubgroup=` to let the service manager put your initial | |
276 | service process into a subgroup right away.) | |
e2391ce0 | 277 | |
1d7150ec LP |
278 | (Also note, if you intend to use "threaded" cgroups — as added in Linux 4.14 —, |
279 | then you should do that *two* levels down from the main service cgroup your | |
280 | turned delegation on for. Why that? You need one level so that systemd can | |
281 | properly create the `.control` subgroup, as described above. But that one | |
282 | cannot be threaded, since that would mean `.control` has to be threaded too — | |
283 | this is a requirement of threaded cgroups: either a cgroup and all its siblings | |
284 | are threaded or none –, but systemd expects it to be a regular cgroup. Thus you | |
285 | have to nest a second cgroup beneath it which then can be threaded.) | |
286 | ||
e30eaff3 LP |
287 | ## Three Scenarios |
288 | ||
289 | Let's say you write a container manager, and you wonder what to do regarding | |
290 | cgroups for it, as you want your manager to be able to run on systemd systems. | |
291 | ||
292 | You basically have three options: | |
293 | ||
5b24525a ZJS |
294 | 1. 😊 The *integration-is-good* option. For this, you register each container |
295 | you have either as a systemd service (i.e. let systemd invoke the executor | |
296 | binary for you) or a systemd scope (i.e. your manager executes the binary | |
297 | directly, but then tells systemd about it. In this mode the administrator | |
298 | can use the usual systemd resource management and reporting commands | |
299 | individually on those containers. By turning on `Delegate=` for these scopes | |
300 | or services you make it possible to run cgroup-enabled programs in your | |
301 | containers, for example a nested systemd instance. This option has two | |
302 | sub-options: | |
e30eaff3 | 303 | |
5b24525a ZJS |
304 | a. You transiently register the service or scope by directly contacting |
305 | systemd via D-Bus. In this case systemd will just manage the unit for you | |
306 | and nothing else. | |
e30eaff3 LP |
307 | |
308 | b. Instead you register the service or scope through `systemd-machined` | |
309 | (also via D-Bus). This mini-daemon is basically just a proxy for the same | |
310 | operations as in a. The main benefit of this: this way you let the system | |
311 | know that what you are registering is a container, and this opens up | |
312 | certain additional integration points. For example, `journalctl -M` can | |
313 | then be used to directly look into any container's journal logs (should | |
314 | the container run systemd inside), or `systemctl -M` can be used to | |
315 | directly invoke systemd operations inside the containers. Moreover tools | |
316 | like "ps" can then show you to which container a process belongs (`ps -eo | |
317 | pid,comm,machine`), and even gnome-system-monitor supports it. | |
318 | ||
319 | 2. 🙁 The *i-like-islands* option. If all you care about is your own cgroup tree, | |
320 | and you want to have to do as little as possible with systemd and no | |
321 | interest in integration with the rest of the system, then this is a valid | |
322 | option. For this all you have to do is turn on `Delegate=` for your main | |
323 | manager daemon. Then figure out the cgroup systemd placed your daemon in: | |
324 | you can now freely create sub-cgroups beneath it. Don't forget the | |
325 | *no-processes-in-inner-nodes* rule however: you have to move your main | |
326 | daemon process out of that cgroup (and into a sub-cgroup) before you can | |
327 | start further processes in any of your sub-cgroups. | |
328 | ||
329 | 3. 🙁 The *i-like-continents* option. In this option you'd leave your manager | |
330 | daemon where it is, and would not turn on delegation on its unit. However, | |
4db9e01f AM |
331 | as you start your first managed process (a container, for example) you would |
332 | register a new scope unit with systemd, and that scope unit would have | |
333 | `Delegate=` turned on, and it would contain the PID of this process; all | |
334 | your managed processes subsequently created should also be moved into this | |
335 | scope. From systemd's PoV there'd be two units: your manager service and the | |
336 | big scope that contains all your managed processes in one. | |
e30eaff3 LP |
337 | |
338 | BTW: if for whatever reason you say "I hate D-Bus, I'll never call any D-Bus | |
339 | API, kthxbye", then options #1 and #3 are not available, as they generally | |
340 | involve talking to systemd from your program code, via D-Bus. You still have | |
341 | option #2 in that case however, as you can simply set `Delegate=` in your | |
342 | service's unit file and you are done and have your own sub-tree. In fact, #2 is | |
343 | the one option that allows you to completely ignore systemd's existence: you | |
344 | can entirely generically follow the single rule that you just use the cgroup | |
345 | you are started in, and everything below it, whatever that might be. That said, | |
346 | maybe if you dislike D-Bus and systemd that much, the better approach might be | |
347 | to work on that, and widen your horizon a bit. You are welcome. | |
348 | ||
349 | ## Controller Support | |
350 | ||
351 | systemd supports a number of controllers (but not all). Specifically, supported | |
352 | are: | |
353 | ||
4e1dfa45 CD |
354 | * on cgroup v1: `cpu`, `cpuacct`, `blkio`, `memory`, `devices`, `pids` |
355 | * on cgroup v2: `cpu`, `io`, `memory`, `pids` | |
e30eaff3 | 356 | |
4e1dfa45 CD |
357 | It is our intention to natively support all cgroup v2 controllers as they are |
358 | added to the kernel. However, regarding cgroup v1: at this point we will not | |
5b24525a | 359 | add support for any other controllers anymore. This means systemd currently |
4e1dfa45 | 360 | does not and will never manage the following controllers on cgroup v1: |
e30eaff3 LP |
361 | `freezer`, `cpuset`, `net_cls`, `perf_event`, `net_prio`, `hugetlb`. Why not? |
362 | Depending on the case, either their API semantics or implementations aren't | |
4e1dfa45 | 363 | really usable, or it's very clear they have no future on cgroup v2, and we |
e30eaff3 LP |
364 | won't add new code for stuff that clearly has no future. |
365 | ||
4e1dfa45 | 366 | Effectively this means that all those mentioned cgroup v1 controllers are up |
e30eaff3 LP |
367 | for grabs: systemd won't manage them, and hence won't delegate them to your |
368 | code (however, systemd will still mount their hierarchies, simply because it | |
369 | mounts all controller hierarchies it finds available in the kernel). If you | |
370 | decide to use them, then that's fine, but systemd won't help you with it (but | |
371 | also not interfere with it). To be nice to other tenants it might be wise to | |
372 | replicate the cgroup hierarchies of the other controllers in them too however, | |
373 | but of course that's between you and those other tenants, and systemd won't | |
374 | care. Replicating the cgroup hierarchies in those unsupported controllers would | |
375 | mean replicating the full cgroup paths in them, and hence the prefixing | |
376 | `.slice` components too, otherwise the hierarchies will start being orthogonal | |
f223fd6a | 377 | after all, and that's not really desirable. One more thing: systemd will clean |
e30eaff3 LP |
378 | up after you in the hierarchies it manages: if your daemon goes down, its |
379 | cgroups will be removed too. You basically get the guarantee that you start | |
380 | with a pristine cgroup sub-tree for your service or scope whenever it is | |
381 | started. This is not the case however in the hierarchies systemd doesn't | |
382 | manage. This means that your programs should be ready to deal with left-over | |
383 | cgroups in them — from previous runs, and be extra careful with them as they | |
384 | might still carry settings that might not be valid anymore. | |
385 | ||
386 | Note a particular asymmetry here: if your systemd version doesn't support a | |
4e1dfa45 | 387 | specific controller on cgroup v1 you can still make use of it for delegation, |
e30eaff3 | 388 | by directly fiddling with its hierarchy and replicating the cgroup tree there |
4e1dfa45 | 389 | as necessary (as suggested above). However, on cgroup v2 this is different: |
e30eaff3 LP |
390 | separately mounted hierarchies are not available, and delegation has always to |
391 | happen through systemd itself. This means: when you update your kernel and it | |
392 | adds a new, so far unseen controller, and you want to use it for delegation, | |
393 | then you also need to update systemd to a version that groks it. | |
394 | ||
395 | ## systemd as Container Payload | |
396 | ||
397 | systemd can happily run as a container payload's PID 1. Note that systemd | |
398 | unconditionally needs write access to the cgroup tree however, hence you need | |
399 | to delegate a sub-tree to it. Note that there's nothing too special you have to | |
400 | do beyond that: just invoke systemd as PID 1 inside the root of the delegated | |
401 | cgroup sub-tree, and it will figure out the rest: it will determine the cgroup | |
402 | it is running in and take possession of it. It won't interfere with any cgroup | |
403 | outside of the sub-tree it was invoked in. Use of `CLONE_NEWCGROUP` is hence | |
404 | optional (but of course wise). | |
405 | ||
406 | Note one particular asymmetry here though: systemd will try to take possession | |
407 | of the root cgroup you pass to it *in* *full*, i.e. it will not only | |
e5988600 | 408 | create/remove child cgroups below it, it will also attempt to manage the |
e30eaff3 LP |
409 | attributes of it. OTOH as mentioned above, when delegating a cgroup tree to |
410 | somebody else it only passes the rights to create/remove sub-cgroups, but will | |
411 | insist on managing the delegated cgroup tree's top-level attributes. Or in | |
412 | other words: systemd is *greedy* when accepting delegated cgroup trees and also | |
413 | *greedy* when delegating them to others: it insists on managing attributes on | |
414 | the specific cgroup in both cases. A container manager that is itself a payload | |
415 | of a host systemd which wants to run a systemd as its own container payload | |
416 | instead hence needs to insert an extra level in the hierarchy in between, so | |
417 | that the systemd on the host and the one in the container won't fight for the | |
418 | attributes. That said, you likely should do that anyway, due to the | |
419 | no-processes-in-inner-cgroups rule, see below. | |
420 | ||
421 | When systemd runs as container payload it will make use of all hierarchies it | |
422 | has write access to. For legacy mode you need to make at least | |
423 | `/sys/fs/cgroup/systemd/` available, all other hierarchies are optional. For | |
424 | hybrid mode you need to add `/sys/fs/cgroup/unified/`. Finally, for fully | |
425 | unified you (of course, I guess) need to provide only `/sys/fs/cgroup/` itself. | |
426 | ||
427 | ## Some Dos | |
428 | ||
429 | 1. ⚡ If you go for implementation option 1a or 1b (as in the list above), then | |
430 | each of your containers will have its own systemd-managed unit and hence | |
431 | cgroup with possibly further sub-cgroups below. Typically the first process | |
432 | running in that unit will be some kind of executor program, which will in | |
433 | turn fork off the payload processes of the container. In this case don't | |
434 | forget that there are two levels of delegation involved: first, systemd | |
435 | delegates a group sub-tree to your executor. And then your executor should | |
436 | delegate a sub-tree further down to the container payload. Oh, and because | |
437 | of the no-process-in-inner-nodes rule, your executor needs to migrate itself | |
438 | to a sub-cgroup of the cgroup it got delegated, too. Most likely you hence | |
439 | want a two-pronged approach: below the cgroup you got started in, you want | |
440 | one cgroup maybe called `supervisor/` where your manager runs in and then | |
441 | for each container a sibling cgroup of that maybe called `payload-xyz/`. | |
442 | ||
443 | 2. ⚡ Don't forget that the cgroups you create have to have names that are | |
444 | suitable as UNIX file names, and that they live in the same namespace as the | |
445 | various kernel attribute files. Hence, when you want to allow the user | |
446 | arbitrary naming, you might need to escape some of the names (for example, | |
447 | you really don't want to create a cgroup named `tasks`, just because the | |
448 | user created a container by that name, because `tasks` after all is a magic | |
4e1dfa45 | 449 | attribute in cgroup v1, and your `mkdir()` will hence fail with `EEXIST`. In |
e30eaff3 LP |
450 | systemd we do escaping by prefixing names that might collide with a kernel |
451 | attribute name with an underscore. You might want to do the same, but this | |
452 | is really up to you how you do it. Just do it, and be careful. | |
453 | ||
454 | ## Some Don'ts | |
455 | ||
456 | 1. 🚫 Never create your own cgroups below arbitrary cgroups systemd manages, i.e | |
457 | cgroups you haven't set `Delegate=` in. Specifically: 🔥 don't create your | |
458 | own cgroups below the root cgroup 🔥. That's owned by systemd, and you will | |
459 | step on systemd's toes if you ignore that, and systemd will step on | |
460 | yours. Get your own delegated sub-tree, you may create as many cgroups there | |
461 | as you like. Seriously, if you create cgroups directly in the cgroup root, | |
462 | then all you do is ask for trouble. | |
463 | ||
464 | 2. 🚫 Don't attempt to set `Delegate=` in slice units, and in particular not in | |
465 | `-.slice`. It's not supported, and will generate an error. | |
466 | ||
467 | 3. 🚫 Never *write* to any of the attributes of a cgroup systemd created for | |
468 | you. It's systemd's private property. You are welcome to manipulate the | |
469 | attributes of cgroups you created in your own delegated sub-tree, but the | |
470 | cgroup tree of systemd itself is out of limits for you. It's fine to *read* | |
471 | from any attribute you like however. That's totally OK and welcome. | |
472 | ||
d11623e9 LP |
473 | 4. 🚫 When not using `CLONE_NEWCGROUP` when delegating a sub-tree to a |
474 | container payload running systemd, then don't get the idea that you can bind | |
475 | mount only a sub-tree of the host's cgroup tree into the container. Part of | |
476 | the cgroup API is that `/proc/$PID/cgroup` reports the cgroup path of every | |
e30eaff3 LP |
477 | process, and hence any path below `/sys/fs/cgroup/` needs to match what |
478 | `/proc/$PID/cgroup` of the payload processes reports. What you can do safely | |
d11623e9 LP |
479 | however, is mount the upper parts of the cgroup tree read-only (or even |
480 | replace the middle bits with an intermediary `tmpfs` — but be careful not to | |
481 | break the `statfs()` detection logic discussed above), as long as the path | |
482 | to the delegated sub-tree remains accessible as-is. | |
e30eaff3 | 483 | |
3ee9b2f6 LP |
484 | 5. ⚡ Currently, the algorithm for mapping between slice/scope/service unit |
485 | naming and their cgroup paths is not considered public API of systemd, and | |
486 | may change in future versions. This means: it's best to avoid implementing a | |
487 | local logic of translating cgroup paths to slice/scope/service names in your | |
488 | program, or vice versa — it's likely going to break sooner or later. Use the | |
489 | appropriate D-Bus API calls for that instead, so that systemd translates | |
490 | this for you. (Specifically: each Unit object has a `ControlGroup` property | |
491 | to get the cgroup for a unit. The method `GetUnitByControlGroup()` may be | |
492 | used to get the unit for a cgroup.) | |
493 | ||
4e1dfa45 | 494 | 6. ⚡ Think twice before delegating cgroup v1 controllers to less privileged |
e30eaff3 | 495 | containers. It's not safe, you basically allow your containers to freeze the |
4e1dfa45 | 496 | system with that and worse. Delegation is a strongpoint of cgroup v2 though, |
e30eaff3 LP |
497 | and there it's safe to treat delegation boundaries as privilege boundaries. |
498 | ||
499 | And that's it for now. If you have further questions, refer to the systemd | |
500 | mailing list. | |
501 | ||
502 | — Berlin, 2018-04-20 |