docs/CGROUP_DELEGATION.md

   1 ---
   2 title: Control Group APIs and Delegation
   3 category: Interfaces
   4 layout: default
   5 SPDX-License-Identifier: LGPL-2.1-or-later
   6 ---
   7
   8 # Control Group APIs and Delegation
   9
  10 *Intended audience: hackers working on userspace subsystems that require direct
  11 cgroup access, such as container managers and similar.*
  12
  13 So you are wondering about resource management with systemd, you know Linux
  14 control groups (cgroups) a bit and are trying to integrate your software with
  15 what systemd has to offer there. Here's a bit of documentation about the
  16 concepts and interfaces involved with this.
  17
  18 What's described here has been part of systemd and documented since v205
  19 times. However, it has been updated and improved substantially, even
  20 though the concepts stayed mostly the same. This is an attempt to provide more
  21 comprehensive up-to-date information about all this, particular in light of the
  22 poor implementations of the components interfacing with systemd of current
  23 container managers.
  24
  25 Before you read on, please make sure you read the low-level kernel
  26 documentation about the
  27 [unified cgroup hierarchy](https://docs.kernel.org/admin-guide/cgroup-v2.html).
  28 This document then adds in the higher-level view from systemd.
  29
  30 This document augments the existing documentation we already have:
  31
  32 * [The New Control Group Interfaces](/CONTROL_GROUP_INTERFACE)
  33 * [Writing VM and Container Managers](/WRITING_VM_AND_CONTAINER_MANAGERS)
  34
  35 These wiki documents are not as up to date as they should be, currently, but
  36 the basic concepts still fully apply. You should read them too, if you do something
  37 with cgroups and systemd, in particular as they shine more light on the various
  38 D-Bus APIs provided. (That said, sooner or later we should probably fold that
  39 wiki documentation into this very document, too.)
  40
  41 ## Two Key Design Rules
  42
  43 Much of the philosophy behind these concepts is based on a couple of basic
  44 design ideas of cgroup v2 (which we however try to adapt as far as we can to
  45 cgroup v1 too). Specifically two cgroup v2 rules are the most relevant:
  46
  47 1. The **no-processes-in-inner-nodes** rule: this means that it's not permitted
  48 to have processes directly attached to a cgroup that also has child cgroups and
  49 vice versa. A cgroup is either an inner node or a leaf node of the tree, and if
  50 it's an inner node it may not contain processes directly, and if it's a leaf
  51 node then it may not have child cgroups. (Note that there are some minor
  52 exceptions to this rule, though. E.g. the root cgroup is special and allows
  53 both processes and children — which is used in particular to maintain kernel
  54 threads.)
  55
  56 2. The **single-writer** rule: this means that each cgroup only has a single
  57 writer, i.e. a single process managing it. It's OK if different cgroups have
  58 different processes managing them. However, only a single process should own a
  59 specific cgroup, and when it does that ownership is exclusive, and nothing else
  60 should manipulate it at the same time. This rule ensures that various pieces of
  61 software don't step on each other's toes constantly.
  62
  63 These two rules have various effects. For example, one corollary of this is: if
  64 your container manager creates and manages cgroups in the system's root cgroup
  65 you violate rule #2, as the root cgroup is managed by systemd and hence off
  66 limits to everybody else.
  67
  68 Note that rule #1 is generally enforced by the kernel if cgroup v2 is used: as
  69 soon as you add a process to a cgroup it is ensured the rule is not
  70 violated. On cgroup v1 this rule didn't exist, and hence isn't enforced, even
  71 though it's a good thing to follow it then too. Rule #2 is not enforced on
  72 either cgroup v1 nor cgroup v2 (this is UNIX after all, in the general case
  73 root can do anything, modulo SELinux and friends), but if you ignore it you'll
  74 be in constant pain as various pieces of software will fight over cgroup
  75 ownership.
  76
  77 Note that cgroup v1 is currently the most deployed implementation, even though
  78 it's semantically broken in many ways, and in many cases doesn't actually do
  79 what people think it does. cgroup v2 is where things are going, and most new
  80 kernel features in this area are only added to cgroup v2, and not cgroup v1
  81 anymore. For example, cgroup v2 provides proper cgroup-empty notifications, has
  82 support for all kinds of per-cgroup BPF magic, supports secure delegation of
  83 cgroup trees to less privileged processes and so on, which all are not
  84 available on cgroup v1.
  85
  86 ## Three Different Tree Setups 🌳
  87
  88 systemd supports three different modes how cgroups are set up. Specifically:
  89
  90 1. **Unified** — this is the simplest mode, and exposes a pure cgroup v2
  91 logic. In this mode `/sys/fs/cgroup` is the only mounted cgroup API file system
  92 and all available controllers are exclusively exposed through it.
  93
  94 2. **Legacy** — this is the traditional cgroup v1 mode. In this mode the
  95 various controllers each get their own cgroup file system mounted to
  96 `/sys/fs/cgroup/<controller>/`. On top of that systemd manages its own cgroup
  97 hierarchy for managing purposes as `/sys/fs/cgroup/systemd/`.
  98
  99 3. **Hybrid** — this is a hybrid between the unified and legacy mode. It's set
 100 up mostly like legacy, except that there's also an additional hierarchy
 101 `/sys/fs/cgroup/unified/` that contains the cgroup v2 hierarchy. (Note that in
 102 this mode the unified hierarchy won't have controllers attached, the
 103 controllers are all mounted as separate hierarchies as in legacy mode,
 104 i.e. `/sys/fs/cgroup/unified/` is purely and exclusively about core cgroup v2
 105 functionality and not about resource management.) In this mode compatibility
 106 with cgroup v1 is retained while some cgroup v2 features are available
 107 too. This mode is a stopgap. Don't bother with this too much unless you have
 108 too much free time.
 109
 110 To say this clearly, legacy and hybrid modes have no future. If you develop
 111 software today and don't focus on the unified mode, then you are writing
 112 software for yesterday, not tomorrow. They are primarily supported for
 113 compatibility reasons and will not receive new features. Sorry.
 114
 115 Superficially, in legacy and hybrid modes it might appear that the parallel
 116 cgroup hierarchies for each controller are orthogonal from each other. In
 117 systemd they are not: the hierarchies of all controllers are always kept in
 118 sync (at least mostly: sub-trees might be suppressed in certain hierarchies if
 119 no controller usage is required for them). The fact that systemd keeps these
 120 hierarchies in sync means that the legacy and hybrid hierarchies are
 121 conceptually very close to the unified hierarchy. In particular this allows us
 122 to talk of one specific cgroup and actually mean the same cgroup in all
 123 available controller hierarchies. E.g. if we talk about the cgroup `/foo/bar/`
 124 then we actually mean `/sys/fs/cgroup/cpu/foo/bar/` as well as
 125 `/sys/fs/cgroup/memory/foo/bar/`, `/sys/fs/cgroup/pids/foo/bar/`, and so on.
 126 Note that in cgroup v2 the controller hierarchies aren't orthogonal, hence
 127 thinking about them as orthogonal won't help you in the long run anyway.
 128
 129 If you wonder how to detect which of these three modes is currently used, use
 130 `statfs()` on `/sys/fs/cgroup/`. If it reports `CGROUP2_SUPER_MAGIC` in its
 131 `.f_type` field, then you are in unified mode. If it reports `TMPFS_MAGIC` then
 132 you are either in legacy or hybrid mode. To distinguish these two cases, run
 133 `statfs()` again on `/sys/fs/cgroup/unified/`. If that succeeds and reports
 134 `CGROUP2_SUPER_MAGIC` you are in hybrid mode, otherwise not.
 135 From a shell, you can check the `Type` in `stat -f /sys/fs/cgroup` and
 136 `stat -f /sys/fs/cgroup/unified`.
 137
 138 ## systemd's Unit Types
 139
 140 The low-level kernel cgroups feature is exposed in systemd in three different
 141 "unit" types. Specifically:
 142
 143 1. 💼 The `.service` unit type. This unit type is for units encapsulating
 144    processes systemd itself starts. Units of these types have cgroups that are
 145    the leaves of the cgroup tree the systemd instance manages (though possibly
 146    they might contain a sub-tree of their own managed by something else, made
 147    possible by the concept of delegation, see below). Service units are usually
 148    instantiated based on a unit file on disk that describes the command line to
 149    invoke and other properties of the service. However, service units may also
 150    be declared and started programmatically at runtime through a D-Bus API
 151    (which is called *transient* services).
 152
 153 2. 👓 The `.scope` unit type. This is very similar to `.service`. The main
 154    difference: the processes the units of this type encapsulate are forked off
 155    by some unrelated manager process, and that manager asked systemd to expose
 156    them as a unit. Unlike services, scopes can only be declared and started
 157    programmatically, i.e. are always transient. That's because they encapsulate
 158    processes forked off by something else, i.e. existing runtime objects, and
 159    hence cannot really be defined fully in 'offline' concepts such as unit
 160    files.
 161
 162 3. 🔪 The `.slice` unit type. Units of this type do not directly contain any
 163    processes. Units of this type are the inner nodes of part of the cgroup tree
 164    the systemd instance manages. Much like services, slices can be defined
 165    either on disk with unit files or programmatically as transient units.
 166
 167 Slices expose the trunk and branches of a tree, and scopes and services are
 168 attached to those branches as leaves. The idea is that scopes and services can
 169 be moved around though, i.e. assigned to a different slice if needed.
 170
 171 The naming of slice units directly maps to the cgroup tree path. This is not
 172 the case for service and scope units however. A slice named `foo-bar-baz.slice`
 173 maps to a cgroup `/foo.slice/foo-bar.slice/foo-bar-baz.slice/`. A service
 174 `quux.service` which is attached to the slice `foo-bar-baz.slice` maps to the
 175 cgroup `/foo.slice/foo-bar.slice/foo-bar-baz.slice/quux.service/`.
 176
 177 By default systemd sets up four slice units:
 178
 179 1. `-.slice` is the root slice. i.e. the parent of everything else. On the host
 180    system it maps directly to the top-level directory of cgroup v2.
 181
 182 2. `system.slice` is where system services are by default placed, unless
 183    configured otherwise.
 184
 185 3. `user.slice` is where user sessions are placed. Each user gets a slice of
 186    its own below that.
 187
 188 4. `machines.slice` is where VMs and containers are supposed to be
 189    placed. `systemd-nspawn` makes use of this by default, and you're very welcome
 190    to place your containers and VMs there too if you hack on managers for those.
 191
 192 Users may define any amount of additional slices they like though, the four
 193 above are just the defaults.
 194
 195 ## Delegation
 196
 197 Container managers and suchlike often want to control cgroups directly using
 198 the raw kernel APIs. That's entirely fine and supported, as long as proper
 199 *delegation* is followed. Delegation is a concept we inherited from cgroup v2,
 200 but we expose it on cgroup v1 too. Delegation means that some parts of the
 201 cgroup tree may be managed by different managers than others. As long as it is
 202 clear which manager manages which part of the tree each one can do within its
 203 sub-graph of the tree whatever it wants.
 204
 205 Only sub-trees can be delegated (though whoever decides to request a sub-tree
 206 can delegate sub-sub-trees further to somebody else if they like). Delegation
 207 takes place at a specific cgroup: in systemd there's a `Delegate=` property you
 208 can set for a service or scope unit. If you do, it's the cut-off point for
 209 systemd's cgroup management: the unit itself is managed by systemd, i.e. all
 210 its attributes are managed exclusively by systemd, however your program may
 211 create/remove sub-cgroups inside it freely, and those then become exclusive
 212 property of your program, systemd won't touch them — all attributes of *those*
 213 sub-cgroups can be manipulated freely and exclusively by your program.
 214
 215 By turning on the `Delegate=` property for a scope or service you get a few
 216 guarantees:
 217
 218 1. systemd won't fiddle with your sub-tree of the cgroup tree anymore. It won't
 219    change attributes of any cgroups below it, nor will it create or remove any
 220    cgroups thereunder, nor migrate processes across the boundaries of that
 221    sub-tree as it deems useful anymore.
 222
 223 2. If your service makes use of the `User=` functionality, then the sub-tree
 224    will be `chown()`ed to the indicated user so that it can correctly create
 225    cgroups below it. Note however that systemd will do that only in the unified
 226    hierarchy (in unified and hybrid mode) as well as on systemd's own private
 227    hierarchy (in legacy and hybrid mode). It won't pass ownership of the legacy
 228    controller hierarchies. Delegation to less privileged processes is not safe
 229    in cgroup v1 (as a limitation of the kernel), hence systemd won't facilitate
 230    access to it.
 231
 232 3. Any BPF IP filter programs systemd installs will be installed with
 233    `BPF_F_ALLOW_MULTI` so that your program can install additional ones.
 234
 235 In unit files the `Delegate=` property is superficially exposed as
 236 boolean. However, since v236 it optionally takes a list of controller names
 237 instead. If so, delegation is requested for listed controllers
 238 specifically. Note that this only encodes a request. Depending on various
 239 parameters it might happen that your service actually will get fewer
 240 controllers delegated (for example, because the controller is not available on
 241 the current kernel or was turned off) or more.  If no list is specified
 242 (i.e. the property simply set to `yes`) then all available controllers are
 243 delegated.
 244
 245 Let's stress one thing: delegation is available on scope and service units
 246 only. It's expressly not available on slice units. Why? Because slice units are
 247 our *inner* nodes of the cgroup trees and we freely attach services and scopes
 248 to them. If we'd allow delegation on slice units then this would mean that
 249 both systemd and your own manager would create/delete cgroups below the slice
 250 unit and that conflicts with the single-writer rule.
 251
 252 So, if you want to do your own raw cgroups kernel level access, then allocate a
 253 scope unit, or a service unit (or just use the service unit you already have
 254 for your service code), and turn on delegation for it.
 255
 256 The service manager sets the `user.delegate` extended attribute (readable via
 257 `getxattr(2)` and related calls) to the character `1` on cgroup directories
 258 where delegation is enabled (and removes it on those cgroups where it is
 259 not). This may be used by service programs to determine whether a cgroup tree
 260 was delegated to them. Note that this is only supported on kernels 5.6 and
 261 newer in combination with systemd 251 and newer.
 262
 263 (OK, here's one caveat: if you turn on delegation for a service, and that
 264 service has `ExecStartPost=`, `ExecReload=`, `ExecStop=` or `ExecStopPost=`
 265 set, then these commands will be executed within the `.control/` sub-cgroup of
 266 your service's cgroup. This is necessary because by turning on delegation we
 267 have to assume that the cgroup delegated to your service is now an *inner*
 268 cgroup, which means that it may not directly contain any processes. Hence, if
 269 your service has any of these four settings set, you must be prepared that a
 270 `.control/` subcgroup might appear, managed by the service manager. This also
 271 means that your service code should have moved itself further down the cgroup
 272 tree by the time it notifies the service manager about start-up readiness, so
 273 that the service's main cgroup is definitely an inner node by the time the
 274 service manager might start `ExecStartPost=`. Starting with systemd 254 you may
 275 also use `DelegateSubgroup=` to let the service manager put your initial
 276 service process into a subgroup right away.)
 277
 278 (Also note, if you intend to use "threaded" cgroups — as added in Linux 4.14 —,
 279 then you should do that *two* levels down from the main service cgroup your
 280 turned delegation on for. Why that? You need one level so that systemd can
 281 properly create the `.control` subgroup, as described above. But that one
 282 cannot be threaded, since that would mean `.control` has to be threaded too —
 283 this is a requirement of threaded cgroups: either a cgroup and all its siblings
 284 are threaded or none –, but systemd expects it to be a regular cgroup. Thus you
 285 have to nest a second cgroup beneath it which then can be threaded.)
 286
 287 ## Three Scenarios
 288
 289 Let's say you write a container manager, and you wonder what to do regarding
 290 cgroups for it, as you want your manager to be able to run on systemd systems.
 291
 292 You basically have three options:
 293
 294 1. 😊 The *integration-is-good* option. For this, you register each container
 295    you have either as a systemd service (i.e. let systemd invoke the executor
 296    binary for you) or a systemd scope (i.e. your manager executes the binary
 297    directly, but then tells systemd about it. In this mode the administrator
 298    can use the usual systemd resource management and reporting commands
 299    individually on those containers. By turning on `Delegate=` for these scopes
 300    or services you make it possible to run cgroup-enabled programs in your
 301    containers, for example a nested systemd instance. This option has two
 302    sub-options:
 303
 304    a. You transiently register the service or scope by directly contacting
 305       systemd via D-Bus. In this case systemd will just manage the unit for you
 306       and nothing else.
 307
 308    b. Instead you register the service or scope through `systemd-machined`
 309       (also via D-Bus). This mini-daemon is basically just a proxy for the same
 310       operations as in a. The main benefit of this: this way you let the system
 311       know that what you are registering is a container, and this opens up
 312       certain additional integration points. For example, `journalctl -M` can
 313       then be used to directly look into any container's journal logs (should
 314       the container run systemd inside), or `systemctl -M` can be used to
 315       directly invoke systemd operations inside the containers. Moreover tools
 316       like "ps" can then show you to which container a process belongs (`ps -eo
 317       pid,comm,machine`), and even gnome-system-monitor supports it.
 318
 319 2. 🙁 The *i-like-islands* option. If all you care about is your own cgroup tree,
 320    and you want to have to do as little as possible with systemd and no
 321    interest in integration with the rest of the system, then this is a valid
 322    option. For this all you have to do is turn on `Delegate=` for your main
 323    manager daemon. Then figure out the cgroup systemd placed your daemon in:
 324    you can now freely create sub-cgroups beneath it. Don't forget the
 325    *no-processes-in-inner-nodes* rule however: you have to move your main
 326    daemon process out of that cgroup (and into a sub-cgroup) before you can
 327    start further processes in any of your sub-cgroups.
 328
 329 3. 🙁 The *i-like-continents* option. In this option you'd leave your manager
 330    daemon where it is, and would not turn on delegation on its unit. However,
 331    as you start your first managed process (a container, for example) you would
 332    register a new scope unit with systemd, and that scope unit would have
 333    `Delegate=` turned on, and it would contain the PID of this process; all
 334    your managed processes subsequently created should also be moved into this
 335    scope. From systemd's PoV there'd be two units: your manager service and the
 336    big scope that contains all your managed processes in one.
 337
 338 BTW: if for whatever reason you say "I hate D-Bus, I'll never call any D-Bus
 339 API, kthxbye", then options #1 and #3 are not available, as they generally
 340 involve talking to systemd from your program code, via D-Bus. You still have
 341 option #2 in that case however, as you can simply set `Delegate=` in your
 342 service's unit file and you are done and have your own sub-tree. In fact, #2 is
 343 the one option that allows you to completely ignore systemd's existence: you
 344 can entirely generically follow the single rule that you just use the cgroup
 345 you are started in, and everything below it, whatever that might be. That said,
 346 maybe if you dislike D-Bus and systemd that much, the better approach might be
 347 to work on that, and widen your horizon a bit. You are welcome.
 348
 349 ## Controller Support
 350
 351 systemd supports a number of controllers (but not all). Specifically, supported
 352 are:
 353
 354 * on cgroup v1: `cpu`, `cpuacct`, `blkio`, `memory`, `devices`, `pids`
 355 * on cgroup v2: `cpu`, `io`, `memory`, `pids`
 356
 357 It is our intention to natively support all cgroup v2 controllers as they are
 358 added to the kernel. However, regarding cgroup v1: at this point we will not
 359 add support for any other controllers anymore. This means systemd currently
 360 does not and will never manage the following controllers on cgroup v1:
 361 `freezer`, `cpuset`, `net_cls`, `perf_event`, `net_prio`, `hugetlb`. Why not?
 362 Depending on the case, either their API semantics or implementations aren't
 363 really usable, or it's very clear they have no future on cgroup v2, and we
 364 won't add new code for stuff that clearly has no future.
 365
 366 Effectively this means that all those mentioned cgroup v1 controllers are up
 367 for grabs: systemd won't manage them, and hence won't delegate them to your
 368 code (however, systemd will still mount their hierarchies, simply because it
 369 mounts all controller hierarchies it finds available in the kernel). If you
 370 decide to use them, then that's fine, but systemd won't help you with it (but
 371 also not interfere with it). To be nice to other tenants it might be wise to
 372 replicate the cgroup hierarchies of the other controllers in them too however,
 373 but of course that's between you and those other tenants, and systemd won't
 374 care. Replicating the cgroup hierarchies in those unsupported controllers would
 375 mean replicating the full cgroup paths in them, and hence the prefixing
 376 `.slice` components too, otherwise the hierarchies will start being orthogonal
 377 after all, and that's not really desirable. One more thing: systemd will clean
 378 up after you in the hierarchies it manages: if your daemon goes down, its
 379 cgroups will be removed too. You basically get the guarantee that you start
 380 with a pristine cgroup sub-tree for your service or scope whenever it is
 381 started. This is not the case however in the hierarchies systemd doesn't
 382 manage. This means that your programs should be ready to deal with left-over
 383 cgroups in them — from previous runs, and be extra careful with them as they
 384 might still carry settings that might not be valid anymore.
 385
 386 Note a particular asymmetry here: if your systemd version doesn't support a
 387 specific controller on cgroup v1 you can still make use of it for delegation,
 388 by directly fiddling with its hierarchy and replicating the cgroup tree there
 389 as necessary (as suggested above). However, on cgroup v2 this is different:
 390 separately mounted hierarchies are not available, and delegation has always to
 391 happen through systemd itself. This means: when you update your kernel and it
 392 adds a new, so far unseen controller, and you want to use it for delegation,
 393 then you also need to update systemd to a version that groks it.
 394
 395 ## systemd as Container Payload
 396
 397 systemd can happily run as a container payload's PID 1. Note that systemd
 398 unconditionally needs write access to the cgroup tree however, hence you need
 399 to delegate a sub-tree to it. Note that there's nothing too special you have to
 400 do beyond that: just invoke systemd as PID 1 inside the root of the delegated
 401 cgroup sub-tree, and it will figure out the rest: it will determine the cgroup
 402 it is running in and take possession of it. It won't interfere with any cgroup
 403 outside of the sub-tree it was invoked in. Use of `CLONE_NEWCGROUP` is hence
 404 optional (but of course wise).
 405
 406 Note one particular asymmetry here though: systemd will try to take possession
 407 of the root cgroup you pass to it *in* *full*, i.e. it will not only
 408 create/remove child cgroups below it, it will also attempt to manage the
 409 attributes of it. OTOH as mentioned above, when delegating a cgroup tree to
 410 somebody else it only passes the rights to create/remove sub-cgroups, but will
 411 insist on managing the delegated cgroup tree's top-level attributes. Or in
 412 other words: systemd is *greedy* when accepting delegated cgroup trees and also
 413 *greedy* when delegating them to others: it insists on managing attributes on
 414 the specific cgroup in both cases. A container manager that is itself a payload
 415 of a host systemd which wants to run a systemd as its own container payload
 416 instead hence needs to insert an extra level in the hierarchy in between, so
 417 that the systemd on the host and the one in the container won't fight for the
 418 attributes. That said, you likely should do that anyway, due to the
 419 no-processes-in-inner-cgroups rule, see below.
 420
 421 When systemd runs as container payload it will make use of all hierarchies it
 422 has write access to. For legacy mode you need to make at least
 423 `/sys/fs/cgroup/systemd/` available, all other hierarchies are optional. For
 424 hybrid mode you need to add `/sys/fs/cgroup/unified/`. Finally, for fully
 425 unified you (of course, I guess) need to provide only `/sys/fs/cgroup/` itself.
 426
 427 ## Some Dos
 428
 429 1. ⚡ If you go for implementation option 1a or 1b (as in the list above), then
 430    each of your containers will have its own systemd-managed unit and hence
 431    cgroup with possibly further sub-cgroups below. Typically the first process
 432    running in that unit will be some kind of executor program, which will in
 433    turn fork off the payload processes of the container. In this case don't
 434    forget that there are two levels of delegation involved: first, systemd
 435    delegates a group sub-tree to your executor. And then your executor should
 436    delegate a sub-tree further down to the container payload. Oh, and because
 437    of the no-process-in-inner-nodes rule, your executor needs to migrate itself
 438    to a sub-cgroup of the cgroup it got delegated, too. Most likely you hence
 439    want a two-pronged approach: below the cgroup you got started in, you want
 440    one cgroup maybe called `supervisor/` where your manager runs in and then
 441    for each container a sibling cgroup of that maybe called `payload-xyz/`.
 442
 443 2. ⚡ Don't forget that the cgroups you create have to have names that are
 444    suitable as UNIX file names, and that they live in the same namespace as the
 445    various kernel attribute files. Hence, when you want to allow the user
 446    arbitrary naming, you might need to escape some of the names (for example,
 447    you really don't want to create a cgroup named `tasks`, just because the
 448    user created a container by that name, because `tasks` after all is a magic
 449    attribute in cgroup v1, and your `mkdir()` will hence fail with `EEXIST`. In
 450    systemd we do escaping by prefixing names that might collide with a kernel
 451    attribute name with an underscore. You might want to do the same, but this
 452    is really up to you how you do it. Just do it, and be careful.
 453
 454 ## Some Don'ts
 455
 456 1. 🚫 Never create your own cgroups below arbitrary cgroups systemd manages, i.e
 457    cgroups you haven't set `Delegate=` in. Specifically: 🔥 don't create your
 458    own cgroups below the root cgroup 🔥. That's owned by systemd, and you will
 459    step on systemd's toes if you ignore that, and systemd will step on
 460    yours. Get your own delegated sub-tree, you may create as many cgroups there
 461    as you like. Seriously, if you create cgroups directly in the cgroup root,
 462    then all you do is ask for trouble.
 463
 464 2. 🚫 Don't attempt to set `Delegate=` in slice units, and in particular not in
 465    `-.slice`. It's not supported, and will generate an error.
 466
 467 3. 🚫 Never *write* to any of the attributes of a cgroup systemd created for
 468    you. It's systemd's private property. You are welcome to manipulate the
 469    attributes of cgroups you created in your own delegated sub-tree, but the
 470    cgroup tree of systemd itself is out of limits for you. It's fine to *read*
 471    from any attribute you like however. That's totally OK and welcome.
 472
 473 4. 🚫 When not using `CLONE_NEWCGROUP` when delegating a sub-tree to a
 474    container payload running systemd, then don't get the idea that you can bind
 475    mount only a sub-tree of the host's cgroup tree into the container. Part of
 476    the cgroup API is that `/proc/$PID/cgroup` reports the cgroup path of every
 477    process, and hence any path below `/sys/fs/cgroup/` needs to match what
 478    `/proc/$PID/cgroup` of the payload processes reports. What you can do safely
 479    however, is mount the upper parts of the cgroup tree read-only (or even
 480    replace the middle bits with an intermediary `tmpfs` — but be careful not to
 481    break the `statfs()` detection logic discussed above), as long as the path
 482    to the delegated sub-tree remains accessible as-is.
 483
 484 5. ⚡ Currently, the algorithm for mapping between slice/scope/service unit
 485    naming and their cgroup paths is not considered public API of systemd, and
 486    may change in future versions. This means: it's best to avoid implementing a
 487    local logic of translating cgroup paths to slice/scope/service names in your
 488    program, or vice versa — it's likely going to break sooner or later. Use the
 489    appropriate D-Bus API calls for that instead, so that systemd translates
 490    this for you. (Specifically: each Unit object has a `ControlGroup` property
 491    to get the cgroup for a unit. The method `GetUnitByControlGroup()` may be
 492    used to get the unit for a cgroup.)
 493
 494 6. ⚡ Think twice before delegating cgroup v1 controllers to less privileged
 495    containers. It's not safe, you basically allow your containers to freeze the
 496    system with that and worse. Delegation is a strongpoint of cgroup v2 though,
 497    and there it's safe to treat delegation boundaries as privilege boundaries.
 498
 499 And that's it for now. If you have further questions, refer to the systemd
 500 mailing list.
 501
 502 — Berlin, 2018-04-20