]> git.ipfire.org Git - thirdparty/man-pages.git/blob - man7/cgroups.7
cgroups.7: wfix
[thirdparty/man-pages.git] / man7 / cgroups.7
1 .\" Copyright (C) 2015 Serge Hallyn <serge@hallyn.com>
2 .\" and Copyright (C) 2016, 2017 Michael Kerrisk <mtk.manpages@gmail.com>
3 .\"
4 .\" %%%LICENSE_START(VERBATIM)
5 .\" Permission is granted to make and distribute verbatim copies of this
6 .\" manual provided the copyright notice and this permission notice are
7 .\" preserved on all copies.
8 .\"
9 .\" Permission is granted to copy and distribute modified versions of this
10 .\" manual under the conditions for verbatim copying, provided that the
11 .\" entire resulting derived work is distributed under the terms of a
12 .\" permission notice identical to this one.
13 .\"
14 .\" Since the Linux kernel and libraries are constantly changing, this
15 .\" manual page may be incorrect or out-of-date. The author(s) assume no
16 .\" responsibility for errors or omissions, or for damages resulting from
17 .\" the use of the information contained herein. The author(s) may not
18 .\" have taken the same level of care in the production of this manual,
19 .\" which is licensed free of charge, as they might when working
20 .\" professionally.
21 .\"
22 .\" Formatted or processed versions of this manual, if unaccompanied by
23 .\" the source, must acknowledge the copyright and authors of this work.
24 .\" %%%LICENSE_END
25 .\"
26 .TH CGROUPS 7 2018-02-02 "Linux" "Linux Programmer's Manual"
27 .SH NAME
28 cgroups \- Linux control groups
29 .SH DESCRIPTION
30 Control groups, usually referred to as cgroups,
31 are a Linux kernel feature which allow processes to
32 be organized into hierarchical groups whose usage of
33 various types of resources can then be limited and monitored.
34 The kernel's cgroup interface is provided through
35 a pseudo-filesystem called cgroupfs.
36 Grouping is implemented in the core cgroup kernel code,
37 while resource tracking and limits are implemented in
38 a set of per-resource-type subsystems (memory, CPU, and so on).
39 .\"
40 .SS Terminology
41 A
42 .I cgroup
43 is a collection of processes that are bound to a set of
44 limits or parameters defined via the cgroup filesystem.
45 .PP
46 A
47 .I subsystem
48 is a kernel component that modifies the behavior of
49 the processes in a cgroup.
50 Various subsystems have been implemented, making it possible to do things
51 such as limiting the amount of CPU time and memory available to a cgroup,
52 accounting for the CPU time used by a cgroup,
53 and freezing and resuming execution of the processes in a cgroup.
54 Subsystems are sometimes also known as
55 .IR "resource controllers"
56 (or simply, controllers).
57 .PP
58 The cgroups for a controller are arranged in a
59 .IR hierarchy .
60 This hierarchy is defined by creating, removing, and
61 renaming subdirectories within the cgroup filesystem.
62 At each level of the hierarchy, attributes (e.g., limits) can be defined.
63 The limits, control, and accounting provided by cgroups generally have
64 effect throughout the subhierarchy underneath the cgroup where the
65 attributes are defined.
66 Thus, for example, the limits placed on
67 a cgroup at a higher level in the hierarchy cannot be exceeded
68 by descendant cgroups.
69 .\"
70 .SS Cgroups version 1 and version 2
71 The initial release of the cgroups implementation was in Linux 2.6.24.
72 Over time, various cgroup controllers have been added
73 to allow the management of various types of resources.
74 However, the development of these controllers was largely uncoordinated,
75 with the result that many inconsistencies arose between controllers
76 and management of the cgroup hierarchies became rather complex.
77 (A longer description of these problems can be found in
78 the kernel source file
79 .IR Documentation/cgroup\-v2.txt .)
80 .PP
81 Because of the problems with the initial cgroups implementation
82 (cgroups version 1),
83 starting in Linux 3.10, work began on a new,
84 orthogonal implementation to remedy these problems.
85 Initially marked experimental, and hidden behind the
86 .I "\-o\ __DEVEL__sane_behavior"
87 mount option, the new version (cgroups version 2)
88 was eventually made official with the release of Linux 4.5.
89 Differences between the two versions are described in the text below.
90 .PP
91 Although cgroups v2 is intended as a replacement for cgroups v1,
92 the older system continues to exist
93 (and for compatibility reasons is unlikely to be removed).
94 Currently, cgroups v2 implements only a subset of the controllers
95 available in cgroups v1.
96 The two systems are implemented so that both v1 controllers and
97 v2 controllers can be mounted on the same system.
98 Thus, for example, it is possible to use those controllers
99 that are supported under version 2,
100 while also using version 1 controllers
101 where version 2 does not yet support those controllers.
102 The only restriction here is that a controller can't be simultaneously
103 employed in both a cgroups v1 hierarchy and in the cgroups v2 hierarchy.
104 .\"
105 .SH CGROUPS VERSION 1
106 Under cgroups v1, each controller may be mounted against a separate
107 cgroup filesystem that provides its own hierarchical organization of the
108 processes on the system.
109 It is also possible to comount multiple (or even all) cgroups v1 controllers
110 against the same cgroup filesystem, meaning that the comounted controllers
111 manage the same hierarchical organization of processes.
112 .PP
113 For each mounted hierarchy,
114 the directory tree mirrors the control group hierarchy.
115 Each control group is represented by a directory, with each of its child
116 control cgroups represented as a child directory.
117 For instance,
118 .IR /user/joe/1.session
119 represents control group
120 .IR 1.session ,
121 which is a child of cgroup
122 .IR joe ,
123 which is a child of
124 .IR /user .
125 Under each cgroup directory is a set of files which can be read or
126 written to, reflecting resource limits and a few general cgroup
127 properties.
128 .\"
129 .SS Tasks (threads) versus processes
130 In cgroups v1, a distinction is drawn between
131 .I processes
132 and
133 .IR tasks .
134 In this view, a process can consist of multiple tasks
135 (more commonly called threads, from a user-space perspective,
136 and called such in the remainder of this man page).
137 In cgroups v1, it is possible to independently manipulate
138 the cgroup memberships of the threads in a process.
139 .PP
140 The cgroups v1 ability to split threads across different cgroups
141 caused problems in some cases.
142 For example, it made no sense for the
143 .I memory
144 controller,
145 since all of the threads of a process share a single address space.
146 Because of these problems,
147 the ability to independently manipulate the cgroup memberships
148 of the threads in a process was removed in the initial cgroups v2
149 implementation, and subsequently restored in a more limited form
150 (see the discussion of "thread mode" below).
151 .\"
152 .SS Mounting v1 controllers
153 The use of cgroups requires a kernel built with the
154 .BR CONFIG_CGROUP
155 option.
156 In addition, each of the v1 controllers has an associated
157 configuration option that must be set in order to employ that controller.
158 .PP
159 In order to use a v1 controller,
160 it must be mounted against a cgroup filesystem.
161 The usual place for such mounts is under a
162 .BR tmpfs (5)
163 filesystem mounted at
164 .IR /sys/fs/cgroup .
165 Thus, one might mount the
166 .I cpu
167 controller as follows:
168 .PP
169 .in +4n
170 .EX
171 mount \-t cgroup \-o cpu none /sys/fs/cgroup/cpu
172 .EE
173 .in
174 .PP
175 It is possible to comount multiple controllers against the same hierarchy.
176 For example, here the
177 .IR cpu
178 and
179 .IR cpuacct
180 controllers are comounted against a single hierarchy:
181 .PP
182 .in +4n
183 .EX
184 mount \-t cgroup \-o cpu,cpuacct none /sys/fs/cgroup/cpu,cpuacct
185 .EE
186 .in
187 .PP
188 Comounting controllers has the effect that a process is in the same cgroup for
189 all of the comounted controllers.
190 Separately mounting controllers allows a process to
191 be in cgroup
192 .I /foo1
193 for one controller while being in
194 .I /foo2/foo3
195 for another.
196 .PP
197 It is possible to comount all v1 controllers against the same hierarchy:
198 .PP
199 .in +4n
200 .EX
201 mount \-t cgroup \-o all cgroup /sys/fs/cgroup
202 .EE
203 .in
204 .PP
205 (One can achieve the same result by omitting
206 .IR "\-o all" ,
207 since it is the default if no controllers are explicitly specified.)
208 .PP
209 It is not possible to mount the same controller
210 against multiple cgroup hierarchies.
211 For example, it is not possible to mount both the
212 .I cpu
213 and
214 .I cpuacct
215 controllers against one hierarchy, and to mount the
216 .I cpu
217 controller alone against another hierarchy.
218 It is possible to create multiple mount points with exactly
219 the same set of comounted controllers.
220 However, in this case all that results is multiple mount points
221 providing a view of the same hierarchy.
222 .PP
223 Note that on many systems, the v1 controllers are automatically mounted under
224 .IR /sys/fs/cgroup ;
225 in particular,
226 .BR systemd (1)
227 automatically creates such mount points.
228 .\"
229 .SS Unmounting v1 controllers
230 A mounted cgroup filesystem can be unmounted using the
231 .BR umount (8)
232 command, as in the following example:
233 .PP
234 .in +4n
235 .EX
236 umount /sys/fs/cgroup/pids
237 .EE
238 .in
239 .PP
240 .IR "But note well" :
241 a cgroup filesystem is unmounted only if it is not busy,
242 that is, it has no child cgroups.
243 If this is not the case, then the only effect of the
244 .BR umount (8)
245 is to make the mount invisible.
246 Thus, to ensure that the mount point is really removed,
247 one must first remove all child cgroups,
248 which in turn can be done only after all member processes
249 have been moved from those cgroups to the root cgroup.
250 .\"
251 .SS Cgroups version 1 controllers
252 Each of the cgroups version 1 controllers is governed
253 by a kernel configuration option (listed below).
254 Additionally, the availability of the cgroups feature is governed by the
255 .BR CONFIG_CGROUPS
256 kernel configuration option.
257 .TP
258 .IR cpu " (since Linux 2.6.24; " \fBCONFIG_CGROUP_SCHED\fP )
259 Cgroups can be guaranteed a minimum number of "CPU shares"
260 when a system is busy.
261 This does not limit a cgroup's CPU usage if the CPUs are not busy.
262 For further information, see
263 .IR Documentation/scheduler/sched-design-CFS.txt .
264 .IP
265 In Linux 3.2,
266 this controller was extended to provide CPU "bandwidth" control.
267 If the kernel is configured with
268 .BR CONFIG_CFS_BANDWIDTH ,
269 then within each scheduling period
270 (defined via a file in the cgroup directory), it is possible to define
271 an upper limit on the CPU time allocated to the processes in a cgroup.
272 This upper limit applies even if there is no other competition for the CPU.
273 Further information can be found in the kernel source file
274 .IR Documentation/scheduler/sched\-bwc.txt .
275 .TP
276 .IR cpuacct " (since Linux 2.6.24; " \fBCONFIG_CGROUP_CPUACCT\fP )
277 This provides accounting for CPU usage by groups of processes.
278 .IP
279 Further information can be found in the kernel source file
280 .IR Documentation/cgroup\-v1/cpuacct.txt .
281 .TP
282 .IR cpuset " (since Linux 2.6.24; " \fBCONFIG_CPUSETS\fP )
283 This cgroup can be used to bind the processes in a cgroup to
284 a specified set of CPUs and NUMA nodes.
285 .IP
286 Further information can be found in the kernel source file
287 .IR Documentation/cgroup\-v1/cpusets.txt .
288 .TP
289 .IR memory " (since Linux 2.6.25; " \fBCONFIG_MEMCG\fP )
290 The memory controller supports reporting and limiting of process memory, kernel
291 memory, and swap used by cgroups.
292 .IP
293 Further information can be found in the kernel source file
294 .IR Documentation/cgroup\-v1/memory.txt .
295 .TP
296 .IR devices " (since Linux 2.6.26; " \fBCONFIG_CGROUP_DEVICE\fP )
297 This supports controlling which processes may create (mknod) devices as
298 well as open them for reading or writing.
299 The policies may be specified as whitelists and blacklists.
300 Hierarchy is enforced, so new rules must not
301 violate existing rules for the target or ancestor cgroups.
302 .IP
303 Further information can be found in the kernel source file
304 .IR Documentation/cgroup-v1/devices.txt .
305 .TP
306 .IR freezer " (since Linux 2.6.28; " \fBCONFIG_CGROUP_FREEZER\fP )
307 The
308 .IR freezer
309 cgroup can suspend and restore (resume) all processes in a cgroup.
310 Freezing a cgroup
311 .I /A
312 also causes its children, for example, processes in
313 .IR /A/B ,
314 to be frozen.
315 .IP
316 Further information can be found in the kernel source file
317 .IR Documentation/cgroup-v1/freezer-subsystem.txt .
318 .TP
319 .IR net_cls " (since Linux 2.6.29; " \fBCONFIG_CGROUP_NET_CLASSID\fP )
320 This places a classid, specified for the cgroup, on network packets
321 created by a cgroup.
322 These classids can then be used in firewall rules,
323 as well as used to shape traffic using
324 .BR tc (8).
325 This applies only to packets
326 leaving the cgroup, not to traffic arriving at the cgroup.
327 .IP
328 Further information can be found in the kernel source file
329 .IR Documentation/cgroup-v1/net_cls.txt .
330 .TP
331 .IR blkio " (since Linux 2.6.33; " \fBCONFIG_BLK_CGROUP\fP )
332 The
333 .I blkio
334 cgroup controls and limits access to specified block devices by
335 applying IO control in the form of throttling and upper limits against leaf
336 nodes and intermediate nodes in the storage hierarchy.
337 .IP
338 Two policies are available.
339 The first is a proportional-weight time-based division
340 of disk implemented with CFQ.
341 This is in effect for leaf nodes using CFQ.
342 The second is a throttling policy which specifies
343 upper I/O rate limits on a device.
344 .IP
345 Further information can be found in the kernel source file
346 .IR Documentation/cgroup-v1/blkio-controller.txt .
347 .TP
348 .IR perf_event " (since Linux 2.6.39; " \fBCONFIG_CGROUP_PERF\fP )
349 This controller allows
350 .I perf
351 monitoring of the set of processes grouped in a cgroup.
352 .IP
353 Further information can be found in the kernel source file
354 .IR tools/perf/Documentation/perf-record.txt .
355 .TP
356 .IR net_prio " (since Linux 3.3; " \fBCONFIG_CGROUP_NET_PRIO\fP )
357 This allows priorities to be specified, per network interface, for cgroups.
358 .IP
359 Further information can be found in the kernel source file
360 .IR Documentation/cgroup-v1/net_prio.txt .
361 .TP
362 .IR hugetlb " (since Linux 3.5; " \fBCONFIG_CGROUP_HUGETLB\fP )
363 This supports limiting the use of huge pages by cgroups.
364 .IP
365 Further information can be found in the kernel source file
366 .IR Documentation/cgroup-v1/hugetlb.txt .
367 .TP
368 .IR pids " (since Linux 4.3; " \fBCONFIG_CGROUP_PIDS\fP )
369 This controller permits limiting the number of process that may be created
370 in a cgroup (and its descendants).
371 .IP
372 Further information can be found in the kernel source file
373 .IR Documentation/cgroup-v1/pids.txt .
374 .TP
375 .IR rdma " (since Linux 4.11; " \fBCONFIG_CGROUP_RDMA\fP )
376 The RDMA controller permits limiting the use of
377 RDMA/IB-specific resources per cgroup.
378 .IP
379 Further information can be found in the kernel source file
380 .IR Documentation/cgroup-v1/rdma.txt .
381 .\"
382 .SS Creating cgroups and moving processes
383 A cgroup filesystem initially contains a single root cgroup, '/',
384 which all processes belong to.
385 A new cgroup is created by creating a directory in the cgroup filesystem:
386 .PP
387 .in +4n
388 .EX
389 mkdir /sys/fs/cgroup/cpu/cg1
390 .EE
391 .in
392 .PP
393 This creates a new empty cgroup.
394 .PP
395 A process may be moved to this cgroup by writing its PID into the cgroup's
396 .I cgroup.procs
397 file:
398 .PP
399 .in +4n
400 .EX
401 echo $$ > /sys/fs/cgroup/cpu/cg1/cgroup.procs
402 .EE
403 .in
404 .PP
405 Only one PID at a time should be written to this file.
406 .PP
407 Writing the value 0 to a
408 .IR cgroup.procs
409 file causes the writing process to be moved to the corresponding cgroup.
410 .PP
411 When writing a PID into the
412 .IR cgroup.procs ,
413 all threads in the process are moved into the new cgroup at once.
414 .PP
415 Within a hierarchy, a process can be a member of exactly one cgroup.
416 Writing a process's PID to a
417 .IR cgroup.procs
418 file automatically removes it from the cgroup of
419 which it was previously a member.
420 .PP
421 The
422 .I cgroup.procs
423 file can be read to obtain a list of the processes that are
424 members of a cgroup.
425 The returned list of PIDs is not guaranteed to be in order.
426 Nor is it guaranteed to be free of duplicates.
427 (For example, a PID may be recycled while reading from the list.)
428 .PP
429 In cgroups v1, an individual thread can be moved to
430 another cgroup by writing its thread ID
431 (i.e., the kernel thread ID returned by
432 .BR clone (2)
433 and
434 .BR gettid (2))
435 to the
436 .IR tasks
437 file in a cgroup directory.
438 This file can be read to discover the set of threads
439 that are members of the cgroup.
440 .\"
441 .SS Removing cgroups
442 To remove a cgroup,
443 it must first have no child cgroups and contain no (nonzombie) processes.
444 So long as that is the case, one can simply
445 remove the corresponding directory pathname.
446 Note that files in a cgroup directory cannot and need not be
447 removed.
448 .\"
449 .SS Cgroups v1 release notification
450 Two files can be used to determine whether the kernel provides
451 notifications when a cgroup becomes empty.
452 A cgroup is considered to be empty when it contains no child
453 cgroups and no member processes.
454 .PP
455 A special file in the root directory of each cgroup hierarchy,
456 .IR release_agent ,
457 can be used to register the pathname of a program that may be invoked when
458 a cgroup in the hierarchy becomes empty.
459 The pathname of the newly empty cgroup (relative to the cgroup mount point)
460 is provided as the sole command-line argument when the
461 .IR release_agent
462 program is invoked.
463 The
464 .IR release_agent
465 program might remove the cgroup directory,
466 or perhaps repopulate it with a process.
467 .PP
468 The default value of the
469 .IR release_agent
470 file is empty, meaning that no release agent is invoked.
471 .PP
472 The content of the
473 .I release_agent
474 file can also be specified via a mount option when the
475 cgroup filesystem is mounted:
476 .PP
477 .in +4n
478 .EX
479 mount -o release_agent=pathname ...
480 .EE
481 .in
482 .PP
483 Whether or not the
484 .IR release_agent
485 program is invoked when a particular cgroup becomes empty is determined
486 by the value in the
487 .IR notify_on_release
488 file in the corresponding cgroup directory.
489 If this file contains the value 0, then the
490 .IR release_agent
491 program is not invoked.
492 If it contains the value 1, the
493 .IR release_agent
494 program is invoked.
495 The default value for this file in the root cgroup is 0.
496 At the time when a new cgroup is created,
497 the value in this file is inherited from the corresponding file
498 in the parent cgroup.
499 .\"
500 .SS Cgroup v1 named hierarchies
501 In cgroups v1,
502 it is possible to mount a cgroup hierarchy that has no attached controllers:
503 .PP
504 .in +4n
505 .EX
506 mount -t cgroup -o none,name=somename none /some/mount/point
507 .EE
508 .in
509 .PP
510 Multiple instances of such hierarchies can be mounted;
511 each hierarchy must have a unique name.
512 The only purpose of such hierarchies is to track processes.
513 (See the discussion of release notification below.)
514 An example of this is the
515 .I name=systemd
516 cgroup hierarchy that is used by
517 .BR systemd (1)
518 to track services and user sessions.
519 .\"
520 .SH CGROUPS VERSION 2
521 In cgroups v2,
522 all mounted controllers reside in a single unified hierarchy.
523 While (different) controllers may be simultaneously
524 mounted under the v1 and v2 hierarchies,
525 it is not possible to mount the same controller simultaneously
526 under both the v1 and the v2 hierarchies.
527 .PP
528 The new behaviors in cgroups v2 are summarized here,
529 and in some cases elaborated in the following subsections.
530 .IP 1. 3
531 Cgroups v2 provides a unified hierarchy against
532 which all controllers are mounted.
533 .IP 2.
534 "Internal" processes are not permitted.
535 With the exception of the root cgroup, processes may reside
536 only in leaf nodes (cgroups that do not themselves contain child cgroups).
537 The details are somewhat more subtle than this, and are described below.
538 .IP 3.
539 Active cgroups must be specified via the files
540 .IR cgroup.controllers
541 and
542 .IR cgroup.subtree_control .
543 .IP 4.
544 The
545 .I tasks
546 file has been removed.
547 In addition, the
548 .I cgroup.clone_children
549 file that is employed by the
550 .I cpuset
551 controller has been removed.
552 .IP 5.
553 An improved mechanism for notification of empty cgroups is provided by the
554 .IR cgroup.events
555 file.
556 .PP
557 For more changes, see the
558 .I Documentation/cgroup-v2.txt
559 file in the kernel source.
560 .PP
561 Some of the new behaviors listed above saw subsequent modification with
562 the addition in Linux 4.14 of "thread mode" (described below).
563 .\"
564 .SS Cgroups v2 unified hierarchy
565 In cgroups v1, the ability to mount different controllers
566 against different hierarchies was intended to allow great flexibility
567 for application design.
568 In practice, though, the flexibility turned out to less useful than expected,
569 and in many cases added complexity.
570 Therefore, in cgroups v2,
571 all available controllers are mounted against a single hierarchy.
572 The available controllers are automatically mounted,
573 meaning that it is not necessary (or possible) to specify the controllers
574 when mounting the cgroup v2 filesystem using a command such as the following:
575 .PP
576 .in +4n
577 .EX
578 mount -t cgroup2 none /mnt/cgroup2
579 .EE
580 .in
581 .PP
582 A cgroup v2 controller is available only if it is not currently in use
583 via a mount against a cgroup v1 hierarchy.
584 Or, to put things another way, it is not possible to employ
585 the same controller against both a v1 hierarchy and the unified v2 hierarchy.
586 This means that it may be necessary first to unmount a v1 controller
587 (as described above) before that controller is available in v2.
588 Since
589 .BR systemd (1)
590 makes heavy use of some v1 controllers by default,
591 it can in some cases be simpler to boot the system with
592 selected v1 controllers disabled.
593 To do this, specify the
594 .IR cgroup_no_v1=list
595 option on the kernel boot command line;
596 .I list
597 is a comma-separated list of the names of the controllers to disable,
598 or the word
599 .I all
600 to disable all v1 controllers.
601 (This situation is correctly handled by
602 .BR systemd (1),
603 which falls back to operating without the specified controllers.)
604 .PP
605 Note that on many modern systems,
606 .BR systemd (1)
607 automatically mounts the
608 .I cgroup2
609 filesystem at
610 .I /sys/fs/cgroup/unified
611 during the boot process.
612 .\"
613 .SS Cgroups v2 controllers
614 The following controllers, documented in the kernel source file
615 .IR Documentation/cgroup-v2.txt ,
616 are supported in cgroups version 2:
617 .TP
618 .IR io " (since Linux 4.5)"
619 This is the successor of the version 1
620 .I blkio
621 controller.
622 .TP
623 .IR memory " (since Linux 4.5)"
624 This is the successor of the version 1
625 .I memory
626 controller.
627 .TP
628 .IR pids " (since Linux 4.5)"
629 This is the same as the version 1
630 .I pids
631 controller.
632 .TP
633 .IR perf_event " (since Linux 4.11)"
634 This is the same as the version 1
635 .I perf_event
636 controller.
637 .TP
638 .IR rdma " (since Linux 4.11)"
639 This is the same as the version 1
640 .I rdma
641 controller.
642 .TP
643 .IR cpu " (since Linux 4.15)"
644 This is the successor to the version 1
645 .I cpu
646 and
647 .I cpuacct
648 controllers.
649 .\"
650 .SS Cgroups v2 subtree control
651 Each cgroup in the v2 hierarchy contains the following two files:
652 .TP
653 .IR cgroup.controllers
654 This read-only file exposes a list of the controllers that are
655 .I available
656 in this cgroup.
657 The contents of this file match the contents of the
658 .I cgroup.subtree_control
659 file in the parent cgroup.
660 .TP
661 .I cgroup.subtree_control
662 This is a list of controllers that are
663 .IR active
664 .RI ( enabled )
665 in the cgroup.
666 The set of controllers in this file is a subset of the set in the
667 .IR cgroup.controllers
668 of this cgroup.
669 The set of active controllers is modified by writing strings to this file
670 containing space-delimited controller names,
671 each preceded by '+' (to enable a controller)
672 or '\-' (to disable a controller), as in the following example:
673 .IP
674 .in +4n
675 .EX
676 echo '+pids -memory' > x/y/cgroup.subtree_control
677 .EE
678 .in
679 .IP
680 An attempt to enable a controller
681 that is not present in
682 .I cgroup.controllers
683 leads to an
684 .B ENOENT
685 error when writing to the
686 .I cgroup.subtree_control
687 file.
688 .PP
689 Because the list of controllers in
690 .I cgroup.subtree_control
691 is a subset of those
692 .IR cgroup.controllers ,
693 a controller that has been disabled in one cgroup in the hierarchy
694 can never be re-enabled in the subtree below that cgroup.
695 .PP
696 A cgroup's
697 .I cgroup.subtree_control
698 file determines the set of controllers that are exercised in the
699 .I child
700 cgroups.
701 When a controller (e.g.,
702 .IR pids )
703 is present in the
704 .I cgroup.subtree_control
705 file of a parent cgroup,
706 then the corresponding controller-interface files (e.g.,
707 .IR pids.max )
708 are automatically created in the children of that cgroup
709 and can be used to exert resource control in the child cgroups.
710 .\"
711 .SS Cgroups v2 """no internal processes""" rule
712 Cgroups v2 enforces a so-called "no internal processes" rule.
713 Roughly speaking, this rule means that,
714 with the exception of the root cgroup, processes may reside
715 only in leaf nodes (cgroups that do not themselves contain child cgroups).
716 This avoids the need to decide how to partition resources between
717 processes which are members of cgroup A and processes in child cgroups of A.
718 .PP
719 For instance, if cgroup
720 .I /cg1/cg2
721 exists, then a process may reside in
722 .IR /cg1/cg2 ,
723 but not in
724 .IR /cg1 .
725 This is to avoid an ambiguity in cgroups v1
726 with respect to the delegation of resources between processes in
727 .I /cg1
728 and its child cgroups.
729 The recommended approach in cgroups v2 is to create a subdirectory called
730 .I leaf
731 for any nonleaf cgroup which should contain processes, but no child cgroups.
732 Thus, processes which previously would have gone into
733 .I /cg1
734 would now go into
735 .IR /cg1/leaf .
736 This has the advantage of making explicit
737 the relationship between processes in
738 .I /cg1/leaf
739 and
740 .IR /cg1 's
741 other children.
742 .PP
743 The "no internal processes" rule is in fact more subtle than stated above.
744 More precisely, the rule is that a (nonroot) cgroup can't both
745 (1) have member processes, and
746 (2) distribute resources into child cgroups\(emthat is, have a nonempty
747 .I cgroup.subtree_control
748 file.
749 Thus, it
750 .I is
751 possible for a cgroup to have both member processes and child cgroups,
752 but before controllers can be enabled for that cgroup,
753 the member processes must be moved out of the cgroup
754 (e.g., perhaps into the child cgroups).
755 .PP
756 With the Linux 4.14 addition of "thread mode" (described below),
757 the "no internal processes" rule has been relaxed in some cases.
758 .\"
759 .SS Cgroups v2 cgroup.events file
760 With cgroups v2, a new mechanism is provided to obtain notification
761 about when a cgroup becomes empty.
762 The cgroups v1
763 .IR release_agent
764 and
765 .IR notify_on_release
766 files are removed, and replaced by a new, more general-purpose file,
767 .IR cgroup.events .
768 This read-only file contains key-value pairs
769 (delimited by newline characters, with the key and value separated by spaces)
770 that identify events or state for a cgroup.
771 Currently, only one key appears in this file,
772 .IR populated ,
773 which has either the value 0,
774 meaning that the cgroup (and its descendants)
775 contain no (nonzombie) processes,
776 or 1, meaning that the cgroup contains member processes.
777 .PP
778 The
779 .IR cgroup.events
780 file can be monitored, in order to receive notification when a cgroup
781 transitions between the populated and unpopulated states (or vice versa).
782 When monitoring this file using
783 .BR inotify (7),
784 transitions generate
785 .BR IN_MODIFY
786 events, and when monitoring the file using
787 .BR poll (2),
788 transitions cause the bits
789 .B POLLPRI
790 and
791 .B POLLERR
792 to be returned in the
793 .IR revents
794 field.
795 .PP
796 The cgroups v2 release-notification mechanism provided by the
797 .I populated
798 field of the
799 .I cgroup.events
800 file offers at least two advantages over the cgroups v1
801 .IR release_agent
802 mechanism.
803 First, it allows for cheaper notification,
804 since a single process can monitor multiple
805 .IR cgroup.events
806 files.
807 By contrast, the cgroups v1 mechanism requires the creation
808 of a process for each notification.
809 Second, notification can be delegated to a process that lives inside
810 a container associated with the newly empty cgroup.
811 .\"
812 .SS Cgroups v2 cgroup.stat file
813 .\" commit ec39225cca42c05ac36853d11d28f877fde5c42e
814 Each cgroup in the v2 hierarchy contains a read-only
815 .IR cgroup.stat
816 file (first introduced in Linux 4.14)
817 that consists of lines containing key-value pairs.
818 The following keys currently appear in this file:
819 .TP
820 .I nr_descendants
821 This is the total number of visible (i.e., living) descendant cgroups
822 underneath this cgroup.
823 .TP
824 .I nr_dying_descendants
825 This is the total number of dying descendant cgroups
826 underneath this cgroup.
827 A cgroup enters the dying state after being deleted.
828 It remains in that state for an undefined period
829 (which will depend on system load)
830 while resources are freed before the cgroup is destroyed.
831 Note that the presence of some cgroups in the dying state is normal,
832 and is not indicative of any problem.
833 .IP
834 A process can't be made a member of a dying cgroup,
835 and a dying cgroup can't be brought back to life.
836 .\"
837 .SS Limiting the number of descendant cgroups
838 Each cgroup in the v2 hierarchy contains the following files,
839 which can be used to view and set limits on the number
840 of descendant cgroups under that cgroup:
841 .TP
842 .IR cgroup.max.depth " (since Linux 4.14)"
843 .\" commit 1a926e0bbab83bae8207d05a533173425e0496d1
844 This file defines a limit on the depth of nesting of descendant cgroups.
845 A value of 0 in this file means that no descendant cgroups can be created.
846 An attempt to create a descendant whose nesting level exceeds
847 the limit fails
848 .RI ( mkdir (2)
849 fails with the error
850 .BR EAGAIN ).
851 .IP
852 Writing the string
853 .IR """max"""
854 to this file means that no limit is imposed.
855 The default value in this file is
856 .IR """max""" .
857 .TP
858 .IR cgroup.max.descendants " (since Linux 4.14)"
859 .\" commit 1a926e0bbab83bae8207d05a533173425e0496d1
860 This file defines a limit on the number of live descendant cgroups that
861 this cgroup may have.
862 An attempt to create more descendants than allowed by the limit fails
863 .RI ( mkdir (2)
864 fails with the error
865 .BR EAGAIN ).
866 .IP
867 Writing the string
868 .IR """max"""
869 to this file means that no limit is imposed.
870 The default value in this file is
871 .IR """max""" .
872 .\"
873 .SH CGROUPS DELEGATION: DELEGATING A HIERARCHY TO A LESS PRIVILEGED USER
874 In the context of cgroups,
875 delegation means passing management of some subtree
876 of the cgroup hierarchy to a nonprivileged user.
877 Cgroups v1 provides support for delegation based on file permissions
878 in the cgroup hierarchy but with less strict containment rules than v2
879 (as noted below).
880 Cgroups v2 supports delegation with containment by explicit design.
881 The focus of the discussion in this section is on delegation in cgroups v2,
882 with some differences for cgroups v1 noted along the way.
883 .PP
884 Some terminology is required in order to describe delegation.
885 A
886 .I delegater
887 is a privileged user (i.e., root) who owns a parent cgroup.
888 A
889 .I delegatee
890 is a nonprivileged user who will be granted the permissions needed
891 to manage some subhierarchy under that parent cgroup,
892 known as the
893 .IR "delegated subtree" .
894 .PP
895 To perform delegation,
896 the delegater makes certain directories and files writable by the delegatee,
897 typically by changing the ownership of the objects to be the user ID
898 of the delegatee.
899 Assuming that we want to delegate the hierarchy rooted at (say)
900 .I /dlgt_grp
901 and that there are not yet any child cgroups under that cgroup,
902 the ownership of the following is changed to the user ID of the delegatee:
903 .TP
904 .IR /dlgt_grp
905 Changing the ownership of the root of the subtree means that any new
906 cgroups created under the subtree (and the files they contain)
907 will also be owned by the delegatee.
908 .TP
909 .IR /dlgt_grp/cgroup.procs
910 Changing the ownership of this file means that the delegatee
911 can move processes into the root of the delegated subtree.
912 .TP
913 .IR /dlgt_grp/cgroup.subtree_control " (cgroups v2 only)"
914 Changing the ownership of this file means that that the delegatee
915 can enable controllers (that are present in
916 .IR /dlgt_grp/cgroup.controllers )
917 in order to further redistribute resources at lower levels in the subtree.
918 (As an alternative to changing the ownership of this file,
919 the delegater might instead add selected controllers to this file.)
920 .TP
921 .IR /dlgt_grp/cgroup.threads " (cgroups v2 only)"
922 Changing the ownership of this file is necessary if a threaded subtree
923 is being delegated (see the description of "thread mode", below).
924 This permits the delegatee to write thread IDs to the file.
925 (The ownership of this file can also be changed when delegating
926 a domain subtree, but currently this serves no purpose,
927 since, as described below, it is not possible to move a thread between
928 domain cgroups by writing its thread ID to the
929 .IR cgroup.threads
930 file.)
931 .IP
932 In cgroups v1, the corresponding file that should instead be delegated is the
933 .I tasks
934 file.
935 .PP
936 The delegater should
937 .I not
938 change the ownership of any of the controller interfaces files (e.g.,
939 .IR pids.max ,
940 .IR memory.high )
941 in
942 .IR dlgt_grp .
943 Those files are used from the next level above the delegated subtree
944 in order to distribute resources into the subtree,
945 and the delegatee should not have permission to change
946 the resources that are distributed into the delegated subtree.
947 .PP
948 See also the discussion of the
949 .IR /sys/kernel/cgroup/delegate
950 file in NOTES for information about further delegatable files in cgroups v2.
951 .PP
952 After the aforementioned steps have been performed,
953 the delegatee can create child cgroups within the delegated subtree
954 (the cgroup subdirectories and the files they contain
955 will be owned by the delegatee)
956 and move processes between cgroups in the subtree.
957 If some controllers are present in
958 .IR dlgt_grp/cgroup.subtree_control ,
959 or the ownership of that file was passed to the delegatee,
960 the delegatee can also control the further redistribution
961 of the corresponding resources into the delegated subtree.
962 .\"
963 .SS Cgroups v2 delegation: nsdelegate and cgroup namespaces
964 Starting with Linux 4.13,
965 .\" commit 5136f6365ce3eace5a926e10f16ed2a233db5ba9
966 there is a second way to perform cgroup delegation in the cgroups v2 hierarchy.
967 This is done by mounting or remounting the cgroup v2 filesystem with the
968 .I nsdelegate
969 mount option.
970 For example, if the cgroup v2 filesystem has already been mounted,
971 we can remount it with the
972 .I nsdelegate
973 option as follows:
974 .PP
975 .in +4n
976 .EX
977 mount -t cgroup2 -o remount,nsdelegate \\
978 none /sys/fs/cgroup/unified
979 .EE
980 .in
981 .\"
982 .\" ALternatively, we could boot the kernel with the options:
983 .\"
984 .\" cgroup_no_v1=all systemd.legacy_systemd_cgroup_controller
985 .\"
986 .\" The effect of the latter option is to prevent systemd from employing
987 .\" its "hybrid" cgroup mode, where it tries to make use of cgroups v2.
988 .PP
989 The effect of this mount option is to cause cgroup namespaces
990 to automatically become delegation boundaries.
991 More specifically,
992 the following restrictions apply for processes inside the cgroup namespace:
993 .IP * 3
994 Writes to controller interface files in the root directory of the namespace
995 will fail with the error
996 .BR EPERM .
997 Processes inside the cgroup namespace can still write to delegatable
998 files in the root directory of the cgroup namespace such as
999 .IR cgroup.procs
1000 and
1001 .IR cgroup.subtree_control ,
1002 and can create subhierarchy underneath the root directory.
1003 .IP *
1004 Attempts to migrate processes across the namespace boundary are denied
1005 (with the error
1006 .BR ENOENT ).
1007 Processes inside the cgroup namespace can still
1008 (subject to the containment rules described below)
1009 move processes between cgroups
1010 .I within
1011 the subhierarchy under the namespace root.
1012 .PP
1013 The ability to define cgroup namespaces as delegation boundaries
1014 makes cgroup namespaces more useful.
1015 To understand why, suppose that we already have one cgroup hierarchy
1016 that has been delegated to a nonprivileged user,
1017 .IR cecilia ,
1018 using the older delegation technique described above.
1019 Suppose further that
1020 .I cecilia
1021 wanted to further delegate a subhierarchy
1022 under the existing delegated hierarchy.
1023 (For example, the delegated hierarchy might be associated with
1024 an unprivileged container run by
1025 .IR cecilia .)
1026 Even if a cgroup namespace was employed,
1027 because both hierarchies are owned by the unprivileged user
1028 .IR cecilia ,
1029 the following illegitimate actions could be performed:
1030 .IP * 3
1031 A process in the inferior hierarchy could change the
1032 resource controller settings in the root directory of that hierarchy.
1033 (These resource controller settings are intended to allow control to
1034 be exercised from the
1035 .I parent
1036 cgroup;
1037 a process inside the child cgroup should not be allowed to modify them.)
1038 .IP *
1039 A process inside the inferior hierarchy could move processes
1040 into and out of the inferior hierarchy if the cgroups in the
1041 superior hierarchy were somehow visible.
1042 .PP
1043 Employing the
1044 .I nsdelegate
1045 mount option prevents both of these possibilities.
1046 .PP
1047 The
1048 .I nsdelegate
1049 mount option only has an effect when performed in
1050 the initial mount namespace;
1051 in other mount namespaces, the option is silently ignored.
1052 .PP
1053 .IR Note :
1054 On some systems,
1055 .BR systemd (1)
1056 automatically mounts the cgroup v2 filesystem.
1057 In order to experiment with the
1058 .I nsdelegate
1059 operation, it may be useful to boot the kernel with
1060 the following command-line options:
1061 .PP
1062 .in +4n
1063 .EX
1064 cgroup_no_v1=all systemd.legacy_systemd_cgroup_controller
1065 .EE
1066 .in
1067 .PP
1068 These options cause the kernel to boot with the cgroups v1 controllers
1069 disabled (meaning that the controllers are available in the v2 hierarchy),
1070 and tells
1071 .BR systemd (1)
1072 not to mount and use the cgroup v2 hierarchy,
1073 so that the v2 hierarchy can be manually mounted
1074 with the desired options after boot-up.
1075 .\"
1076 .SS Cgroup delegation containment rules
1077 Some delegation
1078 .IR "containment rules"
1079 ensure that the delegatee can move processes between cgroups within the
1080 delegated subtree,
1081 but can't move processes from outside the delegated subtree into
1082 the subtree or vice versa.
1083 A nonprivileged process (i.e., the delegatee) can write the PID of
1084 a "target" process into a
1085 .IR cgroup.procs
1086 file only if all of the following are true:
1087 .IP * 3
1088 The writer has write permission on the
1089 .I cgroup.procs
1090 file in the destination cgroup.
1091 .IP *
1092 The writer has write permission on the
1093 .I cgroup.procs
1094 file in the nearest common ancestor of the source and destination cgroups.
1095 Note that in some cases,
1096 the nearest common ancestor may be the source or destination cgroup itself.
1097 This requirement is not enforced for cgroups v1 hierarchies,
1098 with the consequence that containment in v1 is less strict than in v2.
1099 (For example, in cgroups v1 the user that owns two distinct
1100 delegated subhierarchies can move a process between the hierarchies.)
1101 .IP *
1102 If the cgroup v2 filesystem was mounted with the
1103 .I nsdelegate
1104 option, the writer must be able to see the source and destination cgroups
1105 from its cgroup namespace.
1106 .IP *
1107 In cgroups v1:
1108 the effective UID of the writer (i.e., the delegatee) matches the
1109 real user ID or the saved set-user-ID of the target process.
1110 Before Linux 4.11,
1111 .\" commit 576dd464505fc53d501bb94569db76f220104d28
1112 this requirement also applied in cgroups v2
1113 (This was a historical requirement inherited from cgroups v1
1114 that was later deemed unnecessary,
1115 since the other rules suffice for containment in cgroups v2.)
1116 .PP
1117 .IR Note :
1118 one consequence of these delegation containment rules is that the
1119 unprivileged delegatee can't place the first process into
1120 the delegated subtree;
1121 instead, the delegater must place the first process
1122 (a process owned by the delegatee) into the delegated subtree.
1123 .\"
1124 .SH CGROUPS VERSION 2 THREAD MODE
1125 Among the restrictions imposed by cgroups v2 that were not present
1126 in cgroups v1 are the following:
1127 .IP * 3
1128 .IR "No thread-granularity control" :
1129 all of the threads of a process must be in the same cgroup.
1130 .IP *
1131 .IR "No internal processes" :
1132 a cgroup can't both have member processes and
1133 exercise controllers on child cgroups.
1134 .PP
1135 Both of these restrictions were added because
1136 the lack of these restrictions had caused problems
1137 in cgroups v1.
1138 In particular, the cgroups v1 ability to allow thread-level granularity
1139 for cgroup membership made no sense for some controllers.
1140 (A notable example was the
1141 .I memory
1142 controller: since threads share an address space,
1143 it made no sense to split threads across different
1144 .I memory
1145 cgroups.)
1146 .PP
1147 Notwithstanding the initial design decision in cgroups v2,
1148 there were use cases for certain controllers, notably the
1149 .IR cpu
1150 controller,
1151 for which thread-level granularity of control was meaningful and useful.
1152 To accommodate such use cases, Linux 4.14 added
1153 .I "thread mode"
1154 for cgroups v2.
1155 .PP
1156 Thread mode allows the following:
1157 .IP * 3
1158 The creation of
1159 .IR "threaded subtrees"
1160 in which the threads of a process may
1161 be spread across cgroups inside the tree.
1162 (A threaded subtree may contain multiple multithreaded processes.)
1163 .IP *
1164 The concept of
1165 .IR "threaded controllers",
1166 which can distribute resources across the cgroups in a threaded subtree.
1167 .IP *
1168 A relaxation of the "no internal processes rule",
1169 so that, within a threaded subtree,
1170 a cgroup can both contain member threads and
1171 exercise resource control over child cgroups.
1172 .PP
1173 With the addition of thread mode,
1174 each nonroot cgroup now contains a new file,
1175 .IR cgroup.type ,
1176 that exposes, and in some circumstances can be used to change,
1177 the "type" of a cgroup.
1178 This file contains one of the following type values:
1179 .TP
1180 .I "domain"
1181 This is a normal v2 cgroup that provides process-granularity control.
1182 If a process is a member of this cgroup,
1183 then all threads of the process are (by definition) in the same cgroup.
1184 This is the default cgroup type,
1185 and provides the same behavior that was provided for
1186 cgroups in the initial cgroups v2 implementation.
1187 .TP
1188 .I "threaded"
1189 This cgroup is a member of a threaded subtree.
1190 Threads can be added to this cgroup,
1191 and controllers can be enabled for the cgroup.
1192 .TP
1193 .I "domain threaded"
1194 This is a domain cgroup that serves as the root of a threaded subtree.
1195 This cgroup type is also known as "threaded root".
1196 .TP
1197 .I "domain invalid"
1198 This is a cgroup inside a threaded subtree
1199 that is in an "invalid" state.
1200 Processes can't be added to the cgroup,
1201 and controllers can't be enabled for the cgroup.
1202 The only thing that can be done with this cgroup (other than deleting it)
1203 is to convert it to a
1204 .IR threaded
1205 cgroup by writing the string
1206 .IR """threaded"""
1207 to the
1208 .I cgroup.type
1209 file.
1210 .IP
1211 The rationale for the existence of this "interim" type
1212 during the creation of a threaded subtree
1213 (rather than the kernel simply immediately converting all cgroups
1214 under the threaded root to the type
1215 .IR threaded )
1216 is to allow for
1217 possible future extensions to the thread mode model
1218 .\"
1219 .SS Threaded versus domain controllers
1220 With the addition of threads mode,
1221 cgroups v2 now distinguishes two types of resource controllers:
1222 .IP * 3
1223 .I Threaded
1224 .\" In the kernel source, look for ".threaded[ \t]*= true" in
1225 .\" initializations of struct cgroup_subsys
1226 controllers: these controllers support thread-granularity for
1227 resource control and can be enabled inside threaded subtrees,
1228 with the result that the corresponding controller-interface files
1229 appear inside the cgroups in the threaded subtree.
1230 As at Linux 4.19, the following controllers are threaded:
1231 .IR cpu ,
1232 .IR perf_event ,
1233 and
1234 .IR pids .
1235 .IP *
1236 .I Domain
1237 controllers: these controllers support only process granularity
1238 for resource control.
1239 From the perspective of a domain controller,
1240 all threads of a process are always in the same cgroup.
1241 Domain controllers can't be enabled inside a threaded subtree.
1242 .\"
1243 .SS Creating a threaded subtree
1244 There are two pathways that lead to the creation of a threaded subtree.
1245 The first pathway proceeds as follows:
1246 .IP 1. 3
1247 We write the string
1248 .IR """threaded"""
1249 to the
1250 .I cgroup.type
1251 file of a cgroup
1252 .IR y/z
1253 that currently has the type
1254 .IR domain .
1255 This has the following effects:
1256 .RS
1257 .IP * 3
1258 The type of the cgroup
1259 .IR y/z
1260 becomes
1261 .IR threaded .
1262 .IP *
1263 The type of the parent cgroup,
1264 .IR y ,
1265 becomes
1266 .IR "domain threaded" .
1267 The parent cgroup is the root of a threaded subtree
1268 (also known as the "threaded root").
1269 .IP *
1270 All other cgroups under
1271 .IR y
1272 that were not already of type
1273 .IR threaded
1274 (because they were inside already existing threaded subtrees
1275 under the new threaded root)
1276 are converted to type
1277 .IR "domain invalid" .
1278 Any subsequently created cgroups under
1279 .I y
1280 will also have the type
1281 .IR "domain invalid" .
1282 .RE
1283 .IP 2.
1284 We write the string
1285 .IR """threaded"""
1286 to each of the
1287 .IR "domain invalid"
1288 cgroups under
1289 .IR y ,
1290 in order to convert them to the type
1291 .IR threaded .
1292 As a consequence of this step, all threads under the threaded root
1293 now have the type
1294 .IR threaded
1295 and the threaded subtree is now fully usable.
1296 The requirement to write
1297 .IR """threaded"""
1298 to each of these cgroups is somewhat cumbersome,
1299 but allows for possible future extensions to the thread-mode model.
1300 .PP
1301 The second way of creating a threaded subtree is as follows:
1302 .IP 1. 3
1303 In an existing cgroup,
1304 .IR z ,
1305 that currently has the type
1306 .IR domain ,
1307 we (1) enable one or more threaded controllers and
1308 (2) make a process a member of
1309 .IR z .
1310 (These two steps can be done in either order.)
1311 This has the following consequences:
1312 .RS
1313 .IP * 3
1314 The type of
1315 .I z
1316 becomes
1317 .IR "domain threaded" .
1318 .IP *
1319 All of the descendant cgroups of
1320 .I x
1321 that were not already of type
1322 .IR threaded
1323 are converted to type
1324 .IR "domain invalid" .
1325 .RE
1326 .IP 2.
1327 As before, we make the threaded subtree usable by writing the string
1328 .IR """threaded"""
1329 to each of the
1330 .IR "domain invalid"
1331 cgroups under
1332 .IR y ,
1333 in order to convert them to the type
1334 .IR threaded .
1335 .PP
1336 One of the consequences of the above pathways to creating a threaded subtree
1337 is that the threaded root cgroup can be a parent only to
1338 .I threaded
1339 (and
1340 .IR "domain invalid" )
1341 cgroups.
1342 The threaded root cgroup can't be a parent of a
1343 .I domain
1344 cgroups, and a
1345 .I threaded
1346 cgroup
1347 can't have a sibling that is a
1348 .I domain
1349 cgroup.
1350 .\"
1351 .SS Using a threaded subtree
1352 Within a threaded subtree, threaded controllers can be enabled
1353 in each subgroup whose type has been changed to
1354 .IR threaded ;
1355 upon doing so, the corresponding controller interface files
1356 appear in the children of that cgroup.
1357 .PP
1358 A process can be moved into a threaded subtree by writing its PID to the
1359 .I cgroup.procs
1360 file in one of the cgroups inside the tree.
1361 This has the effect of making all of the threads
1362 in the process members of the corresponding cgroup
1363 and makes the process a member of the threaded subtree.
1364 The threads of the process can then be spread across
1365 the threaded subtree by writing their thread IDs (see
1366 .BR gettid (2))
1367 to the
1368 .I cgroup.threads
1369 files in different cgroups inside the subtree.
1370 The threads of a process must all reside in the same threaded subtree.
1371 .PP
1372 As with writing to
1373 .IR cgroup.procs ,
1374 some containment rules apply when writing to the
1375 .I cgroup.threads
1376 file:
1377 .IP * 3
1378 The writer must have write permission on the
1379 cgroup.threads
1380 file in the destination cgroup.
1381 .IP *
1382 The writer must have write permission on the
1383 .I cgroup.procs
1384 file in the common ancestor of the source and destination cgroups.
1385 (In some cases,
1386 the common ancestor may be the source or destination cgroup itself.)
1387 .IP *
1388 The source and destination cgroups must be in the same threaded subtree.
1389 (Outside a threaded subtree, an attempt to move a thread by writing
1390 its thread ID to the
1391 .I cgroup.threads
1392 file in a different
1393 .I domain
1394 cgroup fails with the error
1395 .BR EOPNOTSUPP .)
1396 .PP
1397 The
1398 .I cgroup.threads
1399 file is present in each cgroup (including
1400 .I domain
1401 cgroups) and can be read in order to discover the set of threads
1402 that is present in the cgroup.
1403 The set of thread IDs obtained when reading this file
1404 is not guaranteed to be ordered or free of duplicates.
1405 .PP
1406 The
1407 .I cgroup.procs
1408 file in the threaded root shows the PIDs of all processes
1409 that are members of the threaded subtree.
1410 The
1411 .I cgroup.procs
1412 files in the other cgroups in the subtree are not readable.
1413 .PP
1414 Domain controllers can't be enabled in a threaded subtree;
1415 no controller-interface files appear inside the cgroups underneath the
1416 threaded root.
1417 From the point of view of a domain controller,
1418 threaded subtrees are invisible:
1419 a multithreaded process inside a threaded subtree appears to a domain
1420 controller as a process that resides in the threaded root cgroup.
1421 .PP
1422 Within a threaded subtree, the "no internal processes" rule does not apply:
1423 a cgroup can both contain member processes (or thread)
1424 and exercise controllers on child cgroups.
1425 .\"
1426 .SS Rules for writing to cgroup.type and creating threaded subtrees
1427 A number of rules apply when writing to the
1428 .I cgroup.type
1429 file:
1430 .IP * 3
1431 Only the string
1432 .IR """threaded"""
1433 may be written.
1434 In other words, the only explicit transition that is possible is to convert a
1435 .I domain
1436 cgroup to type
1437 .IR threaded .
1438 .IP *
1439 The effect of writing
1440 .IR """threaded"""
1441 depends on the current value in
1442 .IR cgroup.type ,
1443 as follows:
1444 .RS
1445 .IP \(bu 3
1446 .IR domain
1447 or
1448 .IR "domain threaded" :
1449 start the creation of a threaded subtree
1450 (whose root is the parent of this cgroup) via
1451 the first of the pathways described above;
1452 .IP \(bu
1453 .IR "domain\ invalid" :
1454 convert this cgroup (which is inside a threaded subtree) to a usable (i.e.,
1455 .IR threaded )
1456 state;
1457 .IP \(bu
1458 .IR threaded :
1459 no effect (a "no-op").
1460 .RE
1461 .IP *
1462 We can't write to a
1463 .I cgroup.type
1464 file if the parent's type is
1465 .IR "domain invalid" .
1466 In other words, the cgroups of a threaded subtree must be converted to the
1467 .I threaded
1468 state in a top-down manner.
1469 .PP
1470 There are also some constraints that must be satisfied
1471 in order to create a threaded subtree rooted at the cgroup
1472 .IR x :
1473 .IP * 3
1474 There can be no member processes in the descendant cgroups of
1475 .IR x .
1476 (The cgroup
1477 .I x
1478 can itself have member processes.)
1479 .IP *
1480 No domain controllers may be enabled in
1481 .IR x 's
1482 .IR cgroup.subtree_control
1483 file.
1484 .PP
1485 If any of the above constraints is violated, then an attempt to write
1486 .IR """threaded"""
1487 to a
1488 .IR cgroup.type
1489 file fails with the error
1490 .BR ENOTSUP .
1491 .\"
1492 .SS The """domain threaded""" cgroup type
1493 According to the pathways described above,
1494 the type of a cgroup can change to
1495 .IR "domain threaded"
1496 in either of the following cases:
1497 .IP * 3
1498 The string
1499 .IR """threaded"""
1500 is written to a child cgroup.
1501 .IP *
1502 A threaded controller is enabled inside the cgroup and
1503 a process is made a member of the cgroup.
1504 .PP
1505 A
1506 .IR "domain threaded"
1507 cgroup,
1508 .IR x ,
1509 can revert to the type
1510 .IR domain
1511 if the above conditions no longer hold true\(emthat is, if all
1512 .I threaded
1513 child cgroups of
1514 .I x
1515 are removed and either
1516 .I x
1517 no longer has threaded controllers enabled or
1518 no longer has member processes.
1519 .PP
1520 When a
1521 .IR "domain threaded"
1522 cgroup
1523 .IR x
1524 reverts to the type
1525 .IR domain :
1526 .IP * 3
1527 All
1528 .IR "domain invalid"
1529 descendants of
1530 .I x
1531 that are not in lower-level threaded subtrees revert to the type
1532 .IR domain .
1533 .IP *
1534 The root cgroups in any lower-level threaded subtrees revert to the type
1535 .IR "domain threaded" .
1536 .\"
1537 .SS Exceptions for the root cgroup
1538 The root cgroup of the v2 hierarchy is treated exceptionally:
1539 it can be the parent of both
1540 .I domain
1541 and
1542 .I threaded
1543 cgroups.
1544 If the string
1545 .I """threaded"""
1546 is written to the
1547 .I cgroup.type
1548 file of one of the children of the root cgroup, then
1549 .IP * 3
1550 The type of that cgroup becomes
1551 .IR threaded .
1552 .IP *
1553 The type of any descendants of that cgroup that
1554 are not part of lower-level threaded subtrees changes to
1555 .IR "domain invalid" .
1556 .PP
1557 Note that in this case, there is no cgroup whose type becomes
1558 .IR "domain threaded" .
1559 (Notionally, the root cgroup can be considered as the threaded root
1560 for the cgroup whose type was changed to
1561 .IR threaded .)
1562 .PP
1563 The aim of this exceptional treatment for the root cgroup is to
1564 allow a threaded cgroup that employs the
1565 .I cpu
1566 controller to be placed as high as possible in the hierarchy,
1567 so as to minimize the (small) cost of traversing the cgroup hierarchy.
1568 .\"
1569 .SS The cgroups v2 """cpu""" controller and realtime threads
1570 As at Linux 4.19, the cgroups v2
1571 .I cpu
1572 controller does not support control of realtime threads
1573 (specifically threads scheduled under any of the policies
1574 .BR SCHED_FIFO ,
1575 .BR SCHED_RR ,
1576 described
1577 .BR SCHED_DEADLINE ;
1578 see
1579 .BR sched (7)).
1580 Therefore, the
1581 .I cpu
1582 controller can be enabled in the root cgroup only
1583 if all realtime threads are in the root cgroup.
1584 (If there are realtime threads in nonroot cgroups, then a
1585 .BR write (2)
1586 of the string
1587 .IR """+cpu"""
1588 to the
1589 .I cgroup.subtree_control
1590 file fails with the error
1591 .BR EINVAL .)
1592 .PP
1593 On some systems,
1594 .BR systemd (1)
1595 places certain realtime threads in nonroot cgroups in the v2 hierarchy.
1596 On such systems,
1597 these threads must first be moved to the root cgroup before the
1598 .I cpu
1599 controller can be enabled.
1600 .\"
1601 .SH ERRORS
1602 The following errors can occur for
1603 .BR mount (2):
1604 .TP
1605 .B EBUSY
1606 An attempt to mount a cgroup version 1 filesystem specified neither the
1607 .I name=
1608 option (to mount a named hierarchy) nor a controller name (or
1609 .IR all ).
1610 .SH NOTES
1611 A child process created via
1612 .BR fork (2)
1613 inherits its parent's cgroup memberships.
1614 A process's cgroup memberships are preserved across
1615 .BR execve (2).
1616 .\"
1617 .SS /proc files
1618 .TP
1619 .IR /proc/cgroups " (since Linux 2.6.24)"
1620 This file contains information about the controllers
1621 that are compiled into the kernel.
1622 An example of the contents of this file (reformatted for readability)
1623 is the following:
1624 .IP
1625 .in +4n
1626 .EX
1627 #subsys_name hierarchy num_cgroups enabled
1628 cpuset 4 1 1
1629 cpu 8 1 1
1630 cpuacct 8 1 1
1631 blkio 6 1 1
1632 memory 3 1 1
1633 devices 10 84 1
1634 freezer 7 1 1
1635 net_cls 9 1 1
1636 perf_event 5 1 1
1637 net_prio 9 1 1
1638 hugetlb 0 1 0
1639 pids 2 1 1
1640 .EE
1641 .in
1642 .IP
1643 The fields in this file are, from left to right:
1644 .RS
1645 .IP 1. 3
1646 The name of the controller.
1647 .IP 2.
1648 The unique ID of the cgroup hierarchy on which this controller is mounted.
1649 If multiple cgroups v1 controllers are bound to the same hierarchy,
1650 then each will show the same hierarchy ID in this field.
1651 The value in this field will be 0 if:
1652 .RS 5
1653 .IP a) 3
1654 the controller is not mounted on a cgroups v1 hierarchy;
1655 .IP b)
1656 the controller is bound to the cgroups v2 single unified hierarchy; or
1657 .IP c)
1658 the controller is disabled (see below).
1659 .RE
1660 .IP 3.
1661 The number of control groups in this hierarchy using this controller.
1662 .IP 4.
1663 This field contains the value 1 if this controller is enabled,
1664 or 0 if it has been disabled (via the
1665 .IR cgroup_disable
1666 kernel command-line boot parameter).
1667 .RE
1668 .TP
1669 .IR /proc/[pid]/cgroup " (since Linux 2.6.24)"
1670 This file describes control groups to which the process
1671 with the corresponding PID belongs.
1672 The displayed information differs for
1673 cgroups version 1 and version 2 hierarchies.
1674 .IP
1675 For each cgroup hierarchy of which the process is a member,
1676 there is one entry containing three colon-separated fields:
1677 .IP
1678 .in +4n
1679 .EX
1680 hierarchy-ID:controller-list:cgroup-path
1681 .EE
1682 .in
1683 .IP
1684 For example:
1685 .IP
1686 .in +4n
1687 .EX
1688 5:cpuacct,cpu,cpuset:/daemons
1689 .EE
1690 .in
1691 .IP
1692 The colon-separated fields are, from left to right:
1693 .RS
1694 .IP 1. 3
1695 For cgroups version 1 hierarchies,
1696 this field contains a unique hierarchy ID number
1697 that can be matched to a hierarchy ID in
1698 .IR /proc/cgroups .
1699 For the cgroups version 2 hierarchy, this field contains the value 0.
1700 .IP 2.
1701 For cgroups version 1 hierarchies,
1702 this field contains a comma-separated list of the controllers
1703 bound to the hierarchy.
1704 For the cgroups version 2 hierarchy, this field is empty.
1705 .IP 3.
1706 This field contains the pathname of the control group in the hierarchy
1707 to which the process belongs.
1708 This pathname is relative to the mount point of the hierarchy.
1709 .RE
1710 .\"
1711 .SS /sys/kernel/cgroup files
1712 .TP
1713 .IR /sys/kernel/cgroup/delegate " (since Linux 4.15)"
1714 .\" commit 01ee6cfb1483fe57c9cbd8e73817dfbf9bacffd3
1715 This file exports a list of the cgroups v2 files
1716 (one per line) that are delegatable
1717 (i.e., whose ownership should be changed to the user ID of the delegatee).
1718 In the future, the set of delegatable files may change or grow,
1719 and this file provides a way for the kernel to inform
1720 user-space applications of which files must be delegated.
1721 As at Linux 4.15, one sees the following when inspecting this file:
1722 .IP
1723 .EX
1724 .in +4n
1725 $ \fBcat /sys/kernel/cgroup/delegate\fP
1726 cgroup.procs
1727 cgroup.subtree_control
1728 cgroup.threads
1729 .in
1730 .EE
1731 .TP
1732 .IR /sys/kernel/cgroup/features " (since Linux 4.15)"
1733 .\" commit 5f2e673405b742be64e7c3604ed4ed3ac14f35ce
1734 Over time, the set of cgroups v2 features that are provided by the
1735 kernel may change or grow,
1736 or some features may not be enabled by default.
1737 This file provides a way for user-space applications to discover what
1738 features the running kernel supports and has enabled.
1739 Features are listed one per line:
1740 .IP
1741 .in +4n
1742 .EX
1743 $ \fBcat /sys/kernel/cgroup/features\fP
1744 nsdelegate
1745 .EE
1746 .in
1747 .IP
1748 The entries that can appear in this file are:
1749 .RS
1750 .TP
1751 .IR nsdelegate " (since Linux 4.15)"
1752 The kernel supports the
1753 .I nsdelegate
1754 mount option.
1755 .RE
1756 .SH SEE ALSO
1757 .BR prlimit (1),
1758 .BR systemd (1),
1759 .BR systemd-cgls (1),
1760 .BR systemd-cgtop (1),
1761 .BR clone (2),
1762 .BR ioprio_set (2),
1763 .BR perf_event_open (2),
1764 .BR setrlimit (2),
1765 .BR cgroup_namespaces (7),
1766 .BR cpuset (7),
1767 .BR namespaces (7),
1768 .BR sched (7),
1769 .BR user_namespaces (7)