]> git.ipfire.org Git - thirdparty/man-pages.git/blob - man7/cgroups.7
All pages: Remove the 5th argument to .TH
[thirdparty/man-pages.git] / man7 / cgroups.7
1 .\" Copyright (C) 2015 Serge Hallyn <serge@hallyn.com>
2 .\" and Copyright (C) 2016, 2017 Michael Kerrisk <mtk.manpages@gmail.com>
3 .\"
4 .\" SPDX-License-Identifier: Linux-man-pages-copyleft
5 .\"
6 .TH CGROUPS 7 2021-08-27 "Linux man-pages (unreleased)"
7 .SH NAME
8 cgroups \- Linux control groups
9 .SH DESCRIPTION
10 Control groups, usually referred to as cgroups,
11 are a Linux kernel feature which allow processes to
12 be organized into hierarchical groups whose usage of
13 various types of resources can then be limited and monitored.
14 The kernel's cgroup interface is provided through
15 a pseudo-filesystem called cgroupfs.
16 Grouping is implemented in the core cgroup kernel code,
17 while resource tracking and limits are implemented in
18 a set of per-resource-type subsystems (memory, CPU, and so on).
19 .\"
20 .SS Terminology
21 A
22 .I cgroup
23 is a collection of processes that are bound to a set of
24 limits or parameters defined via the cgroup filesystem.
25 .PP
26 A
27 .I subsystem
28 is a kernel component that modifies the behavior of
29 the processes in a cgroup.
30 Various subsystems have been implemented, making it possible to do things
31 such as limiting the amount of CPU time and memory available to a cgroup,
32 accounting for the CPU time used by a cgroup,
33 and freezing and resuming execution of the processes in a cgroup.
34 Subsystems are sometimes also known as
35 .I resource controllers
36 (or simply, controllers).
37 .PP
38 The cgroups for a controller are arranged in a
39 .IR hierarchy .
40 This hierarchy is defined by creating, removing, and
41 renaming subdirectories within the cgroup filesystem.
42 At each level of the hierarchy, attributes (e.g., limits) can be defined.
43 The limits, control, and accounting provided by cgroups generally have
44 effect throughout the subhierarchy underneath the cgroup where the
45 attributes are defined.
46 Thus, for example, the limits placed on
47 a cgroup at a higher level in the hierarchy cannot be exceeded
48 by descendant cgroups.
49 .\"
50 .SS Cgroups version 1 and version 2
51 The initial release of the cgroups implementation was in Linux 2.6.24.
52 Over time, various cgroup controllers have been added
53 to allow the management of various types of resources.
54 However, the development of these controllers was largely uncoordinated,
55 with the result that many inconsistencies arose between controllers
56 and management of the cgroup hierarchies became rather complex.
57 A longer description of these problems can be found in the kernel
58 source file
59 .I Documentation/admin\-guide/cgroup\-v2.rst
60 (or
61 .I Documentation/cgroup\-v2.txt
62 in Linux 4.17 and earlier).
63 .PP
64 Because of the problems with the initial cgroups implementation
65 (cgroups version 1),
66 starting in Linux 3.10, work began on a new,
67 orthogonal implementation to remedy these problems.
68 Initially marked experimental, and hidden behind the
69 .I "\-o\ __DEVEL__sane_behavior"
70 mount option, the new version (cgroups version 2)
71 was eventually made official with the release of Linux 4.5.
72 Differences between the two versions are described in the text below.
73 The file
74 .IR cgroup.sane_behavior ,
75 present in cgroups v1, is a relic of this mount option.
76 The file always reports "0" and is only retained for backward compatibility.
77 .PP
78 Although cgroups v2 is intended as a replacement for cgroups v1,
79 the older system continues to exist
80 (and for compatibility reasons is unlikely to be removed).
81 Currently, cgroups v2 implements only a subset of the controllers
82 available in cgroups v1.
83 The two systems are implemented so that both v1 controllers and
84 v2 controllers can be mounted on the same system.
85 Thus, for example, it is possible to use those controllers
86 that are supported under version 2,
87 while also using version 1 controllers
88 where version 2 does not yet support those controllers.
89 The only restriction here is that a controller can't be simultaneously
90 employed in both a cgroups v1 hierarchy and in the cgroups v2 hierarchy.
91 .\"
92 .SH CGROUPS VERSION 1
93 Under cgroups v1, each controller may be mounted against a separate
94 cgroup filesystem that provides its own hierarchical organization of the
95 processes on the system.
96 It is also possible to comount multiple (or even all) cgroups v1 controllers
97 against the same cgroup filesystem, meaning that the comounted controllers
98 manage the same hierarchical organization of processes.
99 .PP
100 For each mounted hierarchy,
101 the directory tree mirrors the control group hierarchy.
102 Each control group is represented by a directory, with each of its child
103 control cgroups represented as a child directory.
104 For instance,
105 .I /user/joe/1.session
106 represents control group
107 .IR 1.session ,
108 which is a child of cgroup
109 .IR joe ,
110 which is a child of
111 .IR /user .
112 Under each cgroup directory is a set of files which can be read or
113 written to, reflecting resource limits and a few general cgroup
114 properties.
115 .\"
116 .SS Tasks (threads) versus processes
117 In cgroups v1, a distinction is drawn between
118 .I processes
119 and
120 .IR tasks .
121 In this view, a process can consist of multiple tasks
122 (more commonly called threads, from a user-space perspective,
123 and called such in the remainder of this man page).
124 In cgroups v1, it is possible to independently manipulate
125 the cgroup memberships of the threads in a process.
126 .PP
127 The cgroups v1 ability to split threads across different cgroups
128 caused problems in some cases.
129 For example, it made no sense for the
130 .I memory
131 controller,
132 since all of the threads of a process share a single address space.
133 Because of these problems,
134 the ability to independently manipulate the cgroup memberships
135 of the threads in a process was removed in the initial cgroups v2
136 implementation, and subsequently restored in a more limited form
137 (see the discussion of "thread mode" below).
138 .\"
139 .SS Mounting v1 controllers
140 The use of cgroups requires a kernel built with the
141 .B CONFIG_CGROUP
142 option.
143 In addition, each of the v1 controllers has an associated
144 configuration option that must be set in order to employ that controller.
145 .PP
146 In order to use a v1 controller,
147 it must be mounted against a cgroup filesystem.
148 The usual place for such mounts is under a
149 .BR tmpfs (5)
150 filesystem mounted at
151 .IR /sys/fs/cgroup .
152 Thus, one might mount the
153 .I cpu
154 controller as follows:
155 .PP
156 .in +4n
157 .EX
158 mount \-t cgroup \-o cpu none /sys/fs/cgroup/cpu
159 .EE
160 .in
161 .PP
162 It is possible to comount multiple controllers against the same hierarchy.
163 For example, here the
164 .I cpu
165 and
166 .I cpuacct
167 controllers are comounted against a single hierarchy:
168 .PP
169 .in +4n
170 .EX
171 mount \-t cgroup \-o cpu,cpuacct none /sys/fs/cgroup/cpu,cpuacct
172 .EE
173 .in
174 .PP
175 Comounting controllers has the effect that a process is in the same cgroup for
176 all of the comounted controllers.
177 Separately mounting controllers allows a process to
178 be in cgroup
179 .I /foo1
180 for one controller while being in
181 .I /foo2/foo3
182 for another.
183 .PP
184 It is possible to comount all v1 controllers against the same hierarchy:
185 .PP
186 .in +4n
187 .EX
188 mount \-t cgroup \-o all cgroup /sys/fs/cgroup
189 .EE
190 .in
191 .PP
192 (One can achieve the same result by omitting
193 .IR "\-o all" ,
194 since it is the default if no controllers are explicitly specified.)
195 .PP
196 It is not possible to mount the same controller
197 against multiple cgroup hierarchies.
198 For example, it is not possible to mount both the
199 .I cpu
200 and
201 .I cpuacct
202 controllers against one hierarchy, and to mount the
203 .I cpu
204 controller alone against another hierarchy.
205 It is possible to create multiple mount with exactly
206 the same set of comounted controllers.
207 However, in this case all that results is multiple mount points
208 providing a view of the same hierarchy.
209 .PP
210 Note that on many systems, the v1 controllers are automatically mounted under
211 .IR /sys/fs/cgroup ;
212 in particular,
213 .BR systemd (1)
214 automatically creates such mounts.
215 .\"
216 .SS Unmounting v1 controllers
217 A mounted cgroup filesystem can be unmounted using the
218 .BR umount (8)
219 command, as in the following example:
220 .PP
221 .in +4n
222 .EX
223 umount /sys/fs/cgroup/pids
224 .EE
225 .in
226 .PP
227 .IR "But note well" :
228 a cgroup filesystem is unmounted only if it is not busy,
229 that is, it has no child cgroups.
230 If this is not the case, then the only effect of the
231 .BR umount (8)
232 is to make the mount invisible.
233 Thus, to ensure that the mount is really removed,
234 one must first remove all child cgroups,
235 which in turn can be done only after all member processes
236 have been moved from those cgroups to the root cgroup.
237 .\"
238 .SS Cgroups version 1 controllers
239 Each of the cgroups version 1 controllers is governed
240 by a kernel configuration option (listed below).
241 Additionally, the availability of the cgroups feature is governed by the
242 .B CONFIG_CGROUPS
243 kernel configuration option.
244 .TP
245 .IR cpu " (since Linux 2.6.24; " \fBCONFIG_CGROUP_SCHED\fP )
246 Cgroups can be guaranteed a minimum number of "CPU shares"
247 when a system is busy.
248 This does not limit a cgroup's CPU usage if the CPUs are not busy.
249 For further information, see
250 .I Documentation/scheduler/sched\-design\-CFS.rst
251 (or
252 .I Documentation/scheduler/sched\-design\-CFS.txt
253 in Linux 5.2 and earlier).
254 .IP
255 In Linux 3.2,
256 this controller was extended to provide CPU "bandwidth" control.
257 If the kernel is configured with
258 .BR CONFIG_CFS_BANDWIDTH ,
259 then within each scheduling period
260 (defined via a file in the cgroup directory), it is possible to define
261 an upper limit on the CPU time allocated to the processes in a cgroup.
262 This upper limit applies even if there is no other competition for the CPU.
263 Further information can be found in the kernel source file
264 .I Documentation/scheduler/sched\-bwc.rst
265 (or
266 .I Documentation/scheduler/sched\-bwc.txt
267 in Linux 5.2 and earlier).
268 .TP
269 .IR cpuacct " (since Linux 2.6.24; " \fBCONFIG_CGROUP_CPUACCT\fP )
270 This provides accounting for CPU usage by groups of processes.
271 .IP
272 Further information can be found in the kernel source file
273 .I Documentation/admin\-guide/cgroup\-v1/cpuacct.rst
274 (or
275 .I Documentation/cgroup\-v1/cpuacct.txt
276 in Linux 5.2 and earlier).
277 .TP
278 .IR cpuset " (since Linux 2.6.24; " \fBCONFIG_CPUSETS\fP )
279 This cgroup can be used to bind the processes in a cgroup to
280 a specified set of CPUs and NUMA nodes.
281 .IP
282 Further information can be found in the kernel source file
283 .I Documentation/admin\-guide/cgroup\-v1/cpusets.rst
284 (or
285 .I Documentation/cgroup\-v1/cpusets.txt
286 in Linux 5.2 and earlier).
287 .
288 .TP
289 .IR memory " (since Linux 2.6.25; " \fBCONFIG_MEMCG\fP )
290 The memory controller supports reporting and limiting of process memory, kernel
291 memory, and swap used by cgroups.
292 .IP
293 Further information can be found in the kernel source file
294 .I Documentation/admin\-guide/cgroup\-v1/memory.rst
295 (or
296 .I Documentation/cgroup\-v1/memory.txt
297 in Linux 5.2 and earlier).
298 .TP
299 .IR devices " (since Linux 2.6.26; " \fBCONFIG_CGROUP_DEVICE\fP )
300 This supports controlling which processes may create (mknod) devices as
301 well as open them for reading or writing.
302 The policies may be specified as allow-lists and deny-lists.
303 Hierarchy is enforced, so new rules must not
304 violate existing rules for the target or ancestor cgroups.
305 .IP
306 Further information can be found in the kernel source file
307 .I Documentation/admin\-guide/cgroup\-v1/devices.rst
308 (or
309 .I Documentation/cgroup\-v1/devices.txt
310 in Linux 5.2 and earlier).
311 .TP
312 .IR freezer " (since Linux 2.6.28; " \fBCONFIG_CGROUP_FREEZER\fP )
313 The
314 .I freezer
315 cgroup can suspend and restore (resume) all processes in a cgroup.
316 Freezing a cgroup
317 .I /A
318 also causes its children, for example, processes in
319 .IR /A/B ,
320 to be frozen.
321 .IP
322 Further information can be found in the kernel source file
323 .I Documentation/admin\-guide/cgroup\-v1/freezer\-subsystem.rst
324 (or
325 .I Documentation/cgroup\-v1/freezer\-subsystem.txt
326 in Linux 5.2 and earlier).
327 .TP
328 .IR net_cls " (since Linux 2.6.29; " \fBCONFIG_CGROUP_NET_CLASSID\fP )
329 This places a classid, specified for the cgroup, on network packets
330 created by a cgroup.
331 These classids can then be used in firewall rules,
332 as well as used to shape traffic using
333 .BR tc (8).
334 This applies only to packets
335 leaving the cgroup, not to traffic arriving at the cgroup.
336 .IP
337 Further information can be found in the kernel source file
338 .I Documentation/admin\-guide/cgroup\-v1/net_cls.rst
339 (or
340 .I Documentation/cgroup\-v1/net_cls.txt
341 in Linux 5.2 and earlier).
342 .TP
343 .IR blkio " (since Linux 2.6.33; " \fBCONFIG_BLK_CGROUP\fP )
344 The
345 .I blkio
346 cgroup controls and limits access to specified block devices by
347 applying IO control in the form of throttling and upper limits against leaf
348 nodes and intermediate nodes in the storage hierarchy.
349 .IP
350 Two policies are available.
351 The first is a proportional-weight time-based division
352 of disk implemented with CFQ.
353 This is in effect for leaf nodes using CFQ.
354 The second is a throttling policy which specifies
355 upper I/O rate limits on a device.
356 .IP
357 Further information can be found in the kernel source file
358 .I Documentation/admin\-guide/cgroup\-v1/blkio\-controller.rst
359 (or
360 .I Documentation/cgroup\-v1/blkio\-controller.txt
361 in Linux 5.2 and earlier).
362 .TP
363 .IR perf_event " (since Linux 2.6.39; " \fBCONFIG_CGROUP_PERF\fP )
364 This controller allows
365 .I perf
366 monitoring of the set of processes grouped in a cgroup.
367 .IP
368 Further information can be found in the kernel source files
369 .TP
370 .IR net_prio " (since Linux 3.3; " \fBCONFIG_CGROUP_NET_PRIO\fP )
371 This allows priorities to be specified, per network interface, for cgroups.
372 .IP
373 Further information can be found in the kernel source file
374 .I Documentation/admin\-guide/cgroup\-v1/net_prio.rst
375 (or
376 .I Documentation/cgroup\-v1/net_prio.txt
377 in Linux 5.2 and earlier).
378 .TP
379 .IR hugetlb " (since Linux 3.5; " \fBCONFIG_CGROUP_HUGETLB\fP )
380 This supports limiting the use of huge pages by cgroups.
381 .IP
382 Further information can be found in the kernel source file
383 .I Documentation/admin\-guide/cgroup\-v1/hugetlb.rst
384 (or
385 .I Documentation/cgroup\-v1/hugetlb.txt
386 in Linux 5.2 and earlier).
387 .TP
388 .IR pids " (since Linux 4.3; " \fBCONFIG_CGROUP_PIDS\fP )
389 This controller permits limiting the number of process that may be created
390 in a cgroup (and its descendants).
391 .IP
392 Further information can be found in the kernel source file
393 .I Documentation/admin\-guide/cgroup\-v1/pids.rst
394 (or
395 .I Documentation/cgroup\-v1/pids.txt
396 in Linux 5.2 and earlier).
397 .TP
398 .IR rdma " (since Linux 4.11; " \fBCONFIG_CGROUP_RDMA\fP )
399 The RDMA controller permits limiting the use of
400 RDMA/IB-specific resources per cgroup.
401 .IP
402 Further information can be found in the kernel source file
403 .I Documentation/admin\-guide/cgroup\-v1/rdma.rst
404 (or
405 .I Documentation/cgroup\-v1/rdma.txt
406 in Linux 5.2 and earlier).
407 .\"
408 .SS Creating cgroups and moving processes
409 A cgroup filesystem initially contains a single root cgroup, '/',
410 which all processes belong to.
411 A new cgroup is created by creating a directory in the cgroup filesystem:
412 .PP
413 .in +4n
414 .EX
415 mkdir /sys/fs/cgroup/cpu/cg1
416 .EE
417 .in
418 .PP
419 This creates a new empty cgroup.
420 .PP
421 A process may be moved to this cgroup by writing its PID into the cgroup's
422 .I cgroup.procs
423 file:
424 .PP
425 .in +4n
426 .EX
427 echo $$ > /sys/fs/cgroup/cpu/cg1/cgroup.procs
428 .EE
429 .in
430 .PP
431 Only one PID at a time should be written to this file.
432 .PP
433 Writing the value 0 to a
434 .I cgroup.procs
435 file causes the writing process to be moved to the corresponding cgroup.
436 .PP
437 When writing a PID into the
438 .IR cgroup.procs ,
439 all threads in the process are moved into the new cgroup at once.
440 .PP
441 Within a hierarchy, a process can be a member of exactly one cgroup.
442 Writing a process's PID to a
443 .I cgroup.procs
444 file automatically removes it from the cgroup of
445 which it was previously a member.
446 .PP
447 The
448 .I cgroup.procs
449 file can be read to obtain a list of the processes that are
450 members of a cgroup.
451 The returned list of PIDs is not guaranteed to be in order.
452 Nor is it guaranteed to be free of duplicates.
453 (For example, a PID may be recycled while reading from the list.)
454 .PP
455 In cgroups v1, an individual thread can be moved to
456 another cgroup by writing its thread ID
457 (i.e., the kernel thread ID returned by
458 .BR clone (2)
459 and
460 .BR gettid (2))
461 to the
462 .I tasks
463 file in a cgroup directory.
464 This file can be read to discover the set of threads
465 that are members of the cgroup.
466 .\"
467 .SS Removing cgroups
468 To remove a cgroup,
469 it must first have no child cgroups and contain no (nonzombie) processes.
470 So long as that is the case, one can simply
471 remove the corresponding directory pathname.
472 Note that files in a cgroup directory cannot and need not be
473 removed.
474 .\"
475 .SS Cgroups v1 release notification
476 Two files can be used to determine whether the kernel provides
477 notifications when a cgroup becomes empty.
478 A cgroup is considered to be empty when it contains no child
479 cgroups and no member processes.
480 .PP
481 A special file in the root directory of each cgroup hierarchy,
482 .IR release_agent ,
483 can be used to register the pathname of a program that may be invoked when
484 a cgroup in the hierarchy becomes empty.
485 The pathname of the newly empty cgroup (relative to the cgroup mount point)
486 is provided as the sole command-line argument when the
487 .I release_agent
488 program is invoked.
489 The
490 .I release_agent
491 program might remove the cgroup directory,
492 or perhaps repopulate it with a process.
493 .PP
494 The default value of the
495 .I release_agent
496 file is empty, meaning that no release agent is invoked.
497 .PP
498 The content of the
499 .I release_agent
500 file can also be specified via a mount option when the
501 cgroup filesystem is mounted:
502 .PP
503 .in +4n
504 .EX
505 mount \-o release_agent=pathname ...
506 .EE
507 .in
508 .PP
509 Whether or not the
510 .I release_agent
511 program is invoked when a particular cgroup becomes empty is determined
512 by the value in the
513 .I notify_on_release
514 file in the corresponding cgroup directory.
515 If this file contains the value 0, then the
516 .I release_agent
517 program is not invoked.
518 If it contains the value 1, the
519 .I release_agent
520 program is invoked.
521 The default value for this file in the root cgroup is 0.
522 At the time when a new cgroup is created,
523 the value in this file is inherited from the corresponding file
524 in the parent cgroup.
525 .\"
526 .SS Cgroup v1 named hierarchies
527 In cgroups v1,
528 it is possible to mount a cgroup hierarchy that has no attached controllers:
529 .PP
530 .in +4n
531 .EX
532 mount \-t cgroup \-o none,name=somename none /some/mount/point
533 .EE
534 .in
535 .PP
536 Multiple instances of such hierarchies can be mounted;
537 each hierarchy must have a unique name.
538 The only purpose of such hierarchies is to track processes.
539 (See the discussion of release notification below.)
540 An example of this is the
541 .I name=systemd
542 cgroup hierarchy that is used by
543 .BR systemd (1)
544 to track services and user sessions.
545 .PP
546 Since Linux 5.0, the
547 .I cgroup_no_v1
548 kernel boot option (described below) can be used to disable cgroup v1
549 named hierarchies, by specifying
550 .IR cgroup_no_v1=named .
551 .\"
552 .SH CGROUPS VERSION 2
553 In cgroups v2,
554 all mounted controllers reside in a single unified hierarchy.
555 While (different) controllers may be simultaneously
556 mounted under the v1 and v2 hierarchies,
557 it is not possible to mount the same controller simultaneously
558 under both the v1 and the v2 hierarchies.
559 .PP
560 The new behaviors in cgroups v2 are summarized here,
561 and in some cases elaborated in the following subsections.
562 .IP 1. 3
563 Cgroups v2 provides a unified hierarchy against
564 which all controllers are mounted.
565 .IP 2.
566 "Internal" processes are not permitted.
567 With the exception of the root cgroup, processes may reside
568 only in leaf nodes (cgroups that do not themselves contain child cgroups).
569 The details are somewhat more subtle than this, and are described below.
570 .IP 3.
571 Active cgroups must be specified via the files
572 .I cgroup.controllers
573 and
574 .IR cgroup.subtree_control .
575 .IP 4.
576 The
577 .I tasks
578 file has been removed.
579 In addition, the
580 .I cgroup.clone_children
581 file that is employed by the
582 .I cpuset
583 controller has been removed.
584 .IP 5.
585 An improved mechanism for notification of empty cgroups is provided by the
586 .I cgroup.events
587 file.
588 .PP
589 For more changes, see the
590 .I Documentation/admin\-guide/cgroup\-v2.rst
591 file in the kernel source
592 (or
593 .I Documentation/cgroup\-v2.txt
594 in Linux 4.17 and earlier).
595 .
596 .PP
597 Some of the new behaviors listed above saw subsequent modification with
598 the addition in Linux 4.14 of "thread mode" (described below).
599 .\"
600 .SS Cgroups v2 unified hierarchy
601 In cgroups v1, the ability to mount different controllers
602 against different hierarchies was intended to allow great flexibility
603 for application design.
604 In practice, though,
605 the flexibility turned out to be less useful than expected,
606 and in many cases added complexity.
607 Therefore, in cgroups v2,
608 all available controllers are mounted against a single hierarchy.
609 The available controllers are automatically mounted,
610 meaning that it is not necessary (or possible) to specify the controllers
611 when mounting the cgroup v2 filesystem using a command such as the following:
612 .PP
613 .in +4n
614 .EX
615 mount \-t cgroup2 none /mnt/cgroup2
616 .EE
617 .in
618 .PP
619 A cgroup v2 controller is available only if it is not currently in use
620 via a mount against a cgroup v1 hierarchy.
621 Or, to put things another way, it is not possible to employ
622 the same controller against both a v1 hierarchy and the unified v2 hierarchy.
623 This means that it may be necessary first to unmount a v1 controller
624 (as described above) before that controller is available in v2.
625 Since
626 .BR systemd (1)
627 makes heavy use of some v1 controllers by default,
628 it can in some cases be simpler to boot the system with
629 selected v1 controllers disabled.
630 To do this, specify the
631 .I cgroup_no_v1=list
632 option on the kernel boot command line;
633 .I list
634 is a comma-separated list of the names of the controllers to disable,
635 or the word
636 .I all
637 to disable all v1 controllers.
638 (This situation is correctly handled by
639 .BR systemd (1),
640 which falls back to operating without the specified controllers.)
641 .PP
642 Note that on many modern systems,
643 .BR systemd (1)
644 automatically mounts the
645 .I cgroup2
646 filesystem at
647 .I /sys/fs/cgroup/unified
648 during the boot process.
649 .\"
650 .SS Cgroups v2 mount options
651 The following options
652 .RI ( mount\~\-o )
653 can be specified when mounting the group v2 filesystem:
654 .TP
655 .IR nsdelegate " (since Linux 4.15)"
656 Treat cgroup namespaces as delegation boundaries.
657 For details, see below.
658 .TP
659 .IR memory_localevents " (since Linux 5.2)"
660 .\" commit 9852ae3fe5293264f01c49f2571ef7688f7823ce
661 The
662 .I memory.events
663 should show statistics only for the cgroup itself,
664 and not for any descendant cgroups.
665 This was the behavior before Linux 5.2.
666 Starting in Linux 5.2,
667 the default behavior is to include statistics for descendant cgroups in
668 .IR memory.events ,
669 and this mount option can be used to revert to the legacy behavior.
670 This option is system wide and can be set on mount or
671 modified through remount only from the initial mount namespace;
672 it is silently ignored in noninitial namespaces.
673 .\"
674 .SS Cgroups v2 controllers
675 The following controllers, documented in the kernel source file
676 .I Documentation/admin\-guide/cgroup\-v2.rst
677 (or
678 .I Documentation/cgroup\-v2.txt
679 in Linux 4.17 and earlier),
680 are supported in cgroups version 2:
681 .TP
682 .IR cpu " (since Linux 4.15)"
683 This is the successor to the version 1
684 .I cpu
685 and
686 .I cpuacct
687 controllers.
688 .TP
689 .IR cpuset " (since Linux 5.0)"
690 This is the successor of the version 1
691 .I cpuset
692 controller.
693 .TP
694 .IR freezer " (since Linux 5.2)"
695 .\" commit 76f969e8948d82e78e1bc4beb6b9465908e74873
696 This is the successor of the version 1
697 .I freezer
698 controller.
699 .TP
700 .IR hugetlb " (since Linux 5.6)"
701 This is the successor of the version 1
702 .I hugetlb
703 controller.
704 .TP
705 .IR io " (since Linux 4.5)"
706 This is the successor of the version 1
707 .I blkio
708 controller.
709 .TP
710 .IR memory " (since Linux 4.5)"
711 This is the successor of the version 1
712 .I memory
713 controller.
714 .TP
715 .IR perf_event " (since Linux 4.11)"
716 This is the same as the version 1
717 .I perf_event
718 controller.
719 .TP
720 .IR pids " (since Linux 4.5)"
721 This is the same as the version 1
722 .I pids
723 controller.
724 .TP
725 .IR rdma " (since Linux 4.11)"
726 This is the same as the version 1
727 .I rdma
728 controller.
729 .PP
730 There is no direct equivalent of the
731 .I net_cls
732 and
733 .I net_prio
734 controllers from cgroups version 1.
735 Instead, support has been added to
736 .BR iptables (8)
737 to allow eBPF filters that hook on cgroup v2 pathnames to make decisions
738 about network traffic on a per-cgroup basis.
739 .PP
740 The v2
741 .I devices
742 controller provides no interface files;
743 instead, device control is gated by attaching an eBPF
744 .RB ( BPF_CGROUP_DEVICE )
745 program to a v2 cgroup.
746 .\"
747 .SS Cgroups v2 subtree control
748 Each cgroup in the v2 hierarchy contains the following two files:
749 .TP
750 .I cgroup.controllers
751 This read-only file exposes a list of the controllers that are
752 .I available
753 in this cgroup.
754 The contents of this file match the contents of the
755 .I cgroup.subtree_control
756 file in the parent cgroup.
757 .TP
758 .I cgroup.subtree_control
759 This is a list of controllers that are
760 .I active
761 .RI ( enabled )
762 in the cgroup.
763 The set of controllers in this file is a subset of the set in the
764 .I cgroup.controllers
765 of this cgroup.
766 The set of active controllers is modified by writing strings to this file
767 containing space-delimited controller names,
768 each preceded by '+' (to enable a controller)
769 or '\-' (to disable a controller), as in the following example:
770 .IP
771 .in +4n
772 .EX
773 echo \(aq+pids \-memory\(aq > x/y/cgroup.subtree_control
774 .EE
775 .in
776 .IP
777 An attempt to enable a controller
778 that is not present in
779 .I cgroup.controllers
780 leads to an
781 .B ENOENT
782 error when writing to the
783 .I cgroup.subtree_control
784 file.
785 .PP
786 Because the list of controllers in
787 .I cgroup.subtree_control
788 is a subset of those
789 .IR cgroup.controllers ,
790 a controller that has been disabled in one cgroup in the hierarchy
791 can never be re-enabled in the subtree below that cgroup.
792 .PP
793 A cgroup's
794 .I cgroup.subtree_control
795 file determines the set of controllers that are exercised in the
796 .I child
797 cgroups.
798 When a controller (e.g.,
799 .IR pids )
800 is present in the
801 .I cgroup.subtree_control
802 file of a parent cgroup,
803 then the corresponding controller-interface files (e.g.,
804 .IR pids.max )
805 are automatically created in the children of that cgroup
806 and can be used to exert resource control in the child cgroups.
807 .\"
808 .SS Cgroups v2 """no internal processes""" rule
809 Cgroups v2 enforces a so-called "no internal processes" rule.
810 Roughly speaking, this rule means that,
811 with the exception of the root cgroup, processes may reside
812 only in leaf nodes (cgroups that do not themselves contain child cgroups).
813 This avoids the need to decide how to partition resources between
814 processes which are members of cgroup A and processes in child cgroups of A.
815 .PP
816 For instance, if cgroup
817 .I /cg1/cg2
818 exists, then a process may reside in
819 .IR /cg1/cg2 ,
820 but not in
821 .IR /cg1 .
822 This is to avoid an ambiguity in cgroups v1
823 with respect to the delegation of resources between processes in
824 .I /cg1
825 and its child cgroups.
826 The recommended approach in cgroups v2 is to create a subdirectory called
827 .I leaf
828 for any nonleaf cgroup which should contain processes, but no child cgroups.
829 Thus, processes which previously would have gone into
830 .I /cg1
831 would now go into
832 .IR /cg1/leaf .
833 This has the advantage of making explicit
834 the relationship between processes in
835 .I /cg1/leaf
836 and
837 .IR /cg1 's
838 other children.
839 .PP
840 The "no internal processes" rule is in fact more subtle than stated above.
841 More precisely, the rule is that a (nonroot) cgroup can't both
842 (1) have member processes, and
843 (2) distribute resources into child cgroups\(emthat is, have a nonempty
844 .I cgroup.subtree_control
845 file.
846 Thus, it
847 .I is
848 possible for a cgroup to have both member processes and child cgroups,
849 but before controllers can be enabled for that cgroup,
850 the member processes must be moved out of the cgroup
851 (e.g., perhaps into the child cgroups).
852 .PP
853 With the Linux 4.14 addition of "thread mode" (described below),
854 the "no internal processes" rule has been relaxed in some cases.
855 .\"
856 .SS Cgroups v2 cgroup.events file
857 Each nonroot cgroup in the v2 hierarchy contains a read-only file,
858 .IR cgroup.events ,
859 whose contents are key-value pairs
860 (delimited by newline characters, with the key and value separated by spaces)
861 providing state information about the cgroup:
862 .PP
863 .in +4n
864 .EX
865 $ \fBcat mygrp/cgroup.events\fP
866 populated 1
867 frozen 0
868 .EE
869 .in
870 .PP
871 The following keys may appear in this file:
872 .TP
873 .I populated
874 The value of this key is either 1,
875 if this cgroup or any of its descendants has member processes,
876 or otherwise 0.
877 .TP
878 .IR frozen " (since Linux 5.2)"
879 .\" commit 76f969e8948d82e78e1bc4beb6b9465908e7487
880 The value of this key is 1 if this cgroup is currently frozen,
881 or 0 if it is not.
882 .PP
883 The
884 .I cgroup.events
885 file can be monitored, in order to receive notification when the value of
886 one of its keys changes.
887 Such monitoring can be done using
888 .BR inotify (7),
889 which notifies changes as
890 .B IN_MODIFY
891 events, or
892 .BR poll (2),
893 which notifies changes by returning the
894 .B POLLPRI
895 and
896 .B POLLERR
897 bits in the
898 .I revents
899 field.
900 .\"
901 .SS Cgroup v2 release notification
902 Cgroups v2 provides a new mechanism for obtaining notification
903 when a cgroup becomes empty.
904 The cgroups v1
905 .I release_agent
906 and
907 .I notify_on_release
908 files are removed, and replaced by the
909 .I populated
910 key in the
911 .I cgroup.events
912 file.
913 This key either has the value 0,
914 meaning that the cgroup (and its descendants)
915 contain no (nonzombie) member processes,
916 or 1, meaning that the cgroup (or one of its descendants)
917 contains member processes.
918 .PP
919 The cgroups v2 release-notification mechanism
920 offers the following advantages over the cgroups v1
921 .I release_agent
922 mechanism:
923 .IP * 3
924 It allows for cheaper notification,
925 since a single process can monitor multiple
926 .I cgroup.events
927 files (using the techniques described earlier).
928 By contrast, the cgroups v1 mechanism requires the expense of creating
929 a process for each notification.
930 .IP *
931 Notification for different cgroup subhierarchies can be delegated
932 to different processes.
933 By contrast, the cgroups v1 mechanism allows only one release agent
934 for an entire hierarchy.
935 .\"
936 .SS Cgroups v2 cgroup.stat file
937 .\" commit ec39225cca42c05ac36853d11d28f877fde5c42e
938 Each cgroup in the v2 hierarchy contains a read-only
939 .I cgroup.stat
940 file (first introduced in Linux 4.14)
941 that consists of lines containing key-value pairs.
942 The following keys currently appear in this file:
943 .TP
944 .I nr_descendants
945 This is the total number of visible (i.e., living) descendant cgroups
946 underneath this cgroup.
947 .TP
948 .I nr_dying_descendants
949 This is the total number of dying descendant cgroups
950 underneath this cgroup.
951 A cgroup enters the dying state after being deleted.
952 It remains in that state for an undefined period
953 (which will depend on system load)
954 while resources are freed before the cgroup is destroyed.
955 Note that the presence of some cgroups in the dying state is normal,
956 and is not indicative of any problem.
957 .IP
958 A process can't be made a member of a dying cgroup,
959 and a dying cgroup can't be brought back to life.
960 .\"
961 .SS Limiting the number of descendant cgroups
962 Each cgroup in the v2 hierarchy contains the following files,
963 which can be used to view and set limits on the number
964 of descendant cgroups under that cgroup:
965 .TP
966 .IR cgroup.max.depth " (since Linux 4.14)"
967 .\" commit 1a926e0bbab83bae8207d05a533173425e0496d1
968 This file defines a limit on the depth of nesting of descendant cgroups.
969 A value of 0 in this file means that no descendant cgroups can be created.
970 An attempt to create a descendant whose nesting level exceeds
971 the limit fails
972 .RI ( mkdir (2)
973 fails with the error
974 .BR EAGAIN ).
975 .IP
976 Writing the string
977 .I """max"""
978 to this file means that no limit is imposed.
979 The default value in this file is
980 .I """max""" .
981 .TP
982 .IR cgroup.max.descendants " (since Linux 4.14)"
983 .\" commit 1a926e0bbab83bae8207d05a533173425e0496d1
984 This file defines a limit on the number of live descendant cgroups that
985 this cgroup may have.
986 An attempt to create more descendants than allowed by the limit fails
987 .RI ( mkdir (2)
988 fails with the error
989 .BR EAGAIN ).
990 .IP
991 Writing the string
992 .I """max"""
993 to this file means that no limit is imposed.
994 The default value in this file is
995 .IR """max""" .
996 .\"
997 .SH CGROUPS DELEGATION: DELEGATING A HIERARCHY TO A LESS PRIVILEGED USER
998 In the context of cgroups,
999 delegation means passing management of some subtree
1000 of the cgroup hierarchy to a nonprivileged user.
1001 Cgroups v1 provides support for delegation based on file permissions
1002 in the cgroup hierarchy but with less strict containment rules than v2
1003 (as noted below).
1004 Cgroups v2 supports delegation with containment by explicit design.
1005 The focus of the discussion in this section is on delegation in cgroups v2,
1006 with some differences for cgroups v1 noted along the way.
1007 .PP
1008 Some terminology is required in order to describe delegation.
1009 A
1010 .I delegater
1011 is a privileged user (i.e., root) who owns a parent cgroup.
1012 A
1013 .I delegatee
1014 is a nonprivileged user who will be granted the permissions needed
1015 to manage some subhierarchy under that parent cgroup,
1016 known as the
1017 .IR "delegated subtree" .
1018 .PP
1019 To perform delegation,
1020 the delegater makes certain directories and files writable by the delegatee,
1021 typically by changing the ownership of the objects to be the user ID
1022 of the delegatee.
1023 Assuming that we want to delegate the hierarchy rooted at (say)
1024 .I /dlgt_grp
1025 and that there are not yet any child cgroups under that cgroup,
1026 the ownership of the following is changed to the user ID of the delegatee:
1027 .TP
1028 .I /dlgt_grp
1029 Changing the ownership of the root of the subtree means that any new
1030 cgroups created under the subtree (and the files they contain)
1031 will also be owned by the delegatee.
1032 .TP
1033 .I /dlgt_grp/cgroup.procs
1034 Changing the ownership of this file means that the delegatee
1035 can move processes into the root of the delegated subtree.
1036 .TP
1037 .IR /dlgt_grp/cgroup.subtree_control " (cgroups v2 only)"
1038 Changing the ownership of this file means that the delegatee
1039 can enable controllers (that are present in
1040 .IR /dlgt_grp/cgroup.controllers )
1041 in order to further redistribute resources at lower levels in the subtree.
1042 (As an alternative to changing the ownership of this file,
1043 the delegater might instead add selected controllers to this file.)
1044 .TP
1045 .IR /dlgt_grp/cgroup.threads " (cgroups v2 only)"
1046 Changing the ownership of this file is necessary if a threaded subtree
1047 is being delegated (see the description of "thread mode", below).
1048 This permits the delegatee to write thread IDs to the file.
1049 (The ownership of this file can also be changed when delegating
1050 a domain subtree, but currently this serves no purpose,
1051 since, as described below, it is not possible to move a thread between
1052 domain cgroups by writing its thread ID to the
1053 .I cgroup.threads
1054 file.)
1055 .IP
1056 In cgroups v1, the corresponding file that should instead be delegated is the
1057 .I tasks
1058 file.
1059 .PP
1060 The delegater should
1061 .I not
1062 change the ownership of any of the controller interfaces files (e.g.,
1063 .IR pids.max ,
1064 .IR memory.high )
1065 in
1066 .IR dlgt_grp .
1067 Those files are used from the next level above the delegated subtree
1068 in order to distribute resources into the subtree,
1069 and the delegatee should not have permission to change
1070 the resources that are distributed into the delegated subtree.
1071 .PP
1072 See also the discussion of the
1073 .I /sys/kernel/cgroup/delegate
1074 file in NOTES for information about further delegatable files in cgroups v2.
1075 .PP
1076 After the aforementioned steps have been performed,
1077 the delegatee can create child cgroups within the delegated subtree
1078 (the cgroup subdirectories and the files they contain
1079 will be owned by the delegatee)
1080 and move processes between cgroups in the subtree.
1081 If some controllers are present in
1082 .IR dlgt_grp/cgroup.subtree_control ,
1083 or the ownership of that file was passed to the delegatee,
1084 the delegatee can also control the further redistribution
1085 of the corresponding resources into the delegated subtree.
1086 .\"
1087 .SS Cgroups v2 delegation: nsdelegate and cgroup namespaces
1088 Starting with Linux 4.13,
1089 .\" commit 5136f6365ce3eace5a926e10f16ed2a233db5ba9
1090 there is a second way to perform cgroup delegation in the cgroups v2 hierarchy.
1091 This is done by mounting or remounting the cgroup v2 filesystem with the
1092 .I nsdelegate
1093 mount option.
1094 For example, if the cgroup v2 filesystem has already been mounted,
1095 we can remount it with the
1096 .I nsdelegate
1097 option as follows:
1098 .PP
1099 .in +4n
1100 .EX
1101 mount \-t cgroup2 \-o remount,nsdelegate \e
1102 none /sys/fs/cgroup/unified
1103 .EE
1104 .in
1105 .\"
1106 .\" Alternatively, we could boot the kernel with the options:
1107 .\"
1108 .\" cgroup_no_v1=all systemd.legacy_systemd_cgroup_controller
1109 .\"
1110 .\" The effect of the latter option is to prevent systemd from employing
1111 .\" its "hybrid" cgroup mode, where it tries to make use of cgroups v2.
1112 .PP
1113 The effect of this mount option is to cause cgroup namespaces
1114 to automatically become delegation boundaries.
1115 More specifically,
1116 the following restrictions apply for processes inside the cgroup namespace:
1117 .IP * 3
1118 Writes to controller interface files in the root directory of the namespace
1119 will fail with the error
1120 .BR EPERM .
1121 Processes inside the cgroup namespace can still write to delegatable
1122 files in the root directory of the cgroup namespace such as
1123 .I cgroup.procs
1124 and
1125 .IR cgroup.subtree_control ,
1126 and can create subhierarchy underneath the root directory.
1127 .IP *
1128 Attempts to migrate processes across the namespace boundary are denied
1129 (with the error
1130 .BR ENOENT ).
1131 Processes inside the cgroup namespace can still
1132 (subject to the containment rules described below)
1133 move processes between cgroups
1134 .I within
1135 the subhierarchy under the namespace root.
1136 .PP
1137 The ability to define cgroup namespaces as delegation boundaries
1138 makes cgroup namespaces more useful.
1139 To understand why, suppose that we already have one cgroup hierarchy
1140 that has been delegated to a nonprivileged user,
1141 .IR cecilia ,
1142 using the older delegation technique described above.
1143 Suppose further that
1144 .I cecilia
1145 wanted to further delegate a subhierarchy
1146 under the existing delegated hierarchy.
1147 (For example, the delegated hierarchy might be associated with
1148 an unprivileged container run by
1149 .IR cecilia .)
1150 Even if a cgroup namespace was employed,
1151 because both hierarchies are owned by the unprivileged user
1152 .IR cecilia ,
1153 the following illegitimate actions could be performed:
1154 .IP * 3
1155 A process in the inferior hierarchy could change the
1156 resource controller settings in the root directory of that hierarchy.
1157 (These resource controller settings are intended to allow control to
1158 be exercised from the
1159 .I parent
1160 cgroup;
1161 a process inside the child cgroup should not be allowed to modify them.)
1162 .IP *
1163 A process inside the inferior hierarchy could move processes
1164 into and out of the inferior hierarchy if the cgroups in the
1165 superior hierarchy were somehow visible.
1166 .PP
1167 Employing the
1168 .I nsdelegate
1169 mount option prevents both of these possibilities.
1170 .PP
1171 The
1172 .I nsdelegate
1173 mount option only has an effect when performed in
1174 the initial mount namespace;
1175 in other mount namespaces, the option is silently ignored.
1176 .PP
1177 .IR Note :
1178 On some systems,
1179 .BR systemd (1)
1180 automatically mounts the cgroup v2 filesystem.
1181 In order to experiment with the
1182 .I nsdelegate
1183 operation, it may be useful to boot the kernel with
1184 the following command-line options:
1185 .PP
1186 .in +4n
1187 .EX
1188 cgroup_no_v1=all systemd.legacy_systemd_cgroup_controller
1189 .EE
1190 .in
1191 .PP
1192 These options cause the kernel to boot with the cgroups v1 controllers
1193 disabled (meaning that the controllers are available in the v2 hierarchy),
1194 and tells
1195 .BR systemd (1)
1196 not to mount and use the cgroup v2 hierarchy,
1197 so that the v2 hierarchy can be manually mounted
1198 with the desired options after boot-up.
1199 .\"
1200 .SS Cgroup delegation containment rules
1201 Some delegation
1202 .I containment rules
1203 ensure that the delegatee can move processes between cgroups within the
1204 delegated subtree,
1205 but can't move processes from outside the delegated subtree into
1206 the subtree or vice versa.
1207 A nonprivileged process (i.e., the delegatee) can write the PID of
1208 a "target" process into a
1209 .I cgroup.procs
1210 file only if all of the following are true:
1211 .IP * 3
1212 The writer has write permission on the
1213 .I cgroup.procs
1214 file in the destination cgroup.
1215 .IP *
1216 The writer has write permission on the
1217 .I cgroup.procs
1218 file in the nearest common ancestor of the source and destination cgroups.
1219 Note that in some cases,
1220 the nearest common ancestor may be the source or destination cgroup itself.
1221 This requirement is not enforced for cgroups v1 hierarchies,
1222 with the consequence that containment in v1 is less strict than in v2.
1223 (For example, in cgroups v1 the user that owns two distinct
1224 delegated subhierarchies can move a process between the hierarchies.)
1225 .IP *
1226 If the cgroup v2 filesystem was mounted with the
1227 .I nsdelegate
1228 option, the writer must be able to see the source and destination cgroups
1229 from its cgroup namespace.
1230 .IP *
1231 In cgroups v1:
1232 the effective UID of the writer (i.e., the delegatee) matches the
1233 real user ID or the saved set-user-ID of the target process.
1234 Before Linux 4.11,
1235 .\" commit 576dd464505fc53d501bb94569db76f220104d28
1236 this requirement also applied in cgroups v2
1237 (This was a historical requirement inherited from cgroups v1
1238 that was later deemed unnecessary,
1239 since the other rules suffice for containment in cgroups v2.)
1240 .PP
1241 .IR Note :
1242 one consequence of these delegation containment rules is that the
1243 unprivileged delegatee can't place the first process into
1244 the delegated subtree;
1245 instead, the delegater must place the first process
1246 (a process owned by the delegatee) into the delegated subtree.
1247 .\"
1248 .SH CGROUPS VERSION 2 THREAD MODE
1249 Among the restrictions imposed by cgroups v2 that were not present
1250 in cgroups v1 are the following:
1251 .IP * 3
1252 .IR "No thread-granularity control" :
1253 all of the threads of a process must be in the same cgroup.
1254 .IP *
1255 .IR "No internal processes" :
1256 a cgroup can't both have member processes and
1257 exercise controllers on child cgroups.
1258 .PP
1259 Both of these restrictions were added because
1260 the lack of these restrictions had caused problems
1261 in cgroups v1.
1262 In particular, the cgroups v1 ability to allow thread-level granularity
1263 for cgroup membership made no sense for some controllers.
1264 (A notable example was the
1265 .I memory
1266 controller: since threads share an address space,
1267 it made no sense to split threads across different
1268 .I memory
1269 cgroups.)
1270 .PP
1271 Notwithstanding the initial design decision in cgroups v2,
1272 there were use cases for certain controllers, notably the
1273 .I cpu
1274 controller,
1275 for which thread-level granularity of control was meaningful and useful.
1276 To accommodate such use cases, Linux 4.14 added
1277 .I "thread mode"
1278 for cgroups v2.
1279 .PP
1280 Thread mode allows the following:
1281 .IP * 3
1282 The creation of
1283 .I threaded subtrees
1284 in which the threads of a process may
1285 be spread across cgroups inside the tree.
1286 (A threaded subtree may contain multiple multithreaded processes.)
1287 .IP *
1288 The concept of
1289 .IR "threaded controllers" ,
1290 which can distribute resources across the cgroups in a threaded subtree.
1291 .IP *
1292 A relaxation of the "no internal processes rule",
1293 so that, within a threaded subtree,
1294 a cgroup can both contain member threads and
1295 exercise resource control over child cgroups.
1296 .PP
1297 With the addition of thread mode,
1298 each nonroot cgroup now contains a new file,
1299 .IR cgroup.type ,
1300 that exposes, and in some circumstances can be used to change,
1301 the "type" of a cgroup.
1302 This file contains one of the following type values:
1303 .TP
1304 .I domain
1305 This is a normal v2 cgroup that provides process-granularity control.
1306 If a process is a member of this cgroup,
1307 then all threads of the process are (by definition) in the same cgroup.
1308 This is the default cgroup type,
1309 and provides the same behavior that was provided for
1310 cgroups in the initial cgroups v2 implementation.
1311 .TP
1312 .I threaded
1313 This cgroup is a member of a threaded subtree.
1314 Threads can be added to this cgroup,
1315 and controllers can be enabled for the cgroup.
1316 .TP
1317 .I domain threaded
1318 This is a domain cgroup that serves as the root of a threaded subtree.
1319 This cgroup type is also known as "threaded root".
1320 .TP
1321 .I domain invalid
1322 This is a cgroup inside a threaded subtree
1323 that is in an "invalid" state.
1324 Processes can't be added to the cgroup,
1325 and controllers can't be enabled for the cgroup.
1326 The only thing that can be done with this cgroup (other than deleting it)
1327 is to convert it to a
1328 .I threaded
1329 cgroup by writing the string
1330 .I """threaded"""
1331 to the
1332 .I cgroup.type
1333 file.
1334 .IP
1335 The rationale for the existence of this "interim" type
1336 during the creation of a threaded subtree
1337 (rather than the kernel simply immediately converting all cgroups
1338 under the threaded root to the type
1339 .IR threaded )
1340 is to allow for
1341 possible future extensions to the thread mode model
1342 .\"
1343 .SS Threaded versus domain controllers
1344 With the addition of threads mode,
1345 cgroups v2 now distinguishes two types of resource controllers:
1346 .IP * 3
1347 .I Threaded
1348 .\" In the kernel source, look for ".threaded[ \t]*= true" in
1349 .\" initializations of struct cgroup_subsys
1350 controllers: these controllers support thread-granularity for
1351 resource control and can be enabled inside threaded subtrees,
1352 with the result that the corresponding controller-interface files
1353 appear inside the cgroups in the threaded subtree.
1354 As at Linux 4.19, the following controllers are threaded:
1355 .IR cpu ,
1356 .IR perf_event ,
1357 and
1358 .IR pids .
1359 .IP *
1360 .I Domain
1361 controllers: these controllers support only process granularity
1362 for resource control.
1363 From the perspective of a domain controller,
1364 all threads of a process are always in the same cgroup.
1365 Domain controllers can't be enabled inside a threaded subtree.
1366 .\"
1367 .SS Creating a threaded subtree
1368 There are two pathways that lead to the creation of a threaded subtree.
1369 The first pathway proceeds as follows:
1370 .IP 1. 3
1371 We write the string
1372 .I """threaded"""
1373 to the
1374 .I cgroup.type
1375 file of a cgroup
1376 .I y/z
1377 that currently has the type
1378 .IR domain .
1379 This has the following effects:
1380 .RS
1381 .IP * 3
1382 The type of the cgroup
1383 .I y/z
1384 becomes
1385 .IR threaded .
1386 .IP *
1387 The type of the parent cgroup,
1388 .IR y ,
1389 becomes
1390 .IR "domain threaded" .
1391 The parent cgroup is the root of a threaded subtree
1392 (also known as the "threaded root").
1393 .IP *
1394 All other cgroups under
1395 .I y
1396 that were not already of type
1397 .I threaded
1398 (because they were inside already existing threaded subtrees
1399 under the new threaded root)
1400 are converted to type
1401 .IR "domain invalid" .
1402 Any subsequently created cgroups under
1403 .I y
1404 will also have the type
1405 .IR "domain invalid" .
1406 .RE
1407 .IP 2.
1408 We write the string
1409 .I """threaded"""
1410 to each of the
1411 .I domain invalid
1412 cgroups under
1413 .IR y ,
1414 in order to convert them to the type
1415 .IR threaded .
1416 As a consequence of this step, all threads under the threaded root
1417 now have the type
1418 .I threaded
1419 and the threaded subtree is now fully usable.
1420 The requirement to write
1421 .I """threaded"""
1422 to each of these cgroups is somewhat cumbersome,
1423 but allows for possible future extensions to the thread-mode model.
1424 .PP
1425 The second way of creating a threaded subtree is as follows:
1426 .IP 1. 3
1427 In an existing cgroup,
1428 .IR z ,
1429 that currently has the type
1430 .IR domain ,
1431 we (1) enable one or more threaded controllers and
1432 (2) make a process a member of
1433 .IR z .
1434 (These two steps can be done in either order.)
1435 This has the following consequences:
1436 .RS
1437 .IP * 3
1438 The type of
1439 .I z
1440 becomes
1441 .IR "domain threaded" .
1442 .IP *
1443 All of the descendant cgroups of
1444 .I x
1445 that were not already of type
1446 .I threaded
1447 are converted to type
1448 .IR "domain invalid" .
1449 .RE
1450 .IP 2.
1451 As before, we make the threaded subtree usable by writing the string
1452 .I """threaded"""
1453 to each of the
1454 .I domain invalid
1455 cgroups under
1456 .IR y ,
1457 in order to convert them to the type
1458 .IR threaded .
1459 .PP
1460 One of the consequences of the above pathways to creating a threaded subtree
1461 is that the threaded root cgroup can be a parent only to
1462 .I threaded
1463 (and
1464 .IR "domain invalid" )
1465 cgroups.
1466 The threaded root cgroup can't be a parent of a
1467 .I domain
1468 cgroups, and a
1469 .I threaded
1470 cgroup
1471 can't have a sibling that is a
1472 .I domain
1473 cgroup.
1474 .\"
1475 .SS Using a threaded subtree
1476 Within a threaded subtree, threaded controllers can be enabled
1477 in each subgroup whose type has been changed to
1478 .IR threaded ;
1479 upon doing so, the corresponding controller interface files
1480 appear in the children of that cgroup.
1481 .PP
1482 A process can be moved into a threaded subtree by writing its PID to the
1483 .I cgroup.procs
1484 file in one of the cgroups inside the tree.
1485 This has the effect of making all of the threads
1486 in the process members of the corresponding cgroup
1487 and makes the process a member of the threaded subtree.
1488 The threads of the process can then be spread across
1489 the threaded subtree by writing their thread IDs (see
1490 .BR gettid (2))
1491 to the
1492 .I cgroup.threads
1493 files in different cgroups inside the subtree.
1494 The threads of a process must all reside in the same threaded subtree.
1495 .PP
1496 As with writing to
1497 .IR cgroup.procs ,
1498 some containment rules apply when writing to the
1499 .I cgroup.threads
1500 file:
1501 .IP * 3
1502 The writer must have write permission on the
1503 cgroup.threads
1504 file in the destination cgroup.
1505 .IP *
1506 The writer must have write permission on the
1507 .I cgroup.procs
1508 file in the common ancestor of the source and destination cgroups.
1509 (In some cases,
1510 the common ancestor may be the source or destination cgroup itself.)
1511 .IP *
1512 The source and destination cgroups must be in the same threaded subtree.
1513 (Outside a threaded subtree, an attempt to move a thread by writing
1514 its thread ID to the
1515 .I cgroup.threads
1516 file in a different
1517 .I domain
1518 cgroup fails with the error
1519 .BR EOPNOTSUPP .)
1520 .PP
1521 The
1522 .I cgroup.threads
1523 file is present in each cgroup (including
1524 .I domain
1525 cgroups) and can be read in order to discover the set of threads
1526 that is present in the cgroup.
1527 The set of thread IDs obtained when reading this file
1528 is not guaranteed to be ordered or free of duplicates.
1529 .PP
1530 The
1531 .I cgroup.procs
1532 file in the threaded root shows the PIDs of all processes
1533 that are members of the threaded subtree.
1534 The
1535 .I cgroup.procs
1536 files in the other cgroups in the subtree are not readable.
1537 .PP
1538 Domain controllers can't be enabled in a threaded subtree;
1539 no controller-interface files appear inside the cgroups underneath the
1540 threaded root.
1541 From the point of view of a domain controller,
1542 threaded subtrees are invisible:
1543 a multithreaded process inside a threaded subtree appears to a domain
1544 controller as a process that resides in the threaded root cgroup.
1545 .PP
1546 Within a threaded subtree, the "no internal processes" rule does not apply:
1547 a cgroup can both contain member processes (or thread)
1548 and exercise controllers on child cgroups.
1549 .\"
1550 .SS Rules for writing to cgroup.type and creating threaded subtrees
1551 A number of rules apply when writing to the
1552 .I cgroup.type
1553 file:
1554 .IP * 3
1555 Only the string
1556 .I """threaded"""
1557 may be written.
1558 In other words, the only explicit transition that is possible is to convert a
1559 .I domain
1560 cgroup to type
1561 .IR threaded .
1562 .IP *
1563 The effect of writing
1564 .I """threaded"""
1565 depends on the current value in
1566 .IR cgroup.type ,
1567 as follows:
1568 .RS
1569 .IP \(bu 3
1570 .I domain
1571 or
1572 .IR "domain threaded" :
1573 start the creation of a threaded subtree
1574 (whose root is the parent of this cgroup) via
1575 the first of the pathways described above;
1576 .IP \(bu
1577 .IR "domain\ invalid" :
1578 convert this cgroup (which is inside a threaded subtree) to a usable (i.e.,
1579 .IR threaded )
1580 state;
1581 .IP \(bu
1582 .IR threaded :
1583 no effect (a "no-op").
1584 .RE
1585 .IP *
1586 We can't write to a
1587 .I cgroup.type
1588 file if the parent's type is
1589 .IR "domain invalid" .
1590 In other words, the cgroups of a threaded subtree must be converted to the
1591 .I threaded
1592 state in a top-down manner.
1593 .PP
1594 There are also some constraints that must be satisfied
1595 in order to create a threaded subtree rooted at the cgroup
1596 .IR x :
1597 .IP * 3
1598 There can be no member processes in the descendant cgroups of
1599 .IR x .
1600 (The cgroup
1601 .I x
1602 can itself have member processes.)
1603 .IP *
1604 No domain controllers may be enabled in
1605 .IR x 's
1606 .I cgroup.subtree_control
1607 file.
1608 .PP
1609 If any of the above constraints is violated, then an attempt to write
1610 .I """threaded"""
1611 to a
1612 .I cgroup.type
1613 file fails with the error
1614 .BR ENOTSUP .
1615 .\"
1616 .SS The """domain threaded""" cgroup type
1617 According to the pathways described above,
1618 the type of a cgroup can change to
1619 .I domain threaded
1620 in either of the following cases:
1621 .IP * 3
1622 The string
1623 .I """threaded"""
1624 is written to a child cgroup.
1625 .IP *
1626 A threaded controller is enabled inside the cgroup and
1627 a process is made a member of the cgroup.
1628 .PP
1629 A
1630 .I domain threaded
1631 cgroup,
1632 .IR x ,
1633 can revert to the type
1634 .I domain
1635 if the above conditions no longer hold true\(emthat is, if all
1636 .I threaded
1637 child cgroups of
1638 .I x
1639 are removed and either
1640 .I x
1641 no longer has threaded controllers enabled or
1642 no longer has member processes.
1643 .PP
1644 When a
1645 .I domain threaded
1646 cgroup
1647 .I x
1648 reverts to the type
1649 .IR domain :
1650 .IP * 3
1651 All
1652 .I domain invalid
1653 descendants of
1654 .I x
1655 that are not in lower-level threaded subtrees revert to the type
1656 .IR domain .
1657 .IP *
1658 The root cgroups in any lower-level threaded subtrees revert to the type
1659 .IR "domain threaded" .
1660 .\"
1661 .SS Exceptions for the root cgroup
1662 The root cgroup of the v2 hierarchy is treated exceptionally:
1663 it can be the parent of both
1664 .I domain
1665 and
1666 .I threaded
1667 cgroups.
1668 If the string
1669 .I """threaded"""
1670 is written to the
1671 .I cgroup.type
1672 file of one of the children of the root cgroup, then
1673 .IP * 3
1674 The type of that cgroup becomes
1675 .IR threaded .
1676 .IP *
1677 The type of any descendants of that cgroup that
1678 are not part of lower-level threaded subtrees changes to
1679 .IR "domain invalid" .
1680 .PP
1681 Note that in this case, there is no cgroup whose type becomes
1682 .IR "domain threaded" .
1683 (Notionally, the root cgroup can be considered as the threaded root
1684 for the cgroup whose type was changed to
1685 .IR threaded .)
1686 .PP
1687 The aim of this exceptional treatment for the root cgroup is to
1688 allow a threaded cgroup that employs the
1689 .I cpu
1690 controller to be placed as high as possible in the hierarchy,
1691 so as to minimize the (small) cost of traversing the cgroup hierarchy.
1692 .\"
1693 .SS The cgroups v2 """cpu""" controller and realtime threads
1694 As at Linux 4.19, the cgroups v2
1695 .I cpu
1696 controller does not support control of realtime threads
1697 (specifically threads scheduled under any of the policies
1698 .BR SCHED_FIFO ,
1699 .BR SCHED_RR ,
1700 described
1701 .BR SCHED_DEADLINE ;
1702 see
1703 .BR sched (7)).
1704 Therefore, the
1705 .I cpu
1706 controller can be enabled in the root cgroup only
1707 if all realtime threads are in the root cgroup.
1708 (If there are realtime threads in nonroot cgroups, then a
1709 .BR write (2)
1710 of the string
1711 .I """+cpu"""
1712 to the
1713 .I cgroup.subtree_control
1714 file fails with the error
1715 .BR EINVAL .)
1716 .PP
1717 On some systems,
1718 .BR systemd (1)
1719 places certain realtime threads in nonroot cgroups in the v2 hierarchy.
1720 On such systems,
1721 these threads must first be moved to the root cgroup before the
1722 .I cpu
1723 controller can be enabled.
1724 .\"
1725 .SH ERRORS
1726 The following errors can occur for
1727 .BR mount (2):
1728 .TP
1729 .B EBUSY
1730 An attempt to mount a cgroup version 1 filesystem specified neither the
1731 .I name=
1732 option (to mount a named hierarchy) nor a controller name (or
1733 .IR all ).
1734 .SH NOTES
1735 A child process created via
1736 .BR fork (2)
1737 inherits its parent's cgroup memberships.
1738 A process's cgroup memberships are preserved across
1739 .BR execve (2).
1740 .PP
1741 The
1742 .BR clone3 (2)
1743 .B CLONE_INTO_CGROUP
1744 flag can be used to create a child process that begins its life in
1745 a different version 2 cgroup from the parent process.
1746 .\"
1747 .SS /proc files
1748 .TP
1749 .IR /proc/cgroups " (since Linux 2.6.24)"
1750 This file contains information about the controllers
1751 that are compiled into the kernel.
1752 An example of the contents of this file (reformatted for readability)
1753 is the following:
1754 .IP
1755 .in +4n
1756 .EX
1757 #subsys_name hierarchy num_cgroups enabled
1758 cpuset 4 1 1
1759 cpu 8 1 1
1760 cpuacct 8 1 1
1761 blkio 6 1 1
1762 memory 3 1 1
1763 devices 10 84 1
1764 freezer 7 1 1
1765 net_cls 9 1 1
1766 perf_event 5 1 1
1767 net_prio 9 1 1
1768 hugetlb 0 1 0
1769 pids 2 1 1
1770 .EE
1771 .in
1772 .IP
1773 The fields in this file are, from left to right:
1774 .RS
1775 .IP 1. 3
1776 The name of the controller.
1777 .IP 2.
1778 The unique ID of the cgroup hierarchy on which this controller is mounted.
1779 If multiple cgroups v1 controllers are bound to the same hierarchy,
1780 then each will show the same hierarchy ID in this field.
1781 The value in this field will be 0 if:
1782 .RS 5
1783 .IP a) 3
1784 the controller is not mounted on a cgroups v1 hierarchy;
1785 .IP b)
1786 the controller is bound to the cgroups v2 single unified hierarchy; or
1787 .IP c)
1788 the controller is disabled (see below).
1789 .RE
1790 .IP 3.
1791 The number of control groups in this hierarchy using this controller.
1792 .IP 4.
1793 This field contains the value 1 if this controller is enabled,
1794 or 0 if it has been disabled (via the
1795 .I cgroup_disable
1796 kernel command-line boot parameter).
1797 .RE
1798 .TP
1799 .IR /proc/[pid]/cgroup " (since Linux 2.6.24)"
1800 This file describes control groups to which the process
1801 with the corresponding PID belongs.
1802 The displayed information differs for
1803 cgroups version 1 and version 2 hierarchies.
1804 .IP
1805 For each cgroup hierarchy of which the process is a member,
1806 there is one entry containing three colon-separated fields:
1807 .IP
1808 .in +4n
1809 .EX
1810 hierarchy\-ID:controller\-list:cgroup\-path
1811 .EE
1812 .in
1813 .IP
1814 For example:
1815 .IP
1816 .in +4n
1817 .EX
1818 5:cpuacct,cpu,cpuset:/daemons
1819 .EE
1820 .in
1821 .IP
1822 The colon-separated fields are, from left to right:
1823 .RS
1824 .IP 1. 3
1825 For cgroups version 1 hierarchies,
1826 this field contains a unique hierarchy ID number
1827 that can be matched to a hierarchy ID in
1828 .IR /proc/cgroups .
1829 For the cgroups version 2 hierarchy, this field contains the value 0.
1830 .IP 2.
1831 For cgroups version 1 hierarchies,
1832 this field contains a comma-separated list of the controllers
1833 bound to the hierarchy.
1834 For the cgroups version 2 hierarchy, this field is empty.
1835 .IP 3.
1836 This field contains the pathname of the control group in the hierarchy
1837 to which the process belongs.
1838 This pathname is relative to the mount point of the hierarchy.
1839 .RE
1840 .\"
1841 .SS /sys/kernel/cgroup files
1842 .TP
1843 .IR /sys/kernel/cgroup/delegate " (since Linux 4.15)"
1844 .\" commit 01ee6cfb1483fe57c9cbd8e73817dfbf9bacffd3
1845 This file exports a list of the cgroups v2 files
1846 (one per line) that are delegatable
1847 (i.e., whose ownership should be changed to the user ID of the delegatee).
1848 In the future, the set of delegatable files may change or grow,
1849 and this file provides a way for the kernel to inform
1850 user-space applications of which files must be delegated.
1851 As at Linux 4.15, one sees the following when inspecting this file:
1852 .IP
1853 .in +4n
1854 .EX
1855 $ \fBcat /sys/kernel/cgroup/delegate\fP
1856 cgroup.procs
1857 cgroup.subtree_control
1858 cgroup.threads
1859 .EE
1860 .in
1861 .TP
1862 .IR /sys/kernel/cgroup/features " (since Linux 4.15)"
1863 .\" commit 5f2e673405b742be64e7c3604ed4ed3ac14f35ce
1864 Over time, the set of cgroups v2 features that are provided by the
1865 kernel may change or grow,
1866 or some features may not be enabled by default.
1867 This file provides a way for user-space applications to discover what
1868 features the running kernel supports and has enabled.
1869 Features are listed one per line:
1870 .IP
1871 .in +4n
1872 .EX
1873 $ \fBcat /sys/kernel/cgroup/features\fP
1874 nsdelegate
1875 memory_localevents
1876 .EE
1877 .in
1878 .IP
1879 The entries that can appear in this file are:
1880 .RS
1881 .TP
1882 .IR memory_localevents " (since Linux 5.2)"
1883 The kernel supports the
1884 .I memory_localevents
1885 mount option.
1886 .TP
1887 .IR nsdelegate " (since Linux 4.15)"
1888 The kernel supports the
1889 .I nsdelegate
1890 mount option.
1891 .TP
1892 .IR memory_recursiveprot " (since Linux 5.7)"
1893 .\" commit 8a931f801340c2be10552c7b5622d5f4852f3a36
1894 The kernel supports the
1895 .I memory_recursiveprot
1896 mount option.
1897 .RE
1898 .SH SEE ALSO
1899 .BR prlimit (1),
1900 .BR systemd (1),
1901 .BR systemd\-cgls (1),
1902 .BR systemd\-cgtop (1),
1903 .BR clone (2),
1904 .BR ioprio_set (2),
1905 .BR perf_event_open (2),
1906 .BR setrlimit (2),
1907 .BR cgroup_namespaces (7),
1908 .BR cpuset (7),
1909 .BR namespaces (7),
1910 .BR sched (7),
1911 .BR user_namespaces (7)
1912 .PP
1913 The kernel source file
1914 .IR Documentation/admin\-guide/cgroup\-v2.rst .