]> git.ipfire.org Git - thirdparty/man-pages.git/blame - man7/cgroups.7
standards.7: Add some more standards
[thirdparty/man-pages.git] / man7 / cgroups.7
CommitLineData
014cb63b 1.\" Copyright (C) 2015 Serge Hallyn <serge@hallyn.com>
4242dfbe 2.\" and Copyright (C) 2016, 2017 Michael Kerrisk <mtk.manpages@gmail.com>
014cb63b
MK
3.\"
4.\" %%%LICENSE_START(VERBATIM)
5.\" Permission is granted to make and distribute verbatim copies of this
6.\" manual provided the copyright notice and this permission notice are
7.\" preserved on all copies.
8.\"
9.\" Permission is granted to copy and distribute modified versions of this
10.\" manual under the conditions for verbatim copying, provided that the
11.\" entire resulting derived work is distributed under the terms of a
12.\" permission notice identical to this one.
13.\"
14.\" Since the Linux kernel and libraries are constantly changing, this
15.\" manual page may be incorrect or out-of-date. The author(s) assume no
16.\" responsibility for errors or omissions, or for damages resulting from
17.\" the use of the information contained herein. The author(s) may not
18.\" have taken the same level of care in the production of this manual,
19.\" which is licensed free of charge, as they might when working
20.\" professionally.
21.\"
22.\" Formatted or processed versions of this manual, if unaccompanied by
23.\" the source, must acknowledge the copyright and authors of this work.
24.\" %%%LICENSE_END
25.\"
e8426ca2 26.TH CGROUPS 7 2020-04-11 "Linux" "Linux Programmer's Manual"
21f0d132
MK
27.SH NAME
28cgroups \- Linux control groups
29.SH DESCRIPTION
77eefc59 30Control groups, usually referred to as cgroups,
a15e0673 31are a Linux kernel feature which allow processes to
8bff7140
MK
32be organized into hierarchical groups whose usage of
33various types of resources can then be limited and monitored.
34The kernel's cgroup interface is provided through
21f0d132 35a pseudo-filesystem called cgroupfs.
6398ca15 36Grouping is implemented in the core cgroup kernel code,
21f0d132 37while resource tracking and limits are implemented in
8bff7140 38a set of per-resource-type subsystems (memory, CPU, and so on).
21f0d132 39.\"
176a4211
MK
40.SS Terminology
41A
42.I cgroup
43is a collection of processes that are bound to a set of
44limits or parameters defined via the cgroup filesystem.
a721e8b2 45.PP
176a4211
MK
46A
47.I subsystem
48is a kernel component that modifies the behavior of
49the processes in a cgroup.
50Various subsystems have been implemented, making it possible to do things
51such as limiting the amount of CPU time and memory available to a cgroup,
52accounting for the CPU time used by a cgroup,
53and freezing and resuming execution of the processes in a cgroup.
54Subsystems are sometimes also known as
55.IR "resource controllers"
56(or simply, controllers).
a721e8b2 57.PP
55f52de8 58The cgroups for a controller are arranged in a
176a4211
MK
59.IR hierarchy .
60This hierarchy is defined by creating, removing, and
61renaming subdirectories within the cgroup filesystem.
8fc9db1e
MK
62At each level of the hierarchy, attributes (e.g., limits) can be defined.
63The limits, control, and accounting provided by cgroups generally have
64effect throughout the subhierarchy underneath the cgroup where the
65attributes are defined.
8bff7140
MK
66Thus, for example, the limits placed on
67a cgroup at a higher level in the hierarchy cannot be exceeded
68by descendant cgroups.
176a4211 69.\"
43df1ab3
MK
70.SS Cgroups version 1 and version 2
71The initial release of the cgroups implementation was in Linux 2.6.24.
55f52de8 72Over time, various cgroup controllers have been added
43df1ab3 73to allow the management of various types of resources.
55f52de8
MK
74However, the development of these controllers was largely uncoordinated,
75with the result that many inconsistencies arose between controllers
43df1ab3
MK
76and management of the cgroup hierarchies became rather complex.
77(A longer description of these problems can be found in
78the kernel source file
0a837899 79.IR Documentation/cgroup\-v2.txt .)
a721e8b2 80.PP
813d9220
MK
81Because of the problems with the initial cgroups implementation
82(cgroups version 1),
43df1ab3
MK
83starting in Linux 3.10, work began on a new,
84orthogonal implementation to remedy these problems.
85Initially marked experimental, and hidden behind the
86.I "\-o\ __DEVEL__sane_behavior"
87mount option, the new version (cgroups version 2)
88was eventually made official with the release of Linux 4.5.
89Differences between the two versions are described in the text below.
8f0b7d76
MG
90The file
91.IR cgroup.sane_behavior ,
92present in cgroups v1, is a relic of this mount option. The file
93always reports "0" and is only retained for backward compatibility.
a721e8b2 94.PP
43df1ab3
MK
95Although cgroups v2 is intended as a replacement for cgroups v1,
96the older system continues to exist
97(and for compatibility reasons is unlikely to be removed).
98Currently, cgroups v2 implements only a subset of the controllers
99available in cgroups v1.
100The two systems are implemented so that both v1 controllers and
101v2 controllers can be mounted on the same system.
102Thus, for example, it is possible to use those controllers
103that are supported under version 2,
104while also using version 1 controllers
105where version 2 does not yet support those controllers.
1a90a85e
MK
106The only restriction here is that a controller can't be simultaneously
107employed in both a cgroups v1 hierarchy and in the cgroups v2 hierarchy.
43df1ab3 108.\"
5714ccee 109.SH CGROUPS VERSION 1
8bff7140
MK
110Under cgroups v1, each controller may be mounted against a separate
111cgroup filesystem that provides its own hierarchical organization of the
112processes on the system.
980f1827 113It is also possible to comount multiple (or even all) cgroups v1 controllers
8bff7140
MK
114against the same cgroup filesystem, meaning that the comounted controllers
115manage the same hierarchical organization of processes.
a721e8b2 116.PP
8bff7140
MK
117For each mounted hierarchy,
118the directory tree mirrors the control group hierarchy.
119Each control group is represented by a directory, with each of its child
120control cgroups represented as a child directory.
121For instance,
122.IR /user/joe/1.session
123represents control group
124.IR 1.session ,
125which is a child of cgroup
126.IR joe ,
127which is a child of
128.IR /user .
129Under each cgroup directory is a set of files which can be read or
130written to, reflecting resource limits and a few general cgroup
131properties.
8bff7140 132.\"
6398ca15 133.SS Tasks (threads) versus processes
c775bca2
MK
134In cgroups v1, a distinction is drawn between
135.I processes
136and
137.IR tasks .
138In this view, a process can consist of multiple tasks
6398ca15
MK
139(more commonly called threads, from a user-space perspective,
140and called such in the remainder of this man page).
0ec74e08 141In cgroups v1, it is possible to independently manipulate
6398ca15 142the cgroup memberships of the threads in a process.
c56ec51b
MK
143.PP
144The cgroups v1 ability to split threads across different cgroups
145caused problems in some cases.
146For example, it made no sense for the
147.I memory
148controller,
149since all of the threads of a process share a single address space.
150Because of these problems,
c775bca2 151the ability to independently manipulate the cgroup memberships
56769384
MK
152of the threads in a process was removed in the initial cgroups v2
153implementation, and subsequently restored in a more limited form
154(see the discussion of "thread mode" below).
c775bca2 155.\"
77e0a626
MK
156.SS Mounting v1 controllers
157The use of cgroups requires a kernel built with the
8e6578f8
KF
158.BR CONFIG_CGROUP
159option.
77e0a626
MK
160In addition, each of the v1 controllers has an associated
161configuration option that must be set in order to employ that controller.
a721e8b2 162.PP
77e0a626
MK
163In order to use a v1 controller,
164it must be mounted against a cgroup filesystem.
4e07c70f
MK
165The usual place for such mounts is under a
166.BR tmpfs (5)
167filesystem mounted at
77e0a626
MK
168.IR /sys/fs/cgroup .
169Thus, one might mount the
170.I cpu
171controller as follows:
a721e8b2 172.PP
77e0a626 173.in +4n
b8302363 174.EX
77e0a626 175mount \-t cgroup \-o cpu none /sys/fs/cgroup/cpu
b8302363 176.EE
e646a1ba 177.in
a721e8b2 178.PP
77e0a626
MK
179It is possible to comount multiple controllers against the same hierarchy.
180For example, here the
181.IR cpu
21f0d132 182and
77e0a626
MK
183.IR cpuacct
184controllers are comounted against a single hierarchy:
a721e8b2 185.PP
21f0d132 186.in +4n
b8302363 187.EX
77e0a626 188mount \-t cgroup \-o cpu,cpuacct none /sys/fs/cgroup/cpu,cpuacct
b8302363 189.EE
e646a1ba 190.in
a721e8b2 191.PP
55f52de8 192Comounting controllers has the effect that a process is in the same cgroup for
77e0a626 193all of the comounted controllers.
55f52de8 194Separately mounting controllers allows a process to
21f0d132
MK
195be in cgroup
196.I /foo1
55f52de8 197for one controller while being in
21f0d132
MK
198.I /foo2/foo3
199for another.
a721e8b2 200.PP
77e0a626 201It is possible to comount all v1 controllers against the same hierarchy:
a721e8b2 202.PP
77e0a626 203.in +4n
b8302363 204.EX
77e0a626 205mount \-t cgroup \-o all cgroup /sys/fs/cgroup
b8302363 206.EE
e646a1ba 207.in
a721e8b2 208.PP
77e0a626
MK
209(One can achieve the same result by omitting
210.IR "\-o all" ,
211since it is the default if no controllers are explicitly specified.)
a721e8b2 212.PP
31ec2a5c
MK
213It is not possible to mount the same controller
214against multiple cgroup hierarchies.
215For example, it is not possible to mount both the
216.I cpu
217and
218.I cpuacct
219controllers against one hierarchy, and to mount the
220.I cpu
221controller alone against another hierarchy.
222It is possible to create multiple mount points with exactly
223the same set of comounted controllers.
224However, in this case all that results is multiple mount points
225providing a view of the same hierarchy.
a721e8b2 226.PP
77e0a626
MK
227Note that on many systems, the v1 controllers are automatically mounted under
228.IR /sys/fs/cgroup ;
229in particular,
230.BR systemd (1)
231automatically creates such mount points.
21f0d132 232.\"
7409b54b
MK
233.SS Unmounting v1 controllers
234A mounted cgroup filesystem can be unmounted using the
235.BR umount (8)
236command, as in the following example:
237.PP
238.in +4n
239.EX
240umount /sys/fs/cgroup/pids
241.EE
242.in
243.PP
244.IR "But note well" :
245a cgroup filesystem is unmounted only if it is not busy,
246that is, it has no child cgroups.
247If this is not the case, then the only effect of the
248.BR umount (8)
249is to make the mount invisible.
250Thus, to ensure that the mount point is really removed,
251one must first remove all child cgroups,
252which in turn can be done only after all member processes
253have been moved from those cgroups to the root cgroup.
254.\"
860573ad
MK
255.SS Cgroups version 1 controllers
256Each of the cgroups version 1 controllers is governed
257by a kernel configuration option (listed below).
258Additionally, the availability of the cgroups feature is governed by the
259.BR CONFIG_CGROUPS
260kernel configuration option.
261.TP
262.IR cpu " (since Linux 2.6.24; " \fBCONFIG_CGROUP_SCHED\fP )
263Cgroups can be guaranteed a minimum number of "CPU shares"
264when a system is busy.
265This does not limit a cgroup's CPU usage if the CPUs are not busy.
4ad9a706
MK
266For further information, see
267.IR Documentation/scheduler/sched-design-CFS.txt .
a721e8b2 268.IP
4ad9a706
MK
269In Linux 3.2,
270this controller was extended to provide CPU "bandwidth" control.
271If the kernel is configured with
81ff7360 272.BR CONFIG_CFS_BANDWIDTH ,
4ad9a706
MK
273then within each scheduling period
274(defined via a file in the cgroup directory), it is possible to define
275an upper limit on the CPU time allocated to the processes in a cgroup.
276This upper limit applies even if there is no other competition for the CPU.
860573ad
MK
277Further information can be found in the kernel source file
278.IR Documentation/scheduler/sched\-bwc.txt .
279.TP
280.IR cpuacct " (since Linux 2.6.24; " \fBCONFIG_CGROUP_CPUACCT\fP )
281This provides accounting for CPU usage by groups of processes.
a721e8b2 282.IP
860573ad
MK
283Further information can be found in the kernel source file
284.IR Documentation/cgroup\-v1/cpuacct.txt .
285.TP
286.IR cpuset " (since Linux 2.6.24; " \fBCONFIG_CPUSETS\fP )
287This cgroup can be used to bind the processes in a cgroup to
288a specified set of CPUs and NUMA nodes.
a721e8b2 289.IP
860573ad
MK
290Further information can be found in the kernel source file
291.IR Documentation/cgroup\-v1/cpusets.txt .
292.TP
293.IR memory " (since Linux 2.6.25; " \fBCONFIG_MEMCG\fP )
294The memory controller supports reporting and limiting of process memory, kernel
295memory, and swap used by cgroups.
a721e8b2 296.IP
860573ad
MK
297Further information can be found in the kernel source file
298.IR Documentation/cgroup\-v1/memory.txt .
299.TP
300.IR devices " (since Linux 2.6.26; " \fBCONFIG_CGROUP_DEVICE\fP )
301This supports controlling which processes may create (mknod) devices as
302well as open them for reading or writing.
640453bb 303The policies may be specified as allow-lists and deny-lists.
860573ad
MK
304Hierarchy is enforced, so new rules must not
305violate existing rules for the target or ancestor cgroups.
a721e8b2 306.IP
860573ad
MK
307Further information can be found in the kernel source file
308.IR Documentation/cgroup-v1/devices.txt .
309.TP
310.IR freezer " (since Linux 2.6.28; " \fBCONFIG_CGROUP_FREEZER\fP )
311The
312.IR freezer
313cgroup can suspend and restore (resume) all processes in a cgroup.
314Freezing a cgroup
315.I /A
316also causes its children, for example, processes in
317.IR /A/B ,
318to be frozen.
a721e8b2 319.IP
860573ad
MK
320Further information can be found in the kernel source file
321.IR Documentation/cgroup-v1/freezer-subsystem.txt .
322.TP
323.IR net_cls " (since Linux 2.6.29; " \fBCONFIG_CGROUP_NET_CLASSID\fP )
324This places a classid, specified for the cgroup, on network packets
325created by a cgroup.
326These classids can then be used in firewall rules,
327as well as used to shape traffic using
328.BR tc (8).
329This applies only to packets
330leaving the cgroup, not to traffic arriving at the cgroup.
a721e8b2 331.IP
860573ad
MK
332Further information can be found in the kernel source file
333.IR Documentation/cgroup-v1/net_cls.txt .
334.TP
335.IR blkio " (since Linux 2.6.33; " \fBCONFIG_BLK_CGROUP\fP )
336The
337.I blkio
338cgroup controls and limits access to specified block devices by
339applying IO control in the form of throttling and upper limits against leaf
340nodes and intermediate nodes in the storage hierarchy.
a721e8b2 341.IP
860573ad
MK
342Two policies are available.
343The first is a proportional-weight time-based division
344of disk implemented with CFQ.
345This is in effect for leaf nodes using CFQ.
346The second is a throttling policy which specifies
347upper I/O rate limits on a device.
a721e8b2 348.IP
860573ad
MK
349Further information can be found in the kernel source file
350.IR Documentation/cgroup-v1/blkio-controller.txt .
351.TP
352.IR perf_event " (since Linux 2.6.39; " \fBCONFIG_CGROUP_PERF\fP )
353This controller allows
354.I perf
355monitoring of the set of processes grouped in a cgroup.
a721e8b2 356.IP
860573ad 357Further information can be found in the kernel source file
c174eb6a 358.IR tools/perf/Documentation/perf-record.txt .
860573ad
MK
359.TP
360.IR net_prio " (since Linux 3.3; " \fBCONFIG_CGROUP_NET_PRIO\fP )
361This allows priorities to be specified, per network interface, for cgroups.
a721e8b2 362.IP
860573ad
MK
363Further information can be found in the kernel source file
364.IR Documentation/cgroup-v1/net_prio.txt .
365.TP
366.IR hugetlb " (since Linux 3.5; " \fBCONFIG_CGROUP_HUGETLB\fP )
367This supports limiting the use of huge pages by cgroups.
a721e8b2 368.IP
860573ad
MK
369Further information can be found in the kernel source file
370.IR Documentation/cgroup-v1/hugetlb.txt .
371.TP
372.IR pids " (since Linux 4.3; " \fBCONFIG_CGROUP_PIDS\fP )
373This controller permits limiting the number of process that may be created
374in a cgroup (and its descendants).
a721e8b2 375.IP
860573ad
MK
376Further information can be found in the kernel source file
377.IR Documentation/cgroup-v1/pids.txt .
cfec905e
NB
378.TP
379.IR rdma " (since Linux 4.11; " \fBCONFIG_CGROUP_RDMA\fP )
d145c025
MK
380The RDMA controller permits limiting the use of
381RDMA/IB-specific resources per cgroup.
cfec905e
NB
382.IP
383Further information can be found in the kernel source file
384.IR Documentation/cgroup-v1/rdma.txt .
860573ad 385.\"
6398ca15 386.SS Creating cgroups and moving processes
9ed582ac 387A cgroup filesystem initially contains a single root cgroup, '/',
6398ca15 388which all processes belong to.
21f0d132 389A new cgroup is created by creating a directory in the cgroup filesystem:
a721e8b2 390.PP
4769a778
MK
391.in +4n
392.EX
393mkdir /sys/fs/cgroup/cpu/cg1
394.EE
395.in
a721e8b2 396.PP
21f0d132 397This creates a new empty cgroup.
a721e8b2 398.PP
f524e7f8 399A process may be moved to this cgroup by writing its PID into the cgroup's
21f0d132 400.I cgroup.procs
21f0d132 401file:
a721e8b2 402.PP
4769a778
MK
403.in +4n
404.EX
405echo $$ > /sys/fs/cgroup/cpu/cg1/cgroup.procs
406.EE
407.in
a721e8b2 408.PP
f524e7f8 409Only one PID at a time should be written to this file.
a721e8b2 410.PP
f524e7f8
MK
411Writing the value 0 to a
412.IR cgroup.procs
413file causes the writing process to be moved to the corresponding cgroup.
a721e8b2 414.PP
6398ca15
MK
415When writing a PID into the
416.IR cgroup.procs ,
87402a2e 417all threads in the process are moved into the new cgroup at once.
a721e8b2 418.PP
f524e7f8
MK
419Within a hierarchy, a process can be a member of exactly one cgroup.
420Writing a process's PID to a
421.IR cgroup.procs
422file automatically removes it from the cgroup of
423which it was previously a member.
a721e8b2 424.PP
f524e7f8
MK
425The
426.I cgroup.procs
427file can be read to obtain a list of the processes that are
428members of a cgroup.
429The returned list of PIDs is not guaranteed to be in order.
430Nor is it guaranteed to be free of duplicates.
431(For example, a PID may be recycled while reading from the list.)
a721e8b2 432.PP
56769384 433In cgroups v1, an individual thread can be moved to
87402a2e
MK
434another cgroup by writing its thread ID
435(i.e., the kernel thread ID returned by
436.BR clone (2)
437and
438.BR gettid (2))
439to the
440.IR tasks
441file in a cgroup directory.
442This file can be read to discover the set of threads
443that are members of the cgroup.
b43be47e
MK
444.\"
445.SS Removing cgroups
446To remove a cgroup,
447it must first have no child cgroups and contain no (nonzombie) processes.
448So long as that is the case, one can simply
449remove the corresponding directory pathname.
450Note that files in a cgroup directory cannot and need not be
451removed.
452.\"
88afe701 453.SS Cgroups v1 release notification
23388d41
MK
454Two files can be used to determine whether the kernel provides
455notifications when a cgroup becomes empty.
456A cgroup is considered to be empty when it contains no child
457cgroups and no member processes.
a721e8b2 458.PP
23388d41 459A special file in the root directory of each cgroup hierarchy,
88afe701 460.IR release_agent ,
23388d41
MK
461can be used to register the pathname of a program that may be invoked when
462a cgroup in the hierarchy becomes empty.
463The pathname of the newly empty cgroup (relative to the cgroup mount point)
464is provided as the sole command-line argument when the
465.IR release_agent
466program is invoked.
467The
468.IR release_agent
469program might remove the cgroup directory,
980f1827 470or perhaps repopulate it with a process.
a721e8b2 471.PP
23388d41
MK
472The default value of the
473.IR release_agent
474file is empty, meaning that no release agent is invoked.
a721e8b2 475.PP
59af0514
MK
476The content of the
477.I release_agent
478file can also be specified via a mount option when the
479cgroup filesystem is mounted:
480.PP
481.in +4n
482.EX
483mount -o release_agent=pathname ...
484.EE
485.in
486.PP
23388d41
MK
487Whether or not the
488.IR release_agent
489program is invoked when a particular cgroup becomes empty is determined
490by the value in the
88afe701 491.IR notify_on_release
23388d41
MK
492file in the corresponding cgroup directory.
493If this file contains the value 0, then the
494.IR release_agent
495program is not invoked.
496If it contains the value 1, the
497.IR release_agent
498program is invoked.
499The default value for this file in the root cgroup is 0.
500At the time when a new cgroup is created,
501the value in this file is inherited from the corresponding file
502in the parent cgroup.
88afe701 503.\"
d311c798
MK
504.SS Cgroup v1 named hierarchies
505In cgroups v1,
506it is possible to mount a cgroup hierarchy that has no attached controllers:
507.PP
508.in +4n
509.EX
510mount -t cgroup -o none,name=somename none /some/mount/point
511.EE
512.in
513.PP
514Multiple instances of such hierarchies can be mounted;
515each hierarchy must have a unique name.
516The only purpose of such hierarchies is to track processes.
517(See the discussion of release notification below.)
518An example of this is the
519.I name=systemd
520cgroup hierarchy that is used by
521.BR systemd (1)
522to track services and user sessions.
29fa4cbc
MK
523.PP
524Since Linux 5.0, the
525.I cgroup_no_v1
526kernel boot option (described below) can be used to disable cgroup v1
527named hierarchies, by specifying
528.IR cgroup_no_v1=named .
529
d311c798 530.\"
5714ccee 531.SH CGROUPS VERSION 2
b43be47e
MK
532In cgroups v2,
533all mounted controllers reside in a single unified hierarchy.
534While (different) controllers may be simultaneously
535mounted under the v1 and v2 hierarchies,
536it is not possible to mount the same controller simultaneously
537under both the v1 and the v2 hierarchies.
a721e8b2 538.PP
2befa495
MK
539The new behaviors in cgroups v2 are summarized here,
540and in some cases elaborated in the following subsections.
541.IP 1. 3
a15e0673 542Cgroups v2 provides a unified hierarchy against
dddb7ea1
MK
543which all controllers are mounted.
544.IP 2.
2befa495
MK
545"Internal" processes are not permitted.
546With the exception of the root cgroup, processes may reside
547only in leaf nodes (cgroups that do not themselves contain child cgroups).
4f017a68 548The details are somewhat more subtle than this, and are described below.
dddb7ea1 549.IP 3.
2befa495
MK
550Active cgroups must be specified via the files
551.IR cgroup.controllers
552and
553.IR cgroup.subtree_control .
dddb7ea1 554.IP 4.
2befa495
MK
555The
556.I tasks
557file has been removed.
558In addition, the
559.I cgroup.clone_children
560file that is employed by the
561.I cpuset
562controller has been removed.
dddb7ea1 563.IP 5.
2befa495
MK
564An improved mechanism for notification of empty cgroups is provided by the
565.IR cgroup.events
566file.
567.PP
568For more changes, see the
569.I Documentation/cgroup-v2.txt
570file in the kernel source.
e91d4f9e
MK
571.PP
572Some of the new behaviors listed above saw subsequent modification with
573the addition in Linux 4.14 of "thread mode" (described below).
2befa495 574.\"
dddb7ea1
MK
575.SS Cgroups v2 unified hierarchy
576In cgroups v1, the ability to mount different controllers
577against different hierarchies was intended to allow great flexibility
578for application design.
e91fc446
MK
579In practice, though,
580the flexibility turned out to be less useful than expected,
dddb7ea1
MK
581and in many cases added complexity.
582Therefore, in cgroups v2,
583all available controllers are mounted against a single hierarchy.
584The available controllers are automatically mounted,
585meaning that it is not necessary (or possible) to specify the controllers
586when mounting the cgroup v2 filesystem using a command such as the following:
a721e8b2 587.PP
4769a778
MK
588.in +4n
589.EX
590mount -t cgroup2 none /mnt/cgroup2
591.EE
592.in
a721e8b2 593.PP
dddb7ea1
MK
594A cgroup v2 controller is available only if it is not currently in use
595via a mount against a cgroup v1 hierarchy.
596Or, to put things another way, it is not possible to employ
597the same controller against both a v1 hierarchy and the unified v2 hierarchy.
57cbb0db
MK
598This means that it may be necessary first to unmount a v1 controller
599(as described above) before that controller is available in v2.
600Since
601.BR systemd (1)
602makes heavy use of some v1 controllers by default,
603it can in some cases be simpler to boot the system with
604selected v1 controllers disabled.
605To do this, specify the
606.IR cgroup_no_v1=list
607option on the kernel boot command line;
608.I list
609is a comma-separated list of the names of the controllers to disable,
610or the word
611.I all
612to disable all v1 controllers.
613(This situation is correctly handled by
614.BR systemd (1),
615which falls back to operating without the specified controllers.)
03bb1264
MK
616.PP
617Note that on many modern systems,
618.BR systemd (1)
619automatically mounts the
620.I cgroup2
621filesystem at
622.I /sys/fs/cgroup/unified
623during the boot process.
dddb7ea1 624.\"
efb95954
MK
625.SS Cgroups v2 mount options
626The following options
627.RI ( "mount -o" )
628can be specified when mounting the group v2 filesystem:
629.TP
630.IR nsdelegate " (since Linux 4.15)"
631Treat cgroup namespaces as delegation boundaries.
632For details, see below.
9e18674a
MK
633.TP
634.IR memory_localevents " (since Linux 5.2)"
635.\" commit 9852ae3fe5293264f01c49f2571ef7688f7823ce
636The
637.I memory.events
638should show statistics only for the cgroup itself,
639and not for any descendant cgroups.
640This was the behavior before Linux 5.2.
641Starting in Linux 5.2,
642the default behavior is to include statistics for descendant cgroups in
643.IR memory.events ,
644and this mount option can be used to revert to the legacy behavior.
645This option is system wide and can be set on mount or
646modified through remount only from the initial mount namespace;
647it is silently ignored in noninitial namespaces.
efb95954 648.\"
44c429ed
MK
649.SS Cgroups v2 controllers
650The following controllers, documented in the kernel source file
651.IR Documentation/cgroup-v2.txt ,
652are supported in cgroups version 2:
653.TP
cda7f4a3
MK
654.IR cpu " (since Linux 4.15)"
655This is the successor to the version 1
656.I cpu
657and
658.I cpuacct
659controllers.
660.TP
38c287b8
MK
661.IR cpuset " (since Linux 5.0)"
662This is the successor of the version 1
663.I cpuset
664controller.
665.TP
cda7f4a3
MK
666.IR freezer " (since Linux 5.2)"
667.\" commit 76f969e8948d82e78e1bc4beb6b9465908e74873
668This is the successor of the version 1
669.I freezer
670controller.
671.TP
38c287b8
MK
672.IR hugetlb " (since Linux 5.6)"
673This is the successor of the version 1
674.I hugetlb
675controller.
676.TP
44c429ed
MK
677.IR io " (since Linux 4.5)"
678This is the successor of the version 1
679.I blkio
680controller.
681.TP
682.IR memory " (since Linux 4.5)"
683This is the successor of the version 1
684.I memory
685controller.
686.TP
cda7f4a3 687.IR perf_event " (since Linux 4.11)"
44c429ed 688This is the same as the version 1
cda7f4a3 689.I perf_event
44c429ed
MK
690controller.
691.TP
cda7f4a3 692.IR pids " (since Linux 4.5)"
f7286edc 693This is the same as the version 1
cda7f4a3 694.I pids
44c429ed
MK
695controller.
696.TP
697.IR rdma " (since Linux 4.11)"
698This is the same as the version 1
699.I rdma
700controller.
38c287b8
MK
701.PP
702There is no direct equivalent of the
703.I net_cls
704and
705.I net_prio
706controllers from cgroups version 1.
707Instead, support has been added to
708.BR iptables (8)
709to allow eBPF filters that hook on cgroup v2 pathnames to make decisions
710about network traffic on a per-cgroup basis.
711.PP
712The v2
713.I devices
714controller provides no interface files;
715instead, device control is gated by attaching an eBPF
716.RB ( BPF_CGROUP_DEVICE )
717program to a v2 cgroup.
44c429ed 718.\"
2befa495 719.SS Cgroups v2 subtree control
8d5f42dc
MK
720Each cgroup in the v2 hierarchy contains the following two files:
721.TP
722.IR cgroup.controllers
277559a4 723This read-only file exposes a list of the controllers that are
8d5f42dc
MK
724.I available
725in this cgroup.
726The contents of this file match the contents of the
727.I cgroup.subtree_control
728file in the parent cgroup.
729.TP
730.I cgroup.subtree_control
731This is a list of controllers that are
732.IR active
733.RI ( enabled )
734in the cgroup.
735The set of controllers in this file is a subset of the set in the
21f0d132 736.IR cgroup.controllers
8d5f42dc
MK
737of this cgroup.
738The set of active controllers is modified by writing strings to this file
739containing space-delimited controller names,
740each preceded by '+' (to enable a controller)
741or '\-' (to disable a controller), as in the following example:
742.IP
743.in +4n
744.EX
745echo '+pids -memory' > x/y/cgroup.subtree_control
746.EE
747.in
748.IP
c9b101d1
MK
749An attempt to enable a controller
750that is not present in
751.I cgroup.controllers
752leads to an
753.B ENOENT
754error when writing to the
755.I cgroup.subtree_control
756file.
757.PP
8d5f42dc
MK
758Because the list of controllers in
759.I cgroup.subtree_control
760is a subset of those
761.IR cgroup.controllers ,
762a controller that has been disabled in one cgroup in the hierarchy
763can never be re-enabled in the subtree below that cgroup.
764.PP
765A cgroup's
766.I cgroup.subtree_control
767file determines the set of controllers that are exercised in the
768.I child
769cgroups.
770When a controller (e.g.,
771.IR pids )
772is present in the
773.I cgroup.subtree_control
774file of a parent cgroup,
775then the corresponding controller-interface files (e.g.,
776.IR pids.max )
777are automatically created in the children of that cgroup
778and can be used to exert resource control in the child cgroups.
21f0d132 779.\"
2468f14e
MK
780.SS Cgroups v2 """no internal processes""" rule
781Cgroups v2 enforces a so-called "no internal processes" rule.
782Roughly speaking, this rule means that,
783with the exception of the root cgroup, processes may reside
784only in leaf nodes (cgroups that do not themselves contain child cgroups).
785This avoids the need to decide how to partition resources between
786processes which are members of cgroup A and processes in child cgroups of A.
787.PP
788For instance, if cgroup
789.I /cg1/cg2
790exists, then a process may reside in
791.IR /cg1/cg2 ,
792but not in
793.IR /cg1 .
794This is to avoid an ambiguity in cgroups v1
795with respect to the delegation of resources between processes in
796.I /cg1
797and its child cgroups.
798The recommended approach in cgroups v2 is to create a subdirectory called
799.I leaf
800for any nonleaf cgroup which should contain processes, but no child cgroups.
801Thus, processes which previously would have gone into
802.I /cg1
803would now go into
804.IR /cg1/leaf .
805This has the advantage of making explicit
806the relationship between processes in
807.I /cg1/leaf
808and
809.IR /cg1 's
810other children.
811.PP
812The "no internal processes" rule is in fact more subtle than stated above.
813More precisely, the rule is that a (nonroot) cgroup can't both
814(1) have member processes, and
815(2) distribute resources into child cgroups\(emthat is, have a nonempty
816.I cgroup.subtree_control
817file.
818Thus, it
819.I is
820possible for a cgroup to have both member processes and child cgroups,
821but before controllers can be enabled for that cgroup,
822the member processes must be moved out of the cgroup
823(e.g., perhaps into the child cgroups).
e91d4f9e
MK
824.PP
825With the Linux 4.14 addition of "thread mode" (described below),
826the "no internal processes" rule has been relaxed in some cases.
2468f14e 827.\"
754f4cf5 828.SS Cgroups v2 cgroup.events file
71e2545e
MK
829Each nonroot cgroup in the v2 hierarchy contains a read-only file,
830.IR cgroup.events ,
831whose contents are key-value pairs
754f4cf5 832(delimited by newline characters, with the key and value separated by spaces)
71e2545e
MK
833providing state information about the
834the cgroup:
835.PP
836.in +4n
837.EX
838$ \fBcat mygrp/cgroup.events\fP
839populated 1
c309dee7 840frozen 0
71e2545e
MK
841.EE
842.in
843.PP
844The following keys may appear in this file:
845.TP
846.IR populated
847The value of this key is either 1,
848if this cgroup or any of its descendants has member processes,
849or otherwise 0.
c309dee7
MK
850.TP
851.IR frozen " (since Linux 5.2)"
852.\" commit 76f969e8948d82e78e1bc4beb6b9465908e7487
853The value of this key is 1 if this cgroup is currently frozen,
854or 0 if it is not.
a721e8b2 855.PP
754f4cf5
MK
856The
857.IR cgroup.events
71e2545e
MK
858file can be monitored, in order to receive notification when the value of
859one of its keys changes.
860Such monitoring can be done using
754f4cf5 861.BR inotify (7),
71e2545e 862which notifies changes as
754f4cf5 863.BR IN_MODIFY
71e2545e 864events, or
754f4cf5 865.BR poll (2),
71e2545e 866which notifies changes by returning the
754f4cf5 867.B POLLPRI
7747ed97
MK
868and
869.B POLLERR
71e2545e 870bits in the
7747ed97
MK
871.IR revents
872field.
71e2545e
MK
873.\"
874.SS Cgroup v2 release notification
875Cgroups v2 provides a new mechanism for obtaining notification
876when a cgroup becomes empty.
877The cgroups v1
878.IR release_agent
879and
880.IR notify_on_release
881files are removed, and replaced by the
ccb1a262 882.I populated
71e2545e
MK
883key in the
884.IR cgroup.events
885file.
886This key either has the value 0,
887meaning that the cgroup (and its descendants)
888contain no (nonzombie) member processes,
889or 1, meaning that the cgroup (or one of its descendants)
890contains member processes.
891.PP
892The cgroups v2 release-notification mechanism
daf57a6a 893offers the following advantages over the cgroups v1
754f4cf5 894.IR release_agent
daf57a6a
MK
895mechanism:
896.IP * 3
897It allows for cheaper notification,
754f4cf5
MK
898since a single process can monitor multiple
899.IR cgroup.events
71e2545e 900files (using the techniques described earlier).
daf57a6a
MK
901By contrast, the cgroups v1 mechanism requires the expense of creating
902a process for each notification.
903.IP *
904Notification for different cgroup subhierarchies can be delegated
905to different processes.
906By contrast, the cgroups v1 mechanism allows only one release agent
907for an entire hierarchy.
c91a9f8a 908.\"
5e071499
MK
909.SS Cgroups v2 cgroup.stat file
910.\" commit ec39225cca42c05ac36853d11d28f877fde5c42e
911Each cgroup in the v2 hierarchy contains a read-only
912.IR cgroup.stat
913file (first introduced in Linux 4.14)
914that consists of lines containing key-value pairs.
915The following keys currently appear in this file:
916.TP
917.I nr_descendants
918This is the total number of visible (i.e., living) descendant cgroups
919underneath this cgroup.
920.TP
921.I nr_dying_descendants
922This is the total number of dying descendant cgroups
923underneath this cgroup.
924A cgroup enters the dying state after being deleted.
925It remains in that state for an undefined period
926(which will depend on system load)
c7f63e74
MK
927while resources are freed before the cgroup is destroyed.
928Note that the presence of some cgroups in the dying state is normal,
929and is not indicative of any problem.
5e071499
MK
930.IP
931A process can't be made a member of a dying cgroup,
932and a dying cgroup can't be brought back to life.
933.\"
5845e10b
MK
934.SS Limiting the number of descendant cgroups
935Each cgroup in the v2 hierarchy contains the following files,
936which can be used to view and set limits on the number
937of descendant cgroups under that cgroup:
938.TP
939.IR cgroup.max.depth " (since Linux 4.14)"
940.\" commit 1a926e0bbab83bae8207d05a533173425e0496d1
941This file defines a limit on the depth of nesting of descendant cgroups.
942A value of 0 in this file means that no descendant cgroups can be created.
943An attempt to create a descendant whose nesting level exceeds
944the limit fails
945.RI ( mkdir (2)
946fails with the error
947.BR EAGAIN ).
948.IP
949Writing the string
950.IR """max"""
951to this file means that no limit is imposed.
952The default value in this file is
953.IR """max""" .
954.TP
955.IR cgroup.max.descendants " (since Linux 4.14)"
956.\" commit 1a926e0bbab83bae8207d05a533173425e0496d1
957This file defines a limit on the number of live descendant cgroups that
958this cgroup may have.
959An attempt to create more descendants than allowed by the limit fails
960.RI ( mkdir (2)
961fails with the error
962.BR EAGAIN ).
963.IP
964Writing the string
965.IR """max"""
966to this file means that no limit is imposed.
967The default value in this file is
968.IR """max""" .
969.\"
4b1c2041 970.SH CGROUPS DELEGATION: DELEGATING A HIERARCHY TO A LESS PRIVILEGED USER
4242dfbe
MK
971In the context of cgroups,
972delegation means passing management of some subtree
51629a30 973of the cgroup hierarchy to a nonprivileged user.
87b18a8b
MK
974Cgroups v1 provides support for delegation based on file permissions
975in the cgroup hierarchy but with less strict containment rules than v2
976(as noted below).
977Cgroups v2 supports delegation with containment by explicit design.
4b1c2041
MK
978The focus of the discussion in this section is on delegation in cgroups v2,
979with some differences for cgroups v1 noted along the way.
4242dfbe
MK
980.PP
981Some terminology is required in order to describe delegation.
982A
983.I delegater
984is a privileged user (i.e., root) who owns a parent cgroup.
985A
986.I delegatee
987is a nonprivileged user who will be granted the permissions needed
988to manage some subhierarchy under that parent cgroup,
989known as the
990.IR "delegated subtree" .
991.PP
992To perform delegation,
993the delegater makes certain directories and files writable by the delegatee,
994typically by changing the ownership of the objects to be the user ID
995of the delegatee.
0735069b
MK
996Assuming that we want to delegate the hierarchy rooted at (say)
997.I /dlgt_grp
4242dfbe
MK
998and that there are not yet any child cgroups under that cgroup,
999the ownership of the following is changed to the user ID of the delegatee:
1000.TP
0735069b 1001.IR /dlgt_grp
4242dfbe
MK
1002Changing the ownership of the root of the subtree means that any new
1003cgroups created under the subtree (and the files they contain)
1004will also be owned by the delegatee.
1005.TP
0735069b 1006.IR /dlgt_grp/cgroup.procs
f7286edc 1007Changing the ownership of this file means that the delegatee
4242dfbe
MK
1008can move processes into the root of the delegated subtree.
1009.TP
4b1c2041 1010.IR /dlgt_grp/cgroup.subtree_control " (cgroups v2 only)"
15f2303d 1011Changing the ownership of this file means that the delegatee
e5936eb6 1012can enable controllers (that are present in
0735069b 1013.IR /dlgt_grp/cgroup.controllers )
4242dfbe 1014in order to further redistribute resources at lower levels in the subtree.
e5936eb6
MK
1015(As an alternative to changing the ownership of this file,
1016the delegater might instead add selected controllers to this file.)
639b6c8c 1017.TP
4b1c2041 1018.IR /dlgt_grp/cgroup.threads " (cgroups v2 only)"
639b6c8c
MK
1019Changing the ownership of this file is necessary if a threaded subtree
1020is being delegated (see the description of "thread mode", below).
7b327dd5 1021This permits the delegatee to write thread IDs to the file.
cd7f4c49
MK
1022(The ownership of this file can also be changed when delegating
1023a domain subtree, but currently this serves no purpose,
1024since, as described below, it is not possible to move a thread between
1025domain cgroups by writing its thread ID to the
2b91ed4e 1026.IR cgroup.threads
cd7f4c49 1027file.)
4b1c2041
MK
1028.IP
1029In cgroups v1, the corresponding file that should instead be delegated is the
1030.I tasks
1031file.
4242dfbe
MK
1032.PP
1033The delegater should
1034.I not
1035change the ownership of any of the controller interfaces files (e.g.,
1036.IR pids.max ,
1037.IR memory.high )
1038in
0735069b 1039.IR dlgt_grp .
4242dfbe
MK
1040Those files are used from the next level above the delegated subtree
1041in order to distribute resources into the subtree,
1042and the delegatee should not have permission to change
1043the resources that are distributed into the delegated subtree.
1044.PP
668ef765
MK
1045See also the discussion of the
1046.IR /sys/kernel/cgroup/delegate
4b1c2041 1047file in NOTES for information about further delegatable files in cgroups v2.
668ef765 1048.PP
4242dfbe
MK
1049After the aforementioned steps have been performed,
1050the delegatee can create child cgroups within the delegated subtree
6dc513cd
MK
1051(the cgroup subdirectories and the files they contain
1052will be owned by the delegatee)
4242dfbe
MK
1053and move processes between cgroups in the subtree.
1054If some controllers are present in
0735069b 1055.IR dlgt_grp/cgroup.subtree_control ,
4242dfbe 1056or the ownership of that file was passed to the delegatee,
f7286edc 1057the delegatee can also control the further redistribution
4242dfbe 1058of the corresponding resources into the delegated subtree.
27b086e9 1059.\"
ed3f4f34 1060.SS Cgroups v2 delegation: nsdelegate and cgroup namespaces
ed3f4f34
MK
1061Starting with Linux 4.13,
1062.\" commit 5136f6365ce3eace5a926e10f16ed2a233db5ba9
4b1c2041 1063there is a second way to perform cgroup delegation in the cgroups v2 hierarchy.
07361828 1064This is done by mounting or remounting the cgroup v2 filesystem with the
ed3f4f34 1065.I nsdelegate
07361828
MK
1066mount option.
1067For example, if the cgroup v2 filesystem has already been mounted,
1068we can remount it with the
1069.I nsdelegate
1070option as follows:
ed3f4f34
MK
1071.PP
1072.in +4n
1073.EX
d1a71985 1074mount -t cgroup2 -o remount,nsdelegate \e
07361828 1075 none /sys/fs/cgroup/unified
ed3f4f34
MK
1076.EE
1077.in
07361828
MK
1078.\"
1079.\" ALternatively, we could boot the kernel with the options:
1080.\"
1081.\" cgroup_no_v1=all systemd.legacy_systemd_cgroup_controller
1082.\"
1083.\" The effect of the latter option is to prevent systemd from employing
1084.\" its "hybrid" cgroup mode, where it tries to make use of cgroups v2.
ed3f4f34 1085.PP
dc581e07 1086The effect of this mount option is to cause cgroup namespaces
ed3f4f34
MK
1087to automatically become delegation boundaries.
1088More specifically,
1089the following restrictions apply for processes inside the cgroup namespace:
1090.IP * 3
446d1643 1091Writes to controller interface files in the root directory of the namespace
ed3f4f34
MK
1092will fail with the error
1093.BR EPERM .
1094Processes inside the cgroup namespace can still write to delegatable
446d1643 1095files in the root directory of the cgroup namespace such as
ed3f4f34
MK
1096.IR cgroup.procs
1097and
1098.IR cgroup.subtree_control ,
446d1643 1099and can create subhierarchy underneath the root directory.
ed3f4f34
MK
1100.IP *
1101Attempts to migrate processes across the namespace boundary are denied
1102(with the error
1103.BR ENOENT ).
1104Processes inside the cgroup namespace can still
1105(subject to the containment rules described below)
1106move processes between cgroups
1107.I within
1108the subhierarchy under the namespace root.
1109.PP
1110The ability to define cgroup namespaces as delegation boundaries
1111makes cgroup namespaces more useful.
1112To understand why, suppose that we already have one cgroup hierarchy
1113that has been delegated to a nonprivileged user,
1114.IR cecilia ,
1115using the older delegation technique described above.
1116Suppose further that
1117.I cecilia
1118wanted to further delegate a subhierarchy
1119under the existing delegated hierarchy.
1120(For example, the delegated hierarchy might be associated with
1121an unprivileged container run by
1122.IR cecilia .)
1123Even if a cgroup namespace was employed,
1124because both hierarchies are owned by the unprivileged user
1125.IR cecilia ,
1126the following illegitimate actions could be performed:
1127.IP * 3
1128A process in the inferior hierarchy could change the
619dbe1c 1129resource controller settings in the root directory of that hierarchy.
ed3f4f34
MK
1130(These resource controller settings are intended to allow control to
1131be exercised from the
1132.I parent
1133cgroup;
1134a process inside the child cgroup should not be allowed to modify them.)
1135.IP *
1136A process inside the inferior hierarchy could move processes
1137into and out of the inferior hierarchy if the cgroups in the
1138superior hierarchy were somehow visible.
1139.PP
1140Employing the
1141.I nsdelegate
1142mount option prevents both of these possibilities.
1143.PP
1144The
1145.I nsdelegate
1146mount option only has an effect when performed in
1147the initial mount namespace;
1148in other mount namespaces, the option is silently ignored.
07361828
MK
1149.PP
1150.IR Note :
1151On some systems,
1152.BR systemd (1)
1153automatically mounts the cgroup v2 filesystem.
1154In order to experiment with the
1155.I nsdelegate
44084d19
MK
1156operation, it may be useful to boot the kernel with
1157the following command-line options:
1158.PP
1159.in +4n
1160.EX
1161cgroup_no_v1=all systemd.legacy_systemd_cgroup_controller
1162.EE
1163.in
1164.PP
1165These options cause the kernel to boot with the cgroups v1 controllers
1166disabled (meaning that the controllers are available in the v2 hierarchy),
1167and tells
1168.BR systemd (1)
1169not to mount and use the cgroup v2 hierarchy,
1170so that the v2 hierarchy can be manually mounted
1171with the desired options after boot-up.
ed3f4f34 1172.\"
4b1c2041 1173.SS Cgroup delegation containment rules
4242dfbe
MK
1174Some delegation
1175.IR "containment rules"
1176ensure that the delegatee can move processes between cgroups within the
1177delegated subtree,
1178but can't move processes from outside the delegated subtree into
1179the subtree or vice versa.
1180A nonprivileged process (i.e., the delegatee) can write the PID of
1181a "target" process into a
1182.IR cgroup.procs
1183file only if all of the following are true:
1184.IP * 3
4242dfbe
MK
1185The writer has write permission on the
1186.I cgroup.procs
1187file in the destination cgroup.
1188.IP *
1189The writer has write permission on the
1190.I cgroup.procs
396761ee 1191file in the nearest common ancestor of the source and destination cgroups.
e366c4d4
MK
1192Note that in some cases,
1193the nearest common ancestor may be the source or destination cgroup itself.
4b1c2041
MK
1194This requirement is not enforced for cgroups v1 hierarchies,
1195with the consequence that containment in v1 is less strict than in v2.
1196(For example, in cgroups v1 the user that owns two distinct
1197delegated subhierarchies can move a process between the hierarchies.)
28f612ea 1198.IP *
ed3f4f34
MK
1199If the cgroup v2 filesystem was mounted with the
1200.I nsdelegate
7b574df5 1201option, the writer must be able to see the source and destination cgroups
ed3f4f34
MK
1202from its cgroup namespace.
1203.IP *
4b1c2041 1204In cgroups v1:
28f612ea
MK
1205the effective UID of the writer (i.e., the delegatee) matches the
1206real user ID or the saved set-user-ID of the target process.
4b1c2041
MK
1207Before Linux 4.11,
1208.\" commit 576dd464505fc53d501bb94569db76f220104d28
1209this requirement also applied in cgroups v2
28f612ea
MK
1210(This was a historical requirement inherited from cgroups v1
1211that was later deemed unnecessary,
1212since the other rules suffice for containment in cgroups v2.)
4242dfbe
MK
1213.PP
1214.IR Note :
1215one consequence of these delegation containment rules is that the
0735069b
MK
1216unprivileged delegatee can't place the first process into
1217the delegated subtree;
1218instead, the delegater must place the first process
1219(a process owned by the delegatee) into the delegated subtree.
4242dfbe 1220.\"
75e83bc2 1221.SH CGROUPS VERSION 2 THREAD MODE
c8902e25
MK
1222Among the restrictions imposed by cgroups v2 that were not present
1223in cgroups v1 are the following:
1224.IP * 3
1225.IR "No thread-granularity control" :
1226all of the threads of a process must be in the same cgroup.
1227.IP *
1228.IR "No internal processes" :
1229a cgroup can't both have member processes and
1230exercise controllers on child cgroups.
1231.PP
1232Both of these restrictions were added because
1233the lack of these restrictions had caused problems
1234in cgroups v1.
1235In particular, the cgroups v1 ability to allow thread-level granularity
1236for cgroup membership made no sense for some controllers.
1237(A notable example was the
1238.I memory
1239controller: since threads share an address space,
1240it made no sense to split threads across different
1241.I memory
1242cgroups.)
1243.PP
1244Notwithstanding the initial design decision in cgroups v2,
1245there were use cases for certain controllers, notably the
1246.IR cpu
1247controller,
1248for which thread-level granularity of control was meaningful and useful.
1249To accommodate such use cases, Linux 4.14 added
1250.I "thread mode"
1251for cgroups v2.
1252.PP
1253Thread mode allows the following:
1254.IP * 3
1255The creation of
1256.IR "threaded subtrees"
1257in which the threads of a process may
1258be spread across cgroups inside the tree.
1259(A threaded subtree may contain multiple multithreaded processes.)
1260.IP *
1261The concept of
1262.IR "threaded controllers",
1263which can distribute resources across the cgroups in a threaded subtree.
1264.IP *
1265A relaxation of the "no internal processes rule",
1266so that, within a threaded subtree,
1267a cgroup can both contain member threads and
1268exercise resource control over child cgroups.
1269.PP
1270With the addition of thread mode,
1271each nonroot cgroup now contains a new file,
1272.IR cgroup.type ,
1273that exposes, and in some circumstances can be used to change,
1274the "type" of a cgroup.
1275This file contains one of the following type values:
1276.TP
1277.I "domain"
1278This is a normal v2 cgroup that provides process-granularity control.
1279If a process is a member of this cgroup,
1280then all threads of the process are (by definition) in the same cgroup.
1281This is the default cgroup type,
1282and provides the same behavior that was provided for
1283cgroups in the initial cgroups v2 implementation.
1284.TP
1285.I "threaded"
1286This cgroup is a member of a threaded subtree.
1287Threads can be added to this cgroup,
1288and controllers can be enabled for the cgroup.
1289.TP
1290.I "domain threaded"
1291This is a domain cgroup that serves as the root of a threaded subtree.
1292This cgroup type is also known as "threaded root".
1293.TP
1294.I "domain invalid"
1295This is a cgroup inside a threaded subtree
1296that is in an "invalid" state.
1297Processes can't be added to the cgroup,
1298and controllers can't be enabled for the cgroup.
1299The only thing that can be done with this cgroup (other than deleting it)
1300is to convert it to a
1301.IR threaded
1302cgroup by writing the string
1303.IR """threaded"""
1304to the
1305.I cgroup.type
1306file.
61254835
MK
1307.IP
1308The rationale for the existence of this "interim" type
1309during the creation of a threaded subtree
1310(rather than the kernel simply immediately converting all cgroups
1311under the threaded root to the type
1312.IR threaded )
1313is to allow for
1314possible future extensions to the thread mode model
c8902e25
MK
1315.\"
1316.SS Threaded versus domain controllers
1317With the addition of threads mode,
1318cgroups v2 now distinguishes two types of resource controllers:
1319.IP * 3
1320.I Threaded
2cd9bbfa 1321.\" In the kernel source, look for ".threaded[ \t]*= true" in
218eadf4 1322.\" initializations of struct cgroup_subsys
c8902e25
MK
1323controllers: these controllers support thread-granularity for
1324resource control and can be enabled inside threaded subtrees,
1325with the result that the corresponding controller-interface files
1326appear inside the cgroups in the threaded subtree.
aa2c3623 1327As at Linux 4.19, the following controllers are threaded:
c8902e25
MK
1328.IR cpu ,
1329.IR perf_event ,
1330and
1331.IR pids .
1332.IP *
1333.I Domain
1334controllers: these controllers support only process granularity
1335for resource control.
1336From the perspective of a domain controller,
1337all threads of a process are always in the same cgroup.
1338Domain controllers can't be enabled inside a threaded subtree.
1339.\"
1340.SS Creating a threaded subtree
1341There are two pathways that lead to the creation of a threaded subtree.
1342The first pathway proceeds as follows:
1343.IP 1. 3
1344We write the string
1345.IR """threaded"""
1346to the
1347.I cgroup.type
1348file of a cgroup
1349.IR y/z
1350that currently has the type
1351.IR domain .
1352This has the following effects:
1353.RS
1354.IP * 3
1355The type of the cgroup
1356.IR y/z
1357becomes
1358.IR threaded .
1359.IP *
1360The type of the parent cgroup,
1361.IR y ,
1362becomes
1363.IR "domain threaded" .
1364The parent cgroup is the root of a threaded subtree
1365(also known as the "threaded root").
1366.IP *
1367All other cgroups under
1368.IR y
1369that were not already of type
1370.IR threaded
1371(because they were inside already existing threaded subtrees
1372under the new threaded root)
1373are converted to type
1374.IR "domain invalid" .
1375Any subsequently created cgroups under
1376.I y
1377will also have the type
1378.IR "domain invalid" .
1379.RE
1380.IP 2.
1381We write the string
1382.IR """threaded"""
1383to each of the
1384.IR "domain invalid"
1385cgroups under
1386.IR y ,
1387in order to convert them to the type
1388.IR threaded .
1389As a consequence of this step, all threads under the threaded root
1390now have the type
1391.IR threaded
1392and the threaded subtree is now fully usable.
1393The requirement to write
1394.IR """threaded"""
1395to each of these cgroups is somewhat cumbersome,
1396but allows for possible future extensions to the thread-mode model.
1397.PP
1398The second way of creating a threaded subtree is as follows:
1399.IP 1. 3
1400In an existing cgroup,
1401.IR z ,
1402that currently has the type
1403.IR domain ,
1404we (1) enable one or more threaded controllers and
1405(2) make a process a member of
1406.IR z .
1407(These two steps can be done in either order.)
1408This has the following consequences:
1409.RS
1410.IP * 3
1411The type of
1412.I z
1413becomes
1414.IR "domain threaded" .
1415.IP *
1416All of the descendant cgroups of
1417.I x
7a1cddd2 1418that were not already of type
c8902e25
MK
1419.IR threaded
1420are converted to type
1421.IR "domain invalid" .
1422.RE
1423.IP 2.
1424As before, we make the threaded subtree usable by writing the string
1425.IR """threaded"""
1426to each of the
1427.IR "domain invalid"
1428cgroups under
1429.IR y ,
1430in order to convert them to the type
1431.IR threaded .
1432.PP
1433One of the consequences of the above pathways to creating a threaded subtree
1434is that the threaded root cgroup can be a parent only to
1435.I threaded
1436(and
1437.IR "domain invalid" )
1438cgroups.
1439The threaded root cgroup can't be a parent of a
1440.I domain
1441cgroups, and a
1442.I threaded
1443cgroup
1444can't have a sibling that is a
1445.I domain
1446cgroup.
1447.\"
1448.SS Using a threaded subtree
1449Within a threaded subtree, threaded controllers can be enabled
1450in each subgroup whose type has been changed to
1451.IR threaded ;
1452upon doing so, the corresponding controller interface files
1453appear in the children of that cgroup.
1454.PP
1455A process can be moved into a threaded subtree by writing its PID to the
1456.I cgroup.procs
1457file in one of the cgroups inside the tree.
1458This has the effect of making all of the threads
1459in the process members of the corresponding cgroup
1460and makes the process a member of the threaded subtree.
1461The threads of the process can then be spread across
1462the threaded subtree by writing their thread IDs (see
1463.BR gettid (2))
1464to the
b2c3e720 1465.I cgroup.threads
c8902e25
MK
1466files in different cgroups inside the subtree.
1467The threads of a process must all reside in the same threaded subtree.
1468.PP
d84e558e
MK
1469As with writing to
1470.IR cgroup.procs ,
1471some containment rules apply when writing to the
b2c3e720 1472.I cgroup.threads
d84e558e
MK
1473file:
1474.IP * 3
1475The writer must have write permission on the
1476cgroup.threads
1477file in the destination cgroup.
1478.IP *
1479The writer must have write permission on the
1480.I cgroup.procs
1481file in the common ancestor of the source and destination cgroups.
1482(In some cases,
1483the common ancestor may be the source or destination cgroup itself.)
1484.IP *
1485The source and destination cgroups must be in the same threaded subtree.
1486(Outside a threaded subtree, an attempt to move a thread by writing
1487its thread ID to the
1488.I cgroup.threads
1489file in a different
1490.I domain
1491cgroup fails with the error
1492.BR EOPNOTSUPP .)
4178f132
MK
1493.PP
1494The
1495.I cgroup.threads
c8902e25
MK
1496file is present in each cgroup (including
1497.I domain
1498cgroups) and can be read in order to discover the set of threads
1499that is present in the cgroup.
1500The set of thread IDs obtained when reading this file
1501is not guaranteed to be ordered or free of duplicates.
1502.PP
1503The
1504.I cgroup.procs
1505file in the threaded root shows the PIDs of all processes
1506that are members of the threaded subtree.
1507The
1508.I cgroup.procs
1509files in the other cgroups in the subtree are not readable.
1510.PP
1511Domain controllers can't be enabled in a threaded subtree;
1512no controller-interface files appear inside the cgroups underneath the
1513threaded root.
1514From the point of view of a domain controller,
1515threaded subtrees are invisible:
1516a multithreaded process inside a threaded subtree appears to a domain
1517controller as a process that resides in the threaded root cgroup.
1518.PP
1519Within a threaded subtree, the "no internal processes" rule does not apply:
1520a cgroup can both contain member processes (or thread)
1521and exercise controllers on child cgroups.
1522.\"
1523.SS Rules for writing to cgroup.type and creating threaded subtrees
1524A number of rules apply when writing to the
1525.I cgroup.type
1526file:
1527.IP * 3
1528Only the string
1529.IR """threaded"""
1530may be written.
1531In other words, the only explicit transition that is possible is to convert a
1532.I domain
1533cgroup to type
1534.IR threaded .
1535.IP *
6c9aa5ad 1536The effect of writing
c8902e25 1537.IR """threaded"""
6c9aa5ad
MK
1538depends on the current value in
1539.IR cgroup.type ,
1540as follows:
c8902e25
MK
1541.RS
1542.IP \(bu 3
6c9aa5ad
MK
1543.IR domain
1544or
1545.IR "domain threaded" :
1546start the creation of a threaded subtree
1547(whose root is the parent of this cgroup) via
c8902e25
MK
1548the first of the pathways described above;
1549.IP \(bu
6c9aa5ad 1550.IR "domain\ invalid" :
4644794c 1551convert this cgroup (which is inside a threaded subtree) to a usable (i.e.,
c8902e25
MK
1552.IR threaded )
1553state;
1554.IP \(bu
6c9aa5ad
MK
1555.IR threaded :
1556no effect (a "no-op").
c8902e25
MK
1557.RE
1558.IP *
1559We can't write to a
1560.I cgroup.type
1561file if the parent's type is
1562.IR "domain invalid" .
1563In other words, the cgroups of a threaded subtree must be converted to the
1564.I threaded
1565state in a top-down manner.
1566.PP
00c27092 1567There are also some constraints that must be satisfied
c8902e25
MK
1568in order to create a threaded subtree rooted at the cgroup
1569.IR x :
1570.IP * 3
1571There can be no member processes in the descendant cgroups of
1572.IR x .
1573(The cgroup
1574.I x
1575can itself have member processes.)
1576.IP *
1577No domain controllers may be enabled in
1578.IR x 's
1579.IR cgroup.subtree_control
1580file.
c8902e25
MK
1581.PP
1582If any of the above constraints is violated, then an attempt to write
1583.IR """threaded"""
1584to a
1585.IR cgroup.type
1586file fails with the error
1587.BR ENOTSUP .
1588.\"
1589.SS The """domain threaded""" cgroup type
1590According to the pathways described above,
1591the type of a cgroup can change to
1592.IR "domain threaded"
1593in either of the following cases:
1594.IP * 3
1595The string
1596.IR """threaded"""
1597is written to a child cgroup.
1598.IP *
1599A threaded controller is enabled inside the cgroup and
1600a process is made a member of the cgroup.
1601.PP
1602A
1603.IR "domain threaded"
1604cgroup,
1605.IR x ,
1606can revert to the type
1607.IR domain
1608if the above conditions no longer hold true\(emthat is, if all
1609.I threaded
1610child cgroups of
1611.I x
1612are removed and either
1613.I x
1614no longer has threaded controllers enabled or
1615no longer has member processes.
1616.PP
1617When a
1618.IR "domain threaded"
1619cgroup
1620.IR x
1621reverts to the type
1622.IR domain :
1623.IP * 3
1624All
1625.IR "domain invalid"
1626descendants of
1627.I x
1628that are not in lower-level threaded subtrees revert to the type
1629.IR domain .
1630.IP *
1631The root cgroups in any lower-level threaded subtrees revert to the type
1632.IR "domain threaded" .
1633.\"
1634.SS Exceptions for the root cgroup
1635The root cgroup of the v2 hierarchy is treated exceptionally:
1636it can be the parent of both
1637.I domain
1638and
1639.I threaded
1640cgroups.
1641If the string
1642.I """threaded"""
1643is written to the
1644.I cgroup.type
1645file of one of the children of the root cgroup, then
1646.IP * 3
1647The type of that cgroup becomes
1648.IR threaded .
1649.IP *
1650The type of any descendants of that cgroup that
1651are not part of lower-level threaded subtrees changes to
1652.IR "domain invalid" .
1653.PP
1654Note that in this case, there is no cgroup whose type becomes
1655.IR "domain threaded" .
1656(Notionally, the root cgroup can be considered as the threaded root
1657for the cgroup whose type was changed to
1658.IR threaded .)
1659.PP
1660The aim of this exceptional treatment for the root cgroup is to
1661allow a threaded cgroup that employs the
1662.I cpu
1663controller to be placed as high as possible in the hierarchy,
1664so as to minimize the (small) cost of traversing the cgroup hierarchy.
1665.\"
edc90967 1666.SS The cgroups v2 """cpu""" controller and realtime threads
aa2c3623 1667As at Linux 4.19, the cgroups v2
c8902e25 1668.I cpu
0bef253e
MK
1669controller does not support control of realtime threads
1670(specifically threads scheduled under any of the policies
1671.BR SCHED_FIFO ,
1672.BR SCHED_RR ,
1673described
1674.BR SCHED_DEADLINE ;
1675see
1676.BR sched (7)).
1677Therefore, the
1678.I cpu
1679controller can be enabled in the root cgroup only
c8902e25 1680if all realtime threads are in the root cgroup.
edc90967 1681(If there are realtime threads in nonroot cgroups, then a
c8902e25
MK
1682.BR write (2)
1683of the string
1684.IR """+cpu"""
1685to the
1686.I cgroup.subtree_control
1687file fails with the error
c2df7694 1688.BR EINVAL .)
17094a28
MK
1689.PP
1690On some systems,
c8902e25 1691.BR systemd (1)
edc90967 1692places certain realtime threads in nonroot cgroups in the v2 hierarchy.
c8902e25 1693On such systems,
edc90967 1694these threads must first be moved to the root cgroup before the
c8902e25
MK
1695.I cpu
1696controller can be enabled.
1697.\"
1698.SH ERRORS
1699The following errors can occur for
1700.BR mount (2):
1701.TP
1702.B EBUSY
1703An attempt to mount a cgroup version 1 filesystem specified neither the
1704.I name=
1705option (to mount a named hierarchy) nor a controller name (or
1706.IR all ).
1707.SH NOTES
1708A child process created via
1709.BR fork (2)
1710inherits its parent's cgroup memberships.
1711A process's cgroup memberships are preserved across
1712.BR execve (2).
c0e4ab63
MK
1713.PP
1714The
1715.BR clone3 (2)
1716.B CLONE_INTO_CGROUP
1717flag can be used to create a child process that begins its life in
1718a different version 2 cgroup from the parent process.
c8902e25 1719.\"
5c2181ad
MK
1720.SS /proc files
1721.TP
34eb3340 1722.IR /proc/cgroups " (since Linux 2.6.24)"
92bb6d36 1723This file contains information about the controllers
1a4f7d59 1724that are compiled into the kernel.
34eb3340
MK
1725An example of the contents of this file (reformatted for readability)
1726is the following:
a721e8b2 1727.IP
34eb3340 1728.in +4n
b8302363 1729.EX
4580c2f6
MK
1730#subsys_name hierarchy num_cgroups enabled
1731cpuset 4 1 1
1732cpu 8 1 1
1733cpuacct 8 1 1
1734blkio 6 1 1
1735memory 3 1 1
1736devices 10 84 1
1737freezer 7 1 1
1738net_cls 9 1 1
1739perf_event 5 1 1
1740net_prio 9 1 1
1741hugetlb 0 1 0
1742pids 2 1 1
b8302363 1743.EE
e646a1ba 1744.in
a721e8b2 1745.IP
34eb3340
MK
1746The fields in this file are, from left to right:
1747.RS
1748.IP 1. 3
1749The name of the controller.
1750.IP 2.
92bb6d36 1751The unique ID of the cgroup hierarchy on which this controller is mounted.
11c0797f 1752If multiple cgroups v1 controllers are bound to the same hierarchy,
34eb3340 1753then each will show the same hierarchy ID in this field.
92bb6d36
MK
1754The value in this field will be 0 if:
1755.RS 5
1756.IP a) 3
1757the controller is not mounted on a cgroups v1 hierarchy;
1758.IP b)
1759the controller is bound to the cgroups v2 single unified hierarchy; or
1760.IP c)
1761the controller is disabled (see below).
1762.RE
34eb3340
MK
1763.IP 3.
1764The number of control groups in this hierarchy using this controller.
1765.IP 4.
1766This field contains the value 1 if this controller is enabled,
1767or 0 if it has been disabled (via the
1768.IR cgroup_disable
1769kernel command-line boot parameter).
1770.RE
1771.TP
5c2181ad 1772.IR /proc/[pid]/cgroup " (since Linux 2.6.24)"
f5faa016
MK
1773This file describes control groups to which the process
1774with the corresponding PID belongs.
5f8a7eb2 1775The displayed information differs for
2c4fbe35 1776cgroups version 1 and version 2 hierarchies.
a721e8b2 1777.IP
5f8a7eb2 1778For each cgroup hierarchy of which the process is a member,
2e33b59e 1779there is one entry containing three colon-separated fields:
a721e8b2 1780.IP
4769a778
MK
1781.in +4n
1782.EX
1783hierarchy-ID:controller-list:cgroup-path
1784.EE
1785.in
a721e8b2 1786.IP
5f8a7eb2 1787For example:
c1a022dc
MK
1788.IP
1789.in +4n
1790.EX
17915:cpuacct,cpu,cpuset:/daemons
1792.EE
1793.in
5c2181ad
MK
1794.IP
1795The colon-separated fields are, from left to right:
5f8a7eb2 1796.RS
5c2181ad 1797.IP 1. 3
5f8a7eb2
MK
1798For cgroups version 1 hierarchies,
1799this field contains a unique hierarchy ID number
1800that can be matched to a hierarchy ID in
1801.IR /proc/cgroups .
1802For the cgroups version 2 hierarchy, this field contains the value 0.
5c2181ad 1803.IP 2.
5f8a7eb2 1804For cgroups version 1 hierarchies,
55f52de8 1805this field contains a comma-separated list of the controllers
5f8a7eb2
MK
1806bound to the hierarchy.
1807For the cgroups version 2 hierarchy, this field is empty.
5c2181ad 1808.IP 3.
5f8a7eb2
MK
1809This field contains the pathname of the control group in the hierarchy
1810to which the process belongs.
1811This pathname is relative to the mount point of the hierarchy.
5c2181ad 1812.RE
668ef765
MK
1813.\"
1814.SS /sys/kernel/cgroup files
1815.TP
1816.IR /sys/kernel/cgroup/delegate " (since Linux 4.15)"
1817.\" commit 01ee6cfb1483fe57c9cbd8e73817dfbf9bacffd3
1818This file exports a list of the cgroups v2 files
1819(one per line) that are delegatable
1820(i.e., whose ownership should be changed to the user ID of the delegatee).
1821In the future, the set of delegatable files may change or grow,
1822and this file provides a way for the kernel to inform
1823user-space applications of which files must be delegated.
1824As at Linux 4.15, one sees the following when inspecting this file:
1825.IP
1826.EX
1827.in +4n
1828$ \fBcat /sys/kernel/cgroup/delegate\fP
1829cgroup.procs
1830cgroup.subtree_control
c7913617 1831cgroup.threads
668ef765
MK
1832.in
1833.EE
6413d784
MK
1834.TP
1835.IR /sys/kernel/cgroup/features " (since Linux 4.15)"
1836.\" commit 5f2e673405b742be64e7c3604ed4ed3ac14f35ce
1837Over time, the set of cgroups v2 features that are provided by the
1838kernel may change or grow,
1839or some features may not be enabled by default.
1840This file provides a way for user-space applications to discover what
fcf115f5 1841features the running kernel supports and has enabled.
6413d784
MK
1842Features are listed one per line:
1843.IP
1844.in +4n
1845.EX
6413d784
MK
1846$ \fBcat /sys/kernel/cgroup/features\fP
1847nsdelegate
9e18674a 1848memory_localevents
2e69ff53 1849.EE
6413d784
MK
1850.in
1851.IP
1852The entries that can appear in this file are:
1853.RS
1854.TP
9e18674a
MK
1855.IR memory_localevents " (since Linux 5.2)"
1856The kernel supports the
1857.I memory_localevents
1858mount option.
1859.TP
6413d784
MK
1860.IR nsdelegate " (since Linux 4.15)"
1861The kernel supports the
1862.I nsdelegate
1863mount option.
1864.RE
bbfdf727 1865.SH SEE ALSO
ebbc83be 1866.BR prlimit (1),
f60a5da2 1867.BR systemd (1),
edc2a022
MK
1868.BR systemd-cgls (1),
1869.BR systemd-cgtop (1),
325b7eb0 1870.BR clone (2),
ebbc83be
MK
1871.BR ioprio_set (2),
1872.BR perf_event_open (2),
1873.BR setrlimit (2),
cff6de30 1874.BR cgroup_namespaces (7),
69c47536 1875.BR cpuset (7),
ebbc83be
MK
1876.BR namespaces (7),
1877.BR sched (7),
1878.BR user_namespaces (7)
d4c9a848
MK
1879.PP
1880The kernel source file
1881.IR Documentation/admin-guide/cgroup-v2.rst .