]> git.ipfire.org Git - thirdparty/man-pages.git/blame - man7/cgroups.7
adjtimex.2, futex.2, mremap.2, seccomp.2, getnameinfo.3, random.3, console_codes...
[thirdparty/man-pages.git] / man7 / cgroups.7
CommitLineData
014cb63b 1.\" Copyright (C) 2015 Serge Hallyn <serge@hallyn.com>
4242dfbe 2.\" and Copyright (C) 2016, 2017 Michael Kerrisk <mtk.manpages@gmail.com>
014cb63b
MK
3.\"
4.\" %%%LICENSE_START(VERBATIM)
5.\" Permission is granted to make and distribute verbatim copies of this
6.\" manual provided the copyright notice and this permission notice are
7.\" preserved on all copies.
8.\"
9.\" Permission is granted to copy and distribute modified versions of this
10.\" manual under the conditions for verbatim copying, provided that the
11.\" entire resulting derived work is distributed under the terms of a
12.\" permission notice identical to this one.
13.\"
14.\" Since the Linux kernel and libraries are constantly changing, this
15.\" manual page may be incorrect or out-of-date. The author(s) assume no
16.\" responsibility for errors or omissions, or for damages resulting from
17.\" the use of the information contained herein. The author(s) may not
18.\" have taken the same level of care in the production of this manual,
19.\" which is licensed free of charge, as they might when working
20.\" professionally.
21.\"
22.\" Formatted or processed versions of this manual, if unaccompanied by
23.\" the source, must acknowledge the copyright and authors of this work.
24.\" %%%LICENSE_END
25.\"
8538a62b 26.TH CGROUPS 7 2018-02-02 "Linux" "Linux Programmer's Manual"
21f0d132
MK
27.SH NAME
28cgroups \- Linux control groups
29.SH DESCRIPTION
77eefc59 30Control groups, usually referred to as cgroups,
a15e0673 31are a Linux kernel feature which allow processes to
8bff7140
MK
32be organized into hierarchical groups whose usage of
33various types of resources can then be limited and monitored.
34The kernel's cgroup interface is provided through
21f0d132 35a pseudo-filesystem called cgroupfs.
6398ca15 36Grouping is implemented in the core cgroup kernel code,
21f0d132 37while resource tracking and limits are implemented in
8bff7140 38a set of per-resource-type subsystems (memory, CPU, and so on).
21f0d132 39.\"
176a4211
MK
40.SS Terminology
41A
42.I cgroup
43is a collection of processes that are bound to a set of
44limits or parameters defined via the cgroup filesystem.
a721e8b2 45.PP
176a4211
MK
46A
47.I subsystem
48is a kernel component that modifies the behavior of
49the processes in a cgroup.
50Various subsystems have been implemented, making it possible to do things
51such as limiting the amount of CPU time and memory available to a cgroup,
52accounting for the CPU time used by a cgroup,
53and freezing and resuming execution of the processes in a cgroup.
54Subsystems are sometimes also known as
55.IR "resource controllers"
56(or simply, controllers).
a721e8b2 57.PP
55f52de8 58The cgroups for a controller are arranged in a
176a4211
MK
59.IR hierarchy .
60This hierarchy is defined by creating, removing, and
61renaming subdirectories within the cgroup filesystem.
8fc9db1e
MK
62At each level of the hierarchy, attributes (e.g., limits) can be defined.
63The limits, control, and accounting provided by cgroups generally have
64effect throughout the subhierarchy underneath the cgroup where the
65attributes are defined.
8bff7140
MK
66Thus, for example, the limits placed on
67a cgroup at a higher level in the hierarchy cannot be exceeded
68by descendant cgroups.
176a4211 69.\"
43df1ab3
MK
70.SS Cgroups version 1 and version 2
71The initial release of the cgroups implementation was in Linux 2.6.24.
55f52de8 72Over time, various cgroup controllers have been added
43df1ab3 73to allow the management of various types of resources.
55f52de8
MK
74However, the development of these controllers was largely uncoordinated,
75with the result that many inconsistencies arose between controllers
43df1ab3
MK
76and management of the cgroup hierarchies became rather complex.
77(A longer description of these problems can be found in
78the kernel source file
0a837899 79.IR Documentation/cgroup\-v2.txt .)
a721e8b2 80.PP
813d9220
MK
81Because of the problems with the initial cgroups implementation
82(cgroups version 1),
43df1ab3
MK
83starting in Linux 3.10, work began on a new,
84orthogonal implementation to remedy these problems.
85Initially marked experimental, and hidden behind the
86.I "\-o\ __DEVEL__sane_behavior"
87mount option, the new version (cgroups version 2)
88was eventually made official with the release of Linux 4.5.
89Differences between the two versions are described in the text below.
a721e8b2 90.PP
43df1ab3
MK
91Although cgroups v2 is intended as a replacement for cgroups v1,
92the older system continues to exist
93(and for compatibility reasons is unlikely to be removed).
94Currently, cgroups v2 implements only a subset of the controllers
95available in cgroups v1.
96The two systems are implemented so that both v1 controllers and
97v2 controllers can be mounted on the same system.
98Thus, for example, it is possible to use those controllers
99that are supported under version 2,
100while also using version 1 controllers
101where version 2 does not yet support those controllers.
1a90a85e
MK
102The only restriction here is that a controller can't be simultaneously
103employed in both a cgroups v1 hierarchy and in the cgroups v2 hierarchy.
43df1ab3 104.\"
5714ccee 105.SH CGROUPS VERSION 1
8bff7140
MK
106Under cgroups v1, each controller may be mounted against a separate
107cgroup filesystem that provides its own hierarchical organization of the
108processes on the system.
980f1827 109It is also possible to comount multiple (or even all) cgroups v1 controllers
8bff7140
MK
110against the same cgroup filesystem, meaning that the comounted controllers
111manage the same hierarchical organization of processes.
a721e8b2 112.PP
8bff7140
MK
113For each mounted hierarchy,
114the directory tree mirrors the control group hierarchy.
115Each control group is represented by a directory, with each of its child
116control cgroups represented as a child directory.
117For instance,
118.IR /user/joe/1.session
119represents control group
120.IR 1.session ,
121which is a child of cgroup
122.IR joe ,
123which is a child of
124.IR /user .
125Under each cgroup directory is a set of files which can be read or
126written to, reflecting resource limits and a few general cgroup
127properties.
8bff7140 128.\"
6398ca15 129.SS Tasks (threads) versus processes
c775bca2
MK
130In cgroups v1, a distinction is drawn between
131.I processes
132and
133.IR tasks .
134In this view, a process can consist of multiple tasks
6398ca15
MK
135(more commonly called threads, from a user-space perspective,
136and called such in the remainder of this man page).
0ec74e08 137In cgroups v1, it is possible to independently manipulate
6398ca15 138the cgroup memberships of the threads in a process.
c56ec51b
MK
139.PP
140The cgroups v1 ability to split threads across different cgroups
141caused problems in some cases.
142For example, it made no sense for the
143.I memory
144controller,
145since all of the threads of a process share a single address space.
146Because of these problems,
c775bca2 147the ability to independently manipulate the cgroup memberships
56769384
MK
148of the threads in a process was removed in the initial cgroups v2
149implementation, and subsequently restored in a more limited form
150(see the discussion of "thread mode" below).
c775bca2 151.\"
77e0a626
MK
152.SS Mounting v1 controllers
153The use of cgroups requires a kernel built with the
8e6578f8
KF
154.BR CONFIG_CGROUP
155option.
77e0a626
MK
156In addition, each of the v1 controllers has an associated
157configuration option that must be set in order to employ that controller.
a721e8b2 158.PP
77e0a626
MK
159In order to use a v1 controller,
160it must be mounted against a cgroup filesystem.
4e07c70f
MK
161The usual place for such mounts is under a
162.BR tmpfs (5)
163filesystem mounted at
77e0a626
MK
164.IR /sys/fs/cgroup .
165Thus, one might mount the
166.I cpu
167controller as follows:
a721e8b2 168.PP
77e0a626 169.in +4n
b8302363 170.EX
77e0a626 171mount \-t cgroup \-o cpu none /sys/fs/cgroup/cpu
b8302363 172.EE
e646a1ba 173.in
a721e8b2 174.PP
77e0a626
MK
175It is possible to comount multiple controllers against the same hierarchy.
176For example, here the
177.IR cpu
21f0d132 178and
77e0a626
MK
179.IR cpuacct
180controllers are comounted against a single hierarchy:
a721e8b2 181.PP
21f0d132 182.in +4n
b8302363 183.EX
77e0a626 184mount \-t cgroup \-o cpu,cpuacct none /sys/fs/cgroup/cpu,cpuacct
b8302363 185.EE
e646a1ba 186.in
a721e8b2 187.PP
55f52de8 188Comounting controllers has the effect that a process is in the same cgroup for
77e0a626 189all of the comounted controllers.
55f52de8 190Separately mounting controllers allows a process to
21f0d132
MK
191be in cgroup
192.I /foo1
55f52de8 193for one controller while being in
21f0d132
MK
194.I /foo2/foo3
195for another.
a721e8b2 196.PP
77e0a626 197It is possible to comount all v1 controllers against the same hierarchy:
a721e8b2 198.PP
77e0a626 199.in +4n
b8302363 200.EX
77e0a626 201mount \-t cgroup \-o all cgroup /sys/fs/cgroup
b8302363 202.EE
e646a1ba 203.in
a721e8b2 204.PP
77e0a626
MK
205(One can achieve the same result by omitting
206.IR "\-o all" ,
207since it is the default if no controllers are explicitly specified.)
a721e8b2 208.PP
31ec2a5c
MK
209It is not possible to mount the same controller
210against multiple cgroup hierarchies.
211For example, it is not possible to mount both the
212.I cpu
213and
214.I cpuacct
215controllers against one hierarchy, and to mount the
216.I cpu
217controller alone against another hierarchy.
218It is possible to create multiple mount points with exactly
219the same set of comounted controllers.
220However, in this case all that results is multiple mount points
221providing a view of the same hierarchy.
a721e8b2 222.PP
77e0a626
MK
223Note that on many systems, the v1 controllers are automatically mounted under
224.IR /sys/fs/cgroup ;
225in particular,
226.BR systemd (1)
227automatically creates such mount points.
21f0d132 228.\"
7409b54b
MK
229.SS Unmounting v1 controllers
230A mounted cgroup filesystem can be unmounted using the
231.BR umount (8)
232command, as in the following example:
233.PP
234.in +4n
235.EX
236umount /sys/fs/cgroup/pids
237.EE
238.in
239.PP
240.IR "But note well" :
241a cgroup filesystem is unmounted only if it is not busy,
242that is, it has no child cgroups.
243If this is not the case, then the only effect of the
244.BR umount (8)
245is to make the mount invisible.
246Thus, to ensure that the mount point is really removed,
247one must first remove all child cgroups,
248which in turn can be done only after all member processes
249have been moved from those cgroups to the root cgroup.
250.\"
860573ad
MK
251.SS Cgroups version 1 controllers
252Each of the cgroups version 1 controllers is governed
253by a kernel configuration option (listed below).
254Additionally, the availability of the cgroups feature is governed by the
255.BR CONFIG_CGROUPS
256kernel configuration option.
257.TP
258.IR cpu " (since Linux 2.6.24; " \fBCONFIG_CGROUP_SCHED\fP )
259Cgroups can be guaranteed a minimum number of "CPU shares"
260when a system is busy.
261This does not limit a cgroup's CPU usage if the CPUs are not busy.
4ad9a706
MK
262For further information, see
263.IR Documentation/scheduler/sched-design-CFS.txt .
a721e8b2 264.IP
4ad9a706
MK
265In Linux 3.2,
266this controller was extended to provide CPU "bandwidth" control.
267If the kernel is configured with
81ff7360 268.BR CONFIG_CFS_BANDWIDTH ,
4ad9a706
MK
269then within each scheduling period
270(defined via a file in the cgroup directory), it is possible to define
271an upper limit on the CPU time allocated to the processes in a cgroup.
272This upper limit applies even if there is no other competition for the CPU.
860573ad
MK
273Further information can be found in the kernel source file
274.IR Documentation/scheduler/sched\-bwc.txt .
275.TP
276.IR cpuacct " (since Linux 2.6.24; " \fBCONFIG_CGROUP_CPUACCT\fP )
277This provides accounting for CPU usage by groups of processes.
a721e8b2 278.IP
860573ad
MK
279Further information can be found in the kernel source file
280.IR Documentation/cgroup\-v1/cpuacct.txt .
281.TP
282.IR cpuset " (since Linux 2.6.24; " \fBCONFIG_CPUSETS\fP )
283This cgroup can be used to bind the processes in a cgroup to
284a specified set of CPUs and NUMA nodes.
a721e8b2 285.IP
860573ad
MK
286Further information can be found in the kernel source file
287.IR Documentation/cgroup\-v1/cpusets.txt .
288.TP
289.IR memory " (since Linux 2.6.25; " \fBCONFIG_MEMCG\fP )
290The memory controller supports reporting and limiting of process memory, kernel
291memory, and swap used by cgroups.
a721e8b2 292.IP
860573ad
MK
293Further information can be found in the kernel source file
294.IR Documentation/cgroup\-v1/memory.txt .
295.TP
296.IR devices " (since Linux 2.6.26; " \fBCONFIG_CGROUP_DEVICE\fP )
297This supports controlling which processes may create (mknod) devices as
298well as open them for reading or writing.
299The policies may be specified as whitelists and blacklists.
300Hierarchy is enforced, so new rules must not
301violate existing rules for the target or ancestor cgroups.
a721e8b2 302.IP
860573ad
MK
303Further information can be found in the kernel source file
304.IR Documentation/cgroup-v1/devices.txt .
305.TP
306.IR freezer " (since Linux 2.6.28; " \fBCONFIG_CGROUP_FREEZER\fP )
307The
308.IR freezer
309cgroup can suspend and restore (resume) all processes in a cgroup.
310Freezing a cgroup
311.I /A
312also causes its children, for example, processes in
313.IR /A/B ,
314to be frozen.
a721e8b2 315.IP
860573ad
MK
316Further information can be found in the kernel source file
317.IR Documentation/cgroup-v1/freezer-subsystem.txt .
318.TP
319.IR net_cls " (since Linux 2.6.29; " \fBCONFIG_CGROUP_NET_CLASSID\fP )
320This places a classid, specified for the cgroup, on network packets
321created by a cgroup.
322These classids can then be used in firewall rules,
323as well as used to shape traffic using
324.BR tc (8).
325This applies only to packets
326leaving the cgroup, not to traffic arriving at the cgroup.
a721e8b2 327.IP
860573ad
MK
328Further information can be found in the kernel source file
329.IR Documentation/cgroup-v1/net_cls.txt .
330.TP
331.IR blkio " (since Linux 2.6.33; " \fBCONFIG_BLK_CGROUP\fP )
332The
333.I blkio
334cgroup controls and limits access to specified block devices by
335applying IO control in the form of throttling and upper limits against leaf
336nodes and intermediate nodes in the storage hierarchy.
a721e8b2 337.IP
860573ad
MK
338Two policies are available.
339The first is a proportional-weight time-based division
340of disk implemented with CFQ.
341This is in effect for leaf nodes using CFQ.
342The second is a throttling policy which specifies
343upper I/O rate limits on a device.
a721e8b2 344.IP
860573ad
MK
345Further information can be found in the kernel source file
346.IR Documentation/cgroup-v1/blkio-controller.txt .
347.TP
348.IR perf_event " (since Linux 2.6.39; " \fBCONFIG_CGROUP_PERF\fP )
349This controller allows
350.I perf
351monitoring of the set of processes grouped in a cgroup.
a721e8b2 352.IP
860573ad 353Further information can be found in the kernel source file
c174eb6a 354.IR tools/perf/Documentation/perf-record.txt .
860573ad
MK
355.TP
356.IR net_prio " (since Linux 3.3; " \fBCONFIG_CGROUP_NET_PRIO\fP )
357This allows priorities to be specified, per network interface, for cgroups.
a721e8b2 358.IP
860573ad
MK
359Further information can be found in the kernel source file
360.IR Documentation/cgroup-v1/net_prio.txt .
361.TP
362.IR hugetlb " (since Linux 3.5; " \fBCONFIG_CGROUP_HUGETLB\fP )
363This supports limiting the use of huge pages by cgroups.
a721e8b2 364.IP
860573ad
MK
365Further information can be found in the kernel source file
366.IR Documentation/cgroup-v1/hugetlb.txt .
367.TP
368.IR pids " (since Linux 4.3; " \fBCONFIG_CGROUP_PIDS\fP )
369This controller permits limiting the number of process that may be created
370in a cgroup (and its descendants).
a721e8b2 371.IP
860573ad
MK
372Further information can be found in the kernel source file
373.IR Documentation/cgroup-v1/pids.txt .
cfec905e
NB
374.TP
375.IR rdma " (since Linux 4.11; " \fBCONFIG_CGROUP_RDMA\fP )
d145c025
MK
376The RDMA controller permits limiting the use of
377RDMA/IB-specific resources per cgroup.
cfec905e
NB
378.IP
379Further information can be found in the kernel source file
380.IR Documentation/cgroup-v1/rdma.txt .
860573ad 381.\"
6398ca15 382.SS Creating cgroups and moving processes
9ed582ac 383A cgroup filesystem initially contains a single root cgroup, '/',
6398ca15 384which all processes belong to.
21f0d132 385A new cgroup is created by creating a directory in the cgroup filesystem:
a721e8b2 386.PP
4769a778
MK
387.in +4n
388.EX
389mkdir /sys/fs/cgroup/cpu/cg1
390.EE
391.in
a721e8b2 392.PP
21f0d132 393This creates a new empty cgroup.
a721e8b2 394.PP
f524e7f8 395A process may be moved to this cgroup by writing its PID into the cgroup's
21f0d132 396.I cgroup.procs
21f0d132 397file:
a721e8b2 398.PP
4769a778
MK
399.in +4n
400.EX
401echo $$ > /sys/fs/cgroup/cpu/cg1/cgroup.procs
402.EE
403.in
a721e8b2 404.PP
f524e7f8 405Only one PID at a time should be written to this file.
a721e8b2 406.PP
f524e7f8
MK
407Writing the value 0 to a
408.IR cgroup.procs
409file causes the writing process to be moved to the corresponding cgroup.
a721e8b2 410.PP
6398ca15
MK
411When writing a PID into the
412.IR cgroup.procs ,
87402a2e 413all threads in the process are moved into the new cgroup at once.
a721e8b2 414.PP
f524e7f8
MK
415Within a hierarchy, a process can be a member of exactly one cgroup.
416Writing a process's PID to a
417.IR cgroup.procs
418file automatically removes it from the cgroup of
419which it was previously a member.
a721e8b2 420.PP
f524e7f8
MK
421The
422.I cgroup.procs
423file can be read to obtain a list of the processes that are
424members of a cgroup.
425The returned list of PIDs is not guaranteed to be in order.
426Nor is it guaranteed to be free of duplicates.
427(For example, a PID may be recycled while reading from the list.)
a721e8b2 428.PP
56769384 429In cgroups v1, an individual thread can be moved to
87402a2e
MK
430another cgroup by writing its thread ID
431(i.e., the kernel thread ID returned by
432.BR clone (2)
433and
434.BR gettid (2))
435to the
436.IR tasks
437file in a cgroup directory.
438This file can be read to discover the set of threads
439that are members of the cgroup.
b43be47e
MK
440.\"
441.SS Removing cgroups
442To remove a cgroup,
443it must first have no child cgroups and contain no (nonzombie) processes.
444So long as that is the case, one can simply
445remove the corresponding directory pathname.
446Note that files in a cgroup directory cannot and need not be
447removed.
448.\"
88afe701 449.SS Cgroups v1 release notification
23388d41
MK
450Two files can be used to determine whether the kernel provides
451notifications when a cgroup becomes empty.
452A cgroup is considered to be empty when it contains no child
453cgroups and no member processes.
a721e8b2 454.PP
23388d41 455A special file in the root directory of each cgroup hierarchy,
88afe701 456.IR release_agent ,
23388d41
MK
457can be used to register the pathname of a program that may be invoked when
458a cgroup in the hierarchy becomes empty.
459The pathname of the newly empty cgroup (relative to the cgroup mount point)
460is provided as the sole command-line argument when the
461.IR release_agent
462program is invoked.
463The
464.IR release_agent
465program might remove the cgroup directory,
980f1827 466or perhaps repopulate it with a process.
a721e8b2 467.PP
23388d41
MK
468The default value of the
469.IR release_agent
470file is empty, meaning that no release agent is invoked.
a721e8b2 471.PP
59af0514
MK
472The content of the
473.I release_agent
474file can also be specified via a mount option when the
475cgroup filesystem is mounted:
476.PP
477.in +4n
478.EX
479mount -o release_agent=pathname ...
480.EE
481.in
482.PP
23388d41
MK
483Whether or not the
484.IR release_agent
485program is invoked when a particular cgroup becomes empty is determined
486by the value in the
88afe701 487.IR notify_on_release
23388d41
MK
488file in the corresponding cgroup directory.
489If this file contains the value 0, then the
490.IR release_agent
491program is not invoked.
492If it contains the value 1, the
493.IR release_agent
494program is invoked.
495The default value for this file in the root cgroup is 0.
496At the time when a new cgroup is created,
497the value in this file is inherited from the corresponding file
498in the parent cgroup.
88afe701 499.\"
d311c798
MK
500.SS Cgroup v1 named hierarchies
501In cgroups v1,
502it is possible to mount a cgroup hierarchy that has no attached controllers:
503.PP
504.in +4n
505.EX
506mount -t cgroup -o none,name=somename none /some/mount/point
507.EE
508.in
509.PP
510Multiple instances of such hierarchies can be mounted;
511each hierarchy must have a unique name.
512The only purpose of such hierarchies is to track processes.
513(See the discussion of release notification below.)
514An example of this is the
515.I name=systemd
516cgroup hierarchy that is used by
517.BR systemd (1)
518to track services and user sessions.
29fa4cbc
MK
519.PP
520Since Linux 5.0, the
521.I cgroup_no_v1
522kernel boot option (described below) can be used to disable cgroup v1
523named hierarchies, by specifying
524.IR cgroup_no_v1=named .
525
d311c798 526.\"
5714ccee 527.SH CGROUPS VERSION 2
b43be47e
MK
528In cgroups v2,
529all mounted controllers reside in a single unified hierarchy.
530While (different) controllers may be simultaneously
531mounted under the v1 and v2 hierarchies,
532it is not possible to mount the same controller simultaneously
533under both the v1 and the v2 hierarchies.
a721e8b2 534.PP
2befa495
MK
535The new behaviors in cgroups v2 are summarized here,
536and in some cases elaborated in the following subsections.
537.IP 1. 3
a15e0673 538Cgroups v2 provides a unified hierarchy against
dddb7ea1
MK
539which all controllers are mounted.
540.IP 2.
2befa495
MK
541"Internal" processes are not permitted.
542With the exception of the root cgroup, processes may reside
543only in leaf nodes (cgroups that do not themselves contain child cgroups).
4f017a68 544The details are somewhat more subtle than this, and are described below.
dddb7ea1 545.IP 3.
2befa495
MK
546Active cgroups must be specified via the files
547.IR cgroup.controllers
548and
549.IR cgroup.subtree_control .
dddb7ea1 550.IP 4.
2befa495
MK
551The
552.I tasks
553file has been removed.
554In addition, the
555.I cgroup.clone_children
556file that is employed by the
557.I cpuset
558controller has been removed.
dddb7ea1 559.IP 5.
2befa495
MK
560An improved mechanism for notification of empty cgroups is provided by the
561.IR cgroup.events
562file.
563.PP
564For more changes, see the
565.I Documentation/cgroup-v2.txt
566file in the kernel source.
e91d4f9e
MK
567.PP
568Some of the new behaviors listed above saw subsequent modification with
569the addition in Linux 4.14 of "thread mode" (described below).
2befa495 570.\"
dddb7ea1
MK
571.SS Cgroups v2 unified hierarchy
572In cgroups v1, the ability to mount different controllers
573against different hierarchies was intended to allow great flexibility
574for application design.
575In practice, though, the flexibility turned out to less useful than expected,
576and in many cases added complexity.
577Therefore, in cgroups v2,
578all available controllers are mounted against a single hierarchy.
579The available controllers are automatically mounted,
580meaning that it is not necessary (or possible) to specify the controllers
581when mounting the cgroup v2 filesystem using a command such as the following:
a721e8b2 582.PP
4769a778
MK
583.in +4n
584.EX
585mount -t cgroup2 none /mnt/cgroup2
586.EE
587.in
a721e8b2 588.PP
dddb7ea1
MK
589A cgroup v2 controller is available only if it is not currently in use
590via a mount against a cgroup v1 hierarchy.
591Or, to put things another way, it is not possible to employ
592the same controller against both a v1 hierarchy and the unified v2 hierarchy.
57cbb0db
MK
593This means that it may be necessary first to unmount a v1 controller
594(as described above) before that controller is available in v2.
595Since
596.BR systemd (1)
597makes heavy use of some v1 controllers by default,
598it can in some cases be simpler to boot the system with
599selected v1 controllers disabled.
600To do this, specify the
601.IR cgroup_no_v1=list
602option on the kernel boot command line;
603.I list
604is a comma-separated list of the names of the controllers to disable,
605or the word
606.I all
607to disable all v1 controllers.
608(This situation is correctly handled by
609.BR systemd (1),
610which falls back to operating without the specified controllers.)
03bb1264
MK
611.PP
612Note that on many modern systems,
613.BR systemd (1)
614automatically mounts the
615.I cgroup2
616filesystem at
617.I /sys/fs/cgroup/unified
618during the boot process.
dddb7ea1 619.\"
44c429ed
MK
620.SS Cgroups v2 controllers
621The following controllers, documented in the kernel source file
622.IR Documentation/cgroup-v2.txt ,
623are supported in cgroups version 2:
624.TP
625.IR io " (since Linux 4.5)"
626This is the successor of the version 1
627.I blkio
628controller.
629.TP
630.IR memory " (since Linux 4.5)"
631This is the successor of the version 1
632.I memory
633controller.
634.TP
635.IR pids " (since Linux 4.5)"
636This is the same as the version 1
637.I pids
638controller.
639.TP
640.IR perf_event " (since Linux 4.11)"
f7286edc 641This is the same as the version 1
44c429ed
MK
642.I perf_event
643controller.
644.TP
645.IR rdma " (since Linux 4.11)"
646This is the same as the version 1
647.I rdma
648controller.
649.TP
650.IR cpu " (since Linux 4.15)"
651This is the successor to the version 1
652.I cpu
653and
654.I cpuacct
655controllers.
656.\"
2befa495 657.SS Cgroups v2 subtree control
8d5f42dc
MK
658Each cgroup in the v2 hierarchy contains the following two files:
659.TP
660.IR cgroup.controllers
277559a4 661This read-only file exposes a list of the controllers that are
8d5f42dc
MK
662.I available
663in this cgroup.
664The contents of this file match the contents of the
665.I cgroup.subtree_control
666file in the parent cgroup.
667.TP
668.I cgroup.subtree_control
669This is a list of controllers that are
670.IR active
671.RI ( enabled )
672in the cgroup.
673The set of controllers in this file is a subset of the set in the
21f0d132 674.IR cgroup.controllers
8d5f42dc
MK
675of this cgroup.
676The set of active controllers is modified by writing strings to this file
677containing space-delimited controller names,
678each preceded by '+' (to enable a controller)
679or '\-' (to disable a controller), as in the following example:
680.IP
681.in +4n
682.EX
683echo '+pids -memory' > x/y/cgroup.subtree_control
684.EE
685.in
686.IP
c9b101d1
MK
687An attempt to enable a controller
688that is not present in
689.I cgroup.controllers
690leads to an
691.B ENOENT
692error when writing to the
693.I cgroup.subtree_control
694file.
695.PP
8d5f42dc
MK
696Because the list of controllers in
697.I cgroup.subtree_control
698is a subset of those
699.IR cgroup.controllers ,
700a controller that has been disabled in one cgroup in the hierarchy
701can never be re-enabled in the subtree below that cgroup.
702.PP
703A cgroup's
704.I cgroup.subtree_control
705file determines the set of controllers that are exercised in the
706.I child
707cgroups.
708When a controller (e.g.,
709.IR pids )
710is present in the
711.I cgroup.subtree_control
712file of a parent cgroup,
713then the corresponding controller-interface files (e.g.,
714.IR pids.max )
715are automatically created in the children of that cgroup
716and can be used to exert resource control in the child cgroups.
21f0d132 717.\"
2468f14e
MK
718.SS Cgroups v2 """no internal processes""" rule
719Cgroups v2 enforces a so-called "no internal processes" rule.
720Roughly speaking, this rule means that,
721with the exception of the root cgroup, processes may reside
722only in leaf nodes (cgroups that do not themselves contain child cgroups).
723This avoids the need to decide how to partition resources between
724processes which are members of cgroup A and processes in child cgroups of A.
725.PP
726For instance, if cgroup
727.I /cg1/cg2
728exists, then a process may reside in
729.IR /cg1/cg2 ,
730but not in
731.IR /cg1 .
732This is to avoid an ambiguity in cgroups v1
733with respect to the delegation of resources between processes in
734.I /cg1
735and its child cgroups.
736The recommended approach in cgroups v2 is to create a subdirectory called
737.I leaf
738for any nonleaf cgroup which should contain processes, but no child cgroups.
739Thus, processes which previously would have gone into
740.I /cg1
741would now go into
742.IR /cg1/leaf .
743This has the advantage of making explicit
744the relationship between processes in
745.I /cg1/leaf
746and
747.IR /cg1 's
748other children.
749.PP
750The "no internal processes" rule is in fact more subtle than stated above.
751More precisely, the rule is that a (nonroot) cgroup can't both
752(1) have member processes, and
753(2) distribute resources into child cgroups\(emthat is, have a nonempty
754.I cgroup.subtree_control
755file.
756Thus, it
757.I is
758possible for a cgroup to have both member processes and child cgroups,
759but before controllers can be enabled for that cgroup,
760the member processes must be moved out of the cgroup
761(e.g., perhaps into the child cgroups).
e91d4f9e
MK
762.PP
763With the Linux 4.14 addition of "thread mode" (described below),
764the "no internal processes" rule has been relaxed in some cases.
2468f14e 765.\"
754f4cf5
MK
766.SS Cgroups v2 cgroup.events file
767With cgroups v2, a new mechanism is provided to obtain notification
768about when a cgroup becomes empty.
769The cgroups v1
770.IR release_agent
771and
772.IR notify_on_release
773files are removed, and replaced by a new, more general-purpose file,
774.IR cgroup.events .
e5bd7e65 775This read-only file contains key-value pairs
754f4cf5
MK
776(delimited by newline characters, with the key and value separated by spaces)
777that identify events or state for a cgroup.
778Currently, only one key appears in this file,
779.IR populated ,
780which has either the value 0,
781meaning that the cgroup (and its descendants)
782contain no (nonzombie) processes,
783or 1, meaning that the cgroup contains member processes.
a721e8b2 784.PP
754f4cf5
MK
785The
786.IR cgroup.events
787file can be monitored, in order to receive notification when a cgroup
788transitions between the populated and unpopulated states (or vice versa).
789When monitoring this file using
790.BR inotify (7),
791transitions generate
792.BR IN_MODIFY
793events, and when monitoring the file using
794.BR poll (2),
7747ed97 795transitions cause the bits
754f4cf5 796.B POLLPRI
7747ed97
MK
797and
798.B POLLERR
799to be returned in the
800.IR revents
801field.
a721e8b2 802.PP
ccb1a262
MK
803The cgroups v2 release-notification mechanism provided by the
804.I populated
805field of the
806.I cgroup.events
807file offers at least two advantages over the cgroups v1
754f4cf5
MK
808.IR release_agent
809mechanism.
810First, it allows for cheaper notification,
811since a single process can monitor multiple
812.IR cgroup.events
813files.
814By contrast, the cgroups v1 mechanism requires the creation
815of a process for each notification.
a15e0673 816Second, notification can be delegated to a process that lives inside
754f4cf5 817a container associated with the newly empty cgroup.
c91a9f8a 818.\"
5e071499
MK
819.SS Cgroups v2 cgroup.stat file
820.\" commit ec39225cca42c05ac36853d11d28f877fde5c42e
821Each cgroup in the v2 hierarchy contains a read-only
822.IR cgroup.stat
823file (first introduced in Linux 4.14)
824that consists of lines containing key-value pairs.
825The following keys currently appear in this file:
826.TP
827.I nr_descendants
828This is the total number of visible (i.e., living) descendant cgroups
829underneath this cgroup.
830.TP
831.I nr_dying_descendants
832This is the total number of dying descendant cgroups
833underneath this cgroup.
834A cgroup enters the dying state after being deleted.
835It remains in that state for an undefined period
836(which will depend on system load)
c7f63e74
MK
837while resources are freed before the cgroup is destroyed.
838Note that the presence of some cgroups in the dying state is normal,
839and is not indicative of any problem.
5e071499
MK
840.IP
841A process can't be made a member of a dying cgroup,
842and a dying cgroup can't be brought back to life.
843.\"
5845e10b
MK
844.SS Limiting the number of descendant cgroups
845Each cgroup in the v2 hierarchy contains the following files,
846which can be used to view and set limits on the number
847of descendant cgroups under that cgroup:
848.TP
849.IR cgroup.max.depth " (since Linux 4.14)"
850.\" commit 1a926e0bbab83bae8207d05a533173425e0496d1
851This file defines a limit on the depth of nesting of descendant cgroups.
852A value of 0 in this file means that no descendant cgroups can be created.
853An attempt to create a descendant whose nesting level exceeds
854the limit fails
855.RI ( mkdir (2)
856fails with the error
857.BR EAGAIN ).
858.IP
859Writing the string
860.IR """max"""
861to this file means that no limit is imposed.
862The default value in this file is
863.IR """max""" .
864.TP
865.IR cgroup.max.descendants " (since Linux 4.14)"
866.\" commit 1a926e0bbab83bae8207d05a533173425e0496d1
867This file defines a limit on the number of live descendant cgroups that
868this cgroup may have.
869An attempt to create more descendants than allowed by the limit fails
870.RI ( mkdir (2)
871fails with the error
872.BR EAGAIN ).
873.IP
874Writing the string
875.IR """max"""
876to this file means that no limit is imposed.
877The default value in this file is
878.IR """max""" .
879.\"
4b1c2041 880.SH CGROUPS DELEGATION: DELEGATING A HIERARCHY TO A LESS PRIVILEGED USER
4242dfbe
MK
881In the context of cgroups,
882delegation means passing management of some subtree
51629a30 883of the cgroup hierarchy to a nonprivileged user.
87b18a8b
MK
884Cgroups v1 provides support for delegation based on file permissions
885in the cgroup hierarchy but with less strict containment rules than v2
886(as noted below).
887Cgroups v2 supports delegation with containment by explicit design.
4b1c2041
MK
888The focus of the discussion in this section is on delegation in cgroups v2,
889with some differences for cgroups v1 noted along the way.
4242dfbe
MK
890.PP
891Some terminology is required in order to describe delegation.
892A
893.I delegater
894is a privileged user (i.e., root) who owns a parent cgroup.
895A
896.I delegatee
897is a nonprivileged user who will be granted the permissions needed
898to manage some subhierarchy under that parent cgroup,
899known as the
900.IR "delegated subtree" .
901.PP
902To perform delegation,
903the delegater makes certain directories and files writable by the delegatee,
904typically by changing the ownership of the objects to be the user ID
905of the delegatee.
0735069b
MK
906Assuming that we want to delegate the hierarchy rooted at (say)
907.I /dlgt_grp
4242dfbe
MK
908and that there are not yet any child cgroups under that cgroup,
909the ownership of the following is changed to the user ID of the delegatee:
910.TP
0735069b 911.IR /dlgt_grp
4242dfbe
MK
912Changing the ownership of the root of the subtree means that any new
913cgroups created under the subtree (and the files they contain)
914will also be owned by the delegatee.
915.TP
0735069b 916.IR /dlgt_grp/cgroup.procs
f7286edc 917Changing the ownership of this file means that the delegatee
4242dfbe
MK
918can move processes into the root of the delegated subtree.
919.TP
4b1c2041 920.IR /dlgt_grp/cgroup.subtree_control " (cgroups v2 only)"
e5936eb6
MK
921Changing the ownership of this file means that that the delegatee
922can enable controllers (that are present in
0735069b 923.IR /dlgt_grp/cgroup.controllers )
4242dfbe 924in order to further redistribute resources at lower levels in the subtree.
e5936eb6
MK
925(As an alternative to changing the ownership of this file,
926the delegater might instead add selected controllers to this file.)
639b6c8c 927.TP
4b1c2041 928.IR /dlgt_grp/cgroup.threads " (cgroups v2 only)"
639b6c8c
MK
929Changing the ownership of this file is necessary if a threaded subtree
930is being delegated (see the description of "thread mode", below).
7b327dd5 931This permits the delegatee to write thread IDs to the file.
cd7f4c49
MK
932(The ownership of this file can also be changed when delegating
933a domain subtree, but currently this serves no purpose,
934since, as described below, it is not possible to move a thread between
935domain cgroups by writing its thread ID to the
2b91ed4e 936.IR cgroup.threads
cd7f4c49 937file.)
4b1c2041
MK
938.IP
939In cgroups v1, the corresponding file that should instead be delegated is the
940.I tasks
941file.
4242dfbe
MK
942.PP
943The delegater should
944.I not
945change the ownership of any of the controller interfaces files (e.g.,
946.IR pids.max ,
947.IR memory.high )
948in
0735069b 949.IR dlgt_grp .
4242dfbe
MK
950Those files are used from the next level above the delegated subtree
951in order to distribute resources into the subtree,
952and the delegatee should not have permission to change
953the resources that are distributed into the delegated subtree.
954.PP
668ef765
MK
955See also the discussion of the
956.IR /sys/kernel/cgroup/delegate
4b1c2041 957file in NOTES for information about further delegatable files in cgroups v2.
668ef765 958.PP
4242dfbe
MK
959After the aforementioned steps have been performed,
960the delegatee can create child cgroups within the delegated subtree
6dc513cd
MK
961(the cgroup subdirectories and the files they contain
962will be owned by the delegatee)
4242dfbe
MK
963and move processes between cgroups in the subtree.
964If some controllers are present in
0735069b 965.IR dlgt_grp/cgroup.subtree_control ,
4242dfbe 966or the ownership of that file was passed to the delegatee,
f7286edc 967the delegatee can also control the further redistribution
4242dfbe 968of the corresponding resources into the delegated subtree.
27b086e9 969.\"
ed3f4f34 970.SS Cgroups v2 delegation: nsdelegate and cgroup namespaces
ed3f4f34
MK
971Starting with Linux 4.13,
972.\" commit 5136f6365ce3eace5a926e10f16ed2a233db5ba9
4b1c2041 973there is a second way to perform cgroup delegation in the cgroups v2 hierarchy.
07361828 974This is done by mounting or remounting the cgroup v2 filesystem with the
ed3f4f34 975.I nsdelegate
07361828
MK
976mount option.
977For example, if the cgroup v2 filesystem has already been mounted,
978we can remount it with the
979.I nsdelegate
980option as follows:
ed3f4f34
MK
981.PP
982.in +4n
983.EX
07361828
MK
984mount -t cgroup2 -o remount,nsdelegate \\
985 none /sys/fs/cgroup/unified
ed3f4f34
MK
986.EE
987.in
07361828
MK
988.\"
989.\" ALternatively, we could boot the kernel with the options:
990.\"
991.\" cgroup_no_v1=all systemd.legacy_systemd_cgroup_controller
992.\"
993.\" The effect of the latter option is to prevent systemd from employing
994.\" its "hybrid" cgroup mode, where it tries to make use of cgroups v2.
ed3f4f34 995.PP
dc581e07 996The effect of this mount option is to cause cgroup namespaces
ed3f4f34
MK
997to automatically become delegation boundaries.
998More specifically,
999the following restrictions apply for processes inside the cgroup namespace:
1000.IP * 3
446d1643 1001Writes to controller interface files in the root directory of the namespace
ed3f4f34
MK
1002will fail with the error
1003.BR EPERM .
1004Processes inside the cgroup namespace can still write to delegatable
446d1643 1005files in the root directory of the cgroup namespace such as
ed3f4f34
MK
1006.IR cgroup.procs
1007and
1008.IR cgroup.subtree_control ,
446d1643 1009and can create subhierarchy underneath the root directory.
ed3f4f34
MK
1010.IP *
1011Attempts to migrate processes across the namespace boundary are denied
1012(with the error
1013.BR ENOENT ).
1014Processes inside the cgroup namespace can still
1015(subject to the containment rules described below)
1016move processes between cgroups
1017.I within
1018the subhierarchy under the namespace root.
1019.PP
1020The ability to define cgroup namespaces as delegation boundaries
1021makes cgroup namespaces more useful.
1022To understand why, suppose that we already have one cgroup hierarchy
1023that has been delegated to a nonprivileged user,
1024.IR cecilia ,
1025using the older delegation technique described above.
1026Suppose further that
1027.I cecilia
1028wanted to further delegate a subhierarchy
1029under the existing delegated hierarchy.
1030(For example, the delegated hierarchy might be associated with
1031an unprivileged container run by
1032.IR cecilia .)
1033Even if a cgroup namespace was employed,
1034because both hierarchies are owned by the unprivileged user
1035.IR cecilia ,
1036the following illegitimate actions could be performed:
1037.IP * 3
1038A process in the inferior hierarchy could change the
619dbe1c 1039resource controller settings in the root directory of that hierarchy.
ed3f4f34
MK
1040(These resource controller settings are intended to allow control to
1041be exercised from the
1042.I parent
1043cgroup;
1044a process inside the child cgroup should not be allowed to modify them.)
1045.IP *
1046A process inside the inferior hierarchy could move processes
1047into and out of the inferior hierarchy if the cgroups in the
1048superior hierarchy were somehow visible.
1049.PP
1050Employing the
1051.I nsdelegate
1052mount option prevents both of these possibilities.
1053.PP
1054The
1055.I nsdelegate
1056mount option only has an effect when performed in
1057the initial mount namespace;
1058in other mount namespaces, the option is silently ignored.
07361828
MK
1059.PP
1060.IR Note :
1061On some systems,
1062.BR systemd (1)
1063automatically mounts the cgroup v2 filesystem.
1064In order to experiment with the
1065.I nsdelegate
44084d19
MK
1066operation, it may be useful to boot the kernel with
1067the following command-line options:
1068.PP
1069.in +4n
1070.EX
1071cgroup_no_v1=all systemd.legacy_systemd_cgroup_controller
1072.EE
1073.in
1074.PP
1075These options cause the kernel to boot with the cgroups v1 controllers
1076disabled (meaning that the controllers are available in the v2 hierarchy),
1077and tells
1078.BR systemd (1)
1079not to mount and use the cgroup v2 hierarchy,
1080so that the v2 hierarchy can be manually mounted
1081with the desired options after boot-up.
ed3f4f34 1082.\"
4b1c2041 1083.SS Cgroup delegation containment rules
4242dfbe
MK
1084Some delegation
1085.IR "containment rules"
1086ensure that the delegatee can move processes between cgroups within the
1087delegated subtree,
1088but can't move processes from outside the delegated subtree into
1089the subtree or vice versa.
1090A nonprivileged process (i.e., the delegatee) can write the PID of
1091a "target" process into a
1092.IR cgroup.procs
1093file only if all of the following are true:
1094.IP * 3
4242dfbe
MK
1095The writer has write permission on the
1096.I cgroup.procs
1097file in the destination cgroup.
1098.IP *
1099The writer has write permission on the
1100.I cgroup.procs
396761ee 1101file in the nearest common ancestor of the source and destination cgroups.
e366c4d4
MK
1102Note that in some cases,
1103the nearest common ancestor may be the source or destination cgroup itself.
4b1c2041
MK
1104This requirement is not enforced for cgroups v1 hierarchies,
1105with the consequence that containment in v1 is less strict than in v2.
1106(For example, in cgroups v1 the user that owns two distinct
1107delegated subhierarchies can move a process between the hierarchies.)
28f612ea 1108.IP *
ed3f4f34
MK
1109If the cgroup v2 filesystem was mounted with the
1110.I nsdelegate
7b574df5 1111option, the writer must be able to see the source and destination cgroups
ed3f4f34
MK
1112from its cgroup namespace.
1113.IP *
4b1c2041 1114In cgroups v1:
28f612ea
MK
1115the effective UID of the writer (i.e., the delegatee) matches the
1116real user ID or the saved set-user-ID of the target process.
4b1c2041
MK
1117Before Linux 4.11,
1118.\" commit 576dd464505fc53d501bb94569db76f220104d28
1119this requirement also applied in cgroups v2
28f612ea
MK
1120(This was a historical requirement inherited from cgroups v1
1121that was later deemed unnecessary,
1122since the other rules suffice for containment in cgroups v2.)
4242dfbe
MK
1123.PP
1124.IR Note :
1125one consequence of these delegation containment rules is that the
0735069b
MK
1126unprivileged delegatee can't place the first process into
1127the delegated subtree;
1128instead, the delegater must place the first process
1129(a process owned by the delegatee) into the delegated subtree.
4242dfbe 1130.\"
75e83bc2 1131.SH CGROUPS VERSION 2 THREAD MODE
c8902e25
MK
1132Among the restrictions imposed by cgroups v2 that were not present
1133in cgroups v1 are the following:
1134.IP * 3
1135.IR "No thread-granularity control" :
1136all of the threads of a process must be in the same cgroup.
1137.IP *
1138.IR "No internal processes" :
1139a cgroup can't both have member processes and
1140exercise controllers on child cgroups.
1141.PP
1142Both of these restrictions were added because
1143the lack of these restrictions had caused problems
1144in cgroups v1.
1145In particular, the cgroups v1 ability to allow thread-level granularity
1146for cgroup membership made no sense for some controllers.
1147(A notable example was the
1148.I memory
1149controller: since threads share an address space,
1150it made no sense to split threads across different
1151.I memory
1152cgroups.)
1153.PP
1154Notwithstanding the initial design decision in cgroups v2,
1155there were use cases for certain controllers, notably the
1156.IR cpu
1157controller,
1158for which thread-level granularity of control was meaningful and useful.
1159To accommodate such use cases, Linux 4.14 added
1160.I "thread mode"
1161for cgroups v2.
1162.PP
1163Thread mode allows the following:
1164.IP * 3
1165The creation of
1166.IR "threaded subtrees"
1167in which the threads of a process may
1168be spread across cgroups inside the tree.
1169(A threaded subtree may contain multiple multithreaded processes.)
1170.IP *
1171The concept of
1172.IR "threaded controllers",
1173which can distribute resources across the cgroups in a threaded subtree.
1174.IP *
1175A relaxation of the "no internal processes rule",
1176so that, within a threaded subtree,
1177a cgroup can both contain member threads and
1178exercise resource control over child cgroups.
1179.PP
1180With the addition of thread mode,
1181each nonroot cgroup now contains a new file,
1182.IR cgroup.type ,
1183that exposes, and in some circumstances can be used to change,
1184the "type" of a cgroup.
1185This file contains one of the following type values:
1186.TP
1187.I "domain"
1188This is a normal v2 cgroup that provides process-granularity control.
1189If a process is a member of this cgroup,
1190then all threads of the process are (by definition) in the same cgroup.
1191This is the default cgroup type,
1192and provides the same behavior that was provided for
1193cgroups in the initial cgroups v2 implementation.
1194.TP
1195.I "threaded"
1196This cgroup is a member of a threaded subtree.
1197Threads can be added to this cgroup,
1198and controllers can be enabled for the cgroup.
1199.TP
1200.I "domain threaded"
1201This is a domain cgroup that serves as the root of a threaded subtree.
1202This cgroup type is also known as "threaded root".
1203.TP
1204.I "domain invalid"
1205This is a cgroup inside a threaded subtree
1206that is in an "invalid" state.
1207Processes can't be added to the cgroup,
1208and controllers can't be enabled for the cgroup.
1209The only thing that can be done with this cgroup (other than deleting it)
1210is to convert it to a
1211.IR threaded
1212cgroup by writing the string
1213.IR """threaded"""
1214to the
1215.I cgroup.type
1216file.
61254835
MK
1217.IP
1218The rationale for the existence of this "interim" type
1219during the creation of a threaded subtree
1220(rather than the kernel simply immediately converting all cgroups
1221under the threaded root to the type
1222.IR threaded )
1223is to allow for
1224possible future extensions to the thread mode model
c8902e25
MK
1225.\"
1226.SS Threaded versus domain controllers
1227With the addition of threads mode,
1228cgroups v2 now distinguishes two types of resource controllers:
1229.IP * 3
1230.I Threaded
2cd9bbfa 1231.\" In the kernel source, look for ".threaded[ \t]*= true" in
218eadf4 1232.\" initializations of struct cgroup_subsys
c8902e25
MK
1233controllers: these controllers support thread-granularity for
1234resource control and can be enabled inside threaded subtrees,
1235with the result that the corresponding controller-interface files
1236appear inside the cgroups in the threaded subtree.
aa2c3623 1237As at Linux 4.19, the following controllers are threaded:
c8902e25
MK
1238.IR cpu ,
1239.IR perf_event ,
1240and
1241.IR pids .
1242.IP *
1243.I Domain
1244controllers: these controllers support only process granularity
1245for resource control.
1246From the perspective of a domain controller,
1247all threads of a process are always in the same cgroup.
1248Domain controllers can't be enabled inside a threaded subtree.
1249.\"
1250.SS Creating a threaded subtree
1251There are two pathways that lead to the creation of a threaded subtree.
1252The first pathway proceeds as follows:
1253.IP 1. 3
1254We write the string
1255.IR """threaded"""
1256to the
1257.I cgroup.type
1258file of a cgroup
1259.IR y/z
1260that currently has the type
1261.IR domain .
1262This has the following effects:
1263.RS
1264.IP * 3
1265The type of the cgroup
1266.IR y/z
1267becomes
1268.IR threaded .
1269.IP *
1270The type of the parent cgroup,
1271.IR y ,
1272becomes
1273.IR "domain threaded" .
1274The parent cgroup is the root of a threaded subtree
1275(also known as the "threaded root").
1276.IP *
1277All other cgroups under
1278.IR y
1279that were not already of type
1280.IR threaded
1281(because they were inside already existing threaded subtrees
1282under the new threaded root)
1283are converted to type
1284.IR "domain invalid" .
1285Any subsequently created cgroups under
1286.I y
1287will also have the type
1288.IR "domain invalid" .
1289.RE
1290.IP 2.
1291We write the string
1292.IR """threaded"""
1293to each of the
1294.IR "domain invalid"
1295cgroups under
1296.IR y ,
1297in order to convert them to the type
1298.IR threaded .
1299As a consequence of this step, all threads under the threaded root
1300now have the type
1301.IR threaded
1302and the threaded subtree is now fully usable.
1303The requirement to write
1304.IR """threaded"""
1305to each of these cgroups is somewhat cumbersome,
1306but allows for possible future extensions to the thread-mode model.
1307.PP
1308The second way of creating a threaded subtree is as follows:
1309.IP 1. 3
1310In an existing cgroup,
1311.IR z ,
1312that currently has the type
1313.IR domain ,
1314we (1) enable one or more threaded controllers and
1315(2) make a process a member of
1316.IR z .
1317(These two steps can be done in either order.)
1318This has the following consequences:
1319.RS
1320.IP * 3
1321The type of
1322.I z
1323becomes
1324.IR "domain threaded" .
1325.IP *
1326All of the descendant cgroups of
1327.I x
7a1cddd2 1328that were not already of type
c8902e25
MK
1329.IR threaded
1330are converted to type
1331.IR "domain invalid" .
1332.RE
1333.IP 2.
1334As before, we make the threaded subtree usable by writing the string
1335.IR """threaded"""
1336to each of the
1337.IR "domain invalid"
1338cgroups under
1339.IR y ,
1340in order to convert them to the type
1341.IR threaded .
1342.PP
1343One of the consequences of the above pathways to creating a threaded subtree
1344is that the threaded root cgroup can be a parent only to
1345.I threaded
1346(and
1347.IR "domain invalid" )
1348cgroups.
1349The threaded root cgroup can't be a parent of a
1350.I domain
1351cgroups, and a
1352.I threaded
1353cgroup
1354can't have a sibling that is a
1355.I domain
1356cgroup.
1357.\"
1358.SS Using a threaded subtree
1359Within a threaded subtree, threaded controllers can be enabled
1360in each subgroup whose type has been changed to
1361.IR threaded ;
1362upon doing so, the corresponding controller interface files
1363appear in the children of that cgroup.
1364.PP
1365A process can be moved into a threaded subtree by writing its PID to the
1366.I cgroup.procs
1367file in one of the cgroups inside the tree.
1368This has the effect of making all of the threads
1369in the process members of the corresponding cgroup
1370and makes the process a member of the threaded subtree.
1371The threads of the process can then be spread across
1372the threaded subtree by writing their thread IDs (see
1373.BR gettid (2))
1374to the
b2c3e720 1375.I cgroup.threads
c8902e25
MK
1376files in different cgroups inside the subtree.
1377The threads of a process must all reside in the same threaded subtree.
1378.PP
d84e558e
MK
1379As with writing to
1380.IR cgroup.procs ,
1381some containment rules apply when writing to the
b2c3e720 1382.I cgroup.threads
d84e558e
MK
1383file:
1384.IP * 3
1385The writer must have write permission on the
1386cgroup.threads
1387file in the destination cgroup.
1388.IP *
1389The writer must have write permission on the
1390.I cgroup.procs
1391file in the common ancestor of the source and destination cgroups.
1392(In some cases,
1393the common ancestor may be the source or destination cgroup itself.)
1394.IP *
1395The source and destination cgroups must be in the same threaded subtree.
1396(Outside a threaded subtree, an attempt to move a thread by writing
1397its thread ID to the
1398.I cgroup.threads
1399file in a different
1400.I domain
1401cgroup fails with the error
1402.BR EOPNOTSUPP .)
4178f132
MK
1403.PP
1404The
1405.I cgroup.threads
c8902e25
MK
1406file is present in each cgroup (including
1407.I domain
1408cgroups) and can be read in order to discover the set of threads
1409that is present in the cgroup.
1410The set of thread IDs obtained when reading this file
1411is not guaranteed to be ordered or free of duplicates.
1412.PP
1413The
1414.I cgroup.procs
1415file in the threaded root shows the PIDs of all processes
1416that are members of the threaded subtree.
1417The
1418.I cgroup.procs
1419files in the other cgroups in the subtree are not readable.
1420.PP
1421Domain controllers can't be enabled in a threaded subtree;
1422no controller-interface files appear inside the cgroups underneath the
1423threaded root.
1424From the point of view of a domain controller,
1425threaded subtrees are invisible:
1426a multithreaded process inside a threaded subtree appears to a domain
1427controller as a process that resides in the threaded root cgroup.
1428.PP
1429Within a threaded subtree, the "no internal processes" rule does not apply:
1430a cgroup can both contain member processes (or thread)
1431and exercise controllers on child cgroups.
1432.\"
1433.SS Rules for writing to cgroup.type and creating threaded subtrees
1434A number of rules apply when writing to the
1435.I cgroup.type
1436file:
1437.IP * 3
1438Only the string
1439.IR """threaded"""
1440may be written.
1441In other words, the only explicit transition that is possible is to convert a
1442.I domain
1443cgroup to type
1444.IR threaded .
1445.IP *
6c9aa5ad 1446The effect of writing
c8902e25 1447.IR """threaded"""
6c9aa5ad
MK
1448depends on the current value in
1449.IR cgroup.type ,
1450as follows:
c8902e25
MK
1451.RS
1452.IP \(bu 3
6c9aa5ad
MK
1453.IR domain
1454or
1455.IR "domain threaded" :
1456start the creation of a threaded subtree
1457(whose root is the parent of this cgroup) via
c8902e25
MK
1458the first of the pathways described above;
1459.IP \(bu
6c9aa5ad 1460.IR "domain\ invalid" :
4644794c 1461convert this cgroup (which is inside a threaded subtree) to a usable (i.e.,
c8902e25
MK
1462.IR threaded )
1463state;
1464.IP \(bu
6c9aa5ad
MK
1465.IR threaded :
1466no effect (a "no-op").
c8902e25
MK
1467.RE
1468.IP *
1469We can't write to a
1470.I cgroup.type
1471file if the parent's type is
1472.IR "domain invalid" .
1473In other words, the cgroups of a threaded subtree must be converted to the
1474.I threaded
1475state in a top-down manner.
1476.PP
00c27092 1477There are also some constraints that must be satisfied
c8902e25
MK
1478in order to create a threaded subtree rooted at the cgroup
1479.IR x :
1480.IP * 3
1481There can be no member processes in the descendant cgroups of
1482.IR x .
1483(The cgroup
1484.I x
1485can itself have member processes.)
1486.IP *
1487No domain controllers may be enabled in
1488.IR x 's
1489.IR cgroup.subtree_control
1490file.
c8902e25
MK
1491.PP
1492If any of the above constraints is violated, then an attempt to write
1493.IR """threaded"""
1494to a
1495.IR cgroup.type
1496file fails with the error
1497.BR ENOTSUP .
1498.\"
1499.SS The """domain threaded""" cgroup type
1500According to the pathways described above,
1501the type of a cgroup can change to
1502.IR "domain threaded"
1503in either of the following cases:
1504.IP * 3
1505The string
1506.IR """threaded"""
1507is written to a child cgroup.
1508.IP *
1509A threaded controller is enabled inside the cgroup and
1510a process is made a member of the cgroup.
1511.PP
1512A
1513.IR "domain threaded"
1514cgroup,
1515.IR x ,
1516can revert to the type
1517.IR domain
1518if the above conditions no longer hold true\(emthat is, if all
1519.I threaded
1520child cgroups of
1521.I x
1522are removed and either
1523.I x
1524no longer has threaded controllers enabled or
1525no longer has member processes.
1526.PP
1527When a
1528.IR "domain threaded"
1529cgroup
1530.IR x
1531reverts to the type
1532.IR domain :
1533.IP * 3
1534All
1535.IR "domain invalid"
1536descendants of
1537.I x
1538that are not in lower-level threaded subtrees revert to the type
1539.IR domain .
1540.IP *
1541The root cgroups in any lower-level threaded subtrees revert to the type
1542.IR "domain threaded" .
1543.\"
1544.SS Exceptions for the root cgroup
1545The root cgroup of the v2 hierarchy is treated exceptionally:
1546it can be the parent of both
1547.I domain
1548and
1549.I threaded
1550cgroups.
1551If the string
1552.I """threaded"""
1553is written to the
1554.I cgroup.type
1555file of one of the children of the root cgroup, then
1556.IP * 3
1557The type of that cgroup becomes
1558.IR threaded .
1559.IP *
1560The type of any descendants of that cgroup that
1561are not part of lower-level threaded subtrees changes to
1562.IR "domain invalid" .
1563.PP
1564Note that in this case, there is no cgroup whose type becomes
1565.IR "domain threaded" .
1566(Notionally, the root cgroup can be considered as the threaded root
1567for the cgroup whose type was changed to
1568.IR threaded .)
1569.PP
1570The aim of this exceptional treatment for the root cgroup is to
1571allow a threaded cgroup that employs the
1572.I cpu
1573controller to be placed as high as possible in the hierarchy,
1574so as to minimize the (small) cost of traversing the cgroup hierarchy.
1575.\"
edc90967 1576.SS The cgroups v2 """cpu""" controller and realtime threads
aa2c3623 1577As at Linux 4.19, the cgroups v2
c8902e25 1578.I cpu
0bef253e
MK
1579controller does not support control of realtime threads
1580(specifically threads scheduled under any of the policies
1581.BR SCHED_FIFO ,
1582.BR SCHED_RR ,
1583described
1584.BR SCHED_DEADLINE ;
1585see
1586.BR sched (7)).
1587Therefore, the
1588.I cpu
1589controller can be enabled in the root cgroup only
c8902e25 1590if all realtime threads are in the root cgroup.
edc90967 1591(If there are realtime threads in nonroot cgroups, then a
c8902e25
MK
1592.BR write (2)
1593of the string
1594.IR """+cpu"""
1595to the
1596.I cgroup.subtree_control
1597file fails with the error
c2df7694 1598.BR EINVAL .)
17094a28
MK
1599.PP
1600On some systems,
c8902e25 1601.BR systemd (1)
edc90967 1602places certain realtime threads in nonroot cgroups in the v2 hierarchy.
c8902e25 1603On such systems,
edc90967 1604these threads must first be moved to the root cgroup before the
c8902e25
MK
1605.I cpu
1606controller can be enabled.
1607.\"
1608.SH ERRORS
1609The following errors can occur for
1610.BR mount (2):
1611.TP
1612.B EBUSY
1613An attempt to mount a cgroup version 1 filesystem specified neither the
1614.I name=
1615option (to mount a named hierarchy) nor a controller name (or
1616.IR all ).
1617.SH NOTES
1618A child process created via
1619.BR fork (2)
1620inherits its parent's cgroup memberships.
1621A process's cgroup memberships are preserved across
1622.BR execve (2).
1623.\"
5c2181ad
MK
1624.SS /proc files
1625.TP
34eb3340 1626.IR /proc/cgroups " (since Linux 2.6.24)"
92bb6d36 1627This file contains information about the controllers
1a4f7d59 1628that are compiled into the kernel.
34eb3340
MK
1629An example of the contents of this file (reformatted for readability)
1630is the following:
a721e8b2 1631.IP
34eb3340 1632.in +4n
b8302363 1633.EX
4580c2f6
MK
1634#subsys_name hierarchy num_cgroups enabled
1635cpuset 4 1 1
1636cpu 8 1 1
1637cpuacct 8 1 1
1638blkio 6 1 1
1639memory 3 1 1
1640devices 10 84 1
1641freezer 7 1 1
1642net_cls 9 1 1
1643perf_event 5 1 1
1644net_prio 9 1 1
1645hugetlb 0 1 0
1646pids 2 1 1
b8302363 1647.EE
e646a1ba 1648.in
a721e8b2 1649.IP
34eb3340
MK
1650The fields in this file are, from left to right:
1651.RS
1652.IP 1. 3
1653The name of the controller.
1654.IP 2.
92bb6d36 1655The unique ID of the cgroup hierarchy on which this controller is mounted.
11c0797f 1656If multiple cgroups v1 controllers are bound to the same hierarchy,
34eb3340 1657then each will show the same hierarchy ID in this field.
92bb6d36
MK
1658The value in this field will be 0 if:
1659.RS 5
1660.IP a) 3
1661the controller is not mounted on a cgroups v1 hierarchy;
1662.IP b)
1663the controller is bound to the cgroups v2 single unified hierarchy; or
1664.IP c)
1665the controller is disabled (see below).
1666.RE
34eb3340
MK
1667.IP 3.
1668The number of control groups in this hierarchy using this controller.
1669.IP 4.
1670This field contains the value 1 if this controller is enabled,
1671or 0 if it has been disabled (via the
1672.IR cgroup_disable
1673kernel command-line boot parameter).
1674.RE
1675.TP
5c2181ad 1676.IR /proc/[pid]/cgroup " (since Linux 2.6.24)"
f5faa016
MK
1677This file describes control groups to which the process
1678with the corresponding PID belongs.
5f8a7eb2 1679The displayed information differs for
2c4fbe35 1680cgroups version 1 and version 2 hierarchies.
a721e8b2 1681.IP
5f8a7eb2 1682For each cgroup hierarchy of which the process is a member,
2e33b59e 1683there is one entry containing three colon-separated fields:
a721e8b2 1684.IP
4769a778
MK
1685.in +4n
1686.EX
1687hierarchy-ID:controller-list:cgroup-path
1688.EE
1689.in
a721e8b2 1690.IP
5f8a7eb2 1691For example:
c1a022dc
MK
1692.IP
1693.in +4n
1694.EX
16955:cpuacct,cpu,cpuset:/daemons
1696.EE
1697.in
5c2181ad
MK
1698.IP
1699The colon-separated fields are, from left to right:
5f8a7eb2 1700.RS
5c2181ad 1701.IP 1. 3
5f8a7eb2
MK
1702For cgroups version 1 hierarchies,
1703this field contains a unique hierarchy ID number
1704that can be matched to a hierarchy ID in
1705.IR /proc/cgroups .
1706For the cgroups version 2 hierarchy, this field contains the value 0.
5c2181ad 1707.IP 2.
5f8a7eb2 1708For cgroups version 1 hierarchies,
55f52de8 1709this field contains a comma-separated list of the controllers
5f8a7eb2
MK
1710bound to the hierarchy.
1711For the cgroups version 2 hierarchy, this field is empty.
5c2181ad 1712.IP 3.
5f8a7eb2
MK
1713This field contains the pathname of the control group in the hierarchy
1714to which the process belongs.
1715This pathname is relative to the mount point of the hierarchy.
5c2181ad 1716.RE
668ef765
MK
1717.\"
1718.SS /sys/kernel/cgroup files
1719.TP
1720.IR /sys/kernel/cgroup/delegate " (since Linux 4.15)"
1721.\" commit 01ee6cfb1483fe57c9cbd8e73817dfbf9bacffd3
1722This file exports a list of the cgroups v2 files
1723(one per line) that are delegatable
1724(i.e., whose ownership should be changed to the user ID of the delegatee).
1725In the future, the set of delegatable files may change or grow,
1726and this file provides a way for the kernel to inform
1727user-space applications of which files must be delegated.
1728As at Linux 4.15, one sees the following when inspecting this file:
1729.IP
1730.EX
1731.in +4n
1732$ \fBcat /sys/kernel/cgroup/delegate\fP
1733cgroup.procs
1734cgroup.subtree_control
c7913617 1735cgroup.threads
668ef765
MK
1736.in
1737.EE
6413d784
MK
1738.TP
1739.IR /sys/kernel/cgroup/features " (since Linux 4.15)"
1740.\" commit 5f2e673405b742be64e7c3604ed4ed3ac14f35ce
1741Over time, the set of cgroups v2 features that are provided by the
1742kernel may change or grow,
1743or some features may not be enabled by default.
1744This file provides a way for user-space applications to discover what
fcf115f5 1745features the running kernel supports and has enabled.
6413d784
MK
1746Features are listed one per line:
1747.IP
1748.in +4n
1749.EX
6413d784
MK
1750$ \fBcat /sys/kernel/cgroup/features\fP
1751nsdelegate
2e69ff53 1752.EE
6413d784
MK
1753.in
1754.IP
1755The entries that can appear in this file are:
1756.RS
1757.TP
1758.IR nsdelegate " (since Linux 4.15)"
1759The kernel supports the
1760.I nsdelegate
1761mount option.
1762.RE
bbfdf727 1763.SH SEE ALSO
ebbc83be 1764.BR prlimit (1),
f60a5da2 1765.BR systemd (1),
edc2a022
MK
1766.BR systemd-cgls (1),
1767.BR systemd-cgtop (1),
325b7eb0 1768.BR clone (2),
ebbc83be
MK
1769.BR ioprio_set (2),
1770.BR perf_event_open (2),
1771.BR setrlimit (2),
cff6de30 1772.BR cgroup_namespaces (7),
69c47536 1773.BR cpuset (7),
ebbc83be
MK
1774.BR namespaces (7),
1775.BR sched (7),
1776.BR user_namespaces (7)