]> git.ipfire.org Git - thirdparty/man-pages.git/blame - man7/cgroups.7
user_namespaces.7: Minor rewordings of recently added text
[thirdparty/man-pages.git] / man7 / cgroups.7
CommitLineData
014cb63b 1.\" Copyright (C) 2015 Serge Hallyn <serge@hallyn.com>
4242dfbe 2.\" and Copyright (C) 2016, 2017 Michael Kerrisk <mtk.manpages@gmail.com>
014cb63b
MK
3.\"
4.\" %%%LICENSE_START(VERBATIM)
5.\" Permission is granted to make and distribute verbatim copies of this
6.\" manual provided the copyright notice and this permission notice are
7.\" preserved on all copies.
8.\"
9.\" Permission is granted to copy and distribute modified versions of this
10.\" manual under the conditions for verbatim copying, provided that the
11.\" entire resulting derived work is distributed under the terms of a
12.\" permission notice identical to this one.
13.\"
14.\" Since the Linux kernel and libraries are constantly changing, this
15.\" manual page may be incorrect or out-of-date. The author(s) assume no
16.\" responsibility for errors or omissions, or for damages resulting from
17.\" the use of the information contained herein. The author(s) may not
18.\" have taken the same level of care in the production of this manual,
19.\" which is licensed free of charge, as they might when working
20.\" professionally.
21.\"
22.\" Formatted or processed versions of this manual, if unaccompanied by
23.\" the source, must acknowledge the copyright and authors of this work.
24.\" %%%LICENSE_END
25.\"
9ba01802 26.TH CGROUPS 7 2019-03-06 "Linux" "Linux Programmer's Manual"
21f0d132
MK
27.SH NAME
28cgroups \- Linux control groups
29.SH DESCRIPTION
77eefc59 30Control groups, usually referred to as cgroups,
a15e0673 31are a Linux kernel feature which allow processes to
8bff7140
MK
32be organized into hierarchical groups whose usage of
33various types of resources can then be limited and monitored.
34The kernel's cgroup interface is provided through
21f0d132 35a pseudo-filesystem called cgroupfs.
6398ca15 36Grouping is implemented in the core cgroup kernel code,
21f0d132 37while resource tracking and limits are implemented in
8bff7140 38a set of per-resource-type subsystems (memory, CPU, and so on).
21f0d132 39.\"
176a4211
MK
40.SS Terminology
41A
42.I cgroup
43is a collection of processes that are bound to a set of
44limits or parameters defined via the cgroup filesystem.
a721e8b2 45.PP
176a4211
MK
46A
47.I subsystem
48is a kernel component that modifies the behavior of
49the processes in a cgroup.
50Various subsystems have been implemented, making it possible to do things
51such as limiting the amount of CPU time and memory available to a cgroup,
52accounting for the CPU time used by a cgroup,
53and freezing and resuming execution of the processes in a cgroup.
54Subsystems are sometimes also known as
55.IR "resource controllers"
56(or simply, controllers).
a721e8b2 57.PP
55f52de8 58The cgroups for a controller are arranged in a
176a4211
MK
59.IR hierarchy .
60This hierarchy is defined by creating, removing, and
61renaming subdirectories within the cgroup filesystem.
8fc9db1e
MK
62At each level of the hierarchy, attributes (e.g., limits) can be defined.
63The limits, control, and accounting provided by cgroups generally have
64effect throughout the subhierarchy underneath the cgroup where the
65attributes are defined.
8bff7140
MK
66Thus, for example, the limits placed on
67a cgroup at a higher level in the hierarchy cannot be exceeded
68by descendant cgroups.
176a4211 69.\"
43df1ab3
MK
70.SS Cgroups version 1 and version 2
71The initial release of the cgroups implementation was in Linux 2.6.24.
55f52de8 72Over time, various cgroup controllers have been added
43df1ab3 73to allow the management of various types of resources.
55f52de8
MK
74However, the development of these controllers was largely uncoordinated,
75with the result that many inconsistencies arose between controllers
43df1ab3
MK
76and management of the cgroup hierarchies became rather complex.
77(A longer description of these problems can be found in
78the kernel source file
0a837899 79.IR Documentation/cgroup\-v2.txt .)
a721e8b2 80.PP
813d9220
MK
81Because of the problems with the initial cgroups implementation
82(cgroups version 1),
43df1ab3
MK
83starting in Linux 3.10, work began on a new,
84orthogonal implementation to remedy these problems.
85Initially marked experimental, and hidden behind the
86.I "\-o\ __DEVEL__sane_behavior"
87mount option, the new version (cgroups version 2)
88was eventually made official with the release of Linux 4.5.
89Differences between the two versions are described in the text below.
a721e8b2 90.PP
43df1ab3
MK
91Although cgroups v2 is intended as a replacement for cgroups v1,
92the older system continues to exist
93(and for compatibility reasons is unlikely to be removed).
94Currently, cgroups v2 implements only a subset of the controllers
95available in cgroups v1.
96The two systems are implemented so that both v1 controllers and
97v2 controllers can be mounted on the same system.
98Thus, for example, it is possible to use those controllers
99that are supported under version 2,
100while also using version 1 controllers
101where version 2 does not yet support those controllers.
1a90a85e
MK
102The only restriction here is that a controller can't be simultaneously
103employed in both a cgroups v1 hierarchy and in the cgroups v2 hierarchy.
43df1ab3 104.\"
5714ccee 105.SH CGROUPS VERSION 1
8bff7140
MK
106Under cgroups v1, each controller may be mounted against a separate
107cgroup filesystem that provides its own hierarchical organization of the
108processes on the system.
980f1827 109It is also possible to comount multiple (or even all) cgroups v1 controllers
8bff7140
MK
110against the same cgroup filesystem, meaning that the comounted controllers
111manage the same hierarchical organization of processes.
a721e8b2 112.PP
8bff7140
MK
113For each mounted hierarchy,
114the directory tree mirrors the control group hierarchy.
115Each control group is represented by a directory, with each of its child
116control cgroups represented as a child directory.
117For instance,
118.IR /user/joe/1.session
119represents control group
120.IR 1.session ,
121which is a child of cgroup
122.IR joe ,
123which is a child of
124.IR /user .
125Under each cgroup directory is a set of files which can be read or
126written to, reflecting resource limits and a few general cgroup
127properties.
8bff7140 128.\"
6398ca15 129.SS Tasks (threads) versus processes
c775bca2
MK
130In cgroups v1, a distinction is drawn between
131.I processes
132and
133.IR tasks .
134In this view, a process can consist of multiple tasks
6398ca15
MK
135(more commonly called threads, from a user-space perspective,
136and called such in the remainder of this man page).
0ec74e08 137In cgroups v1, it is possible to independently manipulate
6398ca15 138the cgroup memberships of the threads in a process.
c56ec51b
MK
139.PP
140The cgroups v1 ability to split threads across different cgroups
141caused problems in some cases.
142For example, it made no sense for the
143.I memory
144controller,
145since all of the threads of a process share a single address space.
146Because of these problems,
c775bca2 147the ability to independently manipulate the cgroup memberships
56769384
MK
148of the threads in a process was removed in the initial cgroups v2
149implementation, and subsequently restored in a more limited form
150(see the discussion of "thread mode" below).
c775bca2 151.\"
77e0a626
MK
152.SS Mounting v1 controllers
153The use of cgroups requires a kernel built with the
8e6578f8
KF
154.BR CONFIG_CGROUP
155option.
77e0a626
MK
156In addition, each of the v1 controllers has an associated
157configuration option that must be set in order to employ that controller.
a721e8b2 158.PP
77e0a626
MK
159In order to use a v1 controller,
160it must be mounted against a cgroup filesystem.
4e07c70f
MK
161The usual place for such mounts is under a
162.BR tmpfs (5)
163filesystem mounted at
77e0a626
MK
164.IR /sys/fs/cgroup .
165Thus, one might mount the
166.I cpu
167controller as follows:
a721e8b2 168.PP
77e0a626 169.in +4n
b8302363 170.EX
77e0a626 171mount \-t cgroup \-o cpu none /sys/fs/cgroup/cpu
b8302363 172.EE
e646a1ba 173.in
a721e8b2 174.PP
77e0a626
MK
175It is possible to comount multiple controllers against the same hierarchy.
176For example, here the
177.IR cpu
21f0d132 178and
77e0a626
MK
179.IR cpuacct
180controllers are comounted against a single hierarchy:
a721e8b2 181.PP
21f0d132 182.in +4n
b8302363 183.EX
77e0a626 184mount \-t cgroup \-o cpu,cpuacct none /sys/fs/cgroup/cpu,cpuacct
b8302363 185.EE
e646a1ba 186.in
a721e8b2 187.PP
55f52de8 188Comounting controllers has the effect that a process is in the same cgroup for
77e0a626 189all of the comounted controllers.
55f52de8 190Separately mounting controllers allows a process to
21f0d132
MK
191be in cgroup
192.I /foo1
55f52de8 193for one controller while being in
21f0d132
MK
194.I /foo2/foo3
195for another.
a721e8b2 196.PP
77e0a626 197It is possible to comount all v1 controllers against the same hierarchy:
a721e8b2 198.PP
77e0a626 199.in +4n
b8302363 200.EX
77e0a626 201mount \-t cgroup \-o all cgroup /sys/fs/cgroup
b8302363 202.EE
e646a1ba 203.in
a721e8b2 204.PP
77e0a626
MK
205(One can achieve the same result by omitting
206.IR "\-o all" ,
207since it is the default if no controllers are explicitly specified.)
a721e8b2 208.PP
31ec2a5c
MK
209It is not possible to mount the same controller
210against multiple cgroup hierarchies.
211For example, it is not possible to mount both the
212.I cpu
213and
214.I cpuacct
215controllers against one hierarchy, and to mount the
216.I cpu
217controller alone against another hierarchy.
218It is possible to create multiple mount points with exactly
219the same set of comounted controllers.
220However, in this case all that results is multiple mount points
221providing a view of the same hierarchy.
a721e8b2 222.PP
77e0a626
MK
223Note that on many systems, the v1 controllers are automatically mounted under
224.IR /sys/fs/cgroup ;
225in particular,
226.BR systemd (1)
227automatically creates such mount points.
21f0d132 228.\"
7409b54b
MK
229.SS Unmounting v1 controllers
230A mounted cgroup filesystem can be unmounted using the
231.BR umount (8)
232command, as in the following example:
233.PP
234.in +4n
235.EX
236umount /sys/fs/cgroup/pids
237.EE
238.in
239.PP
240.IR "But note well" :
241a cgroup filesystem is unmounted only if it is not busy,
242that is, it has no child cgroups.
243If this is not the case, then the only effect of the
244.BR umount (8)
245is to make the mount invisible.
246Thus, to ensure that the mount point is really removed,
247one must first remove all child cgroups,
248which in turn can be done only after all member processes
249have been moved from those cgroups to the root cgroup.
250.\"
860573ad
MK
251.SS Cgroups version 1 controllers
252Each of the cgroups version 1 controllers is governed
253by a kernel configuration option (listed below).
254Additionally, the availability of the cgroups feature is governed by the
255.BR CONFIG_CGROUPS
256kernel configuration option.
257.TP
258.IR cpu " (since Linux 2.6.24; " \fBCONFIG_CGROUP_SCHED\fP )
259Cgroups can be guaranteed a minimum number of "CPU shares"
260when a system is busy.
261This does not limit a cgroup's CPU usage if the CPUs are not busy.
4ad9a706
MK
262For further information, see
263.IR Documentation/scheduler/sched-design-CFS.txt .
a721e8b2 264.IP
4ad9a706
MK
265In Linux 3.2,
266this controller was extended to provide CPU "bandwidth" control.
267If the kernel is configured with
81ff7360 268.BR CONFIG_CFS_BANDWIDTH ,
4ad9a706
MK
269then within each scheduling period
270(defined via a file in the cgroup directory), it is possible to define
271an upper limit on the CPU time allocated to the processes in a cgroup.
272This upper limit applies even if there is no other competition for the CPU.
860573ad
MK
273Further information can be found in the kernel source file
274.IR Documentation/scheduler/sched\-bwc.txt .
275.TP
276.IR cpuacct " (since Linux 2.6.24; " \fBCONFIG_CGROUP_CPUACCT\fP )
277This provides accounting for CPU usage by groups of processes.
a721e8b2 278.IP
860573ad
MK
279Further information can be found in the kernel source file
280.IR Documentation/cgroup\-v1/cpuacct.txt .
281.TP
282.IR cpuset " (since Linux 2.6.24; " \fBCONFIG_CPUSETS\fP )
283This cgroup can be used to bind the processes in a cgroup to
284a specified set of CPUs and NUMA nodes.
a721e8b2 285.IP
860573ad
MK
286Further information can be found in the kernel source file
287.IR Documentation/cgroup\-v1/cpusets.txt .
288.TP
289.IR memory " (since Linux 2.6.25; " \fBCONFIG_MEMCG\fP )
290The memory controller supports reporting and limiting of process memory, kernel
291memory, and swap used by cgroups.
a721e8b2 292.IP
860573ad
MK
293Further information can be found in the kernel source file
294.IR Documentation/cgroup\-v1/memory.txt .
295.TP
296.IR devices " (since Linux 2.6.26; " \fBCONFIG_CGROUP_DEVICE\fP )
297This supports controlling which processes may create (mknod) devices as
298well as open them for reading or writing.
299The policies may be specified as whitelists and blacklists.
300Hierarchy is enforced, so new rules must not
301violate existing rules for the target or ancestor cgroups.
a721e8b2 302.IP
860573ad
MK
303Further information can be found in the kernel source file
304.IR Documentation/cgroup-v1/devices.txt .
305.TP
306.IR freezer " (since Linux 2.6.28; " \fBCONFIG_CGROUP_FREEZER\fP )
307The
308.IR freezer
309cgroup can suspend and restore (resume) all processes in a cgroup.
310Freezing a cgroup
311.I /A
312also causes its children, for example, processes in
313.IR /A/B ,
314to be frozen.
a721e8b2 315.IP
860573ad
MK
316Further information can be found in the kernel source file
317.IR Documentation/cgroup-v1/freezer-subsystem.txt .
318.TP
319.IR net_cls " (since Linux 2.6.29; " \fBCONFIG_CGROUP_NET_CLASSID\fP )
320This places a classid, specified for the cgroup, on network packets
321created by a cgroup.
322These classids can then be used in firewall rules,
323as well as used to shape traffic using
324.BR tc (8).
325This applies only to packets
326leaving the cgroup, not to traffic arriving at the cgroup.
a721e8b2 327.IP
860573ad
MK
328Further information can be found in the kernel source file
329.IR Documentation/cgroup-v1/net_cls.txt .
330.TP
331.IR blkio " (since Linux 2.6.33; " \fBCONFIG_BLK_CGROUP\fP )
332The
333.I blkio
334cgroup controls and limits access to specified block devices by
335applying IO control in the form of throttling and upper limits against leaf
336nodes and intermediate nodes in the storage hierarchy.
a721e8b2 337.IP
860573ad
MK
338Two policies are available.
339The first is a proportional-weight time-based division
340of disk implemented with CFQ.
341This is in effect for leaf nodes using CFQ.
342The second is a throttling policy which specifies
343upper I/O rate limits on a device.
a721e8b2 344.IP
860573ad
MK
345Further information can be found in the kernel source file
346.IR Documentation/cgroup-v1/blkio-controller.txt .
347.TP
348.IR perf_event " (since Linux 2.6.39; " \fBCONFIG_CGROUP_PERF\fP )
349This controller allows
350.I perf
351monitoring of the set of processes grouped in a cgroup.
a721e8b2 352.IP
860573ad 353Further information can be found in the kernel source file
c174eb6a 354.IR tools/perf/Documentation/perf-record.txt .
860573ad
MK
355.TP
356.IR net_prio " (since Linux 3.3; " \fBCONFIG_CGROUP_NET_PRIO\fP )
357This allows priorities to be specified, per network interface, for cgroups.
a721e8b2 358.IP
860573ad
MK
359Further information can be found in the kernel source file
360.IR Documentation/cgroup-v1/net_prio.txt .
361.TP
362.IR hugetlb " (since Linux 3.5; " \fBCONFIG_CGROUP_HUGETLB\fP )
363This supports limiting the use of huge pages by cgroups.
a721e8b2 364.IP
860573ad
MK
365Further information can be found in the kernel source file
366.IR Documentation/cgroup-v1/hugetlb.txt .
367.TP
368.IR pids " (since Linux 4.3; " \fBCONFIG_CGROUP_PIDS\fP )
369This controller permits limiting the number of process that may be created
370in a cgroup (and its descendants).
a721e8b2 371.IP
860573ad
MK
372Further information can be found in the kernel source file
373.IR Documentation/cgroup-v1/pids.txt .
cfec905e
NB
374.TP
375.IR rdma " (since Linux 4.11; " \fBCONFIG_CGROUP_RDMA\fP )
d145c025
MK
376The RDMA controller permits limiting the use of
377RDMA/IB-specific resources per cgroup.
cfec905e
NB
378.IP
379Further information can be found in the kernel source file
380.IR Documentation/cgroup-v1/rdma.txt .
860573ad 381.\"
6398ca15 382.SS Creating cgroups and moving processes
9ed582ac 383A cgroup filesystem initially contains a single root cgroup, '/',
6398ca15 384which all processes belong to.
21f0d132 385A new cgroup is created by creating a directory in the cgroup filesystem:
a721e8b2 386.PP
4769a778
MK
387.in +4n
388.EX
389mkdir /sys/fs/cgroup/cpu/cg1
390.EE
391.in
a721e8b2 392.PP
21f0d132 393This creates a new empty cgroup.
a721e8b2 394.PP
f524e7f8 395A process may be moved to this cgroup by writing its PID into the cgroup's
21f0d132 396.I cgroup.procs
21f0d132 397file:
a721e8b2 398.PP
4769a778
MK
399.in +4n
400.EX
401echo $$ > /sys/fs/cgroup/cpu/cg1/cgroup.procs
402.EE
403.in
a721e8b2 404.PP
f524e7f8 405Only one PID at a time should be written to this file.
a721e8b2 406.PP
f524e7f8
MK
407Writing the value 0 to a
408.IR cgroup.procs
409file causes the writing process to be moved to the corresponding cgroup.
a721e8b2 410.PP
6398ca15
MK
411When writing a PID into the
412.IR cgroup.procs ,
87402a2e 413all threads in the process are moved into the new cgroup at once.
a721e8b2 414.PP
f524e7f8
MK
415Within a hierarchy, a process can be a member of exactly one cgroup.
416Writing a process's PID to a
417.IR cgroup.procs
418file automatically removes it from the cgroup of
419which it was previously a member.
a721e8b2 420.PP
f524e7f8
MK
421The
422.I cgroup.procs
423file can be read to obtain a list of the processes that are
424members of a cgroup.
425The returned list of PIDs is not guaranteed to be in order.
426Nor is it guaranteed to be free of duplicates.
427(For example, a PID may be recycled while reading from the list.)
a721e8b2 428.PP
56769384 429In cgroups v1, an individual thread can be moved to
87402a2e
MK
430another cgroup by writing its thread ID
431(i.e., the kernel thread ID returned by
432.BR clone (2)
433and
434.BR gettid (2))
435to the
436.IR tasks
437file in a cgroup directory.
438This file can be read to discover the set of threads
439that are members of the cgroup.
b43be47e
MK
440.\"
441.SS Removing cgroups
442To remove a cgroup,
443it must first have no child cgroups and contain no (nonzombie) processes.
444So long as that is the case, one can simply
445remove the corresponding directory pathname.
446Note that files in a cgroup directory cannot and need not be
447removed.
448.\"
88afe701 449.SS Cgroups v1 release notification
23388d41
MK
450Two files can be used to determine whether the kernel provides
451notifications when a cgroup becomes empty.
452A cgroup is considered to be empty when it contains no child
453cgroups and no member processes.
a721e8b2 454.PP
23388d41 455A special file in the root directory of each cgroup hierarchy,
88afe701 456.IR release_agent ,
23388d41
MK
457can be used to register the pathname of a program that may be invoked when
458a cgroup in the hierarchy becomes empty.
459The pathname of the newly empty cgroup (relative to the cgroup mount point)
460is provided as the sole command-line argument when the
461.IR release_agent
462program is invoked.
463The
464.IR release_agent
465program might remove the cgroup directory,
980f1827 466or perhaps repopulate it with a process.
a721e8b2 467.PP
23388d41
MK
468The default value of the
469.IR release_agent
470file is empty, meaning that no release agent is invoked.
a721e8b2 471.PP
59af0514
MK
472The content of the
473.I release_agent
474file can also be specified via a mount option when the
475cgroup filesystem is mounted:
476.PP
477.in +4n
478.EX
479mount -o release_agent=pathname ...
480.EE
481.in
482.PP
23388d41
MK
483Whether or not the
484.IR release_agent
485program is invoked when a particular cgroup becomes empty is determined
486by the value in the
88afe701 487.IR notify_on_release
23388d41
MK
488file in the corresponding cgroup directory.
489If this file contains the value 0, then the
490.IR release_agent
491program is not invoked.
492If it contains the value 1, the
493.IR release_agent
494program is invoked.
495The default value for this file in the root cgroup is 0.
496At the time when a new cgroup is created,
497the value in this file is inherited from the corresponding file
498in the parent cgroup.
88afe701 499.\"
d311c798
MK
500.SS Cgroup v1 named hierarchies
501In cgroups v1,
502it is possible to mount a cgroup hierarchy that has no attached controllers:
503.PP
504.in +4n
505.EX
506mount -t cgroup -o none,name=somename none /some/mount/point
507.EE
508.in
509.PP
510Multiple instances of such hierarchies can be mounted;
511each hierarchy must have a unique name.
512The only purpose of such hierarchies is to track processes.
513(See the discussion of release notification below.)
514An example of this is the
515.I name=systemd
516cgroup hierarchy that is used by
517.BR systemd (1)
518to track services and user sessions.
29fa4cbc
MK
519.PP
520Since Linux 5.0, the
521.I cgroup_no_v1
522kernel boot option (described below) can be used to disable cgroup v1
523named hierarchies, by specifying
524.IR cgroup_no_v1=named .
525
d311c798 526.\"
5714ccee 527.SH CGROUPS VERSION 2
b43be47e
MK
528In cgroups v2,
529all mounted controllers reside in a single unified hierarchy.
530While (different) controllers may be simultaneously
531mounted under the v1 and v2 hierarchies,
532it is not possible to mount the same controller simultaneously
533under both the v1 and the v2 hierarchies.
a721e8b2 534.PP
2befa495
MK
535The new behaviors in cgroups v2 are summarized here,
536and in some cases elaborated in the following subsections.
537.IP 1. 3
a15e0673 538Cgroups v2 provides a unified hierarchy against
dddb7ea1
MK
539which all controllers are mounted.
540.IP 2.
2befa495
MK
541"Internal" processes are not permitted.
542With the exception of the root cgroup, processes may reside
543only in leaf nodes (cgroups that do not themselves contain child cgroups).
4f017a68 544The details are somewhat more subtle than this, and are described below.
dddb7ea1 545.IP 3.
2befa495
MK
546Active cgroups must be specified via the files
547.IR cgroup.controllers
548and
549.IR cgroup.subtree_control .
dddb7ea1 550.IP 4.
2befa495
MK
551The
552.I tasks
553file has been removed.
554In addition, the
555.I cgroup.clone_children
556file that is employed by the
557.I cpuset
558controller has been removed.
dddb7ea1 559.IP 5.
2befa495
MK
560An improved mechanism for notification of empty cgroups is provided by the
561.IR cgroup.events
562file.
563.PP
564For more changes, see the
565.I Documentation/cgroup-v2.txt
566file in the kernel source.
e91d4f9e
MK
567.PP
568Some of the new behaviors listed above saw subsequent modification with
569the addition in Linux 4.14 of "thread mode" (described below).
2befa495 570.\"
dddb7ea1
MK
571.SS Cgroups v2 unified hierarchy
572In cgroups v1, the ability to mount different controllers
573against different hierarchies was intended to allow great flexibility
574for application design.
e91fc446
MK
575In practice, though,
576the flexibility turned out to be less useful than expected,
dddb7ea1
MK
577and in many cases added complexity.
578Therefore, in cgroups v2,
579all available controllers are mounted against a single hierarchy.
580The available controllers are automatically mounted,
581meaning that it is not necessary (or possible) to specify the controllers
582when mounting the cgroup v2 filesystem using a command such as the following:
a721e8b2 583.PP
4769a778
MK
584.in +4n
585.EX
586mount -t cgroup2 none /mnt/cgroup2
587.EE
588.in
a721e8b2 589.PP
dddb7ea1
MK
590A cgroup v2 controller is available only if it is not currently in use
591via a mount against a cgroup v1 hierarchy.
592Or, to put things another way, it is not possible to employ
593the same controller against both a v1 hierarchy and the unified v2 hierarchy.
57cbb0db
MK
594This means that it may be necessary first to unmount a v1 controller
595(as described above) before that controller is available in v2.
596Since
597.BR systemd (1)
598makes heavy use of some v1 controllers by default,
599it can in some cases be simpler to boot the system with
600selected v1 controllers disabled.
601To do this, specify the
602.IR cgroup_no_v1=list
603option on the kernel boot command line;
604.I list
605is a comma-separated list of the names of the controllers to disable,
606or the word
607.I all
608to disable all v1 controllers.
609(This situation is correctly handled by
610.BR systemd (1),
611which falls back to operating without the specified controllers.)
03bb1264
MK
612.PP
613Note that on many modern systems,
614.BR systemd (1)
615automatically mounts the
616.I cgroup2
617filesystem at
618.I /sys/fs/cgroup/unified
619during the boot process.
dddb7ea1 620.\"
44c429ed
MK
621.SS Cgroups v2 controllers
622The following controllers, documented in the kernel source file
623.IR Documentation/cgroup-v2.txt ,
624are supported in cgroups version 2:
625.TP
626.IR io " (since Linux 4.5)"
627This is the successor of the version 1
628.I blkio
629controller.
630.TP
631.IR memory " (since Linux 4.5)"
632This is the successor of the version 1
633.I memory
634controller.
635.TP
636.IR pids " (since Linux 4.5)"
637This is the same as the version 1
638.I pids
639controller.
640.TP
641.IR perf_event " (since Linux 4.11)"
f7286edc 642This is the same as the version 1
44c429ed
MK
643.I perf_event
644controller.
645.TP
646.IR rdma " (since Linux 4.11)"
647This is the same as the version 1
648.I rdma
649controller.
650.TP
651.IR cpu " (since Linux 4.15)"
652This is the successor to the version 1
653.I cpu
654and
655.I cpuacct
656controllers.
657.\"
2befa495 658.SS Cgroups v2 subtree control
8d5f42dc
MK
659Each cgroup in the v2 hierarchy contains the following two files:
660.TP
661.IR cgroup.controllers
277559a4 662This read-only file exposes a list of the controllers that are
8d5f42dc
MK
663.I available
664in this cgroup.
665The contents of this file match the contents of the
666.I cgroup.subtree_control
667file in the parent cgroup.
668.TP
669.I cgroup.subtree_control
670This is a list of controllers that are
671.IR active
672.RI ( enabled )
673in the cgroup.
674The set of controllers in this file is a subset of the set in the
21f0d132 675.IR cgroup.controllers
8d5f42dc
MK
676of this cgroup.
677The set of active controllers is modified by writing strings to this file
678containing space-delimited controller names,
679each preceded by '+' (to enable a controller)
680or '\-' (to disable a controller), as in the following example:
681.IP
682.in +4n
683.EX
684echo '+pids -memory' > x/y/cgroup.subtree_control
685.EE
686.in
687.IP
c9b101d1
MK
688An attempt to enable a controller
689that is not present in
690.I cgroup.controllers
691leads to an
692.B ENOENT
693error when writing to the
694.I cgroup.subtree_control
695file.
696.PP
8d5f42dc
MK
697Because the list of controllers in
698.I cgroup.subtree_control
699is a subset of those
700.IR cgroup.controllers ,
701a controller that has been disabled in one cgroup in the hierarchy
702can never be re-enabled in the subtree below that cgroup.
703.PP
704A cgroup's
705.I cgroup.subtree_control
706file determines the set of controllers that are exercised in the
707.I child
708cgroups.
709When a controller (e.g.,
710.IR pids )
711is present in the
712.I cgroup.subtree_control
713file of a parent cgroup,
714then the corresponding controller-interface files (e.g.,
715.IR pids.max )
716are automatically created in the children of that cgroup
717and can be used to exert resource control in the child cgroups.
21f0d132 718.\"
2468f14e
MK
719.SS Cgroups v2 """no internal processes""" rule
720Cgroups v2 enforces a so-called "no internal processes" rule.
721Roughly speaking, this rule means that,
722with the exception of the root cgroup, processes may reside
723only in leaf nodes (cgroups that do not themselves contain child cgroups).
724This avoids the need to decide how to partition resources between
725processes which are members of cgroup A and processes in child cgroups of A.
726.PP
727For instance, if cgroup
728.I /cg1/cg2
729exists, then a process may reside in
730.IR /cg1/cg2 ,
731but not in
732.IR /cg1 .
733This is to avoid an ambiguity in cgroups v1
734with respect to the delegation of resources between processes in
735.I /cg1
736and its child cgroups.
737The recommended approach in cgroups v2 is to create a subdirectory called
738.I leaf
739for any nonleaf cgroup which should contain processes, but no child cgroups.
740Thus, processes which previously would have gone into
741.I /cg1
742would now go into
743.IR /cg1/leaf .
744This has the advantage of making explicit
745the relationship between processes in
746.I /cg1/leaf
747and
748.IR /cg1 's
749other children.
750.PP
751The "no internal processes" rule is in fact more subtle than stated above.
752More precisely, the rule is that a (nonroot) cgroup can't both
753(1) have member processes, and
754(2) distribute resources into child cgroups\(emthat is, have a nonempty
755.I cgroup.subtree_control
756file.
757Thus, it
758.I is
759possible for a cgroup to have both member processes and child cgroups,
760but before controllers can be enabled for that cgroup,
761the member processes must be moved out of the cgroup
762(e.g., perhaps into the child cgroups).
e91d4f9e
MK
763.PP
764With the Linux 4.14 addition of "thread mode" (described below),
765the "no internal processes" rule has been relaxed in some cases.
2468f14e 766.\"
754f4cf5
MK
767.SS Cgroups v2 cgroup.events file
768With cgroups v2, a new mechanism is provided to obtain notification
769about when a cgroup becomes empty.
770The cgroups v1
771.IR release_agent
772and
773.IR notify_on_release
774files are removed, and replaced by a new, more general-purpose file,
775.IR cgroup.events .
e5bd7e65 776This read-only file contains key-value pairs
754f4cf5
MK
777(delimited by newline characters, with the key and value separated by spaces)
778that identify events or state for a cgroup.
779Currently, only one key appears in this file,
780.IR populated ,
781which has either the value 0,
782meaning that the cgroup (and its descendants)
783contain no (nonzombie) processes,
784or 1, meaning that the cgroup contains member processes.
a721e8b2 785.PP
754f4cf5
MK
786The
787.IR cgroup.events
788file can be monitored, in order to receive notification when a cgroup
789transitions between the populated and unpopulated states (or vice versa).
790When monitoring this file using
791.BR inotify (7),
792transitions generate
793.BR IN_MODIFY
794events, and when monitoring the file using
795.BR poll (2),
7747ed97 796transitions cause the bits
754f4cf5 797.B POLLPRI
7747ed97
MK
798and
799.B POLLERR
800to be returned in the
801.IR revents
802field.
a721e8b2 803.PP
ccb1a262
MK
804The cgroups v2 release-notification mechanism provided by the
805.I populated
806field of the
807.I cgroup.events
808file offers at least two advantages over the cgroups v1
754f4cf5
MK
809.IR release_agent
810mechanism.
811First, it allows for cheaper notification,
812since a single process can monitor multiple
813.IR cgroup.events
814files.
815By contrast, the cgroups v1 mechanism requires the creation
816of a process for each notification.
a15e0673 817Second, notification can be delegated to a process that lives inside
754f4cf5 818a container associated with the newly empty cgroup.
c91a9f8a 819.\"
5e071499
MK
820.SS Cgroups v2 cgroup.stat file
821.\" commit ec39225cca42c05ac36853d11d28f877fde5c42e
822Each cgroup in the v2 hierarchy contains a read-only
823.IR cgroup.stat
824file (first introduced in Linux 4.14)
825that consists of lines containing key-value pairs.
826The following keys currently appear in this file:
827.TP
828.I nr_descendants
829This is the total number of visible (i.e., living) descendant cgroups
830underneath this cgroup.
831.TP
832.I nr_dying_descendants
833This is the total number of dying descendant cgroups
834underneath this cgroup.
835A cgroup enters the dying state after being deleted.
836It remains in that state for an undefined period
837(which will depend on system load)
c7f63e74
MK
838while resources are freed before the cgroup is destroyed.
839Note that the presence of some cgroups in the dying state is normal,
840and is not indicative of any problem.
5e071499
MK
841.IP
842A process can't be made a member of a dying cgroup,
843and a dying cgroup can't be brought back to life.
844.\"
5845e10b
MK
845.SS Limiting the number of descendant cgroups
846Each cgroup in the v2 hierarchy contains the following files,
847which can be used to view and set limits on the number
848of descendant cgroups under that cgroup:
849.TP
850.IR cgroup.max.depth " (since Linux 4.14)"
851.\" commit 1a926e0bbab83bae8207d05a533173425e0496d1
852This file defines a limit on the depth of nesting of descendant cgroups.
853A value of 0 in this file means that no descendant cgroups can be created.
854An attempt to create a descendant whose nesting level exceeds
855the limit fails
856.RI ( mkdir (2)
857fails with the error
858.BR EAGAIN ).
859.IP
860Writing the string
861.IR """max"""
862to this file means that no limit is imposed.
863The default value in this file is
864.IR """max""" .
865.TP
866.IR cgroup.max.descendants " (since Linux 4.14)"
867.\" commit 1a926e0bbab83bae8207d05a533173425e0496d1
868This file defines a limit on the number of live descendant cgroups that
869this cgroup may have.
870An attempt to create more descendants than allowed by the limit fails
871.RI ( mkdir (2)
872fails with the error
873.BR EAGAIN ).
874.IP
875Writing the string
876.IR """max"""
877to this file means that no limit is imposed.
878The default value in this file is
879.IR """max""" .
880.\"
4b1c2041 881.SH CGROUPS DELEGATION: DELEGATING A HIERARCHY TO A LESS PRIVILEGED USER
4242dfbe
MK
882In the context of cgroups,
883delegation means passing management of some subtree
51629a30 884of the cgroup hierarchy to a nonprivileged user.
87b18a8b
MK
885Cgroups v1 provides support for delegation based on file permissions
886in the cgroup hierarchy but with less strict containment rules than v2
887(as noted below).
888Cgroups v2 supports delegation with containment by explicit design.
4b1c2041
MK
889The focus of the discussion in this section is on delegation in cgroups v2,
890with some differences for cgroups v1 noted along the way.
4242dfbe
MK
891.PP
892Some terminology is required in order to describe delegation.
893A
894.I delegater
895is a privileged user (i.e., root) who owns a parent cgroup.
896A
897.I delegatee
898is a nonprivileged user who will be granted the permissions needed
899to manage some subhierarchy under that parent cgroup,
900known as the
901.IR "delegated subtree" .
902.PP
903To perform delegation,
904the delegater makes certain directories and files writable by the delegatee,
905typically by changing the ownership of the objects to be the user ID
906of the delegatee.
0735069b
MK
907Assuming that we want to delegate the hierarchy rooted at (say)
908.I /dlgt_grp
4242dfbe
MK
909and that there are not yet any child cgroups under that cgroup,
910the ownership of the following is changed to the user ID of the delegatee:
911.TP
0735069b 912.IR /dlgt_grp
4242dfbe
MK
913Changing the ownership of the root of the subtree means that any new
914cgroups created under the subtree (and the files they contain)
915will also be owned by the delegatee.
916.TP
0735069b 917.IR /dlgt_grp/cgroup.procs
f7286edc 918Changing the ownership of this file means that the delegatee
4242dfbe
MK
919can move processes into the root of the delegated subtree.
920.TP
4b1c2041 921.IR /dlgt_grp/cgroup.subtree_control " (cgroups v2 only)"
e5936eb6
MK
922Changing the ownership of this file means that that the delegatee
923can enable controllers (that are present in
0735069b 924.IR /dlgt_grp/cgroup.controllers )
4242dfbe 925in order to further redistribute resources at lower levels in the subtree.
e5936eb6
MK
926(As an alternative to changing the ownership of this file,
927the delegater might instead add selected controllers to this file.)
639b6c8c 928.TP
4b1c2041 929.IR /dlgt_grp/cgroup.threads " (cgroups v2 only)"
639b6c8c
MK
930Changing the ownership of this file is necessary if a threaded subtree
931is being delegated (see the description of "thread mode", below).
7b327dd5 932This permits the delegatee to write thread IDs to the file.
cd7f4c49
MK
933(The ownership of this file can also be changed when delegating
934a domain subtree, but currently this serves no purpose,
935since, as described below, it is not possible to move a thread between
936domain cgroups by writing its thread ID to the
2b91ed4e 937.IR cgroup.threads
cd7f4c49 938file.)
4b1c2041
MK
939.IP
940In cgroups v1, the corresponding file that should instead be delegated is the
941.I tasks
942file.
4242dfbe
MK
943.PP
944The delegater should
945.I not
946change the ownership of any of the controller interfaces files (e.g.,
947.IR pids.max ,
948.IR memory.high )
949in
0735069b 950.IR dlgt_grp .
4242dfbe
MK
951Those files are used from the next level above the delegated subtree
952in order to distribute resources into the subtree,
953and the delegatee should not have permission to change
954the resources that are distributed into the delegated subtree.
955.PP
668ef765
MK
956See also the discussion of the
957.IR /sys/kernel/cgroup/delegate
4b1c2041 958file in NOTES for information about further delegatable files in cgroups v2.
668ef765 959.PP
4242dfbe
MK
960After the aforementioned steps have been performed,
961the delegatee can create child cgroups within the delegated subtree
6dc513cd
MK
962(the cgroup subdirectories and the files they contain
963will be owned by the delegatee)
4242dfbe
MK
964and move processes between cgroups in the subtree.
965If some controllers are present in
0735069b 966.IR dlgt_grp/cgroup.subtree_control ,
4242dfbe 967or the ownership of that file was passed to the delegatee,
f7286edc 968the delegatee can also control the further redistribution
4242dfbe 969of the corresponding resources into the delegated subtree.
27b086e9 970.\"
ed3f4f34 971.SS Cgroups v2 delegation: nsdelegate and cgroup namespaces
ed3f4f34
MK
972Starting with Linux 4.13,
973.\" commit 5136f6365ce3eace5a926e10f16ed2a233db5ba9
4b1c2041 974there is a second way to perform cgroup delegation in the cgroups v2 hierarchy.
07361828 975This is done by mounting or remounting the cgroup v2 filesystem with the
ed3f4f34 976.I nsdelegate
07361828
MK
977mount option.
978For example, if the cgroup v2 filesystem has already been mounted,
979we can remount it with the
980.I nsdelegate
981option as follows:
ed3f4f34
MK
982.PP
983.in +4n
984.EX
d1a71985 985mount -t cgroup2 -o remount,nsdelegate \e
07361828 986 none /sys/fs/cgroup/unified
ed3f4f34
MK
987.EE
988.in
07361828
MK
989.\"
990.\" ALternatively, we could boot the kernel with the options:
991.\"
992.\" cgroup_no_v1=all systemd.legacy_systemd_cgroup_controller
993.\"
994.\" The effect of the latter option is to prevent systemd from employing
995.\" its "hybrid" cgroup mode, where it tries to make use of cgroups v2.
ed3f4f34 996.PP
dc581e07 997The effect of this mount option is to cause cgroup namespaces
ed3f4f34
MK
998to automatically become delegation boundaries.
999More specifically,
1000the following restrictions apply for processes inside the cgroup namespace:
1001.IP * 3
446d1643 1002Writes to controller interface files in the root directory of the namespace
ed3f4f34
MK
1003will fail with the error
1004.BR EPERM .
1005Processes inside the cgroup namespace can still write to delegatable
446d1643 1006files in the root directory of the cgroup namespace such as
ed3f4f34
MK
1007.IR cgroup.procs
1008and
1009.IR cgroup.subtree_control ,
446d1643 1010and can create subhierarchy underneath the root directory.
ed3f4f34
MK
1011.IP *
1012Attempts to migrate processes across the namespace boundary are denied
1013(with the error
1014.BR ENOENT ).
1015Processes inside the cgroup namespace can still
1016(subject to the containment rules described below)
1017move processes between cgroups
1018.I within
1019the subhierarchy under the namespace root.
1020.PP
1021The ability to define cgroup namespaces as delegation boundaries
1022makes cgroup namespaces more useful.
1023To understand why, suppose that we already have one cgroup hierarchy
1024that has been delegated to a nonprivileged user,
1025.IR cecilia ,
1026using the older delegation technique described above.
1027Suppose further that
1028.I cecilia
1029wanted to further delegate a subhierarchy
1030under the existing delegated hierarchy.
1031(For example, the delegated hierarchy might be associated with
1032an unprivileged container run by
1033.IR cecilia .)
1034Even if a cgroup namespace was employed,
1035because both hierarchies are owned by the unprivileged user
1036.IR cecilia ,
1037the following illegitimate actions could be performed:
1038.IP * 3
1039A process in the inferior hierarchy could change the
619dbe1c 1040resource controller settings in the root directory of that hierarchy.
ed3f4f34
MK
1041(These resource controller settings are intended to allow control to
1042be exercised from the
1043.I parent
1044cgroup;
1045a process inside the child cgroup should not be allowed to modify them.)
1046.IP *
1047A process inside the inferior hierarchy could move processes
1048into and out of the inferior hierarchy if the cgroups in the
1049superior hierarchy were somehow visible.
1050.PP
1051Employing the
1052.I nsdelegate
1053mount option prevents both of these possibilities.
1054.PP
1055The
1056.I nsdelegate
1057mount option only has an effect when performed in
1058the initial mount namespace;
1059in other mount namespaces, the option is silently ignored.
07361828
MK
1060.PP
1061.IR Note :
1062On some systems,
1063.BR systemd (1)
1064automatically mounts the cgroup v2 filesystem.
1065In order to experiment with the
1066.I nsdelegate
44084d19
MK
1067operation, it may be useful to boot the kernel with
1068the following command-line options:
1069.PP
1070.in +4n
1071.EX
1072cgroup_no_v1=all systemd.legacy_systemd_cgroup_controller
1073.EE
1074.in
1075.PP
1076These options cause the kernel to boot with the cgroups v1 controllers
1077disabled (meaning that the controllers are available in the v2 hierarchy),
1078and tells
1079.BR systemd (1)
1080not to mount and use the cgroup v2 hierarchy,
1081so that the v2 hierarchy can be manually mounted
1082with the desired options after boot-up.
ed3f4f34 1083.\"
4b1c2041 1084.SS Cgroup delegation containment rules
4242dfbe
MK
1085Some delegation
1086.IR "containment rules"
1087ensure that the delegatee can move processes between cgroups within the
1088delegated subtree,
1089but can't move processes from outside the delegated subtree into
1090the subtree or vice versa.
1091A nonprivileged process (i.e., the delegatee) can write the PID of
1092a "target" process into a
1093.IR cgroup.procs
1094file only if all of the following are true:
1095.IP * 3
4242dfbe
MK
1096The writer has write permission on the
1097.I cgroup.procs
1098file in the destination cgroup.
1099.IP *
1100The writer has write permission on the
1101.I cgroup.procs
396761ee 1102file in the nearest common ancestor of the source and destination cgroups.
e366c4d4
MK
1103Note that in some cases,
1104the nearest common ancestor may be the source or destination cgroup itself.
4b1c2041
MK
1105This requirement is not enforced for cgroups v1 hierarchies,
1106with the consequence that containment in v1 is less strict than in v2.
1107(For example, in cgroups v1 the user that owns two distinct
1108delegated subhierarchies can move a process between the hierarchies.)
28f612ea 1109.IP *
ed3f4f34
MK
1110If the cgroup v2 filesystem was mounted with the
1111.I nsdelegate
7b574df5 1112option, the writer must be able to see the source and destination cgroups
ed3f4f34
MK
1113from its cgroup namespace.
1114.IP *
4b1c2041 1115In cgroups v1:
28f612ea
MK
1116the effective UID of the writer (i.e., the delegatee) matches the
1117real user ID or the saved set-user-ID of the target process.
4b1c2041
MK
1118Before Linux 4.11,
1119.\" commit 576dd464505fc53d501bb94569db76f220104d28
1120this requirement also applied in cgroups v2
28f612ea
MK
1121(This was a historical requirement inherited from cgroups v1
1122that was later deemed unnecessary,
1123since the other rules suffice for containment in cgroups v2.)
4242dfbe
MK
1124.PP
1125.IR Note :
1126one consequence of these delegation containment rules is that the
0735069b
MK
1127unprivileged delegatee can't place the first process into
1128the delegated subtree;
1129instead, the delegater must place the first process
1130(a process owned by the delegatee) into the delegated subtree.
4242dfbe 1131.\"
75e83bc2 1132.SH CGROUPS VERSION 2 THREAD MODE
c8902e25
MK
1133Among the restrictions imposed by cgroups v2 that were not present
1134in cgroups v1 are the following:
1135.IP * 3
1136.IR "No thread-granularity control" :
1137all of the threads of a process must be in the same cgroup.
1138.IP *
1139.IR "No internal processes" :
1140a cgroup can't both have member processes and
1141exercise controllers on child cgroups.
1142.PP
1143Both of these restrictions were added because
1144the lack of these restrictions had caused problems
1145in cgroups v1.
1146In particular, the cgroups v1 ability to allow thread-level granularity
1147for cgroup membership made no sense for some controllers.
1148(A notable example was the
1149.I memory
1150controller: since threads share an address space,
1151it made no sense to split threads across different
1152.I memory
1153cgroups.)
1154.PP
1155Notwithstanding the initial design decision in cgroups v2,
1156there were use cases for certain controllers, notably the
1157.IR cpu
1158controller,
1159for which thread-level granularity of control was meaningful and useful.
1160To accommodate such use cases, Linux 4.14 added
1161.I "thread mode"
1162for cgroups v2.
1163.PP
1164Thread mode allows the following:
1165.IP * 3
1166The creation of
1167.IR "threaded subtrees"
1168in which the threads of a process may
1169be spread across cgroups inside the tree.
1170(A threaded subtree may contain multiple multithreaded processes.)
1171.IP *
1172The concept of
1173.IR "threaded controllers",
1174which can distribute resources across the cgroups in a threaded subtree.
1175.IP *
1176A relaxation of the "no internal processes rule",
1177so that, within a threaded subtree,
1178a cgroup can both contain member threads and
1179exercise resource control over child cgroups.
1180.PP
1181With the addition of thread mode,
1182each nonroot cgroup now contains a new file,
1183.IR cgroup.type ,
1184that exposes, and in some circumstances can be used to change,
1185the "type" of a cgroup.
1186This file contains one of the following type values:
1187.TP
1188.I "domain"
1189This is a normal v2 cgroup that provides process-granularity control.
1190If a process is a member of this cgroup,
1191then all threads of the process are (by definition) in the same cgroup.
1192This is the default cgroup type,
1193and provides the same behavior that was provided for
1194cgroups in the initial cgroups v2 implementation.
1195.TP
1196.I "threaded"
1197This cgroup is a member of a threaded subtree.
1198Threads can be added to this cgroup,
1199and controllers can be enabled for the cgroup.
1200.TP
1201.I "domain threaded"
1202This is a domain cgroup that serves as the root of a threaded subtree.
1203This cgroup type is also known as "threaded root".
1204.TP
1205.I "domain invalid"
1206This is a cgroup inside a threaded subtree
1207that is in an "invalid" state.
1208Processes can't be added to the cgroup,
1209and controllers can't be enabled for the cgroup.
1210The only thing that can be done with this cgroup (other than deleting it)
1211is to convert it to a
1212.IR threaded
1213cgroup by writing the string
1214.IR """threaded"""
1215to the
1216.I cgroup.type
1217file.
61254835
MK
1218.IP
1219The rationale for the existence of this "interim" type
1220during the creation of a threaded subtree
1221(rather than the kernel simply immediately converting all cgroups
1222under the threaded root to the type
1223.IR threaded )
1224is to allow for
1225possible future extensions to the thread mode model
c8902e25
MK
1226.\"
1227.SS Threaded versus domain controllers
1228With the addition of threads mode,
1229cgroups v2 now distinguishes two types of resource controllers:
1230.IP * 3
1231.I Threaded
2cd9bbfa 1232.\" In the kernel source, look for ".threaded[ \t]*= true" in
218eadf4 1233.\" initializations of struct cgroup_subsys
c8902e25
MK
1234controllers: these controllers support thread-granularity for
1235resource control and can be enabled inside threaded subtrees,
1236with the result that the corresponding controller-interface files
1237appear inside the cgroups in the threaded subtree.
aa2c3623 1238As at Linux 4.19, the following controllers are threaded:
c8902e25
MK
1239.IR cpu ,
1240.IR perf_event ,
1241and
1242.IR pids .
1243.IP *
1244.I Domain
1245controllers: these controllers support only process granularity
1246for resource control.
1247From the perspective of a domain controller,
1248all threads of a process are always in the same cgroup.
1249Domain controllers can't be enabled inside a threaded subtree.
1250.\"
1251.SS Creating a threaded subtree
1252There are two pathways that lead to the creation of a threaded subtree.
1253The first pathway proceeds as follows:
1254.IP 1. 3
1255We write the string
1256.IR """threaded"""
1257to the
1258.I cgroup.type
1259file of a cgroup
1260.IR y/z
1261that currently has the type
1262.IR domain .
1263This has the following effects:
1264.RS
1265.IP * 3
1266The type of the cgroup
1267.IR y/z
1268becomes
1269.IR threaded .
1270.IP *
1271The type of the parent cgroup,
1272.IR y ,
1273becomes
1274.IR "domain threaded" .
1275The parent cgroup is the root of a threaded subtree
1276(also known as the "threaded root").
1277.IP *
1278All other cgroups under
1279.IR y
1280that were not already of type
1281.IR threaded
1282(because they were inside already existing threaded subtrees
1283under the new threaded root)
1284are converted to type
1285.IR "domain invalid" .
1286Any subsequently created cgroups under
1287.I y
1288will also have the type
1289.IR "domain invalid" .
1290.RE
1291.IP 2.
1292We write the string
1293.IR """threaded"""
1294to each of the
1295.IR "domain invalid"
1296cgroups under
1297.IR y ,
1298in order to convert them to the type
1299.IR threaded .
1300As a consequence of this step, all threads under the threaded root
1301now have the type
1302.IR threaded
1303and the threaded subtree is now fully usable.
1304The requirement to write
1305.IR """threaded"""
1306to each of these cgroups is somewhat cumbersome,
1307but allows for possible future extensions to the thread-mode model.
1308.PP
1309The second way of creating a threaded subtree is as follows:
1310.IP 1. 3
1311In an existing cgroup,
1312.IR z ,
1313that currently has the type
1314.IR domain ,
1315we (1) enable one or more threaded controllers and
1316(2) make a process a member of
1317.IR z .
1318(These two steps can be done in either order.)
1319This has the following consequences:
1320.RS
1321.IP * 3
1322The type of
1323.I z
1324becomes
1325.IR "domain threaded" .
1326.IP *
1327All of the descendant cgroups of
1328.I x
7a1cddd2 1329that were not already of type
c8902e25
MK
1330.IR threaded
1331are converted to type
1332.IR "domain invalid" .
1333.RE
1334.IP 2.
1335As before, we make the threaded subtree usable by writing the string
1336.IR """threaded"""
1337to each of the
1338.IR "domain invalid"
1339cgroups under
1340.IR y ,
1341in order to convert them to the type
1342.IR threaded .
1343.PP
1344One of the consequences of the above pathways to creating a threaded subtree
1345is that the threaded root cgroup can be a parent only to
1346.I threaded
1347(and
1348.IR "domain invalid" )
1349cgroups.
1350The threaded root cgroup can't be a parent of a
1351.I domain
1352cgroups, and a
1353.I threaded
1354cgroup
1355can't have a sibling that is a
1356.I domain
1357cgroup.
1358.\"
1359.SS Using a threaded subtree
1360Within a threaded subtree, threaded controllers can be enabled
1361in each subgroup whose type has been changed to
1362.IR threaded ;
1363upon doing so, the corresponding controller interface files
1364appear in the children of that cgroup.
1365.PP
1366A process can be moved into a threaded subtree by writing its PID to the
1367.I cgroup.procs
1368file in one of the cgroups inside the tree.
1369This has the effect of making all of the threads
1370in the process members of the corresponding cgroup
1371and makes the process a member of the threaded subtree.
1372The threads of the process can then be spread across
1373the threaded subtree by writing their thread IDs (see
1374.BR gettid (2))
1375to the
b2c3e720 1376.I cgroup.threads
c8902e25
MK
1377files in different cgroups inside the subtree.
1378The threads of a process must all reside in the same threaded subtree.
1379.PP
d84e558e
MK
1380As with writing to
1381.IR cgroup.procs ,
1382some containment rules apply when writing to the
b2c3e720 1383.I cgroup.threads
d84e558e
MK
1384file:
1385.IP * 3
1386The writer must have write permission on the
1387cgroup.threads
1388file in the destination cgroup.
1389.IP *
1390The writer must have write permission on the
1391.I cgroup.procs
1392file in the common ancestor of the source and destination cgroups.
1393(In some cases,
1394the common ancestor may be the source or destination cgroup itself.)
1395.IP *
1396The source and destination cgroups must be in the same threaded subtree.
1397(Outside a threaded subtree, an attempt to move a thread by writing
1398its thread ID to the
1399.I cgroup.threads
1400file in a different
1401.I domain
1402cgroup fails with the error
1403.BR EOPNOTSUPP .)
4178f132
MK
1404.PP
1405The
1406.I cgroup.threads
c8902e25
MK
1407file is present in each cgroup (including
1408.I domain
1409cgroups) and can be read in order to discover the set of threads
1410that is present in the cgroup.
1411The set of thread IDs obtained when reading this file
1412is not guaranteed to be ordered or free of duplicates.
1413.PP
1414The
1415.I cgroup.procs
1416file in the threaded root shows the PIDs of all processes
1417that are members of the threaded subtree.
1418The
1419.I cgroup.procs
1420files in the other cgroups in the subtree are not readable.
1421.PP
1422Domain controllers can't be enabled in a threaded subtree;
1423no controller-interface files appear inside the cgroups underneath the
1424threaded root.
1425From the point of view of a domain controller,
1426threaded subtrees are invisible:
1427a multithreaded process inside a threaded subtree appears to a domain
1428controller as a process that resides in the threaded root cgroup.
1429.PP
1430Within a threaded subtree, the "no internal processes" rule does not apply:
1431a cgroup can both contain member processes (or thread)
1432and exercise controllers on child cgroups.
1433.\"
1434.SS Rules for writing to cgroup.type and creating threaded subtrees
1435A number of rules apply when writing to the
1436.I cgroup.type
1437file:
1438.IP * 3
1439Only the string
1440.IR """threaded"""
1441may be written.
1442In other words, the only explicit transition that is possible is to convert a
1443.I domain
1444cgroup to type
1445.IR threaded .
1446.IP *
6c9aa5ad 1447The effect of writing
c8902e25 1448.IR """threaded"""
6c9aa5ad
MK
1449depends on the current value in
1450.IR cgroup.type ,
1451as follows:
c8902e25
MK
1452.RS
1453.IP \(bu 3
6c9aa5ad
MK
1454.IR domain
1455or
1456.IR "domain threaded" :
1457start the creation of a threaded subtree
1458(whose root is the parent of this cgroup) via
c8902e25
MK
1459the first of the pathways described above;
1460.IP \(bu
6c9aa5ad 1461.IR "domain\ invalid" :
4644794c 1462convert this cgroup (which is inside a threaded subtree) to a usable (i.e.,
c8902e25
MK
1463.IR threaded )
1464state;
1465.IP \(bu
6c9aa5ad
MK
1466.IR threaded :
1467no effect (a "no-op").
c8902e25
MK
1468.RE
1469.IP *
1470We can't write to a
1471.I cgroup.type
1472file if the parent's type is
1473.IR "domain invalid" .
1474In other words, the cgroups of a threaded subtree must be converted to the
1475.I threaded
1476state in a top-down manner.
1477.PP
00c27092 1478There are also some constraints that must be satisfied
c8902e25
MK
1479in order to create a threaded subtree rooted at the cgroup
1480.IR x :
1481.IP * 3
1482There can be no member processes in the descendant cgroups of
1483.IR x .
1484(The cgroup
1485.I x
1486can itself have member processes.)
1487.IP *
1488No domain controllers may be enabled in
1489.IR x 's
1490.IR cgroup.subtree_control
1491file.
c8902e25
MK
1492.PP
1493If any of the above constraints is violated, then an attempt to write
1494.IR """threaded"""
1495to a
1496.IR cgroup.type
1497file fails with the error
1498.BR ENOTSUP .
1499.\"
1500.SS The """domain threaded""" cgroup type
1501According to the pathways described above,
1502the type of a cgroup can change to
1503.IR "domain threaded"
1504in either of the following cases:
1505.IP * 3
1506The string
1507.IR """threaded"""
1508is written to a child cgroup.
1509.IP *
1510A threaded controller is enabled inside the cgroup and
1511a process is made a member of the cgroup.
1512.PP
1513A
1514.IR "domain threaded"
1515cgroup,
1516.IR x ,
1517can revert to the type
1518.IR domain
1519if the above conditions no longer hold true\(emthat is, if all
1520.I threaded
1521child cgroups of
1522.I x
1523are removed and either
1524.I x
1525no longer has threaded controllers enabled or
1526no longer has member processes.
1527.PP
1528When a
1529.IR "domain threaded"
1530cgroup
1531.IR x
1532reverts to the type
1533.IR domain :
1534.IP * 3
1535All
1536.IR "domain invalid"
1537descendants of
1538.I x
1539that are not in lower-level threaded subtrees revert to the type
1540.IR domain .
1541.IP *
1542The root cgroups in any lower-level threaded subtrees revert to the type
1543.IR "domain threaded" .
1544.\"
1545.SS Exceptions for the root cgroup
1546The root cgroup of the v2 hierarchy is treated exceptionally:
1547it can be the parent of both
1548.I domain
1549and
1550.I threaded
1551cgroups.
1552If the string
1553.I """threaded"""
1554is written to the
1555.I cgroup.type
1556file of one of the children of the root cgroup, then
1557.IP * 3
1558The type of that cgroup becomes
1559.IR threaded .
1560.IP *
1561The type of any descendants of that cgroup that
1562are not part of lower-level threaded subtrees changes to
1563.IR "domain invalid" .
1564.PP
1565Note that in this case, there is no cgroup whose type becomes
1566.IR "domain threaded" .
1567(Notionally, the root cgroup can be considered as the threaded root
1568for the cgroup whose type was changed to
1569.IR threaded .)
1570.PP
1571The aim of this exceptional treatment for the root cgroup is to
1572allow a threaded cgroup that employs the
1573.I cpu
1574controller to be placed as high as possible in the hierarchy,
1575so as to minimize the (small) cost of traversing the cgroup hierarchy.
1576.\"
edc90967 1577.SS The cgroups v2 """cpu""" controller and realtime threads
aa2c3623 1578As at Linux 4.19, the cgroups v2
c8902e25 1579.I cpu
0bef253e
MK
1580controller does not support control of realtime threads
1581(specifically threads scheduled under any of the policies
1582.BR SCHED_FIFO ,
1583.BR SCHED_RR ,
1584described
1585.BR SCHED_DEADLINE ;
1586see
1587.BR sched (7)).
1588Therefore, the
1589.I cpu
1590controller can be enabled in the root cgroup only
c8902e25 1591if all realtime threads are in the root cgroup.
edc90967 1592(If there are realtime threads in nonroot cgroups, then a
c8902e25
MK
1593.BR write (2)
1594of the string
1595.IR """+cpu"""
1596to the
1597.I cgroup.subtree_control
1598file fails with the error
c2df7694 1599.BR EINVAL .)
17094a28
MK
1600.PP
1601On some systems,
c8902e25 1602.BR systemd (1)
edc90967 1603places certain realtime threads in nonroot cgroups in the v2 hierarchy.
c8902e25 1604On such systems,
edc90967 1605these threads must first be moved to the root cgroup before the
c8902e25
MK
1606.I cpu
1607controller can be enabled.
1608.\"
1609.SH ERRORS
1610The following errors can occur for
1611.BR mount (2):
1612.TP
1613.B EBUSY
1614An attempt to mount a cgroup version 1 filesystem specified neither the
1615.I name=
1616option (to mount a named hierarchy) nor a controller name (or
1617.IR all ).
1618.SH NOTES
1619A child process created via
1620.BR fork (2)
1621inherits its parent's cgroup memberships.
1622A process's cgroup memberships are preserved across
1623.BR execve (2).
1624.\"
5c2181ad
MK
1625.SS /proc files
1626.TP
34eb3340 1627.IR /proc/cgroups " (since Linux 2.6.24)"
92bb6d36 1628This file contains information about the controllers
1a4f7d59 1629that are compiled into the kernel.
34eb3340
MK
1630An example of the contents of this file (reformatted for readability)
1631is the following:
a721e8b2 1632.IP
34eb3340 1633.in +4n
b8302363 1634.EX
4580c2f6
MK
1635#subsys_name hierarchy num_cgroups enabled
1636cpuset 4 1 1
1637cpu 8 1 1
1638cpuacct 8 1 1
1639blkio 6 1 1
1640memory 3 1 1
1641devices 10 84 1
1642freezer 7 1 1
1643net_cls 9 1 1
1644perf_event 5 1 1
1645net_prio 9 1 1
1646hugetlb 0 1 0
1647pids 2 1 1
b8302363 1648.EE
e646a1ba 1649.in
a721e8b2 1650.IP
34eb3340
MK
1651The fields in this file are, from left to right:
1652.RS
1653.IP 1. 3
1654The name of the controller.
1655.IP 2.
92bb6d36 1656The unique ID of the cgroup hierarchy on which this controller is mounted.
11c0797f 1657If multiple cgroups v1 controllers are bound to the same hierarchy,
34eb3340 1658then each will show the same hierarchy ID in this field.
92bb6d36
MK
1659The value in this field will be 0 if:
1660.RS 5
1661.IP a) 3
1662the controller is not mounted on a cgroups v1 hierarchy;
1663.IP b)
1664the controller is bound to the cgroups v2 single unified hierarchy; or
1665.IP c)
1666the controller is disabled (see below).
1667.RE
34eb3340
MK
1668.IP 3.
1669The number of control groups in this hierarchy using this controller.
1670.IP 4.
1671This field contains the value 1 if this controller is enabled,
1672or 0 if it has been disabled (via the
1673.IR cgroup_disable
1674kernel command-line boot parameter).
1675.RE
1676.TP
5c2181ad 1677.IR /proc/[pid]/cgroup " (since Linux 2.6.24)"
f5faa016
MK
1678This file describes control groups to which the process
1679with the corresponding PID belongs.
5f8a7eb2 1680The displayed information differs for
2c4fbe35 1681cgroups version 1 and version 2 hierarchies.
a721e8b2 1682.IP
5f8a7eb2 1683For each cgroup hierarchy of which the process is a member,
2e33b59e 1684there is one entry containing three colon-separated fields:
a721e8b2 1685.IP
4769a778
MK
1686.in +4n
1687.EX
1688hierarchy-ID:controller-list:cgroup-path
1689.EE
1690.in
a721e8b2 1691.IP
5f8a7eb2 1692For example:
c1a022dc
MK
1693.IP
1694.in +4n
1695.EX
16965:cpuacct,cpu,cpuset:/daemons
1697.EE
1698.in
5c2181ad
MK
1699.IP
1700The colon-separated fields are, from left to right:
5f8a7eb2 1701.RS
5c2181ad 1702.IP 1. 3
5f8a7eb2
MK
1703For cgroups version 1 hierarchies,
1704this field contains a unique hierarchy ID number
1705that can be matched to a hierarchy ID in
1706.IR /proc/cgroups .
1707For the cgroups version 2 hierarchy, this field contains the value 0.
5c2181ad 1708.IP 2.
5f8a7eb2 1709For cgroups version 1 hierarchies,
55f52de8 1710this field contains a comma-separated list of the controllers
5f8a7eb2
MK
1711bound to the hierarchy.
1712For the cgroups version 2 hierarchy, this field is empty.
5c2181ad 1713.IP 3.
5f8a7eb2
MK
1714This field contains the pathname of the control group in the hierarchy
1715to which the process belongs.
1716This pathname is relative to the mount point of the hierarchy.
5c2181ad 1717.RE
668ef765
MK
1718.\"
1719.SS /sys/kernel/cgroup files
1720.TP
1721.IR /sys/kernel/cgroup/delegate " (since Linux 4.15)"
1722.\" commit 01ee6cfb1483fe57c9cbd8e73817dfbf9bacffd3
1723This file exports a list of the cgroups v2 files
1724(one per line) that are delegatable
1725(i.e., whose ownership should be changed to the user ID of the delegatee).
1726In the future, the set of delegatable files may change or grow,
1727and this file provides a way for the kernel to inform
1728user-space applications of which files must be delegated.
1729As at Linux 4.15, one sees the following when inspecting this file:
1730.IP
1731.EX
1732.in +4n
1733$ \fBcat /sys/kernel/cgroup/delegate\fP
1734cgroup.procs
1735cgroup.subtree_control
c7913617 1736cgroup.threads
668ef765
MK
1737.in
1738.EE
6413d784
MK
1739.TP
1740.IR /sys/kernel/cgroup/features " (since Linux 4.15)"
1741.\" commit 5f2e673405b742be64e7c3604ed4ed3ac14f35ce
1742Over time, the set of cgroups v2 features that are provided by the
1743kernel may change or grow,
1744or some features may not be enabled by default.
1745This file provides a way for user-space applications to discover what
fcf115f5 1746features the running kernel supports and has enabled.
6413d784
MK
1747Features are listed one per line:
1748.IP
1749.in +4n
1750.EX
6413d784
MK
1751$ \fBcat /sys/kernel/cgroup/features\fP
1752nsdelegate
2e69ff53 1753.EE
6413d784
MK
1754.in
1755.IP
1756The entries that can appear in this file are:
1757.RS
1758.TP
1759.IR nsdelegate " (since Linux 4.15)"
1760The kernel supports the
1761.I nsdelegate
1762mount option.
1763.RE
bbfdf727 1764.SH SEE ALSO
ebbc83be 1765.BR prlimit (1),
f60a5da2 1766.BR systemd (1),
edc2a022
MK
1767.BR systemd-cgls (1),
1768.BR systemd-cgtop (1),
325b7eb0 1769.BR clone (2),
ebbc83be
MK
1770.BR ioprio_set (2),
1771.BR perf_event_open (2),
1772.BR setrlimit (2),
cff6de30 1773.BR cgroup_namespaces (7),
69c47536 1774.BR cpuset (7),
ebbc83be
MK
1775.BR namespaces (7),
1776.BR sched (7),
1777.BR user_namespaces (7)