]>
Commit | Line | Data |
---|---|---|
014cb63b | 1 | .\" Copyright (C) 2015 Serge Hallyn <serge@hallyn.com> |
4242dfbe | 2 | .\" and Copyright (C) 2016, 2017 Michael Kerrisk <mtk.manpages@gmail.com> |
014cb63b | 3 | .\" |
5fbde956 | 4 | .\" SPDX-License-Identifier: Linux-man-pages-copyleft |
014cb63b | 5 | .\" |
4c1c5274 | 6 | .TH cgroups 7 (date) "Linux man-pages (unreleased)" |
21f0d132 MK |
7 | .SH NAME |
8 | cgroups \- Linux control groups | |
9 | .SH DESCRIPTION | |
77eefc59 | 10 | Control groups, usually referred to as cgroups, |
a15e0673 | 11 | are a Linux kernel feature which allow processes to |
8bff7140 MK |
12 | be organized into hierarchical groups whose usage of |
13 | various types of resources can then be limited and monitored. | |
14 | The kernel's cgroup interface is provided through | |
21f0d132 | 15 | a pseudo-filesystem called cgroupfs. |
6398ca15 | 16 | Grouping is implemented in the core cgroup kernel code, |
21f0d132 | 17 | while resource tracking and limits are implemented in |
8bff7140 | 18 | a set of per-resource-type subsystems (memory, CPU, and so on). |
21f0d132 | 19 | .\" |
176a4211 MK |
20 | .SS Terminology |
21 | A | |
22 | .I cgroup | |
23 | is a collection of processes that are bound to a set of | |
24 | limits or parameters defined via the cgroup filesystem. | |
a721e8b2 | 25 | .PP |
176a4211 MK |
26 | A |
27 | .I subsystem | |
28 | is a kernel component that modifies the behavior of | |
29 | the processes in a cgroup. | |
30 | Various subsystems have been implemented, making it possible to do things | |
31 | such as limiting the amount of CPU time and memory available to a cgroup, | |
32 | accounting for the CPU time used by a cgroup, | |
33 | and freezing and resuming execution of the processes in a cgroup. | |
34 | Subsystems are sometimes also known as | |
1ae6b2c7 | 35 | .I resource controllers |
176a4211 | 36 | (or simply, controllers). |
a721e8b2 | 37 | .PP |
55f52de8 | 38 | The cgroups for a controller are arranged in a |
176a4211 MK |
39 | .IR hierarchy . |
40 | This hierarchy is defined by creating, removing, and | |
41 | renaming subdirectories within the cgroup filesystem. | |
8fc9db1e MK |
42 | At each level of the hierarchy, attributes (e.g., limits) can be defined. |
43 | The limits, control, and accounting provided by cgroups generally have | |
44 | effect throughout the subhierarchy underneath the cgroup where the | |
45 | attributes are defined. | |
8bff7140 MK |
46 | Thus, for example, the limits placed on |
47 | a cgroup at a higher level in the hierarchy cannot be exceeded | |
48 | by descendant cgroups. | |
176a4211 | 49 | .\" |
43df1ab3 MK |
50 | .SS Cgroups version 1 and version 2 |
51 | The initial release of the cgroups implementation was in Linux 2.6.24. | |
55f52de8 | 52 | Over time, various cgroup controllers have been added |
43df1ab3 | 53 | to allow the management of various types of resources. |
55f52de8 MK |
54 | However, the development of these controllers was largely uncoordinated, |
55 | with the result that many inconsistencies arose between controllers | |
43df1ab3 | 56 | and management of the cgroup hierarchies became rather complex. |
069cbb60 SH |
57 | A longer description of these problems can be found in the kernel |
58 | source file | |
1ae6b2c7 | 59 | .I Documentation/admin\-guide/cgroup\-v2.rst |
069cbb60 | 60 | (or |
1ae6b2c7 | 61 | .I Documentation/cgroup\-v2.txt |
069cbb60 | 62 | in Linux 4.17 and earlier). |
a721e8b2 | 63 | .PP |
813d9220 MK |
64 | Because of the problems with the initial cgroups implementation |
65 | (cgroups version 1), | |
43df1ab3 MK |
66 | starting in Linux 3.10, work began on a new, |
67 | orthogonal implementation to remedy these problems. | |
68 | Initially marked experimental, and hidden behind the | |
69 | .I "\-o\ __DEVEL__sane_behavior" | |
70 | mount option, the new version (cgroups version 2) | |
71 | was eventually made official with the release of Linux 4.5. | |
72 | Differences between the two versions are described in the text below. | |
8f0b7d76 MG |
73 | The file |
74 | .IR cgroup.sane_behavior , | |
5e833e27 MK |
75 | present in cgroups v1, is a relic of this mount option. |
76 | The file always reports "0" and is only retained for backward compatibility. | |
a721e8b2 | 77 | .PP |
43df1ab3 MK |
78 | Although cgroups v2 is intended as a replacement for cgroups v1, |
79 | the older system continues to exist | |
80 | (and for compatibility reasons is unlikely to be removed). | |
81 | Currently, cgroups v2 implements only a subset of the controllers | |
82 | available in cgroups v1. | |
83 | The two systems are implemented so that both v1 controllers and | |
84 | v2 controllers can be mounted on the same system. | |
85 | Thus, for example, it is possible to use those controllers | |
86 | that are supported under version 2, | |
87 | while also using version 1 controllers | |
88 | where version 2 does not yet support those controllers. | |
1a90a85e MK |
89 | The only restriction here is that a controller can't be simultaneously |
90 | employed in both a cgroups v1 hierarchy and in the cgroups v2 hierarchy. | |
43df1ab3 | 91 | .\" |
5714ccee | 92 | .SH CGROUPS VERSION 1 |
8bff7140 MK |
93 | Under cgroups v1, each controller may be mounted against a separate |
94 | cgroup filesystem that provides its own hierarchical organization of the | |
95 | processes on the system. | |
980f1827 | 96 | It is also possible to comount multiple (or even all) cgroups v1 controllers |
8bff7140 MK |
97 | against the same cgroup filesystem, meaning that the comounted controllers |
98 | manage the same hierarchical organization of processes. | |
a721e8b2 | 99 | .PP |
8bff7140 MK |
100 | For each mounted hierarchy, |
101 | the directory tree mirrors the control group hierarchy. | |
102 | Each control group is represented by a directory, with each of its child | |
103 | control cgroups represented as a child directory. | |
104 | For instance, | |
1ae6b2c7 | 105 | .I /user/joe/1.session |
8bff7140 MK |
106 | represents control group |
107 | .IR 1.session , | |
108 | which is a child of cgroup | |
109 | .IR joe , | |
110 | which is a child of | |
111 | .IR /user . | |
112 | Under each cgroup directory is a set of files which can be read or | |
113 | written to, reflecting resource limits and a few general cgroup | |
114 | properties. | |
8bff7140 | 115 | .\" |
6398ca15 | 116 | .SS Tasks (threads) versus processes |
c775bca2 MK |
117 | In cgroups v1, a distinction is drawn between |
118 | .I processes | |
119 | and | |
120 | .IR tasks . | |
121 | In this view, a process can consist of multiple tasks | |
6398ca15 MK |
122 | (more commonly called threads, from a user-space perspective, |
123 | and called such in the remainder of this man page). | |
0ec74e08 | 124 | In cgroups v1, it is possible to independently manipulate |
6398ca15 | 125 | the cgroup memberships of the threads in a process. |
c56ec51b MK |
126 | .PP |
127 | The cgroups v1 ability to split threads across different cgroups | |
128 | caused problems in some cases. | |
129 | For example, it made no sense for the | |
130 | .I memory | |
131 | controller, | |
132 | since all of the threads of a process share a single address space. | |
133 | Because of these problems, | |
c775bca2 | 134 | the ability to independently manipulate the cgroup memberships |
56769384 MK |
135 | of the threads in a process was removed in the initial cgroups v2 |
136 | implementation, and subsequently restored in a more limited form | |
137 | (see the discussion of "thread mode" below). | |
c775bca2 | 138 | .\" |
77e0a626 MK |
139 | .SS Mounting v1 controllers |
140 | The use of cgroups requires a kernel built with the | |
1ae6b2c7 | 141 | .B CONFIG_CGROUP |
8e6578f8 | 142 | option. |
77e0a626 MK |
143 | In addition, each of the v1 controllers has an associated |
144 | configuration option that must be set in order to employ that controller. | |
a721e8b2 | 145 | .PP |
77e0a626 MK |
146 | In order to use a v1 controller, |
147 | it must be mounted against a cgroup filesystem. | |
4e07c70f MK |
148 | The usual place for such mounts is under a |
149 | .BR tmpfs (5) | |
150 | filesystem mounted at | |
77e0a626 MK |
151 | .IR /sys/fs/cgroup . |
152 | Thus, one might mount the | |
153 | .I cpu | |
154 | controller as follows: | |
a721e8b2 | 155 | .PP |
77e0a626 | 156 | .in +4n |
b8302363 | 157 | .EX |
77e0a626 | 158 | mount \-t cgroup \-o cpu none /sys/fs/cgroup/cpu |
b8302363 | 159 | .EE |
e646a1ba | 160 | .in |
a721e8b2 | 161 | .PP |
77e0a626 MK |
162 | It is possible to comount multiple controllers against the same hierarchy. |
163 | For example, here the | |
1ae6b2c7 | 164 | .I cpu |
21f0d132 | 165 | and |
1ae6b2c7 | 166 | .I cpuacct |
77e0a626 | 167 | controllers are comounted against a single hierarchy: |
a721e8b2 | 168 | .PP |
21f0d132 | 169 | .in +4n |
b8302363 | 170 | .EX |
77e0a626 | 171 | mount \-t cgroup \-o cpu,cpuacct none /sys/fs/cgroup/cpu,cpuacct |
b8302363 | 172 | .EE |
e646a1ba | 173 | .in |
a721e8b2 | 174 | .PP |
55f52de8 | 175 | Comounting controllers has the effect that a process is in the same cgroup for |
77e0a626 | 176 | all of the comounted controllers. |
55f52de8 | 177 | Separately mounting controllers allows a process to |
21f0d132 MK |
178 | be in cgroup |
179 | .I /foo1 | |
55f52de8 | 180 | for one controller while being in |
21f0d132 MK |
181 | .I /foo2/foo3 |
182 | for another. | |
a721e8b2 | 183 | .PP |
77e0a626 | 184 | It is possible to comount all v1 controllers against the same hierarchy: |
a721e8b2 | 185 | .PP |
77e0a626 | 186 | .in +4n |
b8302363 | 187 | .EX |
77e0a626 | 188 | mount \-t cgroup \-o all cgroup /sys/fs/cgroup |
b8302363 | 189 | .EE |
e646a1ba | 190 | .in |
a721e8b2 | 191 | .PP |
77e0a626 MK |
192 | (One can achieve the same result by omitting |
193 | .IR "\-o all" , | |
194 | since it is the default if no controllers are explicitly specified.) | |
a721e8b2 | 195 | .PP |
31ec2a5c MK |
196 | It is not possible to mount the same controller |
197 | against multiple cgroup hierarchies. | |
198 | For example, it is not possible to mount both the | |
199 | .I cpu | |
200 | and | |
201 | .I cpuacct | |
202 | controllers against one hierarchy, and to mount the | |
203 | .I cpu | |
204 | controller alone against another hierarchy. | |
525a8b54 | 205 | It is possible to create multiple mount with exactly |
31ec2a5c MK |
206 | the same set of comounted controllers. |
207 | However, in this case all that results is multiple mount points | |
208 | providing a view of the same hierarchy. | |
a721e8b2 | 209 | .PP |
77e0a626 MK |
210 | Note that on many systems, the v1 controllers are automatically mounted under |
211 | .IR /sys/fs/cgroup ; | |
212 | in particular, | |
213 | .BR systemd (1) | |
525a8b54 | 214 | automatically creates such mounts. |
21f0d132 | 215 | .\" |
7409b54b MK |
216 | .SS Unmounting v1 controllers |
217 | A mounted cgroup filesystem can be unmounted using the | |
218 | .BR umount (8) | |
219 | command, as in the following example: | |
220 | .PP | |
221 | .in +4n | |
222 | .EX | |
223 | umount /sys/fs/cgroup/pids | |
224 | .EE | |
225 | .in | |
226 | .PP | |
227 | .IR "But note well" : | |
228 | a cgroup filesystem is unmounted only if it is not busy, | |
229 | that is, it has no child cgroups. | |
230 | If this is not the case, then the only effect of the | |
231 | .BR umount (8) | |
232 | is to make the mount invisible. | |
525a8b54 | 233 | Thus, to ensure that the mount is really removed, |
7409b54b MK |
234 | one must first remove all child cgroups, |
235 | which in turn can be done only after all member processes | |
236 | have been moved from those cgroups to the root cgroup. | |
237 | .\" | |
860573ad MK |
238 | .SS Cgroups version 1 controllers |
239 | Each of the cgroups version 1 controllers is governed | |
240 | by a kernel configuration option (listed below). | |
241 | Additionally, the availability of the cgroups feature is governed by the | |
1ae6b2c7 | 242 | .B CONFIG_CGROUPS |
860573ad MK |
243 | kernel configuration option. |
244 | .TP | |
245 | .IR cpu " (since Linux 2.6.24; " \fBCONFIG_CGROUP_SCHED\fP ) | |
246 | Cgroups can be guaranteed a minimum number of "CPU shares" | |
247 | when a system is busy. | |
248 | This does not limit a cgroup's CPU usage if the CPUs are not busy. | |
4ad9a706 | 249 | For further information, see |
1ae6b2c7 | 250 | .I Documentation/scheduler/sched\-design\-CFS.rst |
069cbb60 | 251 | (or |
1ae6b2c7 | 252 | .I Documentation/scheduler/sched\-design\-CFS.txt |
069cbb60 | 253 | in Linux 5.2 and earlier). |
a721e8b2 | 254 | .IP |
4ad9a706 MK |
255 | In Linux 3.2, |
256 | this controller was extended to provide CPU "bandwidth" control. | |
257 | If the kernel is configured with | |
81ff7360 | 258 | .BR CONFIG_CFS_BANDWIDTH , |
4ad9a706 MK |
259 | then within each scheduling period |
260 | (defined via a file in the cgroup directory), it is possible to define | |
261 | an upper limit on the CPU time allocated to the processes in a cgroup. | |
262 | This upper limit applies even if there is no other competition for the CPU. | |
860573ad | 263 | Further information can be found in the kernel source file |
1ae6b2c7 | 264 | .I Documentation/scheduler/sched\-bwc.rst |
069cbb60 | 265 | (or |
1ae6b2c7 | 266 | .I Documentation/scheduler/sched\-bwc.txt |
069cbb60 | 267 | in Linux 5.2 and earlier). |
860573ad MK |
268 | .TP |
269 | .IR cpuacct " (since Linux 2.6.24; " \fBCONFIG_CGROUP_CPUACCT\fP ) | |
270 | This provides accounting for CPU usage by groups of processes. | |
a721e8b2 | 271 | .IP |
860573ad | 272 | Further information can be found in the kernel source file |
1ae6b2c7 | 273 | .I Documentation/admin\-guide/cgroup\-v1/cpuacct.rst |
069cbb60 | 274 | (or |
1ae6b2c7 | 275 | .I Documentation/cgroup\-v1/cpuacct.txt |
069cbb60 | 276 | in Linux 5.2 and earlier). |
860573ad MK |
277 | .TP |
278 | .IR cpuset " (since Linux 2.6.24; " \fBCONFIG_CPUSETS\fP ) | |
279 | This cgroup can be used to bind the processes in a cgroup to | |
280 | a specified set of CPUs and NUMA nodes. | |
a721e8b2 | 281 | .IP |
860573ad | 282 | Further information can be found in the kernel source file |
1ae6b2c7 | 283 | .I Documentation/admin\-guide/cgroup\-v1/cpusets.rst |
069cbb60 | 284 | (or |
1ae6b2c7 | 285 | .I Documentation/cgroup\-v1/cpusets.txt |
069cbb60 SH |
286 | in Linux 5.2 and earlier). |
287 | . | |
860573ad MK |
288 | .TP |
289 | .IR memory " (since Linux 2.6.25; " \fBCONFIG_MEMCG\fP ) | |
290 | The memory controller supports reporting and limiting of process memory, kernel | |
291 | memory, and swap used by cgroups. | |
a721e8b2 | 292 | .IP |
860573ad | 293 | Further information can be found in the kernel source file |
1ae6b2c7 | 294 | .I Documentation/admin\-guide/cgroup\-v1/memory.rst |
069cbb60 | 295 | (or |
1ae6b2c7 | 296 | .I Documentation/cgroup\-v1/memory.txt |
069cbb60 | 297 | in Linux 5.2 and earlier). |
860573ad MK |
298 | .TP |
299 | .IR devices " (since Linux 2.6.26; " \fBCONFIG_CGROUP_DEVICE\fP ) | |
300 | This supports controlling which processes may create (mknod) devices as | |
301 | well as open them for reading or writing. | |
640453bb | 302 | The policies may be specified as allow-lists and deny-lists. |
860573ad MK |
303 | Hierarchy is enforced, so new rules must not |
304 | violate existing rules for the target or ancestor cgroups. | |
a721e8b2 | 305 | .IP |
860573ad | 306 | Further information can be found in the kernel source file |
1ae6b2c7 | 307 | .I Documentation/admin\-guide/cgroup\-v1/devices.rst |
069cbb60 | 308 | (or |
1ae6b2c7 | 309 | .I Documentation/cgroup\-v1/devices.txt |
069cbb60 | 310 | in Linux 5.2 and earlier). |
860573ad MK |
311 | .TP |
312 | .IR freezer " (since Linux 2.6.28; " \fBCONFIG_CGROUP_FREEZER\fP ) | |
313 | The | |
1ae6b2c7 | 314 | .I freezer |
860573ad MK |
315 | cgroup can suspend and restore (resume) all processes in a cgroup. |
316 | Freezing a cgroup | |
317 | .I /A | |
318 | also causes its children, for example, processes in | |
319 | .IR /A/B , | |
320 | to be frozen. | |
a721e8b2 | 321 | .IP |
860573ad | 322 | Further information can be found in the kernel source file |
1ae6b2c7 | 323 | .I Documentation/admin\-guide/cgroup\-v1/freezer\-subsystem.rst |
069cbb60 | 324 | (or |
1ae6b2c7 | 325 | .I Documentation/cgroup\-v1/freezer\-subsystem.txt |
069cbb60 | 326 | in Linux 5.2 and earlier). |
860573ad MK |
327 | .TP |
328 | .IR net_cls " (since Linux 2.6.29; " \fBCONFIG_CGROUP_NET_CLASSID\fP ) | |
329 | This places a classid, specified for the cgroup, on network packets | |
330 | created by a cgroup. | |
331 | These classids can then be used in firewall rules, | |
332 | as well as used to shape traffic using | |
333 | .BR tc (8). | |
334 | This applies only to packets | |
335 | leaving the cgroup, not to traffic arriving at the cgroup. | |
a721e8b2 | 336 | .IP |
860573ad | 337 | Further information can be found in the kernel source file |
1ae6b2c7 | 338 | .I Documentation/admin\-guide/cgroup\-v1/net_cls.rst |
069cbb60 | 339 | (or |
1ae6b2c7 | 340 | .I Documentation/cgroup\-v1/net_cls.txt |
069cbb60 | 341 | in Linux 5.2 and earlier). |
860573ad MK |
342 | .TP |
343 | .IR blkio " (since Linux 2.6.33; " \fBCONFIG_BLK_CGROUP\fP ) | |
344 | The | |
345 | .I blkio | |
346 | cgroup controls and limits access to specified block devices by | |
347 | applying IO control in the form of throttling and upper limits against leaf | |
348 | nodes and intermediate nodes in the storage hierarchy. | |
a721e8b2 | 349 | .IP |
860573ad MK |
350 | Two policies are available. |
351 | The first is a proportional-weight time-based division | |
352 | of disk implemented with CFQ. | |
353 | This is in effect for leaf nodes using CFQ. | |
354 | The second is a throttling policy which specifies | |
355 | upper I/O rate limits on a device. | |
a721e8b2 | 356 | .IP |
860573ad | 357 | Further information can be found in the kernel source file |
1ae6b2c7 | 358 | .I Documentation/admin\-guide/cgroup\-v1/blkio\-controller.rst |
069cbb60 | 359 | (or |
1ae6b2c7 | 360 | .I Documentation/cgroup\-v1/blkio\-controller.txt |
069cbb60 | 361 | in Linux 5.2 and earlier). |
860573ad MK |
362 | .TP |
363 | .IR perf_event " (since Linux 2.6.39; " \fBCONFIG_CGROUP_PERF\fP ) | |
364 | This controller allows | |
365 | .I perf | |
366 | monitoring of the set of processes grouped in a cgroup. | |
a721e8b2 | 367 | .IP |
069cbb60 | 368 | Further information can be found in the kernel source files |
860573ad MK |
369 | .TP |
370 | .IR net_prio " (since Linux 3.3; " \fBCONFIG_CGROUP_NET_PRIO\fP ) | |
371 | This allows priorities to be specified, per network interface, for cgroups. | |
a721e8b2 | 372 | .IP |
860573ad | 373 | Further information can be found in the kernel source file |
1ae6b2c7 | 374 | .I Documentation/admin\-guide/cgroup\-v1/net_prio.rst |
069cbb60 | 375 | (or |
1ae6b2c7 | 376 | .I Documentation/cgroup\-v1/net_prio.txt |
069cbb60 | 377 | in Linux 5.2 and earlier). |
860573ad MK |
378 | .TP |
379 | .IR hugetlb " (since Linux 3.5; " \fBCONFIG_CGROUP_HUGETLB\fP ) | |
380 | This supports limiting the use of huge pages by cgroups. | |
a721e8b2 | 381 | .IP |
860573ad | 382 | Further information can be found in the kernel source file |
1ae6b2c7 | 383 | .I Documentation/admin\-guide/cgroup\-v1/hugetlb.rst |
069cbb60 | 384 | (or |
1ae6b2c7 | 385 | .I Documentation/cgroup\-v1/hugetlb.txt |
069cbb60 | 386 | in Linux 5.2 and earlier). |
860573ad MK |
387 | .TP |
388 | .IR pids " (since Linux 4.3; " \fBCONFIG_CGROUP_PIDS\fP ) | |
389 | This controller permits limiting the number of process that may be created | |
390 | in a cgroup (and its descendants). | |
a721e8b2 | 391 | .IP |
860573ad | 392 | Further information can be found in the kernel source file |
1ae6b2c7 | 393 | .I Documentation/admin\-guide/cgroup\-v1/pids.rst |
069cbb60 | 394 | (or |
1ae6b2c7 | 395 | .I Documentation/cgroup\-v1/pids.txt |
069cbb60 | 396 | in Linux 5.2 and earlier). |
cfec905e NB |
397 | .TP |
398 | .IR rdma " (since Linux 4.11; " \fBCONFIG_CGROUP_RDMA\fP ) | |
d145c025 MK |
399 | The RDMA controller permits limiting the use of |
400 | RDMA/IB-specific resources per cgroup. | |
cfec905e NB |
401 | .IP |
402 | Further information can be found in the kernel source file | |
1ae6b2c7 | 403 | .I Documentation/admin\-guide/cgroup\-v1/rdma.rst |
069cbb60 | 404 | (or |
1ae6b2c7 | 405 | .I Documentation/cgroup\-v1/rdma.txt |
069cbb60 | 406 | in Linux 5.2 and earlier). |
860573ad | 407 | .\" |
6398ca15 | 408 | .SS Creating cgroups and moving processes |
9ed582ac | 409 | A cgroup filesystem initially contains a single root cgroup, '/', |
6398ca15 | 410 | which all processes belong to. |
21f0d132 | 411 | A new cgroup is created by creating a directory in the cgroup filesystem: |
a721e8b2 | 412 | .PP |
4769a778 MK |
413 | .in +4n |
414 | .EX | |
415 | mkdir /sys/fs/cgroup/cpu/cg1 | |
416 | .EE | |
417 | .in | |
a721e8b2 | 418 | .PP |
21f0d132 | 419 | This creates a new empty cgroup. |
a721e8b2 | 420 | .PP |
f524e7f8 | 421 | A process may be moved to this cgroup by writing its PID into the cgroup's |
21f0d132 | 422 | .I cgroup.procs |
21f0d132 | 423 | file: |
a721e8b2 | 424 | .PP |
4769a778 MK |
425 | .in +4n |
426 | .EX | |
427 | echo $$ > /sys/fs/cgroup/cpu/cg1/cgroup.procs | |
428 | .EE | |
429 | .in | |
a721e8b2 | 430 | .PP |
f524e7f8 | 431 | Only one PID at a time should be written to this file. |
a721e8b2 | 432 | .PP |
f524e7f8 | 433 | Writing the value 0 to a |
1ae6b2c7 | 434 | .I cgroup.procs |
f524e7f8 | 435 | file causes the writing process to be moved to the corresponding cgroup. |
a721e8b2 | 436 | .PP |
6398ca15 MK |
437 | When writing a PID into the |
438 | .IR cgroup.procs , | |
87402a2e | 439 | all threads in the process are moved into the new cgroup at once. |
a721e8b2 | 440 | .PP |
f524e7f8 MK |
441 | Within a hierarchy, a process can be a member of exactly one cgroup. |
442 | Writing a process's PID to a | |
1ae6b2c7 | 443 | .I cgroup.procs |
f524e7f8 MK |
444 | file automatically removes it from the cgroup of |
445 | which it was previously a member. | |
a721e8b2 | 446 | .PP |
f524e7f8 MK |
447 | The |
448 | .I cgroup.procs | |
449 | file can be read to obtain a list of the processes that are | |
450 | members of a cgroup. | |
451 | The returned list of PIDs is not guaranteed to be in order. | |
452 | Nor is it guaranteed to be free of duplicates. | |
453 | (For example, a PID may be recycled while reading from the list.) | |
a721e8b2 | 454 | .PP |
56769384 | 455 | In cgroups v1, an individual thread can be moved to |
87402a2e MK |
456 | another cgroup by writing its thread ID |
457 | (i.e., the kernel thread ID returned by | |
458 | .BR clone (2) | |
459 | and | |
460 | .BR gettid (2)) | |
461 | to the | |
1ae6b2c7 | 462 | .I tasks |
87402a2e MK |
463 | file in a cgroup directory. |
464 | This file can be read to discover the set of threads | |
465 | that are members of the cgroup. | |
b43be47e MK |
466 | .\" |
467 | .SS Removing cgroups | |
468 | To remove a cgroup, | |
469 | it must first have no child cgroups and contain no (nonzombie) processes. | |
470 | So long as that is the case, one can simply | |
471 | remove the corresponding directory pathname. | |
472 | Note that files in a cgroup directory cannot and need not be | |
473 | removed. | |
474 | .\" | |
88afe701 | 475 | .SS Cgroups v1 release notification |
23388d41 MK |
476 | Two files can be used to determine whether the kernel provides |
477 | notifications when a cgroup becomes empty. | |
478 | A cgroup is considered to be empty when it contains no child | |
479 | cgroups and no member processes. | |
a721e8b2 | 480 | .PP |
23388d41 | 481 | A special file in the root directory of each cgroup hierarchy, |
88afe701 | 482 | .IR release_agent , |
23388d41 MK |
483 | can be used to register the pathname of a program that may be invoked when |
484 | a cgroup in the hierarchy becomes empty. | |
485 | The pathname of the newly empty cgroup (relative to the cgroup mount point) | |
486 | is provided as the sole command-line argument when the | |
1ae6b2c7 | 487 | .I release_agent |
23388d41 MK |
488 | program is invoked. |
489 | The | |
1ae6b2c7 | 490 | .I release_agent |
23388d41 | 491 | program might remove the cgroup directory, |
980f1827 | 492 | or perhaps repopulate it with a process. |
a721e8b2 | 493 | .PP |
23388d41 | 494 | The default value of the |
1ae6b2c7 | 495 | .I release_agent |
23388d41 | 496 | file is empty, meaning that no release agent is invoked. |
a721e8b2 | 497 | .PP |
59af0514 MK |
498 | The content of the |
499 | .I release_agent | |
500 | file can also be specified via a mount option when the | |
501 | cgroup filesystem is mounted: | |
502 | .PP | |
503 | .in +4n | |
504 | .EX | |
fb6d2c09 | 505 | mount \-o release_agent=pathname ... |
59af0514 MK |
506 | .EE |
507 | .in | |
508 | .PP | |
23388d41 | 509 | Whether or not the |
1ae6b2c7 | 510 | .I release_agent |
23388d41 MK |
511 | program is invoked when a particular cgroup becomes empty is determined |
512 | by the value in the | |
1ae6b2c7 | 513 | .I notify_on_release |
23388d41 MK |
514 | file in the corresponding cgroup directory. |
515 | If this file contains the value 0, then the | |
1ae6b2c7 | 516 | .I release_agent |
23388d41 MK |
517 | program is not invoked. |
518 | If it contains the value 1, the | |
1ae6b2c7 | 519 | .I release_agent |
23388d41 MK |
520 | program is invoked. |
521 | The default value for this file in the root cgroup is 0. | |
522 | At the time when a new cgroup is created, | |
523 | the value in this file is inherited from the corresponding file | |
524 | in the parent cgroup. | |
88afe701 | 525 | .\" |
d311c798 MK |
526 | .SS Cgroup v1 named hierarchies |
527 | In cgroups v1, | |
528 | it is possible to mount a cgroup hierarchy that has no attached controllers: | |
529 | .PP | |
530 | .in +4n | |
531 | .EX | |
fb6d2c09 | 532 | mount \-t cgroup \-o none,name=somename none /some/mount/point |
d311c798 MK |
533 | .EE |
534 | .in | |
535 | .PP | |
536 | Multiple instances of such hierarchies can be mounted; | |
537 | each hierarchy must have a unique name. | |
538 | The only purpose of such hierarchies is to track processes. | |
539 | (See the discussion of release notification below.) | |
540 | An example of this is the | |
541 | .I name=systemd | |
542 | cgroup hierarchy that is used by | |
543 | .BR systemd (1) | |
544 | to track services and user sessions. | |
29fa4cbc MK |
545 | .PP |
546 | Since Linux 5.0, the | |
547 | .I cgroup_no_v1 | |
548 | kernel boot option (described below) can be used to disable cgroup v1 | |
549 | named hierarchies, by specifying | |
550 | .IR cgroup_no_v1=named . | |
d311c798 | 551 | .\" |
5714ccee | 552 | .SH CGROUPS VERSION 2 |
b43be47e MK |
553 | In cgroups v2, |
554 | all mounted controllers reside in a single unified hierarchy. | |
555 | While (different) controllers may be simultaneously | |
556 | mounted under the v1 and v2 hierarchies, | |
557 | it is not possible to mount the same controller simultaneously | |
558 | under both the v1 and the v2 hierarchies. | |
a721e8b2 | 559 | .PP |
2befa495 MK |
560 | The new behaviors in cgroups v2 are summarized here, |
561 | and in some cases elaborated in the following subsections. | |
22356d97 | 562 | .IP \(bu 3 |
a15e0673 | 563 | Cgroups v2 provides a unified hierarchy against |
dddb7ea1 | 564 | which all controllers are mounted. |
22356d97 | 565 | .IP \(bu |
2befa495 MK |
566 | "Internal" processes are not permitted. |
567 | With the exception of the root cgroup, processes may reside | |
568 | only in leaf nodes (cgroups that do not themselves contain child cgroups). | |
4f017a68 | 569 | The details are somewhat more subtle than this, and are described below. |
22356d97 | 570 | .IP \(bu |
2befa495 | 571 | Active cgroups must be specified via the files |
1ae6b2c7 | 572 | .I cgroup.controllers |
2befa495 MK |
573 | and |
574 | .IR cgroup.subtree_control . | |
22356d97 | 575 | .IP \(bu |
2befa495 MK |
576 | The |
577 | .I tasks | |
578 | file has been removed. | |
579 | In addition, the | |
580 | .I cgroup.clone_children | |
581 | file that is employed by the | |
582 | .I cpuset | |
583 | controller has been removed. | |
22356d97 | 584 | .IP \(bu |
2befa495 | 585 | An improved mechanism for notification of empty cgroups is provided by the |
1ae6b2c7 | 586 | .I cgroup.events |
2befa495 MK |
587 | file. |
588 | .PP | |
589 | For more changes, see the | |
1ae6b2c7 | 590 | .I Documentation/admin\-guide/cgroup\-v2.rst |
069cbb60 SH |
591 | file in the kernel source |
592 | (or | |
1ae6b2c7 | 593 | .I Documentation/cgroup\-v2.txt |
069cbb60 SH |
594 | in Linux 4.17 and earlier). |
595 | . | |
e91d4f9e MK |
596 | .PP |
597 | Some of the new behaviors listed above saw subsequent modification with | |
598 | the addition in Linux 4.14 of "thread mode" (described below). | |
2befa495 | 599 | .\" |
dddb7ea1 MK |
600 | .SS Cgroups v2 unified hierarchy |
601 | In cgroups v1, the ability to mount different controllers | |
602 | against different hierarchies was intended to allow great flexibility | |
603 | for application design. | |
e91fc446 MK |
604 | In practice, though, |
605 | the flexibility turned out to be less useful than expected, | |
dddb7ea1 MK |
606 | and in many cases added complexity. |
607 | Therefore, in cgroups v2, | |
608 | all available controllers are mounted against a single hierarchy. | |
609 | The available controllers are automatically mounted, | |
610 | meaning that it is not necessary (or possible) to specify the controllers | |
611 | when mounting the cgroup v2 filesystem using a command such as the following: | |
a721e8b2 | 612 | .PP |
4769a778 MK |
613 | .in +4n |
614 | .EX | |
fb6d2c09 | 615 | mount \-t cgroup2 none /mnt/cgroup2 |
4769a778 MK |
616 | .EE |
617 | .in | |
a721e8b2 | 618 | .PP |
dddb7ea1 MK |
619 | A cgroup v2 controller is available only if it is not currently in use |
620 | via a mount against a cgroup v1 hierarchy. | |
621 | Or, to put things another way, it is not possible to employ | |
622 | the same controller against both a v1 hierarchy and the unified v2 hierarchy. | |
57cbb0db MK |
623 | This means that it may be necessary first to unmount a v1 controller |
624 | (as described above) before that controller is available in v2. | |
625 | Since | |
626 | .BR systemd (1) | |
627 | makes heavy use of some v1 controllers by default, | |
628 | it can in some cases be simpler to boot the system with | |
629 | selected v1 controllers disabled. | |
630 | To do this, specify the | |
1ae6b2c7 | 631 | .I cgroup_no_v1=list |
57cbb0db MK |
632 | option on the kernel boot command line; |
633 | .I list | |
634 | is a comma-separated list of the names of the controllers to disable, | |
635 | or the word | |
636 | .I all | |
637 | to disable all v1 controllers. | |
638 | (This situation is correctly handled by | |
639 | .BR systemd (1), | |
640 | which falls back to operating without the specified controllers.) | |
03bb1264 MK |
641 | .PP |
642 | Note that on many modern systems, | |
643 | .BR systemd (1) | |
644 | automatically mounts the | |
645 | .I cgroup2 | |
646 | filesystem at | |
647 | .I /sys/fs/cgroup/unified | |
648 | during the boot process. | |
dddb7ea1 | 649 | .\" |
efb95954 MK |
650 | .SS Cgroups v2 mount options |
651 | The following options | |
1ae6b2c7 | 652 | .RI ( mount\~\-o ) |
efb95954 MK |
653 | can be specified when mounting the group v2 filesystem: |
654 | .TP | |
655 | .IR nsdelegate " (since Linux 4.15)" | |
656 | Treat cgroup namespaces as delegation boundaries. | |
657 | For details, see below. | |
9e18674a MK |
658 | .TP |
659 | .IR memory_localevents " (since Linux 5.2)" | |
660 | .\" commit 9852ae3fe5293264f01c49f2571ef7688f7823ce | |
661 | The | |
662 | .I memory.events | |
663 | should show statistics only for the cgroup itself, | |
664 | and not for any descendant cgroups. | |
665 | This was the behavior before Linux 5.2. | |
666 | Starting in Linux 5.2, | |
667 | the default behavior is to include statistics for descendant cgroups in | |
668 | .IR memory.events , | |
669 | and this mount option can be used to revert to the legacy behavior. | |
670 | This option is system wide and can be set on mount or | |
671 | modified through remount only from the initial mount namespace; | |
672 | it is silently ignored in noninitial namespaces. | |
efb95954 | 673 | .\" |
44c429ed MK |
674 | .SS Cgroups v2 controllers |
675 | The following controllers, documented in the kernel source file | |
1ae6b2c7 | 676 | .I Documentation/admin\-guide/cgroup\-v2.rst |
069cbb60 | 677 | (or |
1ae6b2c7 | 678 | .I Documentation/cgroup\-v2.txt |
069cbb60 | 679 | in Linux 4.17 and earlier), |
44c429ed MK |
680 | are supported in cgroups version 2: |
681 | .TP | |
cda7f4a3 MK |
682 | .IR cpu " (since Linux 4.15)" |
683 | This is the successor to the version 1 | |
684 | .I cpu | |
685 | and | |
686 | .I cpuacct | |
687 | controllers. | |
688 | .TP | |
38c287b8 MK |
689 | .IR cpuset " (since Linux 5.0)" |
690 | This is the successor of the version 1 | |
691 | .I cpuset | |
692 | controller. | |
693 | .TP | |
cda7f4a3 MK |
694 | .IR freezer " (since Linux 5.2)" |
695 | .\" commit 76f969e8948d82e78e1bc4beb6b9465908e74873 | |
696 | This is the successor of the version 1 | |
697 | .I freezer | |
698 | controller. | |
699 | .TP | |
38c287b8 MK |
700 | .IR hugetlb " (since Linux 5.6)" |
701 | This is the successor of the version 1 | |
702 | .I hugetlb | |
703 | controller. | |
704 | .TP | |
44c429ed MK |
705 | .IR io " (since Linux 4.5)" |
706 | This is the successor of the version 1 | |
707 | .I blkio | |
708 | controller. | |
709 | .TP | |
710 | .IR memory " (since Linux 4.5)" | |
711 | This is the successor of the version 1 | |
712 | .I memory | |
713 | controller. | |
714 | .TP | |
cda7f4a3 | 715 | .IR perf_event " (since Linux 4.11)" |
44c429ed | 716 | This is the same as the version 1 |
cda7f4a3 | 717 | .I perf_event |
44c429ed MK |
718 | controller. |
719 | .TP | |
cda7f4a3 | 720 | .IR pids " (since Linux 4.5)" |
f7286edc | 721 | This is the same as the version 1 |
cda7f4a3 | 722 | .I pids |
44c429ed MK |
723 | controller. |
724 | .TP | |
725 | .IR rdma " (since Linux 4.11)" | |
726 | This is the same as the version 1 | |
727 | .I rdma | |
728 | controller. | |
38c287b8 MK |
729 | .PP |
730 | There is no direct equivalent of the | |
731 | .I net_cls | |
732 | and | |
733 | .I net_prio | |
734 | controllers from cgroups version 1. | |
735 | Instead, support has been added to | |
736 | .BR iptables (8) | |
737 | to allow eBPF filters that hook on cgroup v2 pathnames to make decisions | |
738 | about network traffic on a per-cgroup basis. | |
739 | .PP | |
740 | The v2 | |
741 | .I devices | |
742 | controller provides no interface files; | |
743 | instead, device control is gated by attaching an eBPF | |
744 | .RB ( BPF_CGROUP_DEVICE ) | |
745 | program to a v2 cgroup. | |
44c429ed | 746 | .\" |
2befa495 | 747 | .SS Cgroups v2 subtree control |
8d5f42dc MK |
748 | Each cgroup in the v2 hierarchy contains the following two files: |
749 | .TP | |
1ae6b2c7 | 750 | .I cgroup.controllers |
277559a4 | 751 | This read-only file exposes a list of the controllers that are |
8d5f42dc MK |
752 | .I available |
753 | in this cgroup. | |
754 | The contents of this file match the contents of the | |
755 | .I cgroup.subtree_control | |
756 | file in the parent cgroup. | |
757 | .TP | |
758 | .I cgroup.subtree_control | |
759 | This is a list of controllers that are | |
1ae6b2c7 | 760 | .I active |
8d5f42dc MK |
761 | .RI ( enabled ) |
762 | in the cgroup. | |
763 | The set of controllers in this file is a subset of the set in the | |
1ae6b2c7 | 764 | .I cgroup.controllers |
8d5f42dc MK |
765 | of this cgroup. |
766 | The set of active controllers is modified by writing strings to this file | |
767 | containing space-delimited controller names, | |
768 | each preceded by '+' (to enable a controller) | |
769 | or '\-' (to disable a controller), as in the following example: | |
770 | .IP | |
771 | .in +4n | |
772 | .EX | |
861d36ba | 773 | echo \(aq+pids \-memory\(aq > x/y/cgroup.subtree_control |
8d5f42dc MK |
774 | .EE |
775 | .in | |
776 | .IP | |
c9b101d1 MK |
777 | An attempt to enable a controller |
778 | that is not present in | |
779 | .I cgroup.controllers | |
780 | leads to an | |
781 | .B ENOENT | |
782 | error when writing to the | |
783 | .I cgroup.subtree_control | |
784 | file. | |
785 | .PP | |
8d5f42dc MK |
786 | Because the list of controllers in |
787 | .I cgroup.subtree_control | |
788 | is a subset of those | |
789 | .IR cgroup.controllers , | |
790 | a controller that has been disabled in one cgroup in the hierarchy | |
791 | can never be re-enabled in the subtree below that cgroup. | |
792 | .PP | |
793 | A cgroup's | |
794 | .I cgroup.subtree_control | |
795 | file determines the set of controllers that are exercised in the | |
796 | .I child | |
797 | cgroups. | |
798 | When a controller (e.g., | |
799 | .IR pids ) | |
800 | is present in the | |
801 | .I cgroup.subtree_control | |
802 | file of a parent cgroup, | |
803 | then the corresponding controller-interface files (e.g., | |
804 | .IR pids.max ) | |
805 | are automatically created in the children of that cgroup | |
806 | and can be used to exert resource control in the child cgroups. | |
21f0d132 | 807 | .\" |
2468f14e MK |
808 | .SS Cgroups v2 """no internal processes""" rule |
809 | Cgroups v2 enforces a so-called "no internal processes" rule. | |
810 | Roughly speaking, this rule means that, | |
811 | with the exception of the root cgroup, processes may reside | |
812 | only in leaf nodes (cgroups that do not themselves contain child cgroups). | |
813 | This avoids the need to decide how to partition resources between | |
814 | processes which are members of cgroup A and processes in child cgroups of A. | |
815 | .PP | |
816 | For instance, if cgroup | |
817 | .I /cg1/cg2 | |
818 | exists, then a process may reside in | |
819 | .IR /cg1/cg2 , | |
820 | but not in | |
821 | .IR /cg1 . | |
822 | This is to avoid an ambiguity in cgroups v1 | |
823 | with respect to the delegation of resources between processes in | |
824 | .I /cg1 | |
825 | and its child cgroups. | |
826 | The recommended approach in cgroups v2 is to create a subdirectory called | |
827 | .I leaf | |
828 | for any nonleaf cgroup which should contain processes, but no child cgroups. | |
829 | Thus, processes which previously would have gone into | |
830 | .I /cg1 | |
831 | would now go into | |
832 | .IR /cg1/leaf . | |
833 | This has the advantage of making explicit | |
834 | the relationship between processes in | |
835 | .I /cg1/leaf | |
836 | and | |
837 | .IR /cg1 's | |
838 | other children. | |
839 | .PP | |
840 | The "no internal processes" rule is in fact more subtle than stated above. | |
841 | More precisely, the rule is that a (nonroot) cgroup can't both | |
842 | (1) have member processes, and | |
843 | (2) distribute resources into child cgroups\(emthat is, have a nonempty | |
844 | .I cgroup.subtree_control | |
845 | file. | |
846 | Thus, it | |
847 | .I is | |
848 | possible for a cgroup to have both member processes and child cgroups, | |
849 | but before controllers can be enabled for that cgroup, | |
850 | the member processes must be moved out of the cgroup | |
851 | (e.g., perhaps into the child cgroups). | |
e91d4f9e MK |
852 | .PP |
853 | With the Linux 4.14 addition of "thread mode" (described below), | |
854 | the "no internal processes" rule has been relaxed in some cases. | |
2468f14e | 855 | .\" |
754f4cf5 | 856 | .SS Cgroups v2 cgroup.events file |
71e2545e MK |
857 | Each nonroot cgroup in the v2 hierarchy contains a read-only file, |
858 | .IR cgroup.events , | |
859 | whose contents are key-value pairs | |
754f4cf5 | 860 | (delimited by newline characters, with the key and value separated by spaces) |
e00e18a2 | 861 | providing state information about the cgroup: |
71e2545e MK |
862 | .PP |
863 | .in +4n | |
864 | .EX | |
865 | $ \fBcat mygrp/cgroup.events\fP | |
866 | populated 1 | |
c309dee7 | 867 | frozen 0 |
71e2545e MK |
868 | .EE |
869 | .in | |
870 | .PP | |
871 | The following keys may appear in this file: | |
872 | .TP | |
1ae6b2c7 | 873 | .I populated |
71e2545e MK |
874 | The value of this key is either 1, |
875 | if this cgroup or any of its descendants has member processes, | |
876 | or otherwise 0. | |
c309dee7 MK |
877 | .TP |
878 | .IR frozen " (since Linux 5.2)" | |
879 | .\" commit 76f969e8948d82e78e1bc4beb6b9465908e7487 | |
880 | The value of this key is 1 if this cgroup is currently frozen, | |
881 | or 0 if it is not. | |
a721e8b2 | 882 | .PP |
754f4cf5 | 883 | The |
1ae6b2c7 | 884 | .I cgroup.events |
71e2545e MK |
885 | file can be monitored, in order to receive notification when the value of |
886 | one of its keys changes. | |
887 | Such monitoring can be done using | |
754f4cf5 | 888 | .BR inotify (7), |
71e2545e | 889 | which notifies changes as |
1ae6b2c7 | 890 | .B IN_MODIFY |
71e2545e | 891 | events, or |
754f4cf5 | 892 | .BR poll (2), |
71e2545e | 893 | which notifies changes by returning the |
754f4cf5 | 894 | .B POLLPRI |
7747ed97 MK |
895 | and |
896 | .B POLLERR | |
71e2545e | 897 | bits in the |
1ae6b2c7 | 898 | .I revents |
7747ed97 | 899 | field. |
71e2545e MK |
900 | .\" |
901 | .SS Cgroup v2 release notification | |
902 | Cgroups v2 provides a new mechanism for obtaining notification | |
903 | when a cgroup becomes empty. | |
904 | The cgroups v1 | |
1ae6b2c7 | 905 | .I release_agent |
71e2545e | 906 | and |
1ae6b2c7 | 907 | .I notify_on_release |
71e2545e | 908 | files are removed, and replaced by the |
ccb1a262 | 909 | .I populated |
71e2545e | 910 | key in the |
1ae6b2c7 | 911 | .I cgroup.events |
71e2545e MK |
912 | file. |
913 | This key either has the value 0, | |
914 | meaning that the cgroup (and its descendants) | |
915 | contain no (nonzombie) member processes, | |
916 | or 1, meaning that the cgroup (or one of its descendants) | |
917 | contains member processes. | |
918 | .PP | |
919 | The cgroups v2 release-notification mechanism | |
daf57a6a | 920 | offers the following advantages over the cgroups v1 |
1ae6b2c7 | 921 | .I release_agent |
daf57a6a | 922 | mechanism: |
22356d97 | 923 | .IP \(bu 3 |
daf57a6a | 924 | It allows for cheaper notification, |
754f4cf5 | 925 | since a single process can monitor multiple |
1ae6b2c7 | 926 | .I cgroup.events |
71e2545e | 927 | files (using the techniques described earlier). |
daf57a6a MK |
928 | By contrast, the cgroups v1 mechanism requires the expense of creating |
929 | a process for each notification. | |
22356d97 | 930 | .IP \(bu |
daf57a6a MK |
931 | Notification for different cgroup subhierarchies can be delegated |
932 | to different processes. | |
933 | By contrast, the cgroups v1 mechanism allows only one release agent | |
934 | for an entire hierarchy. | |
c91a9f8a | 935 | .\" |
5e071499 MK |
936 | .SS Cgroups v2 cgroup.stat file |
937 | .\" commit ec39225cca42c05ac36853d11d28f877fde5c42e | |
938 | Each cgroup in the v2 hierarchy contains a read-only | |
1ae6b2c7 | 939 | .I cgroup.stat |
5e071499 MK |
940 | file (first introduced in Linux 4.14) |
941 | that consists of lines containing key-value pairs. | |
942 | The following keys currently appear in this file: | |
943 | .TP | |
944 | .I nr_descendants | |
945 | This is the total number of visible (i.e., living) descendant cgroups | |
946 | underneath this cgroup. | |
947 | .TP | |
948 | .I nr_dying_descendants | |
949 | This is the total number of dying descendant cgroups | |
950 | underneath this cgroup. | |
951 | A cgroup enters the dying state after being deleted. | |
952 | It remains in that state for an undefined period | |
953 | (which will depend on system load) | |
c7f63e74 MK |
954 | while resources are freed before the cgroup is destroyed. |
955 | Note that the presence of some cgroups in the dying state is normal, | |
956 | and is not indicative of any problem. | |
5e071499 MK |
957 | .IP |
958 | A process can't be made a member of a dying cgroup, | |
959 | and a dying cgroup can't be brought back to life. | |
960 | .\" | |
5845e10b MK |
961 | .SS Limiting the number of descendant cgroups |
962 | Each cgroup in the v2 hierarchy contains the following files, | |
963 | which can be used to view and set limits on the number | |
964 | of descendant cgroups under that cgroup: | |
965 | .TP | |
966 | .IR cgroup.max.depth " (since Linux 4.14)" | |
967 | .\" commit 1a926e0bbab83bae8207d05a533173425e0496d1 | |
968 | This file defines a limit on the depth of nesting of descendant cgroups. | |
969 | A value of 0 in this file means that no descendant cgroups can be created. | |
970 | An attempt to create a descendant whose nesting level exceeds | |
971 | the limit fails | |
972 | .RI ( mkdir (2) | |
973 | fails with the error | |
974 | .BR EAGAIN ). | |
975 | .IP | |
976 | Writing the string | |
1ae6b2c7 | 977 | .I """max""" |
5845e10b MK |
978 | to this file means that no limit is imposed. |
979 | The default value in this file is | |
1ae6b2c7 | 980 | .I """max""" . |
5845e10b MK |
981 | .TP |
982 | .IR cgroup.max.descendants " (since Linux 4.14)" | |
983 | .\" commit 1a926e0bbab83bae8207d05a533173425e0496d1 | |
984 | This file defines a limit on the number of live descendant cgroups that | |
985 | this cgroup may have. | |
986 | An attempt to create more descendants than allowed by the limit fails | |
987 | .RI ( mkdir (2) | |
988 | fails with the error | |
989 | .BR EAGAIN ). | |
990 | .IP | |
991 | Writing the string | |
1ae6b2c7 | 992 | .I """max""" |
5845e10b MK |
993 | to this file means that no limit is imposed. |
994 | The default value in this file is | |
995 | .IR """max""" . | |
996 | .\" | |
4b1c2041 | 997 | .SH CGROUPS DELEGATION: DELEGATING A HIERARCHY TO A LESS PRIVILEGED USER |
4242dfbe MK |
998 | In the context of cgroups, |
999 | delegation means passing management of some subtree | |
51629a30 | 1000 | of the cgroup hierarchy to a nonprivileged user. |
87b18a8b MK |
1001 | Cgroups v1 provides support for delegation based on file permissions |
1002 | in the cgroup hierarchy but with less strict containment rules than v2 | |
1003 | (as noted below). | |
1004 | Cgroups v2 supports delegation with containment by explicit design. | |
4b1c2041 MK |
1005 | The focus of the discussion in this section is on delegation in cgroups v2, |
1006 | with some differences for cgroups v1 noted along the way. | |
4242dfbe MK |
1007 | .PP |
1008 | Some terminology is required in order to describe delegation. | |
1009 | A | |
1010 | .I delegater | |
1011 | is a privileged user (i.e., root) who owns a parent cgroup. | |
1012 | A | |
1013 | .I delegatee | |
1014 | is a nonprivileged user who will be granted the permissions needed | |
1015 | to manage some subhierarchy under that parent cgroup, | |
1016 | known as the | |
1017 | .IR "delegated subtree" . | |
1018 | .PP | |
1019 | To perform delegation, | |
1020 | the delegater makes certain directories and files writable by the delegatee, | |
1021 | typically by changing the ownership of the objects to be the user ID | |
1022 | of the delegatee. | |
0735069b MK |
1023 | Assuming that we want to delegate the hierarchy rooted at (say) |
1024 | .I /dlgt_grp | |
4242dfbe MK |
1025 | and that there are not yet any child cgroups under that cgroup, |
1026 | the ownership of the following is changed to the user ID of the delegatee: | |
1027 | .TP | |
1ae6b2c7 | 1028 | .I /dlgt_grp |
4242dfbe MK |
1029 | Changing the ownership of the root of the subtree means that any new |
1030 | cgroups created under the subtree (and the files they contain) | |
1031 | will also be owned by the delegatee. | |
1032 | .TP | |
1ae6b2c7 | 1033 | .I /dlgt_grp/cgroup.procs |
f7286edc | 1034 | Changing the ownership of this file means that the delegatee |
4242dfbe MK |
1035 | can move processes into the root of the delegated subtree. |
1036 | .TP | |
4b1c2041 | 1037 | .IR /dlgt_grp/cgroup.subtree_control " (cgroups v2 only)" |
15f2303d | 1038 | Changing the ownership of this file means that the delegatee |
e5936eb6 | 1039 | can enable controllers (that are present in |
0735069b | 1040 | .IR /dlgt_grp/cgroup.controllers ) |
4242dfbe | 1041 | in order to further redistribute resources at lower levels in the subtree. |
e5936eb6 MK |
1042 | (As an alternative to changing the ownership of this file, |
1043 | the delegater might instead add selected controllers to this file.) | |
639b6c8c | 1044 | .TP |
4b1c2041 | 1045 | .IR /dlgt_grp/cgroup.threads " (cgroups v2 only)" |
639b6c8c MK |
1046 | Changing the ownership of this file is necessary if a threaded subtree |
1047 | is being delegated (see the description of "thread mode", below). | |
7b327dd5 | 1048 | This permits the delegatee to write thread IDs to the file. |
cd7f4c49 MK |
1049 | (The ownership of this file can also be changed when delegating |
1050 | a domain subtree, but currently this serves no purpose, | |
1051 | since, as described below, it is not possible to move a thread between | |
1052 | domain cgroups by writing its thread ID to the | |
1ae6b2c7 | 1053 | .I cgroup.threads |
cd7f4c49 | 1054 | file.) |
4b1c2041 MK |
1055 | .IP |
1056 | In cgroups v1, the corresponding file that should instead be delegated is the | |
1057 | .I tasks | |
1058 | file. | |
4242dfbe MK |
1059 | .PP |
1060 | The delegater should | |
1061 | .I not | |
1062 | change the ownership of any of the controller interfaces files (e.g., | |
1063 | .IR pids.max , | |
1064 | .IR memory.high ) | |
1065 | in | |
0735069b | 1066 | .IR dlgt_grp . |
4242dfbe MK |
1067 | Those files are used from the next level above the delegated subtree |
1068 | in order to distribute resources into the subtree, | |
1069 | and the delegatee should not have permission to change | |
1070 | the resources that are distributed into the delegated subtree. | |
1071 | .PP | |
668ef765 | 1072 | See also the discussion of the |
1ae6b2c7 | 1073 | .I /sys/kernel/cgroup/delegate |
4b1c2041 | 1074 | file in NOTES for information about further delegatable files in cgroups v2. |
668ef765 | 1075 | .PP |
4242dfbe MK |
1076 | After the aforementioned steps have been performed, |
1077 | the delegatee can create child cgroups within the delegated subtree | |
6dc513cd MK |
1078 | (the cgroup subdirectories and the files they contain |
1079 | will be owned by the delegatee) | |
4242dfbe MK |
1080 | and move processes between cgroups in the subtree. |
1081 | If some controllers are present in | |
0735069b | 1082 | .IR dlgt_grp/cgroup.subtree_control , |
4242dfbe | 1083 | or the ownership of that file was passed to the delegatee, |
f7286edc | 1084 | the delegatee can also control the further redistribution |
4242dfbe | 1085 | of the corresponding resources into the delegated subtree. |
27b086e9 | 1086 | .\" |
ed3f4f34 | 1087 | .SS Cgroups v2 delegation: nsdelegate and cgroup namespaces |
ed3f4f34 MK |
1088 | Starting with Linux 4.13, |
1089 | .\" commit 5136f6365ce3eace5a926e10f16ed2a233db5ba9 | |
4b1c2041 | 1090 | there is a second way to perform cgroup delegation in the cgroups v2 hierarchy. |
07361828 | 1091 | This is done by mounting or remounting the cgroup v2 filesystem with the |
ed3f4f34 | 1092 | .I nsdelegate |
07361828 MK |
1093 | mount option. |
1094 | For example, if the cgroup v2 filesystem has already been mounted, | |
1095 | we can remount it with the | |
1096 | .I nsdelegate | |
1097 | option as follows: | |
ed3f4f34 MK |
1098 | .PP |
1099 | .in +4n | |
1100 | .EX | |
fb6d2c09 | 1101 | mount \-t cgroup2 \-o remount,nsdelegate \e |
07361828 | 1102 | none /sys/fs/cgroup/unified |
ed3f4f34 MK |
1103 | .EE |
1104 | .in | |
07361828 | 1105 | .\" |
6a0aa2ec | 1106 | .\" Alternatively, we could boot the kernel with the options: |
07361828 MK |
1107 | .\" |
1108 | .\" cgroup_no_v1=all systemd.legacy_systemd_cgroup_controller | |
1109 | .\" | |
1110 | .\" The effect of the latter option is to prevent systemd from employing | |
1111 | .\" its "hybrid" cgroup mode, where it tries to make use of cgroups v2. | |
ed3f4f34 | 1112 | .PP |
dc581e07 | 1113 | The effect of this mount option is to cause cgroup namespaces |
ed3f4f34 MK |
1114 | to automatically become delegation boundaries. |
1115 | More specifically, | |
1116 | the following restrictions apply for processes inside the cgroup namespace: | |
22356d97 | 1117 | .IP \(bu 3 |
446d1643 | 1118 | Writes to controller interface files in the root directory of the namespace |
ed3f4f34 MK |
1119 | will fail with the error |
1120 | .BR EPERM . | |
1121 | Processes inside the cgroup namespace can still write to delegatable | |
446d1643 | 1122 | files in the root directory of the cgroup namespace such as |
1ae6b2c7 | 1123 | .I cgroup.procs |
ed3f4f34 MK |
1124 | and |
1125 | .IR cgroup.subtree_control , | |
446d1643 | 1126 | and can create subhierarchy underneath the root directory. |
22356d97 | 1127 | .IP \(bu |
ed3f4f34 MK |
1128 | Attempts to migrate processes across the namespace boundary are denied |
1129 | (with the error | |
1130 | .BR ENOENT ). | |
1131 | Processes inside the cgroup namespace can still | |
1132 | (subject to the containment rules described below) | |
1133 | move processes between cgroups | |
1134 | .I within | |
1135 | the subhierarchy under the namespace root. | |
1136 | .PP | |
1137 | The ability to define cgroup namespaces as delegation boundaries | |
1138 | makes cgroup namespaces more useful. | |
1139 | To understand why, suppose that we already have one cgroup hierarchy | |
1140 | that has been delegated to a nonprivileged user, | |
1141 | .IR cecilia , | |
1142 | using the older delegation technique described above. | |
1143 | Suppose further that | |
1144 | .I cecilia | |
1145 | wanted to further delegate a subhierarchy | |
1146 | under the existing delegated hierarchy. | |
1147 | (For example, the delegated hierarchy might be associated with | |
1148 | an unprivileged container run by | |
1149 | .IR cecilia .) | |
1150 | Even if a cgroup namespace was employed, | |
1151 | because both hierarchies are owned by the unprivileged user | |
1152 | .IR cecilia , | |
1153 | the following illegitimate actions could be performed: | |
22356d97 | 1154 | .IP \(bu 3 |
ed3f4f34 | 1155 | A process in the inferior hierarchy could change the |
619dbe1c | 1156 | resource controller settings in the root directory of that hierarchy. |
ed3f4f34 MK |
1157 | (These resource controller settings are intended to allow control to |
1158 | be exercised from the | |
1159 | .I parent | |
1160 | cgroup; | |
1161 | a process inside the child cgroup should not be allowed to modify them.) | |
22356d97 | 1162 | .IP \(bu |
ed3f4f34 MK |
1163 | A process inside the inferior hierarchy could move processes |
1164 | into and out of the inferior hierarchy if the cgroups in the | |
1165 | superior hierarchy were somehow visible. | |
1166 | .PP | |
1167 | Employing the | |
1168 | .I nsdelegate | |
1169 | mount option prevents both of these possibilities. | |
1170 | .PP | |
1171 | The | |
1172 | .I nsdelegate | |
1173 | mount option only has an effect when performed in | |
1174 | the initial mount namespace; | |
1175 | in other mount namespaces, the option is silently ignored. | |
07361828 MK |
1176 | .PP |
1177 | .IR Note : | |
1178 | On some systems, | |
1179 | .BR systemd (1) | |
1180 | automatically mounts the cgroup v2 filesystem. | |
1181 | In order to experiment with the | |
1182 | .I nsdelegate | |
44084d19 MK |
1183 | operation, it may be useful to boot the kernel with |
1184 | the following command-line options: | |
1185 | .PP | |
1186 | .in +4n | |
1187 | .EX | |
1188 | cgroup_no_v1=all systemd.legacy_systemd_cgroup_controller | |
1189 | .EE | |
1190 | .in | |
1191 | .PP | |
1192 | These options cause the kernel to boot with the cgroups v1 controllers | |
1193 | disabled (meaning that the controllers are available in the v2 hierarchy), | |
1194 | and tells | |
1195 | .BR systemd (1) | |
1196 | not to mount and use the cgroup v2 hierarchy, | |
1197 | so that the v2 hierarchy can be manually mounted | |
1198 | with the desired options after boot-up. | |
ed3f4f34 | 1199 | .\" |
4b1c2041 | 1200 | .SS Cgroup delegation containment rules |
4242dfbe | 1201 | Some delegation |
1ae6b2c7 | 1202 | .I containment rules |
4242dfbe MK |
1203 | ensure that the delegatee can move processes between cgroups within the |
1204 | delegated subtree, | |
1205 | but can't move processes from outside the delegated subtree into | |
1206 | the subtree or vice versa. | |
1207 | A nonprivileged process (i.e., the delegatee) can write the PID of | |
1208 | a "target" process into a | |
1ae6b2c7 | 1209 | .I cgroup.procs |
4242dfbe | 1210 | file only if all of the following are true: |
22356d97 | 1211 | .IP \(bu 3 |
4242dfbe MK |
1212 | The writer has write permission on the |
1213 | .I cgroup.procs | |
1214 | file in the destination cgroup. | |
22356d97 | 1215 | .IP \(bu |
4242dfbe MK |
1216 | The writer has write permission on the |
1217 | .I cgroup.procs | |
396761ee | 1218 | file in the nearest common ancestor of the source and destination cgroups. |
e366c4d4 MK |
1219 | Note that in some cases, |
1220 | the nearest common ancestor may be the source or destination cgroup itself. | |
4b1c2041 MK |
1221 | This requirement is not enforced for cgroups v1 hierarchies, |
1222 | with the consequence that containment in v1 is less strict than in v2. | |
1223 | (For example, in cgroups v1 the user that owns two distinct | |
1224 | delegated subhierarchies can move a process between the hierarchies.) | |
22356d97 | 1225 | .IP \(bu |
ed3f4f34 MK |
1226 | If the cgroup v2 filesystem was mounted with the |
1227 | .I nsdelegate | |
7b574df5 | 1228 | option, the writer must be able to see the source and destination cgroups |
ed3f4f34 | 1229 | from its cgroup namespace. |
22356d97 | 1230 | .IP \(bu |
4b1c2041 | 1231 | In cgroups v1: |
28f612ea MK |
1232 | the effective UID of the writer (i.e., the delegatee) matches the |
1233 | real user ID or the saved set-user-ID of the target process. | |
4b1c2041 MK |
1234 | Before Linux 4.11, |
1235 | .\" commit 576dd464505fc53d501bb94569db76f220104d28 | |
1236 | this requirement also applied in cgroups v2 | |
28f612ea MK |
1237 | (This was a historical requirement inherited from cgroups v1 |
1238 | that was later deemed unnecessary, | |
1239 | since the other rules suffice for containment in cgroups v2.) | |
4242dfbe MK |
1240 | .PP |
1241 | .IR Note : | |
1242 | one consequence of these delegation containment rules is that the | |
0735069b MK |
1243 | unprivileged delegatee can't place the first process into |
1244 | the delegated subtree; | |
1245 | instead, the delegater must place the first process | |
1246 | (a process owned by the delegatee) into the delegated subtree. | |
4242dfbe | 1247 | .\" |
75e83bc2 | 1248 | .SH CGROUPS VERSION 2 THREAD MODE |
c8902e25 MK |
1249 | Among the restrictions imposed by cgroups v2 that were not present |
1250 | in cgroups v1 are the following: | |
22356d97 | 1251 | .IP \(bu 3 |
c8902e25 MK |
1252 | .IR "No thread-granularity control" : |
1253 | all of the threads of a process must be in the same cgroup. | |
22356d97 | 1254 | .IP \(bu |
c8902e25 MK |
1255 | .IR "No internal processes" : |
1256 | a cgroup can't both have member processes and | |
1257 | exercise controllers on child cgroups. | |
1258 | .PP | |
1259 | Both of these restrictions were added because | |
1260 | the lack of these restrictions had caused problems | |
1261 | in cgroups v1. | |
1262 | In particular, the cgroups v1 ability to allow thread-level granularity | |
1263 | for cgroup membership made no sense for some controllers. | |
1264 | (A notable example was the | |
1265 | .I memory | |
1266 | controller: since threads share an address space, | |
1267 | it made no sense to split threads across different | |
1268 | .I memory | |
1269 | cgroups.) | |
1270 | .PP | |
1271 | Notwithstanding the initial design decision in cgroups v2, | |
1272 | there were use cases for certain controllers, notably the | |
1ae6b2c7 | 1273 | .I cpu |
c8902e25 MK |
1274 | controller, |
1275 | for which thread-level granularity of control was meaningful and useful. | |
1276 | To accommodate such use cases, Linux 4.14 added | |
1277 | .I "thread mode" | |
1278 | for cgroups v2. | |
1279 | .PP | |
1280 | Thread mode allows the following: | |
22356d97 | 1281 | .IP \(bu 3 |
c8902e25 | 1282 | The creation of |
1ae6b2c7 | 1283 | .I threaded subtrees |
c8902e25 MK |
1284 | in which the threads of a process may |
1285 | be spread across cgroups inside the tree. | |
1286 | (A threaded subtree may contain multiple multithreaded processes.) | |
22356d97 | 1287 | .IP \(bu |
c8902e25 | 1288 | The concept of |
1ae6b2c7 | 1289 | .IR "threaded controllers" , |
c8902e25 | 1290 | which can distribute resources across the cgroups in a threaded subtree. |
22356d97 | 1291 | .IP \(bu |
c8902e25 MK |
1292 | A relaxation of the "no internal processes rule", |
1293 | so that, within a threaded subtree, | |
1294 | a cgroup can both contain member threads and | |
1295 | exercise resource control over child cgroups. | |
1296 | .PP | |
1297 | With the addition of thread mode, | |
1298 | each nonroot cgroup now contains a new file, | |
1299 | .IR cgroup.type , | |
1300 | that exposes, and in some circumstances can be used to change, | |
1301 | the "type" of a cgroup. | |
1302 | This file contains one of the following type values: | |
1303 | .TP | |
1ae6b2c7 | 1304 | .I domain |
c8902e25 MK |
1305 | This is a normal v2 cgroup that provides process-granularity control. |
1306 | If a process is a member of this cgroup, | |
1307 | then all threads of the process are (by definition) in the same cgroup. | |
1308 | This is the default cgroup type, | |
1309 | and provides the same behavior that was provided for | |
1310 | cgroups in the initial cgroups v2 implementation. | |
1311 | .TP | |
1ae6b2c7 | 1312 | .I threaded |
c8902e25 MK |
1313 | This cgroup is a member of a threaded subtree. |
1314 | Threads can be added to this cgroup, | |
1315 | and controllers can be enabled for the cgroup. | |
1316 | .TP | |
1ae6b2c7 | 1317 | .I domain threaded |
c8902e25 MK |
1318 | This is a domain cgroup that serves as the root of a threaded subtree. |
1319 | This cgroup type is also known as "threaded root". | |
1320 | .TP | |
1ae6b2c7 | 1321 | .I domain invalid |
c8902e25 MK |
1322 | This is a cgroup inside a threaded subtree |
1323 | that is in an "invalid" state. | |
1324 | Processes can't be added to the cgroup, | |
1325 | and controllers can't be enabled for the cgroup. | |
1326 | The only thing that can be done with this cgroup (other than deleting it) | |
1327 | is to convert it to a | |
1ae6b2c7 | 1328 | .I threaded |
c8902e25 | 1329 | cgroup by writing the string |
1ae6b2c7 | 1330 | .I """threaded""" |
c8902e25 MK |
1331 | to the |
1332 | .I cgroup.type | |
1333 | file. | |
61254835 MK |
1334 | .IP |
1335 | The rationale for the existence of this "interim" type | |
1336 | during the creation of a threaded subtree | |
1337 | (rather than the kernel simply immediately converting all cgroups | |
1338 | under the threaded root to the type | |
1339 | .IR threaded ) | |
1340 | is to allow for | |
1341 | possible future extensions to the thread mode model | |
c8902e25 MK |
1342 | .\" |
1343 | .SS Threaded versus domain controllers | |
1344 | With the addition of threads mode, | |
1345 | cgroups v2 now distinguishes two types of resource controllers: | |
22356d97 | 1346 | .IP \(bu 3 |
c8902e25 | 1347 | .I Threaded |
2cd9bbfa | 1348 | .\" In the kernel source, look for ".threaded[ \t]*= true" in |
218eadf4 | 1349 | .\" initializations of struct cgroup_subsys |
c8902e25 MK |
1350 | controllers: these controllers support thread-granularity for |
1351 | resource control and can be enabled inside threaded subtrees, | |
1352 | with the result that the corresponding controller-interface files | |
1353 | appear inside the cgroups in the threaded subtree. | |
aa2c3623 | 1354 | As at Linux 4.19, the following controllers are threaded: |
c8902e25 MK |
1355 | .IR cpu , |
1356 | .IR perf_event , | |
1357 | and | |
1358 | .IR pids . | |
22356d97 | 1359 | .IP \(bu |
c8902e25 MK |
1360 | .I Domain |
1361 | controllers: these controllers support only process granularity | |
1362 | for resource control. | |
1363 | From the perspective of a domain controller, | |
1364 | all threads of a process are always in the same cgroup. | |
1365 | Domain controllers can't be enabled inside a threaded subtree. | |
1366 | .\" | |
1367 | .SS Creating a threaded subtree | |
1368 | There are two pathways that lead to the creation of a threaded subtree. | |
1369 | The first pathway proceeds as follows: | |
22356d97 | 1370 | .IP (1) 5 |
c8902e25 | 1371 | We write the string |
1ae6b2c7 | 1372 | .I """threaded""" |
c8902e25 MK |
1373 | to the |
1374 | .I cgroup.type | |
1375 | file of a cgroup | |
1ae6b2c7 | 1376 | .I y/z |
c8902e25 MK |
1377 | that currently has the type |
1378 | .IR domain . | |
1379 | This has the following effects: | |
1380 | .RS | |
22356d97 | 1381 | .IP \(bu 3 |
c8902e25 | 1382 | The type of the cgroup |
1ae6b2c7 | 1383 | .I y/z |
c8902e25 MK |
1384 | becomes |
1385 | .IR threaded . | |
22356d97 | 1386 | .IP \(bu |
c8902e25 MK |
1387 | The type of the parent cgroup, |
1388 | .IR y , | |
1389 | becomes | |
1390 | .IR "domain threaded" . | |
1391 | The parent cgroup is the root of a threaded subtree | |
1392 | (also known as the "threaded root"). | |
22356d97 | 1393 | .IP \(bu |
c8902e25 | 1394 | All other cgroups under |
1ae6b2c7 | 1395 | .I y |
c8902e25 | 1396 | that were not already of type |
1ae6b2c7 | 1397 | .I threaded |
c8902e25 MK |
1398 | (because they were inside already existing threaded subtrees |
1399 | under the new threaded root) | |
1400 | are converted to type | |
1401 | .IR "domain invalid" . | |
1402 | Any subsequently created cgroups under | |
1403 | .I y | |
1404 | will also have the type | |
1405 | .IR "domain invalid" . | |
1406 | .RE | |
22356d97 | 1407 | .IP (2) |
c8902e25 | 1408 | We write the string |
1ae6b2c7 | 1409 | .I """threaded""" |
c8902e25 | 1410 | to each of the |
1ae6b2c7 | 1411 | .I domain invalid |
c8902e25 MK |
1412 | cgroups under |
1413 | .IR y , | |
1414 | in order to convert them to the type | |
1415 | .IR threaded . | |
1416 | As a consequence of this step, all threads under the threaded root | |
1417 | now have the type | |
1ae6b2c7 | 1418 | .I threaded |
c8902e25 MK |
1419 | and the threaded subtree is now fully usable. |
1420 | The requirement to write | |
1ae6b2c7 | 1421 | .I """threaded""" |
c8902e25 MK |
1422 | to each of these cgroups is somewhat cumbersome, |
1423 | but allows for possible future extensions to the thread-mode model. | |
1424 | .PP | |
1425 | The second way of creating a threaded subtree is as follows: | |
22356d97 | 1426 | .IP (1) 5 |
c8902e25 MK |
1427 | In an existing cgroup, |
1428 | .IR z , | |
1429 | that currently has the type | |
1430 | .IR domain , | |
22356d97 AC |
1431 | we (1.1) enable one or more threaded controllers and |
1432 | (1.2) make a process a member of | |
c8902e25 MK |
1433 | .IR z . |
1434 | (These two steps can be done in either order.) | |
1435 | This has the following consequences: | |
1436 | .RS | |
22356d97 | 1437 | .IP \(bu 3 |
c8902e25 MK |
1438 | The type of |
1439 | .I z | |
1440 | becomes | |
1441 | .IR "domain threaded" . | |
22356d97 | 1442 | .IP \(bu |
c8902e25 MK |
1443 | All of the descendant cgroups of |
1444 | .I x | |
7a1cddd2 | 1445 | that were not already of type |
1ae6b2c7 | 1446 | .I threaded |
c8902e25 MK |
1447 | are converted to type |
1448 | .IR "domain invalid" . | |
1449 | .RE | |
22356d97 | 1450 | .IP (2) |
c8902e25 | 1451 | As before, we make the threaded subtree usable by writing the string |
1ae6b2c7 | 1452 | .I """threaded""" |
c8902e25 | 1453 | to each of the |
1ae6b2c7 | 1454 | .I domain invalid |
c8902e25 MK |
1455 | cgroups under |
1456 | .IR y , | |
1457 | in order to convert them to the type | |
1458 | .IR threaded . | |
1459 | .PP | |
1460 | One of the consequences of the above pathways to creating a threaded subtree | |
1461 | is that the threaded root cgroup can be a parent only to | |
1462 | .I threaded | |
1463 | (and | |
1464 | .IR "domain invalid" ) | |
1465 | cgroups. | |
1466 | The threaded root cgroup can't be a parent of a | |
1467 | .I domain | |
1468 | cgroups, and a | |
1469 | .I threaded | |
1470 | cgroup | |
1471 | can't have a sibling that is a | |
1472 | .I domain | |
1473 | cgroup. | |
1474 | .\" | |
1475 | .SS Using a threaded subtree | |
1476 | Within a threaded subtree, threaded controllers can be enabled | |
1477 | in each subgroup whose type has been changed to | |
1478 | .IR threaded ; | |
1479 | upon doing so, the corresponding controller interface files | |
1480 | appear in the children of that cgroup. | |
1481 | .PP | |
1482 | A process can be moved into a threaded subtree by writing its PID to the | |
1483 | .I cgroup.procs | |
1484 | file in one of the cgroups inside the tree. | |
1485 | This has the effect of making all of the threads | |
1486 | in the process members of the corresponding cgroup | |
1487 | and makes the process a member of the threaded subtree. | |
1488 | The threads of the process can then be spread across | |
1489 | the threaded subtree by writing their thread IDs (see | |
1490 | .BR gettid (2)) | |
1491 | to the | |
b2c3e720 | 1492 | .I cgroup.threads |
c8902e25 MK |
1493 | files in different cgroups inside the subtree. |
1494 | The threads of a process must all reside in the same threaded subtree. | |
1495 | .PP | |
d84e558e MK |
1496 | As with writing to |
1497 | .IR cgroup.procs , | |
1498 | some containment rules apply when writing to the | |
b2c3e720 | 1499 | .I cgroup.threads |
d84e558e | 1500 | file: |
22356d97 | 1501 | .IP \(bu 3 |
d84e558e MK |
1502 | The writer must have write permission on the |
1503 | cgroup.threads | |
1504 | file in the destination cgroup. | |
22356d97 | 1505 | .IP \(bu |
d84e558e MK |
1506 | The writer must have write permission on the |
1507 | .I cgroup.procs | |
1508 | file in the common ancestor of the source and destination cgroups. | |
1509 | (In some cases, | |
1510 | the common ancestor may be the source or destination cgroup itself.) | |
22356d97 | 1511 | .IP \(bu |
d84e558e MK |
1512 | The source and destination cgroups must be in the same threaded subtree. |
1513 | (Outside a threaded subtree, an attempt to move a thread by writing | |
1514 | its thread ID to the | |
1515 | .I cgroup.threads | |
1516 | file in a different | |
1517 | .I domain | |
1518 | cgroup fails with the error | |
1519 | .BR EOPNOTSUPP .) | |
4178f132 MK |
1520 | .PP |
1521 | The | |
1522 | .I cgroup.threads | |
c8902e25 MK |
1523 | file is present in each cgroup (including |
1524 | .I domain | |
1525 | cgroups) and can be read in order to discover the set of threads | |
1526 | that is present in the cgroup. | |
1527 | The set of thread IDs obtained when reading this file | |
1528 | is not guaranteed to be ordered or free of duplicates. | |
1529 | .PP | |
1530 | The | |
1531 | .I cgroup.procs | |
1532 | file in the threaded root shows the PIDs of all processes | |
1533 | that are members of the threaded subtree. | |
1534 | The | |
1535 | .I cgroup.procs | |
1536 | files in the other cgroups in the subtree are not readable. | |
1537 | .PP | |
1538 | Domain controllers can't be enabled in a threaded subtree; | |
1539 | no controller-interface files appear inside the cgroups underneath the | |
1540 | threaded root. | |
1541 | From the point of view of a domain controller, | |
1542 | threaded subtrees are invisible: | |
1543 | a multithreaded process inside a threaded subtree appears to a domain | |
1544 | controller as a process that resides in the threaded root cgroup. | |
1545 | .PP | |
1546 | Within a threaded subtree, the "no internal processes" rule does not apply: | |
1547 | a cgroup can both contain member processes (or thread) | |
1548 | and exercise controllers on child cgroups. | |
1549 | .\" | |
1550 | .SS Rules for writing to cgroup.type and creating threaded subtrees | |
1551 | A number of rules apply when writing to the | |
1552 | .I cgroup.type | |
1553 | file: | |
22356d97 | 1554 | .IP \(bu 3 |
c8902e25 | 1555 | Only the string |
1ae6b2c7 | 1556 | .I """threaded""" |
c8902e25 MK |
1557 | may be written. |
1558 | In other words, the only explicit transition that is possible is to convert a | |
1559 | .I domain | |
1560 | cgroup to type | |
1561 | .IR threaded . | |
22356d97 | 1562 | .IP \(bu |
6c9aa5ad | 1563 | The effect of writing |
1ae6b2c7 | 1564 | .I """threaded""" |
6c9aa5ad MK |
1565 | depends on the current value in |
1566 | .IR cgroup.type , | |
1567 | as follows: | |
c8902e25 MK |
1568 | .RS |
1569 | .IP \(bu 3 | |
1ae6b2c7 | 1570 | .I domain |
6c9aa5ad MK |
1571 | or |
1572 | .IR "domain threaded" : | |
1573 | start the creation of a threaded subtree | |
1574 | (whose root is the parent of this cgroup) via | |
c8902e25 MK |
1575 | the first of the pathways described above; |
1576 | .IP \(bu | |
6c9aa5ad | 1577 | .IR "domain\ invalid" : |
4644794c | 1578 | convert this cgroup (which is inside a threaded subtree) to a usable (i.e., |
c8902e25 MK |
1579 | .IR threaded ) |
1580 | state; | |
1581 | .IP \(bu | |
6c9aa5ad MK |
1582 | .IR threaded : |
1583 | no effect (a "no-op"). | |
c8902e25 | 1584 | .RE |
22356d97 | 1585 | .IP \(bu |
c8902e25 MK |
1586 | We can't write to a |
1587 | .I cgroup.type | |
1588 | file if the parent's type is | |
1589 | .IR "domain invalid" . | |
1590 | In other words, the cgroups of a threaded subtree must be converted to the | |
1591 | .I threaded | |
1592 | state in a top-down manner. | |
1593 | .PP | |
00c27092 | 1594 | There are also some constraints that must be satisfied |
c8902e25 MK |
1595 | in order to create a threaded subtree rooted at the cgroup |
1596 | .IR x : | |
22356d97 | 1597 | .IP \(bu 3 |
c8902e25 MK |
1598 | There can be no member processes in the descendant cgroups of |
1599 | .IR x . | |
1600 | (The cgroup | |
1601 | .I x | |
1602 | can itself have member processes.) | |
22356d97 | 1603 | .IP \(bu |
c8902e25 MK |
1604 | No domain controllers may be enabled in |
1605 | .IR x 's | |
1ae6b2c7 | 1606 | .I cgroup.subtree_control |
c8902e25 | 1607 | file. |
c8902e25 MK |
1608 | .PP |
1609 | If any of the above constraints is violated, then an attempt to write | |
1ae6b2c7 | 1610 | .I """threaded""" |
c8902e25 | 1611 | to a |
1ae6b2c7 | 1612 | .I cgroup.type |
c8902e25 MK |
1613 | file fails with the error |
1614 | .BR ENOTSUP . | |
1615 | .\" | |
1616 | .SS The """domain threaded""" cgroup type | |
1617 | According to the pathways described above, | |
1618 | the type of a cgroup can change to | |
1ae6b2c7 | 1619 | .I domain threaded |
c8902e25 | 1620 | in either of the following cases: |
22356d97 | 1621 | .IP \(bu 3 |
c8902e25 | 1622 | The string |
1ae6b2c7 | 1623 | .I """threaded""" |
c8902e25 | 1624 | is written to a child cgroup. |
22356d97 | 1625 | .IP \(bu |
c8902e25 MK |
1626 | A threaded controller is enabled inside the cgroup and |
1627 | a process is made a member of the cgroup. | |
1628 | .PP | |
1629 | A | |
1ae6b2c7 | 1630 | .I domain threaded |
c8902e25 MK |
1631 | cgroup, |
1632 | .IR x , | |
1633 | can revert to the type | |
1ae6b2c7 | 1634 | .I domain |
c8902e25 MK |
1635 | if the above conditions no longer hold true\(emthat is, if all |
1636 | .I threaded | |
1637 | child cgroups of | |
1638 | .I x | |
1639 | are removed and either | |
1640 | .I x | |
1641 | no longer has threaded controllers enabled or | |
1642 | no longer has member processes. | |
1643 | .PP | |
1644 | When a | |
1ae6b2c7 | 1645 | .I domain threaded |
c8902e25 | 1646 | cgroup |
1ae6b2c7 | 1647 | .I x |
c8902e25 MK |
1648 | reverts to the type |
1649 | .IR domain : | |
22356d97 | 1650 | .IP \(bu 3 |
c8902e25 | 1651 | All |
1ae6b2c7 | 1652 | .I domain invalid |
c8902e25 MK |
1653 | descendants of |
1654 | .I x | |
1655 | that are not in lower-level threaded subtrees revert to the type | |
1656 | .IR domain . | |
22356d97 | 1657 | .IP \(bu |
c8902e25 MK |
1658 | The root cgroups in any lower-level threaded subtrees revert to the type |
1659 | .IR "domain threaded" . | |
1660 | .\" | |
1661 | .SS Exceptions for the root cgroup | |
1662 | The root cgroup of the v2 hierarchy is treated exceptionally: | |
1663 | it can be the parent of both | |
1664 | .I domain | |
1665 | and | |
1666 | .I threaded | |
1667 | cgroups. | |
1668 | If the string | |
1669 | .I """threaded""" | |
1670 | is written to the | |
1671 | .I cgroup.type | |
1672 | file of one of the children of the root cgroup, then | |
22356d97 | 1673 | .IP \(bu 3 |
c8902e25 MK |
1674 | The type of that cgroup becomes |
1675 | .IR threaded . | |
22356d97 | 1676 | .IP \(bu |
c8902e25 MK |
1677 | The type of any descendants of that cgroup that |
1678 | are not part of lower-level threaded subtrees changes to | |
1679 | .IR "domain invalid" . | |
1680 | .PP | |
1681 | Note that in this case, there is no cgroup whose type becomes | |
1682 | .IR "domain threaded" . | |
1683 | (Notionally, the root cgroup can be considered as the threaded root | |
1684 | for the cgroup whose type was changed to | |
1685 | .IR threaded .) | |
1686 | .PP | |
1687 | The aim of this exceptional treatment for the root cgroup is to | |
1688 | allow a threaded cgroup that employs the | |
1689 | .I cpu | |
1690 | controller to be placed as high as possible in the hierarchy, | |
1691 | so as to minimize the (small) cost of traversing the cgroup hierarchy. | |
1692 | .\" | |
edc90967 | 1693 | .SS The cgroups v2 """cpu""" controller and realtime threads |
aa2c3623 | 1694 | As at Linux 4.19, the cgroups v2 |
c8902e25 | 1695 | .I cpu |
0bef253e MK |
1696 | controller does not support control of realtime threads |
1697 | (specifically threads scheduled under any of the policies | |
1698 | .BR SCHED_FIFO , | |
1699 | .BR SCHED_RR , | |
1700 | described | |
1701 | .BR SCHED_DEADLINE ; | |
1702 | see | |
1703 | .BR sched (7)). | |
1704 | Therefore, the | |
1705 | .I cpu | |
1706 | controller can be enabled in the root cgroup only | |
c8902e25 | 1707 | if all realtime threads are in the root cgroup. |
edc90967 | 1708 | (If there are realtime threads in nonroot cgroups, then a |
c8902e25 MK |
1709 | .BR write (2) |
1710 | of the string | |
1ae6b2c7 | 1711 | .I """+cpu""" |
c8902e25 MK |
1712 | to the |
1713 | .I cgroup.subtree_control | |
1714 | file fails with the error | |
c2df7694 | 1715 | .BR EINVAL .) |
17094a28 MK |
1716 | .PP |
1717 | On some systems, | |
c8902e25 | 1718 | .BR systemd (1) |
edc90967 | 1719 | places certain realtime threads in nonroot cgroups in the v2 hierarchy. |
c8902e25 | 1720 | On such systems, |
edc90967 | 1721 | these threads must first be moved to the root cgroup before the |
c8902e25 MK |
1722 | .I cpu |
1723 | controller can be enabled. | |
1724 | .\" | |
1725 | .SH ERRORS | |
1726 | The following errors can occur for | |
1727 | .BR mount (2): | |
1728 | .TP | |
1729 | .B EBUSY | |
1730 | An attempt to mount a cgroup version 1 filesystem specified neither the | |
1731 | .I name= | |
1732 | option (to mount a named hierarchy) nor a controller name (or | |
1733 | .IR all ). | |
1734 | .SH NOTES | |
1735 | A child process created via | |
1736 | .BR fork (2) | |
1737 | inherits its parent's cgroup memberships. | |
1738 | A process's cgroup memberships are preserved across | |
1739 | .BR execve (2). | |
c0e4ab63 MK |
1740 | .PP |
1741 | The | |
1742 | .BR clone3 (2) | |
1743 | .B CLONE_INTO_CGROUP | |
1744 | flag can be used to create a child process that begins its life in | |
1745 | a different version 2 cgroup from the parent process. | |
c8902e25 | 1746 | .\" |
5c2181ad MK |
1747 | .SS /proc files |
1748 | .TP | |
34eb3340 | 1749 | .IR /proc/cgroups " (since Linux 2.6.24)" |
92bb6d36 | 1750 | This file contains information about the controllers |
1a4f7d59 | 1751 | that are compiled into the kernel. |
34eb3340 MK |
1752 | An example of the contents of this file (reformatted for readability) |
1753 | is the following: | |
a721e8b2 | 1754 | .IP |
34eb3340 | 1755 | .in +4n |
b8302363 | 1756 | .EX |
4580c2f6 MK |
1757 | #subsys_name hierarchy num_cgroups enabled |
1758 | cpuset 4 1 1 | |
1759 | cpu 8 1 1 | |
1760 | cpuacct 8 1 1 | |
1761 | blkio 6 1 1 | |
1762 | memory 3 1 1 | |
1763 | devices 10 84 1 | |
1764 | freezer 7 1 1 | |
1765 | net_cls 9 1 1 | |
1766 | perf_event 5 1 1 | |
1767 | net_prio 9 1 1 | |
1768 | hugetlb 0 1 0 | |
1769 | pids 2 1 1 | |
b8302363 | 1770 | .EE |
e646a1ba | 1771 | .in |
a721e8b2 | 1772 | .IP |
34eb3340 MK |
1773 | The fields in this file are, from left to right: |
1774 | .RS | |
22356d97 | 1775 | .IP [1] 5 |
34eb3340 | 1776 | The name of the controller. |
22356d97 | 1777 | .IP [2] |
92bb6d36 | 1778 | The unique ID of the cgroup hierarchy on which this controller is mounted. |
11c0797f | 1779 | If multiple cgroups v1 controllers are bound to the same hierarchy, |
34eb3340 | 1780 | then each will show the same hierarchy ID in this field. |
92bb6d36 | 1781 | The value in this field will be 0 if: |
22356d97 AC |
1782 | .RS |
1783 | .IP \(bu 3 | |
92bb6d36 | 1784 | the controller is not mounted on a cgroups v1 hierarchy; |
22356d97 | 1785 | .IP \(bu |
92bb6d36 | 1786 | the controller is bound to the cgroups v2 single unified hierarchy; or |
22356d97 | 1787 | .IP \(bu |
92bb6d36 MK |
1788 | the controller is disabled (see below). |
1789 | .RE | |
22356d97 | 1790 | .IP [3] |
34eb3340 | 1791 | The number of control groups in this hierarchy using this controller. |
22356d97 | 1792 | .IP [4] |
34eb3340 MK |
1793 | This field contains the value 1 if this controller is enabled, |
1794 | or 0 if it has been disabled (via the | |
1ae6b2c7 | 1795 | .I cgroup_disable |
34eb3340 MK |
1796 | kernel command-line boot parameter). |
1797 | .RE | |
1798 | .TP | |
5c2181ad | 1799 | .IR /proc/[pid]/cgroup " (since Linux 2.6.24)" |
f5faa016 MK |
1800 | This file describes control groups to which the process |
1801 | with the corresponding PID belongs. | |
5f8a7eb2 | 1802 | The displayed information differs for |
2c4fbe35 | 1803 | cgroups version 1 and version 2 hierarchies. |
a721e8b2 | 1804 | .IP |
5f8a7eb2 | 1805 | For each cgroup hierarchy of which the process is a member, |
2e33b59e | 1806 | there is one entry containing three colon-separated fields: |
a721e8b2 | 1807 | .IP |
4769a778 MK |
1808 | .in +4n |
1809 | .EX | |
d064d41a | 1810 | hierarchy\-ID:controller\-list:cgroup\-path |
4769a778 MK |
1811 | .EE |
1812 | .in | |
a721e8b2 | 1813 | .IP |
5f8a7eb2 | 1814 | For example: |
c1a022dc MK |
1815 | .IP |
1816 | .in +4n | |
1817 | .EX | |
1818 | 5:cpuacct,cpu,cpuset:/daemons | |
1819 | .EE | |
1820 | .in | |
5c2181ad MK |
1821 | .IP |
1822 | The colon-separated fields are, from left to right: | |
5f8a7eb2 | 1823 | .RS |
22356d97 | 1824 | .IP [1] 5 |
5f8a7eb2 MK |
1825 | For cgroups version 1 hierarchies, |
1826 | this field contains a unique hierarchy ID number | |
1827 | that can be matched to a hierarchy ID in | |
1828 | .IR /proc/cgroups . | |
1829 | For the cgroups version 2 hierarchy, this field contains the value 0. | |
22356d97 | 1830 | .IP [2] |
5f8a7eb2 | 1831 | For cgroups version 1 hierarchies, |
55f52de8 | 1832 | this field contains a comma-separated list of the controllers |
5f8a7eb2 MK |
1833 | bound to the hierarchy. |
1834 | For the cgroups version 2 hierarchy, this field is empty. | |
22356d97 | 1835 | .IP [3] |
5f8a7eb2 MK |
1836 | This field contains the pathname of the control group in the hierarchy |
1837 | to which the process belongs. | |
1838 | This pathname is relative to the mount point of the hierarchy. | |
5c2181ad | 1839 | .RE |
668ef765 MK |
1840 | .\" |
1841 | .SS /sys/kernel/cgroup files | |
1842 | .TP | |
1843 | .IR /sys/kernel/cgroup/delegate " (since Linux 4.15)" | |
1844 | .\" commit 01ee6cfb1483fe57c9cbd8e73817dfbf9bacffd3 | |
1845 | This file exports a list of the cgroups v2 files | |
1846 | (one per line) that are delegatable | |
1847 | (i.e., whose ownership should be changed to the user ID of the delegatee). | |
1848 | In the future, the set of delegatable files may change or grow, | |
1849 | and this file provides a way for the kernel to inform | |
1850 | user-space applications of which files must be delegated. | |
1851 | As at Linux 4.15, one sees the following when inspecting this file: | |
1852 | .IP | |
668ef765 | 1853 | .in +4n |
4f237029 | 1854 | .EX |
668ef765 MK |
1855 | $ \fBcat /sys/kernel/cgroup/delegate\fP |
1856 | cgroup.procs | |
1857 | cgroup.subtree_control | |
c7913617 | 1858 | cgroup.threads |
668ef765 | 1859 | .EE |
4f237029 | 1860 | .in |
6413d784 MK |
1861 | .TP |
1862 | .IR /sys/kernel/cgroup/features " (since Linux 4.15)" | |
1863 | .\" commit 5f2e673405b742be64e7c3604ed4ed3ac14f35ce | |
1864 | Over time, the set of cgroups v2 features that are provided by the | |
1865 | kernel may change or grow, | |
1866 | or some features may not be enabled by default. | |
1867 | This file provides a way for user-space applications to discover what | |
fcf115f5 | 1868 | features the running kernel supports and has enabled. |
6413d784 MK |
1869 | Features are listed one per line: |
1870 | .IP | |
1871 | .in +4n | |
1872 | .EX | |
6413d784 MK |
1873 | $ \fBcat /sys/kernel/cgroup/features\fP |
1874 | nsdelegate | |
9e18674a | 1875 | memory_localevents |
2e69ff53 | 1876 | .EE |
6413d784 MK |
1877 | .in |
1878 | .IP | |
1879 | The entries that can appear in this file are: | |
1880 | .RS | |
1881 | .TP | |
9e18674a MK |
1882 | .IR memory_localevents " (since Linux 5.2)" |
1883 | The kernel supports the | |
1884 | .I memory_localevents | |
1885 | mount option. | |
1886 | .TP | |
6413d784 MK |
1887 | .IR nsdelegate " (since Linux 4.15)" |
1888 | The kernel supports the | |
1889 | .I nsdelegate | |
1890 | mount option. | |
e571991e BH |
1891 | .TP |
1892 | .IR memory_recursiveprot " (since Linux 5.7)" | |
1893 | .\" commit 8a931f801340c2be10552c7b5622d5f4852f3a36 | |
1894 | The kernel supports the | |
1895 | .I memory_recursiveprot | |
1896 | mount option. | |
6413d784 | 1897 | .RE |
bbfdf727 | 1898 | .SH SEE ALSO |
ebbc83be | 1899 | .BR prlimit (1), |
f60a5da2 | 1900 | .BR systemd (1), |
28a4c58c MK |
1901 | .BR systemd\-cgls (1), |
1902 | .BR systemd\-cgtop (1), | |
325b7eb0 | 1903 | .BR clone (2), |
ebbc83be MK |
1904 | .BR ioprio_set (2), |
1905 | .BR perf_event_open (2), | |
1906 | .BR setrlimit (2), | |
cff6de30 | 1907 | .BR cgroup_namespaces (7), |
69c47536 | 1908 | .BR cpuset (7), |
ebbc83be MK |
1909 | .BR namespaces (7), |
1910 | .BR sched (7), | |
1911 | .BR user_namespaces (7) | |
d4c9a848 MK |
1912 | .PP |
1913 | The kernel source file | |
069cbb60 | 1914 | .IR Documentation/admin\-guide/cgroup\-v2.rst . |