man7/cgroups.7

   1 .\" Copyright (C) 2015 Serge Hallyn <serge@hallyn.com>
   2 .\" and Copyright (C) 2016, 2017 Michael Kerrisk <mtk.manpages@gmail.com>
   3 .\"
   4 .\" %%%LICENSE_START(VERBATIM)
   5 .\" Permission is granted to make and distribute verbatim copies of this
   6 .\" manual provided the copyright notice and this permission notice are
   7 .\" preserved on all copies.
   8 .\"
   9 .\" Permission is granted to copy and distribute modified versions of this
  10 .\" manual under the conditions for verbatim copying, provided that the
  11 .\" entire resulting derived work is distributed under the terms of a
  12 .\" permission notice identical to this one.
  13 .\"
  14 .\" Since the Linux kernel and libraries are constantly changing, this
  15 .\" manual page may be incorrect or out-of-date.  The author(s) assume no
  16 .\" responsibility for errors or omissions, or for damages resulting from
  17 .\" the use of the information contained herein.  The author(s) may not
  18 .\" have taken the same level of care in the production of this manual,
  19 .\" which is licensed free of charge, as they might when working
  20 .\" professionally.
  21 .\"
  22 .\" Formatted or processed versions of this manual, if unaccompanied by
  23 .\" the source, must acknowledge the copyright and authors of this work.
  24 .\" %%%LICENSE_END
  25 .\"
  26 .TH CGROUPS 7 2020-04-11 "Linux" "Linux Programmer's Manual"
  27 .SH NAME
  28 cgroups \- Linux control groups
  29 .SH DESCRIPTION
  30 Control groups, usually referred to as cgroups,
  31 are a Linux kernel feature which allow processes to
  32 be organized into hierarchical groups whose usage of
  33 various types of resources can then be limited and monitored.
  34 The kernel's cgroup interface is provided through
  35 a pseudo-filesystem called cgroupfs.
  36 Grouping is implemented in the core cgroup kernel code,
  37 while resource tracking and limits are implemented in
  38 a set of per-resource-type subsystems (memory, CPU, and so on).
  39 .\"
  40 .SS Terminology
  41 A
  42 .I cgroup
  43 is a collection of processes that are bound to a set of
  44 limits or parameters defined via the cgroup filesystem.
  45 .PP
  46 A
  47 .I subsystem
  48 is a kernel component that modifies the behavior of
  49 the processes in a cgroup.
  50 Various subsystems have been implemented, making it possible to do things
  51 such as limiting the amount of CPU time and memory available to a cgroup,
  52 accounting for the CPU time used by a cgroup,
  53 and freezing and resuming execution of the processes in a cgroup.
  54 Subsystems are sometimes also known as
  55 .IR "resource controllers"
  56 (or simply, controllers).
  57 .PP
  58 The cgroups for a controller are arranged in a
  59 .IR hierarchy .
  60 This hierarchy is defined by creating, removing, and
  61 renaming subdirectories within the cgroup filesystem.
  62 At each level of the hierarchy, attributes (e.g., limits) can be defined.
  63 The limits, control, and accounting provided by cgroups generally have
  64 effect throughout the subhierarchy underneath the cgroup where the
  65 attributes are defined.
  66 Thus, for example, the limits placed on
  67 a cgroup at a higher level in the hierarchy cannot be exceeded
  68 by descendant cgroups.
  69 .\"
  70 .SS Cgroups version 1 and version 2
  71 The initial release of the cgroups implementation was in Linux 2.6.24.
  72 Over time, various cgroup controllers have been added
  73 to allow the management of various types of resources.
  74 However, the development of these controllers was largely uncoordinated,
  75 with the result that many inconsistencies arose between controllers
  76 and management of the cgroup hierarchies became rather complex.
  77 (A longer description of these problems can be found in
  78 the kernel source file
  79 .IR Documentation/cgroup\-v2.txt .)
  80 .PP
  81 Because of the problems with the initial cgroups implementation
  82 (cgroups version 1),
  83 starting in Linux 3.10, work began on a new,
  84 orthogonal implementation to remedy these problems.
  85 Initially marked experimental, and hidden behind the
  86 .I "\-o\ __DEVEL__sane_behavior"
  87 mount option, the new version (cgroups version 2)
  88 was eventually made official with the release of Linux 4.5.
  89 Differences between the two versions are described in the text below.
  90 The file
  91 .IR cgroup.sane_behavior ,
  92 present in cgroups v1, is a relic of this mount option. The file
  93 always reports "0" and is only retained for backward compatibility.
  94 .PP
  95 Although cgroups v2 is intended as a replacement for cgroups v1,
  96 the older system continues to exist
  97 (and for compatibility reasons is unlikely to be removed).
  98 Currently, cgroups v2 implements only a subset of the controllers
  99 available in cgroups v1.
 100 The two systems are implemented so that both v1 controllers and
 101 v2 controllers can be mounted on the same system.
 102 Thus, for example, it is possible to use those controllers
 103 that are supported under version 2,
 104 while also using version 1 controllers
 105 where version 2 does not yet support those controllers.
 106 The only restriction here is that a controller can't be simultaneously
 107 employed in both a cgroups v1 hierarchy and in the cgroups v2 hierarchy.
 108 .\"
 109 .SH CGROUPS VERSION 1
 110 Under cgroups v1, each controller may be mounted against a separate
 111 cgroup filesystem that provides its own hierarchical organization of the
 112 processes on the system.
 113 It is also possible to comount multiple (or even all) cgroups v1 controllers
 114 against the same cgroup filesystem, meaning that the comounted controllers
 115 manage the same hierarchical organization of processes.
 116 .PP
 117 For each mounted hierarchy,
 118 the directory tree mirrors the control group hierarchy.
 119 Each control group is represented by a directory, with each of its child
 120 control cgroups represented as a child directory.
 121 For instance,
 122 .IR /user/joe/1.session
 123 represents control group
 124 .IR 1.session ,
 125 which is a child of cgroup
 126 .IR joe ,
 127 which is a child of
 128 .IR /user .
 129 Under each cgroup directory is a set of files which can be read or
 130 written to, reflecting resource limits and a few general cgroup
 131 properties.
 132 .\"
 133 .SS Tasks (threads) versus processes
 134 In cgroups v1, a distinction is drawn between
 135 .I processes
 136 and
 137 .IR tasks .
 138 In this view, a process can consist of multiple tasks
 139 (more commonly called threads, from a user-space perspective,
 140 and called such in the remainder of this man page).
 141 In cgroups v1, it is possible to independently manipulate
 142 the cgroup memberships of the threads in a process.
 143 .PP
 144 The cgroups v1 ability to split threads across different cgroups
 145 caused problems in some cases.
 146 For example, it made no sense for the
 147 .I memory
 148 controller,
 149 since all of the threads of a process share a single address space.
 150 Because of these problems,
 151 the ability to independently manipulate the cgroup memberships
 152 of the threads in a process was removed in the initial cgroups v2
 153 implementation, and subsequently restored in a more limited form
 154 (see the discussion of "thread mode" below).
 155 .\"
 156 .SS Mounting v1 controllers
 157 The use of cgroups requires a kernel built with the
 158 .BR CONFIG_CGROUP
 159 option.
 160 In addition, each of the v1 controllers has an associated
 161 configuration option that must be set in order to employ that controller.
 162 .PP
 163 In order to use a v1 controller,
 164 it must be mounted against a cgroup filesystem.
 165 The usual place for such mounts is under a
 166 .BR tmpfs (5)
 167 filesystem mounted at
 168 .IR /sys/fs/cgroup .
 169 Thus, one might mount the
 170 .I cpu
 171 controller as follows:
 172 .PP
 173 .in +4n
 174 .EX
 175 mount \-t cgroup \-o cpu none /sys/fs/cgroup/cpu
 176 .EE
 177 .in
 178 .PP
 179 It is possible to comount multiple controllers against the same hierarchy.
 180 For example, here the
 181 .IR cpu
 182 and
 183 .IR cpuacct
 184 controllers are comounted against a single hierarchy:
 185 .PP
 186 .in +4n
 187 .EX
 188 mount \-t cgroup \-o cpu,cpuacct none /sys/fs/cgroup/cpu,cpuacct
 189 .EE
 190 .in
 191 .PP
 192 Comounting controllers has the effect that a process is in the same cgroup for
 193 all of the comounted controllers.
 194 Separately mounting controllers allows a process to
 195 be in cgroup
 196 .I /foo1
 197 for one controller while being in
 198 .I /foo2/foo3
 199 for another.
 200 .PP
 201 It is possible to comount all v1 controllers against the same hierarchy:
 202 .PP
 203 .in +4n
 204 .EX
 205 mount \-t cgroup \-o all cgroup /sys/fs/cgroup
 206 .EE
 207 .in
 208 .PP
 209 (One can achieve the same result by omitting
 210 .IR "\-o all" ,
 211 since it is the default if no controllers are explicitly specified.)
 212 .PP
 213 It is not possible to mount the same controller
 214 against multiple cgroup hierarchies.
 215 For example, it is not possible to mount both the
 216 .I cpu
 217 and
 218 .I cpuacct
 219 controllers against one hierarchy, and to mount the
 220 .I cpu
 221 controller alone against another hierarchy.
 222 It is possible to create multiple mount points with exactly
 223 the same set of comounted controllers.
 224 However, in this case all that results is multiple mount points
 225 providing a view of the same hierarchy.
 226 .PP
 227 Note that on many systems, the v1 controllers are automatically mounted under
 228 .IR /sys/fs/cgroup ;
 229 in particular,
 230 .BR systemd (1)
 231 automatically creates such mount points.
 232 .\"
 233 .SS Unmounting v1 controllers
 234 A mounted cgroup filesystem can be unmounted using the
 235 .BR umount (8)
 236 command, as in the following example:
 237 .PP
 238 .in +4n
 239 .EX
 240 umount /sys/fs/cgroup/pids
 241 .EE
 242 .in
 243 .PP
 244 .IR "But note well" :
 245 a cgroup filesystem is unmounted only if it is not busy,
 246 that is, it has no child cgroups.
 247 If this is not the case, then the only effect of the
 248 .BR umount (8)
 249 is to make the mount invisible.
 250 Thus, to ensure that the mount point is really removed,
 251 one must first remove all child cgroups,
 252 which in turn can be done only after all member processes
 253 have been moved from those cgroups to the root cgroup.
 254 .\"
 255 .SS Cgroups version 1 controllers
 256 Each of the cgroups version 1 controllers is governed
 257 by a kernel configuration option (listed below).
 258 Additionally, the availability of the cgroups feature is governed by the
 259 .BR CONFIG_CGROUPS
 260 kernel configuration option.
 261 .TP
 262 .IR cpu " (since Linux 2.6.24; " \fBCONFIG_CGROUP_SCHED\fP )
 263 Cgroups can be guaranteed a minimum number of "CPU shares"
 264 when a system is busy.
 265 This does not limit a cgroup's CPU usage if the CPUs are not busy.
 266 For further information, see
 267 .IR Documentation/scheduler/sched-design-CFS.txt .
 268 .IP
 269 In Linux 3.2,
 270 this controller was extended to provide CPU "bandwidth" control.
 271 If the kernel is configured with
 272 .BR CONFIG_CFS_BANDWIDTH ,
 273 then within each scheduling period
 274 (defined via a file in the cgroup directory), it is possible to define
 275 an upper limit on the CPU time allocated to the processes in a cgroup.
 276 This upper limit applies even if there is no other competition for the CPU.
 277 Further information can be found in the kernel source file
 278 .IR Documentation/scheduler/sched\-bwc.txt .
 279 .TP
 280 .IR cpuacct " (since Linux 2.6.24; " \fBCONFIG_CGROUP_CPUACCT\fP )
 281 This provides accounting for CPU usage by groups of processes.
 282 .IP
 283 Further information can be found in the kernel source file
 284 .IR Documentation/cgroup\-v1/cpuacct.txt .
 285 .TP
 286 .IR cpuset " (since Linux 2.6.24; " \fBCONFIG_CPUSETS\fP )
 287 This cgroup can be used to bind the processes in a cgroup to
 288 a specified set of CPUs and NUMA nodes.
 289 .IP
 290 Further information can be found in the kernel source file
 291 .IR Documentation/cgroup\-v1/cpusets.txt .
 292 .TP
 293 .IR memory " (since Linux 2.6.25; " \fBCONFIG_MEMCG\fP )
 294 The memory controller supports reporting and limiting of process memory, kernel
 295 memory, and swap used by cgroups.
 296 .IP
 297 Further information can be found in the kernel source file
 298 .IR Documentation/cgroup\-v1/memory.txt .
 299 .TP
 300 .IR devices " (since Linux 2.6.26; " \fBCONFIG_CGROUP_DEVICE\fP )
 301 This supports controlling which processes may create (mknod) devices as
 302 well as open them for reading or writing.
 303 The policies may be specified as allow-lists and deny-lists.
 304 Hierarchy is enforced, so new rules must not
 305 violate existing rules for the target or ancestor cgroups.
 306 .IP
 307 Further information can be found in the kernel source file
 308 .IR Documentation/cgroup-v1/devices.txt .
 309 .TP
 310 .IR freezer " (since Linux 2.6.28; " \fBCONFIG_CGROUP_FREEZER\fP )
 311 The
 312 .IR freezer
 313 cgroup can suspend and restore (resume) all processes in a cgroup.
 314 Freezing a cgroup
 315 .I /A
 316 also causes its children, for example, processes in
 317 .IR /A/B ,
 318 to be frozen.
 319 .IP
 320 Further information can be found in the kernel source file
 321 .IR Documentation/cgroup-v1/freezer-subsystem.txt .
 322 .TP
 323 .IR net_cls " (since Linux 2.6.29; " \fBCONFIG_CGROUP_NET_CLASSID\fP )
 324 This places a classid, specified for the cgroup, on network packets
 325 created by a cgroup.
 326 These classids can then be used in firewall rules,
 327 as well as used to shape traffic using
 328 .BR tc (8).
 329 This applies only to packets
 330 leaving the cgroup, not to traffic arriving at the cgroup.
 331 .IP
 332 Further information can be found in the kernel source file
 333 .IR Documentation/cgroup-v1/net_cls.txt .
 334 .TP
 335 .IR blkio " (since Linux 2.6.33; " \fBCONFIG_BLK_CGROUP\fP )
 336 The
 337 .I blkio
 338 cgroup controls and limits access to specified block devices by
 339 applying IO control in the form of throttling and upper limits against leaf
 340 nodes and intermediate nodes in the storage hierarchy.
 341 .IP
 342 Two policies are available.
 343 The first is a proportional-weight time-based division
 344 of disk implemented with CFQ.
 345 This is in effect for leaf nodes using CFQ.
 346 The second is a throttling policy which specifies
 347 upper I/O rate limits on a device.
 348 .IP
 349 Further information can be found in the kernel source file
 350 .IR Documentation/cgroup-v1/blkio-controller.txt .
 351 .TP
 352 .IR perf_event " (since Linux 2.6.39; " \fBCONFIG_CGROUP_PERF\fP )
 353 This controller allows
 354 .I perf
 355 monitoring of the set of processes grouped in a cgroup.
 356 .IP
 357 Further information can be found in the kernel source file
 358 .IR tools/perf/Documentation/perf-record.txt .
 359 .TP
 360 .IR net_prio " (since Linux 3.3; " \fBCONFIG_CGROUP_NET_PRIO\fP )
 361 This allows priorities to be specified, per network interface, for cgroups.
 362 .IP
 363 Further information can be found in the kernel source file
 364 .IR Documentation/cgroup-v1/net_prio.txt .
 365 .TP
 366 .IR hugetlb " (since Linux 3.5; " \fBCONFIG_CGROUP_HUGETLB\fP )
 367 This supports limiting the use of huge pages by cgroups.
 368 .IP
 369 Further information can be found in the kernel source file
 370 .IR Documentation/cgroup-v1/hugetlb.txt .
 371 .TP
 372 .IR pids " (since Linux 4.3; " \fBCONFIG_CGROUP_PIDS\fP )
 373 This controller permits limiting the number of process that may be created
 374 in a cgroup (and its descendants).
 375 .IP
 376 Further information can be found in the kernel source file
 377 .IR Documentation/cgroup-v1/pids.txt .
 378 .TP
 379 .IR rdma " (since Linux 4.11; " \fBCONFIG_CGROUP_RDMA\fP )
 380 The RDMA controller permits limiting the use of
 381 RDMA/IB-specific resources per cgroup.
 382 .IP
 383 Further information can be found in the kernel source file
 384 .IR Documentation/cgroup-v1/rdma.txt .
 385 .\"
 386 .SS Creating cgroups and moving processes
 387 A cgroup filesystem initially contains a single root cgroup, '/',
 388 which all processes belong to.
 389 A new cgroup is created by creating a directory in the cgroup filesystem:
 390 .PP
 391 .in +4n
 392 .EX
 393 mkdir /sys/fs/cgroup/cpu/cg1
 394 .EE
 395 .in
 396 .PP
 397 This creates a new empty cgroup.
 398 .PP
 399 A process may be moved to this cgroup by writing its PID into the cgroup's
 400 .I cgroup.procs
 401 file:
 402 .PP
 403 .in +4n
 404 .EX
 405 echo $$ > /sys/fs/cgroup/cpu/cg1/cgroup.procs
 406 .EE
 407 .in
 408 .PP
 409 Only one PID at a time should be written to this file.
 410 .PP
 411 Writing the value 0 to a
 412 .IR cgroup.procs
 413 file causes the writing process to be moved to the corresponding cgroup.
 414 .PP
 415 When writing a PID into the
 416 .IR cgroup.procs ,
 417 all threads in the process are moved into the new cgroup at once.
 418 .PP
 419 Within a hierarchy, a process can be a member of exactly one cgroup.
 420 Writing a process's PID to a
 421 .IR cgroup.procs
 422 file automatically removes it from the cgroup of
 423 which it was previously a member.
 424 .PP
 425 The
 426 .I cgroup.procs
 427 file can be read to obtain a list of the processes that are
 428 members of a cgroup.
 429 The returned list of PIDs is not guaranteed to be in order.
 430 Nor is it guaranteed to be free of duplicates.
 431 (For example, a PID may be recycled while reading from the list.)
 432 .PP
 433 In cgroups v1, an individual thread can be moved to
 434 another cgroup by writing its thread ID
 435 (i.e., the kernel thread ID returned by
 436 .BR clone (2)
 437 and
 438 .BR gettid (2))
 439 to the
 440 .IR tasks
 441 file in a cgroup directory.
 442 This file can be read to discover the set of threads
 443 that are members of the cgroup.
 444 .\"
 445 .SS Removing cgroups
 446 To remove a cgroup,
 447 it must first have no child cgroups and contain no (nonzombie) processes.
 448 So long as that is the case, one can simply
 449 remove the corresponding directory pathname.
 450 Note that files in a cgroup directory cannot and need not be
 451 removed.
 452 .\"
 453 .SS Cgroups v1 release notification
 454 Two files can be used to determine whether the kernel provides
 455 notifications when a cgroup becomes empty.
 456 A cgroup is considered to be empty when it contains no child
 457 cgroups and no member processes.
 458 .PP
 459 A special file in the root directory of each cgroup hierarchy,
 460 .IR release_agent ,
 461 can be used to register the pathname of a program that may be invoked when
 462 a cgroup in the hierarchy becomes empty.
 463 The pathname of the newly empty cgroup (relative to the cgroup mount point)
 464 is provided as the sole command-line argument when the
 465 .IR release_agent
 466 program is invoked.
 467 The
 468 .IR release_agent
 469 program might remove the cgroup directory,
 470 or perhaps repopulate it with a process.
 471 .PP
 472 The default value of the
 473 .IR release_agent
 474 file is empty, meaning that no release agent is invoked.
 475 .PP
 476 The content of the
 477 .I release_agent
 478 file can also be specified via a mount option when the
 479 cgroup filesystem is mounted:
 480 .PP
 481 .in +4n
 482 .EX
 483 mount -o release_agent=pathname ...
 484 .EE
 485 .in
 486 .PP
 487 Whether or not the
 488 .IR release_agent
 489 program is invoked when a particular cgroup becomes empty is determined
 490 by the value in the
 491 .IR notify_on_release
 492 file in the corresponding cgroup directory.
 493 If this file contains the value 0, then the
 494 .IR release_agent
 495 program is not invoked.
 496 If it contains the value 1, the
 497 .IR release_agent
 498 program is invoked.
 499 The default value for this file in the root cgroup is 0.
 500 At the time when a new cgroup is created,
 501 the value in this file is inherited from the corresponding file
 502 in the parent cgroup.
 503 .\"
 504 .SS Cgroup v1 named hierarchies
 505 In cgroups v1,
 506 it is possible to mount a cgroup hierarchy that has no attached controllers:
 507 .PP
 508 .in +4n
 509 .EX
 510 mount -t cgroup -o none,name=somename none /some/mount/point
 511 .EE
 512 .in
 513 .PP
 514 Multiple instances of such hierarchies can be mounted;
 515 each hierarchy must have a unique name.
 516 The only purpose of such hierarchies is to track processes.
 517 (See the discussion of release notification below.)
 518 An example of this is the
 519 .I name=systemd
 520 cgroup hierarchy that is used by
 521 .BR systemd (1)
 522 to track services and user sessions.
 523 .PP
 524 Since Linux 5.0, the
 525 .I cgroup_no_v1
 526 kernel boot option (described below) can be used to disable cgroup v1
 527 named hierarchies, by specifying
 528 .IR cgroup_no_v1=named .
 529
 530 .\"
 531 .SH CGROUPS VERSION 2
 532 In cgroups v2,
 533 all mounted controllers reside in a single unified hierarchy.
 534 While (different) controllers may be simultaneously
 535 mounted under the v1 and v2 hierarchies,
 536 it is not possible to mount the same controller simultaneously
 537 under both the v1 and the v2 hierarchies.
 538 .PP
 539 The new behaviors in cgroups v2 are summarized here,
 540 and in some cases elaborated in the following subsections.
 541 .IP 1. 3
 542 Cgroups v2 provides a unified hierarchy against
 543 which all controllers are mounted.
 544 .IP 2.
 545 "Internal" processes are not permitted.
 546 With the exception of the root cgroup, processes may reside
 547 only in leaf nodes (cgroups that do not themselves contain child cgroups).
 548 The details are somewhat more subtle than this, and are described below.
 549 .IP 3.
 550 Active cgroups must be specified via the files
 551 .IR cgroup.controllers
 552 and
 553 .IR cgroup.subtree_control .
 554 .IP 4.
 555 The
 556 .I tasks
 557 file has been removed.
 558 In addition, the
 559 .I cgroup.clone_children
 560 file that is employed by the
 561 .I cpuset
 562 controller has been removed.
 563 .IP 5.
 564 An improved mechanism for notification of empty cgroups is provided by the
 565 .IR cgroup.events
 566 file.
 567 .PP
 568 For more changes, see the
 569 .I Documentation/cgroup-v2.txt
 570 file in the kernel source.
 571 .PP
 572 Some of the new behaviors listed above saw subsequent modification with
 573 the addition in Linux 4.14 of "thread mode" (described below).
 574 .\"
 575 .SS Cgroups v2 unified hierarchy
 576 In cgroups v1, the ability to mount different controllers
 577 against different hierarchies was intended to allow great flexibility
 578 for application design.
 579 In practice, though,
 580 the flexibility turned out to be less useful than expected,
 581 and in many cases added complexity.
 582 Therefore, in cgroups v2,
 583 all available controllers are mounted against a single hierarchy.
 584 The available controllers are automatically mounted,
 585 meaning that it is not necessary (or possible) to specify the controllers
 586 when mounting the cgroup v2 filesystem using a command such as the following:
 587 .PP
 588 .in +4n
 589 .EX
 590 mount -t cgroup2 none /mnt/cgroup2
 591 .EE
 592 .in
 593 .PP
 594 A cgroup v2 controller is available only if it is not currently in use
 595 via a mount against a cgroup v1 hierarchy.
 596 Or, to put things another way, it is not possible to employ
 597 the same controller against both a v1 hierarchy and the unified v2 hierarchy.
 598 This means that it may be necessary first to unmount a v1 controller
 599 (as described above) before that controller is available in v2.
 600 Since
 601 .BR systemd (1)
 602 makes heavy use of some v1 controllers by default,
 603 it can in some cases be simpler to boot the system with
 604 selected v1 controllers disabled.
 605 To do this, specify the
 606 .IR cgroup_no_v1=list
 607 option on the kernel boot command line;
 608 .I list
 609 is a comma-separated list of the names of the controllers to disable,
 610 or the word
 611 .I all
 612 to disable all v1 controllers.
 613 (This situation is correctly handled by
 614 .BR systemd (1),
 615 which falls back to operating without the specified controllers.)
 616 .PP
 617 Note that on many modern systems,
 618 .BR systemd (1)
 619 automatically mounts the
 620 .I cgroup2
 621 filesystem at
 622 .I /sys/fs/cgroup/unified
 623 during the boot process.
 624 .\"
 625 .SS Cgroups v2 mount options
 626 The following options
 627 .RI ( "mount -o" )
 628 can be specified when mounting the group v2 filesystem:
 629 .TP
 630 .IR nsdelegate " (since Linux 4.15)"
 631 Treat cgroup namespaces as delegation boundaries.
 632 For details, see below.
 633 .TP
 634 .IR memory_localevents " (since Linux 5.2)"
 635 .\" commit 9852ae3fe5293264f01c49f2571ef7688f7823ce
 636 The
 637 .I memory.events
 638 should show statistics only for the cgroup itself,
 639 and not for any descendant cgroups.
 640 This was the behavior before Linux 5.2.
 641 Starting in Linux 5.2,
 642 the default behavior is to include statistics for descendant cgroups in
 643 .IR memory.events ,
 644 and this mount option can be used to revert to the legacy behavior.
 645 This option is system wide and can be set on mount or
 646 modified through remount only from the initial mount namespace;
 647 it is silently ignored in noninitial namespaces.
 648 .\"
 649 .SS Cgroups v2 controllers
 650 The following controllers, documented in the kernel source file
 651 .IR Documentation/cgroup-v2.txt ,
 652 are supported in cgroups version 2:
 653 .TP
 654 .IR cpu " (since Linux 4.15)"
 655 This is the successor to the version 1
 656 .I cpu
 657 and
 658 .I cpuacct
 659 controllers.
 660 .TP
 661 .IR cpuset " (since Linux 5.0)"
 662 This is the successor of the version 1
 663 .I cpuset
 664 controller.
 665 .TP
 666 .IR freezer " (since Linux 5.2)"
 667 .\" commit 76f969e8948d82e78e1bc4beb6b9465908e74873
 668 This is the successor of the version 1
 669 .I freezer
 670 controller.
 671 .TP
 672 .IR hugetlb " (since Linux 5.6)"
 673 This is the successor of the version 1
 674 .I hugetlb
 675 controller.
 676 .TP
 677 .IR io " (since Linux 4.5)"
 678 This is the successor of the version 1
 679 .I blkio
 680 controller.
 681 .TP
 682 .IR memory " (since Linux 4.5)"
 683 This is the successor of the version 1
 684 .I memory
 685 controller.
 686 .TP
 687 .IR perf_event " (since Linux 4.11)"
 688 This is the same as the version 1
 689 .I perf_event
 690 controller.
 691 .TP
 692 .IR pids " (since Linux 4.5)"
 693 This is the same as the version 1
 694 .I pids
 695 controller.
 696 .TP
 697 .IR rdma " (since Linux 4.11)"
 698 This is the same as the version 1
 699 .I rdma
 700 controller.
 701 .PP
 702 There is no direct equivalent of the
 703 .I net_cls
 704 and
 705 .I net_prio
 706 controllers from cgroups version 1.
 707 Instead, support has been added to
 708 .BR iptables (8)
 709 to allow eBPF filters that hook on cgroup v2 pathnames to make decisions
 710 about network traffic on a per-cgroup basis.
 711 .PP
 712 The v2
 713 .I devices
 714 controller provides no interface files;
 715 instead, device control is gated by attaching an eBPF
 716 .RB ( BPF_CGROUP_DEVICE )
 717 program to a v2 cgroup.
 718 .\"
 719 .SS Cgroups v2 subtree control
 720 Each cgroup in the v2 hierarchy contains the following two files:
 721 .TP
 722 .IR cgroup.controllers
 723 This read-only file exposes a list of the controllers that are
 724 .I available
 725 in this cgroup.
 726 The contents of this file match the contents of the
 727 .I cgroup.subtree_control
 728 file in the parent cgroup.
 729 .TP
 730 .I cgroup.subtree_control
 731 This is a list of controllers that are
 732 .IR active
 733 .RI ( enabled )
 734 in the cgroup.
 735 The set of controllers in this file is a subset of the set in the
 736 .IR cgroup.controllers
 737 of this cgroup.
 738 The set of active controllers is modified by writing strings to this file
 739 containing space-delimited controller names,
 740 each preceded by '+' (to enable a controller)
 741 or '\-' (to disable a controller), as in the following example:
 742 .IP
 743 .in +4n
 744 .EX
 745 echo '+pids -memory' > x/y/cgroup.subtree_control
 746 .EE
 747 .in
 748 .IP
 749 An attempt to enable a controller
 750 that is not present in
 751 .I cgroup.controllers
 752 leads to an
 753 .B ENOENT
 754 error when writing to the
 755 .I cgroup.subtree_control
 756 file.
 757 .PP
 758 Because the list of controllers in
 759 .I cgroup.subtree_control
 760 is a subset of those
 761 .IR cgroup.controllers ,
 762 a controller that has been disabled in one cgroup in the hierarchy
 763 can never be re-enabled in the subtree below that cgroup.
 764 .PP
 765 A cgroup's
 766 .I cgroup.subtree_control
 767 file determines the set of controllers that are exercised in the
 768 .I child
 769 cgroups.
 770 When a controller (e.g.,
 771 .IR pids )
 772 is present in the
 773 .I cgroup.subtree_control
 774 file of a parent cgroup,
 775 then the corresponding controller-interface files (e.g.,
 776 .IR pids.max )
 777 are automatically created in the children of that cgroup
 778 and can be used to exert resource control in the child cgroups.
 779 .\"
 780 .SS Cgroups v2 """no internal processes""" rule
 781 Cgroups v2 enforces a so-called "no internal processes" rule.
 782 Roughly speaking, this rule means that,
 783 with the exception of the root cgroup, processes may reside
 784 only in leaf nodes (cgroups that do not themselves contain child cgroups).
 785 This avoids the need to decide how to partition resources between
 786 processes which are members of cgroup A and processes in child cgroups of A.
 787 .PP
 788 For instance, if cgroup
 789 .I /cg1/cg2
 790 exists, then a process may reside in
 791 .IR /cg1/cg2 ,
 792 but not in
 793 .IR /cg1 .
 794 This is to avoid an ambiguity in cgroups v1
 795 with respect to the delegation of resources between processes in
 796 .I /cg1
 797 and its child cgroups.
 798 The recommended approach in cgroups v2 is to create a subdirectory called
 799 .I leaf
 800 for any nonleaf cgroup which should contain processes, but no child cgroups.
 801 Thus, processes which previously would have gone into
 802 .I /cg1
 803 would now go into
 804 .IR /cg1/leaf .
 805 This has the advantage of making explicit
 806 the relationship between processes in
 807 .I /cg1/leaf
 808 and
 809 .IR /cg1 's
 810 other children.
 811 .PP
 812 The "no internal processes" rule is in fact more subtle than stated above.
 813 More precisely, the rule is that a (nonroot) cgroup can't both
 814 (1) have member processes, and
 815 (2) distribute resources into child cgroups\(emthat is, have a nonempty
 816 .I cgroup.subtree_control
 817 file.
 818 Thus, it
 819 .I is
 820 possible for a cgroup to have both member processes and child cgroups,
 821 but before controllers can be enabled for that cgroup,
 822 the member processes must be moved out of the cgroup
 823 (e.g., perhaps into the child cgroups).
 824 .PP
 825 With the Linux 4.14 addition of "thread mode" (described below),
 826 the "no internal processes" rule has been relaxed in some cases.
 827 .\"
 828 .SS Cgroups v2 cgroup.events file
 829 Each nonroot cgroup in the v2 hierarchy contains a read-only file,
 830 .IR cgroup.events ,
 831 whose contents are key-value pairs
 832 (delimited by newline characters, with the key and value separated by spaces)
 833 providing state information about the
 834 the cgroup:
 835 .PP
 836 .in +4n
 837 .EX
 838 $ \fBcat mygrp/cgroup.events\fP
 839 populated 1
 840 frozen 0
 841 .EE
 842 .in
 843 .PP
 844 The following keys may appear in this file:
 845 .TP
 846 .IR populated
 847 The value of this key is either 1,
 848 if this cgroup or any of its descendants has member processes,
 849 or otherwise 0.
 850 .TP
 851 .IR frozen " (since Linux 5.2)"
 852 .\" commit 76f969e8948d82e78e1bc4beb6b9465908e7487
 853 The value of this key is 1 if this cgroup is currently frozen,
 854 or 0 if it is not.
 855 .PP
 856 The
 857 .IR cgroup.events
 858 file can be monitored, in order to receive notification when the value of
 859 one of its keys changes.
 860 Such monitoring can be done using
 861 .BR inotify (7),
 862 which notifies changes as
 863 .BR IN_MODIFY
 864 events, or
 865 .BR poll (2),
 866 which notifies changes by returning the
 867 .B POLLPRI
 868 and
 869 .B POLLERR
 870 bits in the
 871 .IR revents
 872 field.
 873 .\"
 874 .SS Cgroup v2 release notification
 875 Cgroups v2 provides a new mechanism for obtaining notification
 876 when a cgroup becomes empty.
 877 The cgroups v1
 878 .IR release_agent
 879 and
 880 .IR notify_on_release
 881 files are removed, and replaced by the
 882 .I populated
 883 key in the
 884 .IR cgroup.events
 885 file.
 886 This key either has the value 0,
 887 meaning that the cgroup (and its descendants)
 888 contain no (nonzombie) member processes,
 889 or 1, meaning that the cgroup (or one of its descendants)
 890 contains member processes.
 891 .PP
 892 The cgroups v2 release-notification mechanism
 893 offers the following advantages over the cgroups v1
 894 .IR release_agent
 895 mechanism:
 896 .IP * 3
 897 It allows for cheaper notification,
 898 since a single process can monitor multiple
 899 .IR cgroup.events
 900 files (using the techniques described earlier).
 901 By contrast, the cgroups v1 mechanism requires the expense of creating
 902 a process for each notification.
 903 .IP *
 904 Notification for different cgroup subhierarchies can be delegated
 905 to different processes.
 906 By contrast, the cgroups v1 mechanism allows only one release agent
 907 for an entire hierarchy.
 908 .\"
 909 .SS Cgroups v2 cgroup.stat file
 910 .\" commit ec39225cca42c05ac36853d11d28f877fde5c42e
 911 Each cgroup in the v2 hierarchy contains a read-only
 912 .IR cgroup.stat
 913 file (first introduced in Linux 4.14)
 914 that consists of lines containing key-value pairs.
 915 The following keys currently appear in this file:
 916 .TP
 917 .I nr_descendants
 918 This is the total number of visible (i.e., living) descendant cgroups
 919 underneath this cgroup.
 920 .TP
 921 .I nr_dying_descendants
 922 This is the total number of dying descendant cgroups
 923 underneath this cgroup.
 924 A cgroup enters the dying state after being deleted.
 925 It remains in that state for an undefined period
 926 (which will depend on system load)
 927 while resources are freed before the cgroup is destroyed.
 928 Note that the presence of some cgroups in the dying state is normal,
 929 and is not indicative of any problem.
 930 .IP
 931 A process can't be made a member of a dying cgroup,
 932 and a dying cgroup can't be brought back to life.
 933 .\"
 934 .SS Limiting the number of descendant cgroups
 935 Each cgroup in the v2 hierarchy contains the following files,
 936 which can be used to view and set limits on the number
 937 of descendant cgroups under that cgroup:
 938 .TP
 939 .IR cgroup.max.depth " (since Linux 4.14)"
 940 .\" commit 1a926e0bbab83bae8207d05a533173425e0496d1
 941 This file defines a limit on the depth of nesting of descendant cgroups.
 942 A value of 0 in this file means that no descendant cgroups can be created.
 943 An attempt to create a descendant whose nesting level exceeds
 944 the limit fails
 945 .RI ( mkdir (2)
 946 fails with the error
 947 .BR EAGAIN ).
 948 .IP
 949 Writing the string
 950 .IR """max"""
 951 to this file means that no limit is imposed.
 952 The default value in this file is
 953 .IR """max""" .
 954 .TP
 955 .IR cgroup.max.descendants " (since Linux 4.14)"
 956 .\" commit 1a926e0bbab83bae8207d05a533173425e0496d1
 957 This file defines a limit on the number of live descendant cgroups that
 958 this cgroup may have.
 959 An attempt to create more descendants than allowed by the limit fails
 960 .RI ( mkdir (2)
 961 fails with the error
 962 .BR EAGAIN ).
 963 .IP
 964 Writing the string
 965 .IR """max"""
 966 to this file means that no limit is imposed.
 967 The default value in this file is
 968 .IR """max""" .
 969 .\"
 970 .SH CGROUPS DELEGATION: DELEGATING A HIERARCHY TO A LESS PRIVILEGED USER
 971 In the context of cgroups,
 972 delegation means passing management of some subtree
 973 of the cgroup hierarchy to a nonprivileged user.
 974 Cgroups v1 provides support for delegation based on file permissions
 975 in the cgroup hierarchy but with less strict containment rules than v2
 976 (as noted below).
 977 Cgroups v2 supports delegation with containment by explicit design.
 978 The focus of the discussion in this section is on delegation in cgroups v2,
 979 with some differences for cgroups v1 noted along the way.
 980 .PP
 981 Some terminology is required in order to describe delegation.
 982 A
 983 .I delegater
 984 is a privileged user (i.e., root) who owns a parent cgroup.
 985 A
 986 .I delegatee
 987 is a nonprivileged user who will be granted the permissions needed
 988 to manage some subhierarchy under that parent cgroup,
 989 known as the
 990 .IR "delegated subtree" .
 991 .PP
 992 To perform delegation,
 993 the delegater makes certain directories and files writable by the delegatee,
 994 typically by changing the ownership of the objects to be the user ID
 995 of the delegatee.
 996 Assuming that we want to delegate the hierarchy rooted at (say)
 997 .I /dlgt_grp
 998 and that there are not yet any child cgroups under that cgroup,
 999 the ownership of the following is changed to the user ID of the delegatee:
1000 .TP
1001 .IR /dlgt_grp
1002 Changing the ownership of the root of the subtree means that any new
1003 cgroups created under the subtree (and the files they contain)
1004 will also be owned by the delegatee.
1005 .TP
1006 .IR /dlgt_grp/cgroup.procs
1007 Changing the ownership of this file means that the delegatee
1008 can move processes into the root of the delegated subtree.
1009 .TP
1010 .IR /dlgt_grp/cgroup.subtree_control " (cgroups v2 only)"
1011 Changing the ownership of this file means that the delegatee
1012 can enable controllers (that are present in
1013 .IR /dlgt_grp/cgroup.controllers )
1014 in order to further redistribute resources at lower levels in the subtree.
1015 (As an alternative to changing the ownership of this file,
1016 the delegater might instead add selected controllers to this file.)
1017 .TP
1018 .IR /dlgt_grp/cgroup.threads " (cgroups v2 only)"
1019 Changing the ownership of this file is necessary if a threaded subtree
1020 is being delegated (see the description of "thread mode", below).
1021 This permits the delegatee to write thread IDs to the file.
1022 (The ownership of this file can also be changed when delegating
1023 a domain subtree, but currently this serves no purpose,
1024 since, as described below, it is not possible to move a thread between
1025 domain cgroups by writing its thread ID to the
1026 .IR cgroup.threads
1027 file.)
1028 .IP
1029 In cgroups v1, the corresponding file that should instead be delegated is the
1030 .I tasks
1031 file.
1032 .PP
1033 The delegater should
1034 .I not
1035 change the ownership of any of the controller interfaces files (e.g.,
1036 .IR pids.max ,
1037 .IR memory.high )
1038 in
1039 .IR dlgt_grp .
1040 Those files are used from the next level above the delegated subtree
1041 in order to distribute resources into the subtree,
1042 and the delegatee should not have permission to change
1043 the resources that are distributed into the delegated subtree.
1044 .PP
1045 See also the discussion of the
1046 .IR /sys/kernel/cgroup/delegate
1047 file in NOTES for information about further delegatable files in cgroups v2.
1048 .PP
1049 After the aforementioned steps have been performed,
1050 the delegatee can create child cgroups within the delegated subtree
1051 (the cgroup subdirectories and the files they contain
1052 will be owned by the delegatee)
1053 and move processes between cgroups in the subtree.
1054 If some controllers are present in
1055 .IR dlgt_grp/cgroup.subtree_control ,
1056 or the ownership of that file was passed to the delegatee,
1057 the delegatee can also control the further redistribution
1058 of the corresponding resources into the delegated subtree.
1059 .\"
1060 .SS Cgroups v2 delegation: nsdelegate and cgroup namespaces
1061 Starting with Linux 4.13,
1062 .\" commit 5136f6365ce3eace5a926e10f16ed2a233db5ba9
1063 there is a second way to perform cgroup delegation in the cgroups v2 hierarchy.
1064 This is done by mounting or remounting the cgroup v2 filesystem with the
1065 .I nsdelegate
1066 mount option.
1067 For example, if the cgroup v2 filesystem has already been mounted,
1068 we can remount it with the
1069 .I nsdelegate
1070 option as follows:
1071 .PP
1072 .in +4n
1073 .EX
1074 mount -t cgroup2 -o remount,nsdelegate \e
1075                  none /sys/fs/cgroup/unified
1076 .EE
1077 .in
1078 .\"
1079 .\" ALternatively, we could boot the kernel with the options:
1080 .\"
1081 .\"    cgroup_no_v1=all systemd.legacy_systemd_cgroup_controller
1082 .\"
1083 .\" The effect of the latter option is to prevent systemd from employing
1084 .\" its "hybrid" cgroup mode, where it tries to make use of cgroups v2.
1085 .PP
1086 The effect of this mount option is to cause cgroup namespaces
1087 to automatically become delegation boundaries.
1088 More specifically,
1089 the following restrictions apply for processes inside the cgroup namespace:
1090 .IP * 3
1091 Writes to controller interface files in the root directory of the namespace
1092 will fail with the error
1093 .BR EPERM .
1094 Processes inside the cgroup namespace can still write to delegatable
1095 files in the root directory of the cgroup namespace such as
1096 .IR cgroup.procs
1097 and
1098 .IR cgroup.subtree_control ,
1099 and can create subhierarchy underneath the root directory.
1100 .IP *
1101 Attempts to migrate processes across the namespace boundary are denied
1102 (with the error
1103 .BR ENOENT ).
1104 Processes inside the cgroup namespace can still
1105 (subject to the containment rules described below)
1106 move processes between cgroups
1107 .I within
1108 the subhierarchy under the namespace root.
1109 .PP
1110 The ability to define cgroup namespaces as delegation boundaries
1111 makes cgroup namespaces more useful.
1112 To understand why, suppose that we already have one cgroup hierarchy
1113 that has been delegated to a nonprivileged user,
1114 .IR cecilia ,
1115 using the older delegation technique described above.
1116 Suppose further that
1117 .I cecilia
1118 wanted to further delegate a subhierarchy
1119 under the existing delegated hierarchy.
1120 (For example, the delegated hierarchy might be associated with
1121 an unprivileged container run by
1122 .IR cecilia .)
1123 Even if a cgroup namespace was employed,
1124 because both hierarchies are owned by the unprivileged user
1125 .IR cecilia ,
1126 the following illegitimate actions could be performed:
1127 .IP * 3
1128 A process in the inferior hierarchy could change the
1129 resource controller settings in the root directory of that hierarchy.
1130 (These resource controller settings are intended to allow control to
1131 be exercised from the
1132 .I parent
1133 cgroup;
1134 a process inside the child cgroup should not be allowed to modify them.)
1135 .IP *
1136 A process inside the inferior hierarchy could move processes
1137 into and out of the inferior hierarchy if the cgroups in the
1138 superior hierarchy were somehow visible.
1139 .PP
1140 Employing the
1141 .I nsdelegate
1142 mount option prevents both of these possibilities.
1143 .PP
1144 The
1145 .I nsdelegate
1146 mount option only has an effect when performed in
1147 the initial mount namespace;
1148 in other mount namespaces, the option is silently ignored.
1149 .PP
1150 .IR Note :
1151 On some systems,
1152 .BR systemd (1)
1153 automatically mounts the cgroup v2 filesystem.
1154 In order to experiment with the
1155 .I nsdelegate
1156 operation, it may be useful to boot the kernel with
1157 the following command-line options:
1158 .PP
1159 .in +4n
1160 .EX
1161 cgroup_no_v1=all systemd.legacy_systemd_cgroup_controller
1162 .EE
1163 .in
1164 .PP
1165 These options cause the kernel to boot with the cgroups v1 controllers
1166 disabled (meaning that the controllers are available in the v2 hierarchy),
1167 and tells
1168 .BR systemd (1)
1169 not to mount and use the cgroup v2 hierarchy,
1170 so that the v2 hierarchy can be manually mounted
1171 with the desired options after boot-up.
1172 .\"
1173 .SS Cgroup delegation containment rules
1174 Some delegation
1175 .IR "containment rules"
1176 ensure that the delegatee can move processes between cgroups within the
1177 delegated subtree,
1178 but can't move processes from outside the delegated subtree into
1179 the subtree or vice versa.
1180 A nonprivileged process (i.e., the delegatee) can write the PID of
1181 a "target" process into a
1182 .IR cgroup.procs
1183 file only if all of the following are true:
1184 .IP * 3
1185 The writer has write permission on the
1186 .I cgroup.procs
1187 file in the destination cgroup.
1188 .IP *
1189 The writer has write permission on the
1190 .I cgroup.procs
1191 file in the nearest common ancestor of the source and destination cgroups.
1192 Note that in some cases,
1193 the nearest common ancestor may be the source or destination cgroup itself.
1194 This requirement is not enforced for cgroups v1 hierarchies,
1195 with the consequence that containment in v1 is less strict than in v2.
1196 (For example, in cgroups v1 the user that owns two distinct
1197 delegated subhierarchies can move a process between the hierarchies.)
1198 .IP *
1199 If the cgroup v2 filesystem was mounted with the
1200 .I nsdelegate
1201 option, the writer must be able to see the source and destination cgroups
1202 from its cgroup namespace.
1203 .IP *
1204 In cgroups v1:
1205 the effective UID of the writer (i.e., the delegatee) matches the
1206 real user ID or the saved set-user-ID of the target process.
1207 Before Linux 4.11,
1208 .\" commit 576dd464505fc53d501bb94569db76f220104d28
1209 this requirement also applied in cgroups v2
1210 (This was a historical requirement inherited from cgroups v1
1211 that was later deemed unnecessary,
1212 since the other rules suffice for containment in cgroups v2.)
1213 .PP
1214 .IR Note :
1215 one consequence of these delegation containment rules is that the
1216 unprivileged delegatee can't place the first process into
1217 the delegated subtree;
1218 instead, the delegater must place the first process
1219 (a process owned by the delegatee) into the delegated subtree.
1220 .\"
1221 .SH CGROUPS VERSION 2 THREAD MODE
1222 Among the restrictions imposed by cgroups v2 that were not present
1223 in cgroups v1 are the following:
1224 .IP * 3
1225 .IR "No thread-granularity control" :
1226 all of the threads of a process must be in the same cgroup.
1227 .IP *
1228 .IR "No internal processes" :
1229 a cgroup can't both have member processes and
1230 exercise controllers on child cgroups.
1231 .PP
1232 Both of these restrictions were added because
1233 the lack of these restrictions had caused problems
1234 in cgroups v1.
1235 In particular, the cgroups v1 ability to allow thread-level granularity
1236 for cgroup membership made no sense for some controllers.
1237 (A notable example was the
1238 .I memory
1239 controller: since threads share an address space,
1240 it made no sense to split threads across different
1241 .I memory
1242 cgroups.)
1243 .PP
1244 Notwithstanding the initial design decision in cgroups v2,
1245 there were use cases for certain controllers, notably the
1246 .IR cpu
1247 controller,
1248 for which thread-level granularity of control was meaningful and useful.
1249 To accommodate such use cases, Linux 4.14 added
1250 .I "thread mode"
1251 for cgroups v2.
1252 .PP
1253 Thread mode allows the following:
1254 .IP * 3
1255 The creation of
1256 .IR "threaded subtrees"
1257 in which the threads of a process may
1258 be spread across cgroups inside the tree.
1259 (A threaded subtree may contain multiple multithreaded processes.)
1260 .IP *
1261 The concept of
1262 .IR "threaded controllers",
1263 which can distribute resources across the cgroups in a threaded subtree.
1264 .IP *
1265 A relaxation of the "no internal processes rule",
1266 so that, within a threaded subtree,
1267 a cgroup can both contain member threads and
1268 exercise resource control over child cgroups.
1269 .PP
1270 With the addition of thread mode,
1271 each nonroot cgroup now contains a new file,
1272 .IR cgroup.type ,
1273 that exposes, and in some circumstances can be used to change,
1274 the "type" of a cgroup.
1275 This file contains one of the following type values:
1276 .TP
1277 .I "domain"
1278 This is a normal v2 cgroup that provides process-granularity control.
1279 If a process is a member of this cgroup,
1280 then all threads of the process are (by definition) in the same cgroup.
1281 This is the default cgroup type,
1282 and provides the same behavior that was provided for
1283 cgroups in the initial cgroups v2 implementation.
1284 .TP
1285 .I "threaded"
1286 This cgroup is a member of a threaded subtree.
1287 Threads can be added to this cgroup,
1288 and controllers can be enabled for the cgroup.
1289 .TP
1290 .I "domain threaded"
1291 This is a domain cgroup that serves as the root of a threaded subtree.
1292 This cgroup type is also known as "threaded root".
1293 .TP
1294 .I "domain invalid"
1295 This is a cgroup inside a threaded subtree
1296 that is in an "invalid" state.
1297 Processes can't be added to the cgroup,
1298 and controllers can't be enabled for the cgroup.
1299 The only thing that can be done with this cgroup (other than deleting it)
1300 is to convert it to a
1301 .IR threaded
1302 cgroup by writing the string
1303 .IR """threaded"""
1304 to the
1305 .I cgroup.type
1306 file.
1307 .IP
1308 The rationale for the existence of this "interim" type
1309 during the creation of a threaded subtree
1310 (rather than the kernel simply immediately converting all cgroups
1311 under the threaded root to the type
1312 .IR threaded )
1313 is to allow for
1314 possible future extensions to the thread mode model
1315 .\"
1316 .SS Threaded versus domain controllers
1317 With the addition of threads mode,
1318 cgroups v2 now distinguishes two types of resource controllers:
1319 .IP * 3
1320 .I Threaded
1321 .\" In the kernel source, look for ".threaded[ \t]*= true" in
1322 .\" initializations of struct cgroup_subsys
1323 controllers: these controllers support thread-granularity for
1324 resource control and can be enabled inside threaded subtrees,
1325 with the result that the corresponding controller-interface files
1326 appear inside the cgroups in the threaded subtree.
1327 As at Linux 4.19, the following controllers are threaded:
1328 .IR cpu ,
1329 .IR perf_event ,
1330 and
1331 .IR pids .
1332 .IP *
1333 .I Domain
1334 controllers: these controllers support only process granularity
1335 for resource control.
1336 From the perspective of a domain controller,
1337 all threads of a process are always in the same cgroup.
1338 Domain controllers can't be enabled inside a threaded subtree.
1339 .\"
1340 .SS Creating a threaded subtree
1341 There are two pathways that lead to the creation of a threaded subtree.
1342 The first pathway proceeds as follows:
1343 .IP 1. 3
1344 We write the string
1345 .IR """threaded"""
1346 to the
1347 .I cgroup.type
1348 file of a cgroup
1349 .IR y/z
1350 that currently has the type
1351 .IR domain .
1352 This has the following effects:
1353 .RS
1354 .IP * 3
1355 The type of the cgroup
1356 .IR y/z
1357 becomes
1358 .IR threaded .
1359 .IP *
1360 The type of the parent cgroup,
1361 .IR y ,
1362 becomes
1363 .IR "domain threaded" .
1364 The parent cgroup is the root of a threaded subtree
1365 (also known as the "threaded root").
1366 .IP *
1367 All other cgroups under
1368 .IR y
1369 that were not already of type
1370 .IR threaded
1371 (because they were inside already existing threaded subtrees
1372 under the new threaded root)
1373 are converted to type
1374 .IR "domain invalid" .
1375 Any subsequently created cgroups under
1376 .I y
1377 will also have the type
1378 .IR "domain invalid" .
1379 .RE
1380 .IP 2.
1381 We write the string
1382 .IR """threaded"""
1383 to each of the
1384 .IR "domain invalid"
1385 cgroups under
1386 .IR y ,
1387 in order to convert them to the type
1388 .IR threaded .
1389 As a consequence of this step, all threads under the threaded root
1390 now have the type
1391 .IR threaded
1392 and the threaded subtree is now fully usable.
1393 The requirement to write
1394 .IR """threaded"""
1395 to each of these cgroups is somewhat cumbersome,
1396 but allows for possible future extensions to the thread-mode model.
1397 .PP
1398 The second way of creating a threaded subtree is as follows:
1399 .IP 1. 3
1400 In an existing cgroup,
1401 .IR z ,
1402 that currently has the type
1403 .IR domain ,
1404 we (1) enable one or more threaded controllers and
1405 (2) make a process a member of
1406 .IR z .
1407 (These two steps can be done in either order.)
1408 This has the following consequences:
1409 .RS
1410 .IP * 3
1411 The type of
1412 .I z
1413 becomes
1414 .IR "domain threaded" .
1415 .IP *
1416 All of the descendant cgroups of
1417 .I x
1418 that were not already of type
1419 .IR threaded
1420 are converted to type
1421 .IR "domain invalid" .
1422 .RE
1423 .IP 2.
1424 As before, we make the threaded subtree usable by writing the string
1425 .IR """threaded"""
1426 to each of the
1427 .IR "domain invalid"
1428 cgroups under
1429 .IR y ,
1430 in order to convert them to the type
1431 .IR threaded .
1432 .PP
1433 One of the consequences of the above pathways to creating a threaded subtree
1434 is that the threaded root cgroup can be a parent only to
1435 .I threaded
1436 (and
1437 .IR "domain invalid" )
1438 cgroups.
1439 The threaded root cgroup can't be a parent of a
1440 .I domain
1441 cgroups, and a
1442 .I threaded
1443 cgroup
1444 can't have a sibling that is a
1445 .I domain
1446 cgroup.
1447 .\"
1448 .SS Using a threaded subtree
1449 Within a threaded subtree, threaded controllers can be enabled
1450 in each subgroup whose type has been changed to
1451 .IR threaded ;
1452 upon doing so, the corresponding controller interface files
1453 appear in the children of that cgroup.
1454 .PP
1455 A process can be moved into a threaded subtree by writing its PID to the
1456 .I cgroup.procs
1457 file in one of the cgroups inside the tree.
1458 This has the effect of making all of the threads
1459 in the process members of the corresponding cgroup
1460 and makes the process a member of the threaded subtree.
1461 The threads of the process can then be spread across
1462 the threaded subtree by writing their thread IDs (see
1463 .BR gettid (2))
1464 to the
1465 .I cgroup.threads
1466 files in different cgroups inside the subtree.
1467 The threads of a process must all reside in the same threaded subtree.
1468 .PP
1469 As with writing to
1470 .IR cgroup.procs ,
1471 some containment rules apply when writing to the
1472 .I cgroup.threads
1473 file:
1474 .IP * 3
1475 The writer must have write permission on the
1476 cgroup.threads
1477 file in the destination cgroup.
1478 .IP *
1479 The writer must have write permission on the
1480 .I cgroup.procs
1481 file in the common ancestor of the source and destination cgroups.
1482 (In some cases,
1483 the common ancestor may be the source or destination cgroup itself.)
1484 .IP *
1485 The source and destination cgroups must be in the same threaded subtree.
1486 (Outside a threaded subtree, an attempt to move a thread by writing
1487 its thread ID to the
1488 .I cgroup.threads
1489 file in a different
1490 .I domain
1491 cgroup fails with the error
1492 .BR EOPNOTSUPP .)
1493 .PP
1494 The
1495 .I cgroup.threads
1496 file is present in each cgroup (including
1497 .I domain
1498 cgroups) and can be read in order to discover the set of threads
1499 that is present in the cgroup.
1500 The set of thread IDs obtained when reading this file
1501 is not guaranteed to be ordered or free of duplicates.
1502 .PP
1503 The
1504 .I cgroup.procs
1505 file in the threaded root shows the PIDs of all processes
1506 that are members of the threaded subtree.
1507 The
1508 .I cgroup.procs
1509 files in the other cgroups in the subtree are not readable.
1510 .PP
1511 Domain controllers can't be enabled in a threaded subtree;
1512 no controller-interface files appear inside the cgroups underneath the
1513 threaded root.
1514 From the point of view of a domain controller,
1515 threaded subtrees are invisible:
1516 a multithreaded process inside a threaded subtree appears to a domain
1517 controller as a process that resides in the threaded root cgroup.
1518 .PP
1519 Within a threaded subtree, the "no internal processes" rule does not apply:
1520 a cgroup can both contain member processes (or thread)
1521 and exercise controllers on child cgroups.
1522 .\"
1523 .SS Rules for writing to cgroup.type and creating threaded subtrees
1524 A number of rules apply when writing to the
1525 .I cgroup.type
1526 file:
1527 .IP * 3
1528 Only the string
1529 .IR """threaded"""
1530 may be written.
1531 In other words, the only explicit transition that is possible is to convert a
1532 .I domain
1533 cgroup to type
1534 .IR threaded .
1535 .IP *
1536 The effect of writing
1537 .IR """threaded"""
1538 depends on the current value in
1539 .IR cgroup.type ,
1540 as follows:
1541 .RS
1542 .IP \(bu 3
1543 .IR domain
1544 or
1545 .IR "domain threaded" :
1546 start the creation of a threaded subtree
1547 (whose root is the parent of this cgroup) via
1548 the first of the pathways described above;
1549 .IP \(bu
1550 .IR "domain\ invalid" :
1551 convert this cgroup (which is inside a threaded subtree) to a usable (i.e.,
1552 .IR threaded )
1553 state;
1554 .IP \(bu
1555 .IR threaded :
1556 no effect (a "no-op").
1557 .RE
1558 .IP *
1559 We can't write to a
1560 .I cgroup.type
1561 file if the parent's type is
1562 .IR "domain invalid" .
1563 In other words, the cgroups of a threaded subtree must be converted to the
1564 .I threaded
1565 state in a top-down manner.
1566 .PP
1567 There are also some constraints that must be satisfied
1568 in order to create a threaded subtree rooted at the cgroup
1569 .IR x :
1570 .IP * 3
1571 There can be no member processes in the descendant cgroups of
1572 .IR x .
1573 (The cgroup
1574 .I x
1575 can itself have member processes.)
1576 .IP *
1577 No domain controllers may be enabled in
1578 .IR x 's
1579 .IR cgroup.subtree_control
1580 file.
1581 .PP
1582 If any of the above constraints is violated, then an attempt to write
1583 .IR """threaded"""
1584 to a
1585 .IR cgroup.type
1586 file fails with the error
1587 .BR ENOTSUP .
1588 .\"
1589 .SS The """domain threaded""" cgroup type
1590 According to the pathways described above,
1591 the type of a cgroup can change to
1592 .IR "domain threaded"
1593 in either of the following cases:
1594 .IP * 3
1595 The string
1596 .IR """threaded"""
1597 is written to a child cgroup.
1598 .IP *
1599 A threaded controller is enabled inside the cgroup and
1600 a process is made a member of the cgroup.
1601 .PP
1602 A
1603 .IR "domain threaded"
1604 cgroup,
1605 .IR x ,
1606 can revert to the type
1607 .IR domain
1608 if the above conditions no longer hold true\(emthat is, if all
1609 .I threaded
1610 child cgroups of
1611 .I x
1612 are removed and either
1613 .I x
1614 no longer has threaded controllers enabled or
1615 no longer has member processes.
1616 .PP
1617 When a
1618 .IR "domain threaded"
1619 cgroup
1620 .IR x
1621 reverts to the type
1622 .IR domain :
1623 .IP * 3
1624 All
1625 .IR "domain invalid"
1626 descendants of
1627 .I x
1628 that are not in lower-level threaded subtrees revert to the type
1629 .IR domain .
1630 .IP *
1631 The root cgroups in any lower-level threaded subtrees revert to the type
1632 .IR "domain threaded" .
1633 .\"
1634 .SS Exceptions for the root cgroup
1635 The root cgroup of the v2 hierarchy is treated exceptionally:
1636 it can be the parent of both
1637 .I domain
1638 and
1639 .I threaded
1640 cgroups.
1641 If the string
1642 .I """threaded"""
1643 is written to the
1644 .I cgroup.type
1645 file of one of the children of the root cgroup, then
1646 .IP * 3
1647 The type of that cgroup becomes
1648 .IR threaded .
1649 .IP *
1650 The type of any descendants of that cgroup that
1651 are not part of lower-level threaded subtrees changes to
1652 .IR "domain invalid" .
1653 .PP
1654 Note that in this case, there is no cgroup whose type becomes
1655 .IR "domain threaded" .
1656 (Notionally, the root cgroup can be considered as the threaded root
1657 for the cgroup whose type was changed to
1658 .IR threaded .)
1659 .PP
1660 The aim of this exceptional treatment for the root cgroup is to
1661 allow a threaded cgroup that employs the
1662 .I cpu
1663 controller to be placed as high as possible in the hierarchy,
1664 so as to minimize the (small) cost of traversing the cgroup hierarchy.
1665 .\"
1666 .SS The cgroups v2 """cpu""" controller and realtime threads
1667 As at Linux 4.19, the cgroups v2
1668 .I cpu
1669 controller does not support control of realtime threads
1670 (specifically threads scheduled under any of the policies
1671 .BR SCHED_FIFO ,
1672 .BR SCHED_RR ,
1673 described
1674 .BR SCHED_DEADLINE ;
1675 see
1676 .BR sched (7)).
1677 Therefore, the
1678 .I cpu
1679 controller can be enabled in the root cgroup only
1680 if all realtime threads are in the root cgroup.
1681 (If there are realtime threads in nonroot cgroups, then a
1682 .BR write (2)
1683 of the string
1684 .IR """+cpu"""
1685 to the
1686 .I cgroup.subtree_control
1687 file fails with the error
1688 .BR EINVAL .)
1689 .PP
1690 On some systems,
1691 .BR systemd (1)
1692 places certain realtime threads in nonroot cgroups in the v2 hierarchy.
1693 On such systems,
1694 these threads must first be moved to the root cgroup before the
1695 .I cpu
1696 controller can be enabled.
1697 .\"
1698 .SH ERRORS
1699 The following errors can occur for
1700 .BR mount (2):
1701 .TP
1702 .B EBUSY
1703 An attempt to mount a cgroup version 1 filesystem specified neither the
1704 .I name=
1705 option (to mount a named hierarchy) nor a controller name (or
1706 .IR all ).
1707 .SH NOTES
1708 A child process created via
1709 .BR fork (2)
1710 inherits its parent's cgroup memberships.
1711 A process's cgroup memberships are preserved across
1712 .BR execve (2).
1713 .PP
1714 The
1715 .BR clone3 (2)
1716 .B CLONE_INTO_CGROUP
1717 flag can be used to create a child process that begins its life in
1718 a different version 2 cgroup from the parent process.
1719 .\"
1720 .SS /proc files
1721 .TP
1722 .IR /proc/cgroups " (since Linux 2.6.24)"
1723 This file contains information about the controllers
1724 that are compiled into the kernel.
1725 An example of the contents of this file (reformatted for readability)
1726 is the following:
1727 .IP
1728 .in +4n
1729 .EX
1730 #subsys_name    hierarchy      num_cgroups    enabled
1731 cpuset          4              1              1
1732 cpu             8              1              1
1733 cpuacct         8              1              1
1734 blkio           6              1              1
1735 memory          3              1              1
1736 devices         10             84             1
1737 freezer         7              1              1
1738 net_cls         9              1              1
1739 perf_event      5              1              1
1740 net_prio        9              1              1
1741 hugetlb         0              1              0
1742 pids            2              1              1
1743 .EE
1744 .in
1745 .IP
1746 The fields in this file are, from left to right:
1747 .RS
1748 .IP 1. 3
1749 The name of the controller.
1750 .IP 2.
1751 The unique ID of the cgroup hierarchy on which this controller is mounted.
1752 If multiple cgroups v1 controllers are bound to the same hierarchy,
1753 then each will show the same hierarchy ID in this field.
1754 The value in this field will be 0 if:
1755 .RS 5
1756 .IP a) 3
1757 the controller is not mounted on a cgroups v1 hierarchy;
1758 .IP b)
1759 the controller is bound to the cgroups v2 single unified hierarchy; or
1760 .IP c)
1761 the controller is disabled (see below).
1762 .RE
1763 .IP 3.
1764 The number of control groups in this hierarchy using this controller.
1765 .IP 4.
1766 This field contains the value 1 if this controller is enabled,
1767 or 0 if it has been disabled (via the
1768 .IR cgroup_disable
1769 kernel command-line boot parameter).
1770 .RE
1771 .TP
1772 .IR /proc/[pid]/cgroup " (since Linux 2.6.24)"
1773 This file describes control groups to which the process
1774 with the corresponding PID belongs.
1775 The displayed information differs for
1776 cgroups version 1 and version 2 hierarchies.
1777 .IP
1778 For each cgroup hierarchy of which the process is a member,
1779 there is one entry containing three colon-separated fields:
1780 .IP
1781 .in +4n
1782 .EX
1783 hierarchy-ID:controller-list:cgroup-path
1784 .EE
1785 .in
1786 .IP
1787 For example:
1788 .IP
1789 .in +4n
1790 .EX
1791 5:cpuacct,cpu,cpuset:/daemons
1792 .EE
1793 .in
1794 .IP
1795 The colon-separated fields are, from left to right:
1796 .RS
1797 .IP 1. 3
1798 For cgroups version 1 hierarchies,
1799 this field contains a unique hierarchy ID number
1800 that can be matched to a hierarchy ID in
1801 .IR /proc/cgroups .
1802 For the cgroups version 2 hierarchy, this field contains the value 0.
1803 .IP 2.
1804 For cgroups version 1 hierarchies,
1805 this field contains a comma-separated list of the controllers
1806 bound to the hierarchy.
1807 For the cgroups version 2 hierarchy, this field is empty.
1808 .IP 3.
1809 This field contains the pathname of the control group in the hierarchy
1810 to which the process belongs.
1811 This pathname is relative to the mount point of the hierarchy.
1812 .RE
1813 .\"
1814 .SS /sys/kernel/cgroup files
1815 .TP
1816 .IR /sys/kernel/cgroup/delegate " (since Linux 4.15)"
1817 .\" commit 01ee6cfb1483fe57c9cbd8e73817dfbf9bacffd3
1818 This file exports a list of the cgroups v2 files
1819 (one per line) that are delegatable
1820 (i.e., whose ownership should be changed to the user ID of the delegatee).
1821 In the future, the set of delegatable files may change or grow,
1822 and this file provides a way for the kernel to inform
1823 user-space applications of which files must be delegated.
1824 As at Linux 4.15, one sees the following when inspecting this file:
1825 .IP
1826 .EX
1827 .in +4n
1828 $ \fBcat /sys/kernel/cgroup/delegate\fP
1829 cgroup.procs
1830 cgroup.subtree_control
1831 cgroup.threads
1832 .in
1833 .EE
1834 .TP
1835 .IR /sys/kernel/cgroup/features " (since Linux 4.15)"
1836 .\" commit 5f2e673405b742be64e7c3604ed4ed3ac14f35ce
1837 Over time, the set of cgroups v2 features that are provided by the
1838 kernel may change or grow,
1839 or some features may not be enabled by default.
1840 This file provides a way for user-space applications to discover what
1841 features the running kernel supports and has enabled.
1842 Features are listed one per line:
1843 .IP
1844 .in +4n
1845 .EX
1846 $ \fBcat /sys/kernel/cgroup/features\fP
1847 nsdelegate
1848 memory_localevents
1849 .EE
1850 .in
1851 .IP
1852 The entries that can appear in this file are:
1853 .RS
1854 .TP
1855 .IR memory_localevents " (since Linux 5.2)"
1856 The kernel supports the
1857 .I memory_localevents
1858 mount option.
1859 .TP
1860 .IR nsdelegate " (since Linux 4.15)"
1861 The kernel supports the
1862 .I nsdelegate
1863 mount option.
1864 .RE
1865 .SH SEE ALSO
1866 .BR prlimit (1),
1867 .BR systemd (1),
1868 .BR systemd-cgls (1),
1869 .BR systemd-cgtop (1),
1870 .BR clone (2),
1871 .BR ioprio_set (2),
1872 .BR perf_event_open (2),
1873 .BR setrlimit (2),
1874 .BR cgroup_namespaces (7),
1875 .BR cpuset (7),
1876 .BR namespaces (7),
1877 .BR sched (7),
1878 .BR user_namespaces (7)
1879 .PP
1880 The kernel source file
1881 .IR Documentation/admin-guide/cgroup-v2.rst .