]> git.ipfire.org Git - thirdparty/kernel/linux.git/commit
cgroup: Defer task cgroup unlink until after the task is done switching out
authorTejun Heo <tj@kernel.org>
Wed, 29 Oct 2025 06:19:17 +0000 (20:19 -1000)
committerTejun Heo <tj@kernel.org>
Mon, 3 Nov 2025 21:46:18 +0000 (11:46 -1000)
commitd245698d727ab8f5420b3e28d1243f96a5234851
treedf5b6a4cdf01bd423c9c42b871b323a2b0105884
parent260fbcb92bbeacfcd050410fdc2d24ab15044400
cgroup: Defer task cgroup unlink until after the task is done switching out

When a task exits, css_set_move_task(tsk, cset, NULL, false) unlinks the task
from its cgroup. From the cgroup's perspective, the task is now gone. If this
makes the cgroup empty, it can be removed, triggering ->css_offline() callbacks
that notify controllers the cgroup is going offline resource-wise.

However, the exiting task can still run, perform memory operations, and schedule
until the final context switch in finish_task_switch(). This creates a confusing
situation where controllers are told a cgroup is offline while resource
activities are still happening in it. While this hasn't broken existing
controllers, it has caused direct confusion for sched_ext schedulers.

Split cgroup_task_exit() into two functions. cgroup_task_exit() now only calls
the subsystem exit callbacks and continues to be called from do_exit(). The
css_set cleanup is moved to the new cgroup_task_dead() which is called from
finish_task_switch() after the final context switch, so that the cgroup only
appears empty after the task is truly done running.

This also reorders operations so that subsys->exit() is now called before
unlinking from the cgroup, which shouldn't break anything.

Cc: Dan Schatzberg <dschatzberg@meta.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
include/linux/cgroup.h
kernel/cgroup/cgroup.c
kernel/sched/core.c