From: Christian Brauner <brauner@kernel.org>
Date: Fri, 22 May 2026 10:06:42 +0000 (+0200)
Subject: Merge patch series "writeback: fix race between cgroup_writeback_umount() and inode_s... 
X-Git-Url: http://git.ipfire.org/gitweb/?a=commitdiff_plain;h=de5fefadeff3cba9d4df1b0d6fe0518281bbccec;p=thirdparty%2Fkernel%2Flinux.git

Merge patch series "writeback: fix race between cgroup_writeback_umount() and inode_switch_wbs()"

Baokun Li <libaokun@linux.alibaba.com> says:

When a container exits, a race between cgroup_writeback_umount() and
inode_switch_wbs() / cleanup_offline_cgwb() can trigger
"VFS: Busy inodes after unmount" followed by a use-after-free on
percpu counters.

There is a window between inode_prepare_wbs_switch() returning true
(having passed the SB_ACTIVE check and grabbed the inode) and the
subsequent wb_queue_isw() call.  If cgroup_writeback_umount() observes
the global isw_nr_in_flight counter as non-zero but flush_workqueue()
finds nothing queued, it returns early -- leaving a held inode
reference that blocks evict_inodes() and a later iput() that hits
freed percpu counters.

Patch 1 closes the race by extending the RCU read-side critical
section to cover the window from inode_prepare_wbs_switch() through
wb_queue_isw(), and adding synchronize_rcu() in the umount path so
that all in-flight switchers complete queueing before
flush_workqueue() runs.  rcu_barrier() is intentionally retained so
the same hunk applies cleanly to stable trees that still queue
switches via queue_rcu_work().

Patch 2 removes the now-dead rcu_barrier() that was left over from
the queue_rcu_work() era (replaced by plain queue_work() in commit
e1b849cfa6b6 "writeback: Avoid contention on wb->list_lock when
switching inodes").  This is mainline-only.

Patch 3 replaces the global synchronize_rcu()/flush_workqueue() pair
with a per-sb counter (s_isw_nr_in_flight) plus three small helpers
(cgroup_writeback_pin / cgroup_writeback_unpin /
cgroup_writeback_drain), eliminating the global serialization
penalty.  This also reverts the RCU extension from patch 1 since the
per-sb counter makes it unnecessary.

Performance
-----------

Measured on a 16 vCPU QEMU guest, all kernels share the same .config.
Background load: 4 ext4 superblocks each running

  while :; do
      mkdir /sys/fs/cgroup/<tag>-tmp$N
      ( echo $BASHPID > <tag>-tmp$N/cgroup.procs
        dd if=/dev/zero of=$mp/burner bs=4k count=256 conv=notrunc \
       oflag=sync)
      rmdir /sys/fs/cgroup/<tag>-tmp$N
  done

This drives both inode_switch_wbs() (different cgroups writing the
same inode) and cleanup_offline_cgwb() (dying memcgs), keeping the
global isw_nr_in_flight non-zero throughout the run.  Latencies are
wall-clock around umount(8) on a separate target sb; only the target
sb's umount is measured.

Four kernels are compared at each step of the series:

  base       pre-fix mainline
  +race      base + patch 1 (race fix, keeps rcu_barrier)
  +rmbarrier +race + patch 2 (drop rcu_barrier)
  +persb     +rmbarrier + patch 3 (per-sb counter)

Target sb runs its own cgwb churn:

                p50      p95      p99      max
  base         99.7 ms 112.9 ms 112.9 ms 127.2 ms
  +race       110.2 ms 153.8 ms 153.8 ms 160.4 ms
  +rmbarrier   67.6 ms  88.3 ms  88.3 ms  96.8 ms
  +persb        7.9 ms  10.0 ms  10.0 ms  10.1 ms

Idle target umount under cross-sb cgwb-switch pressure:

                p50      p95      p99      max
  base         92.0 ms 123.5 ms 136.5 ms 141.3 ms
  +race       118.8 ms 154.6 ms 164.7 ms 165.3 ms
  +rmbarrier   62.7 ms  95.4 ms 108.1 ms 108.6 ms
  +persb        5.3 ms   6.9 ms   7.4 ms   7.4 ms

8 concurrent umounts of idle sbs under the same pressure:

                p50      p95      p99      max
  base        137.5 ms 166.9 ms 166.9 ms 171.3 ms
  +race       162.2 ms 183.9 ms 183.9 ms 217.0 ms
  +rmbarrier   61.3 ms  99.5 ms  99.5 ms 113.7 ms
  +persb        8.1 ms   9.1 ms   9.1 ms   9.5 ms

A no-pressure baseline run (no background load) measures ~5 ms p50
across all four kernels, validating that the methodology has no
systematic bias.

In-kernel cgroup_writeback_umount() cumulative cost across the same
run (bpftrace, ~340 calls covering all four scenarios):

                                cgroup_writeback_umount() time
  base                          21240 ms total  (~62 ms / call)
  +race      (rcu_barrier+sync) 24966 ms total  (~73 ms / call)
  +rmbarrier (synchronize_rcu)  12371 ms total  (~36 ms / call)
  +persb     (per-sb counter)    1.37 ms total  ( ~4 us / call)

Under +persb the wait_var_event() condition is true on entry
whenever the target sb has nothing in flight, so synchronize_rcu()
and flush_workqueue() are never called on this path.

Notes:

  - Patch 1 adds ~10-27 ms p50 over base by introducing
    synchronize_rcu().  This is the cost of closing the race
    correctly and is paid by stable backports as well.
  - Patch 2 ("drop rcu_barrier()") was expected to be a pure cleanup
    on mainline, but actually removes a real wait: rcu_barrier()
    drains call_rcu() callbacks from *all* subsystems, and the
    cgroup teardown path keeps that pipeline busy under this
    workload.  Removing it cuts ~43-101 ms p50 on top of patch 1.
  - Patch 3 (per-sb counter) replaces the global wait entirely; the
    target sb no longer waits for activity on unrelated sbs,
    recovering near-baseline latency in all three scenarios.

* patches from https://patch.msgid.link/20260521095016.2791354-1-libaokun@linux.alibaba.com:
  writeback: use a per-sb counter to drain inode wb switches at umount
  writeback: drop now-unnecessary rcu_barrier() in cgroup_writeback_umount()
  writeback: fix race between cgroup_writeback_umount() and inode_switch_wbs()

Link: https://patch.msgid.link/20260521095016.2791354-1-libaokun@linux.alibaba.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
---

de5fefadeff3cba9d4df1b0d6fe0518281bbccec