From: Christian Brauner Date: Thu, 28 May 2026 12:33:32 +0000 (+0200) Subject: Merge patch series "fs/pipe: reduce pipe->mutex contention by pre-allocating outside... X-Git-Url: http://git.ipfire.org/gitweb/?a=commitdiff_plain;h=99f414273beda01a86c2fb66c2155da61335fa59;p=thirdparty%2Fkernel%2Flinux.git Merge patch series "fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock" Breno Leitao says: While profiling Meta's caching code[1], I found pipe->mutex contention on the hot path. anon_pipe_write() currently calls alloc_page() once per page while holding pipe->mutex. The allocation can sleep doing direct reclaim and runs memcg charging, which extends the critical section and stalls any concurrent reader on the same mutex. This series pre-allocates pages outside pipe->mutex in anon_pipe_write(): for writes that span more than one full page, up to PIPE_PREALLOC_MAX (8) pages are allocated via a per-page alloc_page() loop before the mutex is taken. anon_pipe_get_page() then drains the prealloc array first, falls back to the per-pipe tmp_page[] cache, and only enters the allocator under the mutex for the leftover pages (writes larger than PIPE_PREALLOC_MAX, single-page writes that skip prealloc, or shortfalls when the prealloc loop fails). Leftover prealloc pages are recycled into tmp_page[] before unlock and any remainder is put_page()'d after unlock, keeping the allocator out of the critical section on both sides. alloc_pages_bulk_mempolicy() looked tempting but the bulk allocator refuses __GFP_ACCOUNT under memcg -- it returns at most one page when memcg_kmem_online() && (gfp & __GFP_ACCOUNT), see commit 8dcb3060d81d ("memcg: page_alloc: skip bulk allocator for __GFP_ACCOUNT"). A per-page loop keeps memcg accounting and the task NUMA mempolicy honoured uniformly without open-coding the charge. I also vibe-coded a microbenchmark to validate the change. It sweeps writers x readers over {1,2,5} x {1,5,10} with 64KB writes against a 1 MB pipe and prints throughput + latency percentiles per config. Measured on arm64 and also on x86 using virtme-ng (16 vCPUs, 64KB writes, 1 MB pipe). The numbers below were collected on v1 (alloc_pages_bulk()); v2's per-page loop preserves the dominant "allocation outside the mutex" win and is expected to land in the same range. == No memory pressure (10s per config) == Throughput in MB/s (baseline -> patched, delta): writers readers=1 readers=5 readers=10 1 1119 -> 1354 (+21%) 1132 -> 1195 (+6%) 1060 -> 1240 (+17%) 2 1162 -> 1487 (+28%) 1034 -> 1285 (+24%) 1069 -> 1213 (+14%) 5 1152 -> 1357 (+18%) 1021 -> 1164 (+14%) 997 -> 1239 (+24%) Avg write latency in ns (baseline -> patched, delta): writers readers=1 readers=5 readers=10 1 55786 -> 46103 (-17%) 55164 -> 52260 (-5%) 58906 -> 50370 (-14%) 2 107546 -> 84011 (-22%) 120837 -> 97206 (-20%) 116860 -> 103036 (-12%) 5 271293 -> 230170 (-15%) 306089 -> 268429 (-12%) 313300 -> 252232 (-19%) Throughput improves +6% to +28% and average write latency drops 5% to 22% across every configuration. == Under memory pressure (--memory-pressure, 6s per config) == stress-ng --vm 2 --vm-bytes 50% --vm-keep is forked alongside the sweep so the alloc_page() calls inside anon_pipe_write() routinely hit direct reclaim -- exactly the regime the patch targets. Throughput in MB/s (baseline -> patched, delta): writers readers=1 readers=5 readers=10 1 1088 -> 1438 (+32%) 996 -> 1477 (+48%) 989 -> 1194 (+21%) 2 1076 -> 1378 (+28%) 1007 -> 1269 (+26%) 1018 -> 1234 (+21%) 5 1052 -> 1311 (+25%) 986 -> 1225 (+24%) 972 -> 1249 (+29%) Avg write latency in ns (baseline -> patched, delta): writers readers=1 readers=5 readers=10 1 57397 -> 43406 (-24%) 62690 -> 42272 (-33%) 63136 -> 52272 (-17%) 2 116121 -> 90700 (-22%) 124098 -> 98481 (-21%) 122754 -> 101217 (-18%) 5 297122 -> 238322 (-20%) 316836 -> 255095 (-19%) 321496 -> 250189 (-22%) Throughput improves +21% to +48% and average write latency drops 17% to 33% -- a noticeably bigger win than the no-pressure run. That tracks: when alloc_page() has to dip into reclaim, the cost of holding pipe->mutex across it is highest, and pulling the allocation out of the critical section pays the most. * patches from https://patch.msgid.link/20260524-fix_pipe-v3-0-bb4a75d23a90@debian.org: selftests/pipe: add pipe_bench microbenchmark fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write Link: https://www.usenix.org/system/files/conference/atc13/atc13-bronson.pdf [1] Link: https://patch.msgid.link/20260524-fix_pipe-v3-0-bb4a75d23a90@debian.org Signed-off-by: Christian Brauner --- 99f414273beda01a86c2fb66c2155da61335fa59