process_huge_pages(), used to clear hugepages, is optimized for cache
locality. In particular it processes a hugepage in 4KB page units and in
a difficult to predict order: clearing pages in the periphery in a
backwards or forwards direction, then converging inwards to the faulting
page (or page specified via base_addr.)
This helps maximize temporal locality at time of access. However, while
it keeps stores inside a 4KB page sequential, pages are ordered
semi-randomly in a way that is not easy for the processor to predict.
This limits the clearing bandwidth to what's available in a 4KB page.
Consider the baseline bandwidth:
$ perf bench mem mmap -p 2MB -f populate -s 64GB -l 3
# Running 'mem/mmap' benchmark:
# function 'populate' (Eagerly populated mmap())
# Copying 64GB bytes ...
11.791097 GB/sec
(Unless otherwise noted, all numbers are on AMD Genoa (EPYC 9J13);
region-size=64GB, local node; 2.56 GHz, boost=0.)
11.79 GBps amounts to around 323ns/4KB. With memory access latency of
~100ns, that doesn't leave much time to help from, say, hardware
prefetchers.
(Note that since this is a purely write workload, it's reasonable
to assume that the processor does not need to prefetch any cachelines.
However, for a processor to skip the prefetch, it would need to look
at the access pattern, and see that full cachelines were being written.
This might be easily visible if clear_page() was using, say x86 string
instructions; less so if it were using a store loop. In any case, the
existence of these kind predictors or appropriately helpful threshold
values is implementation specific.
Additionally, even when the processor can skip the prefetch, coherence
protocols will still need to establish exclusive ownership
necessitating communication with remote caches.)
With that, the change is quite straight-forward. Instead of clearing
pages discontiguously, clear contiguously: switch to a loop around
clear_user_highpage().
Performance
==
Testing a demand fault workload shows a decent improvement in bandwidth
with pg-sz=2MB. Performance of pg-sz=1GB does not change because it has
always used straight clearing.
$ perf bench mem mmap -p $pg-sz -f demand -s 64GB -l 5
discontiguous-pages contiguous-pages
(baseline)
(GBps +- %stdev) (GBps +- %stdev)
pg-sz=2MB 11.76 +- 1.10% 23.58 +- 1.95% +100.51%
pg-sz=1GB 24.85 +- 2.41% 25.40 +- 1.33% -
Analysis (pg-sz=2MB)
==
At L1 data cache level, nothing changes. The processor continues to
access the same number of cachelines, allocating and missing them as it
writes to them.
discontiguous-pages 7,394,341,051 L1-dcache-loads # 445.172 M/sec ( +- 0.04% ) (35.73%)
3,292,247,227 L1-dcache-load-misses # 44.52% of all L1-dcache accesses ( +- 0.01% ) (35.73%)
contiguous-pages 7,205,105,282 L1-dcache-loads # 861.895 M/sec ( +- 0.02% ) (35.75%)
3,241,584,535 L1-dcache-load-misses # 44.99% of all L1-dcache accesses ( +- 0.00% ) (35.74%)
The L2 prefetcher, however, is now able to prefetch ~22% more cachelines
(L2 prefetch miss rate also goes up significantly showing that we are
backend limited):
discontiguous-pages 2,835,860,245 l2_pf_hit_l2.all # 170.242 M/sec ( +- 0.12% ) (15.65%)
contiguous-pages 3,472,055,269 l2_pf_hit_l2.all # 411.319 M/sec ( +- 0.62% ) (15.67%)
That sill leaves a large gap between the ~22% improvement in prefetch and
the ~100% improvement in bandwidth but better prefetching seems to
streamline the traffic well enough that most of the data starts comes from
the L2 leading to substantially fewer cache-misses at the LLC:
discontiguous-pages 8,493,499,137 cache-references # 511.416 M/sec ( +- 0.15% ) (50.01%)
930,501,344 cache-misses # 10.96% of all cache refs ( +- 0.52% ) (50.01%)
contiguous-pages 9,421,926,416 cache-references # 1.120 G/sec ( +- 0.09% ) (50.02%)
68,787,247 cache-misses # 0.73% of all cache refs ( +- 0.15% ) (50.03%)
In addition, there are a few minor frontend optimizations: clear_pages()
on x86 is now fully inlined, so we don't have a CALL/RET pair (which isn't
free when using RETHUNK speculative execution mitigation as we do on my
test system.) The loop in clear_contig_highpages() is also easier to
predict (especially when handling faults) as compared to that in
process_huge_pages().
discontiguous-pages 980,014,411 branches # 59.005 M/sec (31.26%)
discontiguous-pages 180,897,177 branch-misses # 18.46% of all branches (31.26%)
contiguous-pages 515,630,550 branches # 62.654 M/sec (31.27%)
contiguous-pages 78,039,496 branch-misses # 15.13% of all branches (31.28%)
Note that although clearing contiguously is easier to optimize for the
processor, it does not, sadly, mean that the processor will necessarily
take advantage of it. For instance this change does not result in any
improvement in my tests on Intel Icelakex (Oracle X9), or on ARM64
Neoverse-N1 (Ampere Altra).
Link: https://lkml.kernel.org/r/20260107072009.1615991-7-ankur.a.arora@oracle.com
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com>
Tested-by: Raghavendra K T <raghavendra.kt@amd.com>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "Borislav Petkov (AMD)" <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Konrad Rzessutek Wilk <konrad.wilk@oracle.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Li Zhe <lizhe.67@bytedance.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>