git.ipfire.org Git - thirdparty/kernel/linux.git/commit

mm, swap: rename __read_swap_cache_async to swap_cache_alloc_folio

Patch series "mm, swap: swap table phase II: unify swapin use", v5.

This series removes the SWP_SYNCHRONOUS_IO swap cache bypass swapin code
and special swap flag bits including SWAP_HAS_CACHE, along with many
historical issues.  The performance is about ~20% better for some
workloads, like Redis with persistence.  This also cleans up the code to
prepare for later phases, some patches are from a previously posted
series.

Swap cache bypassing and swap synchronization in general had many issues.
Some are solved as workarounds, and some are still there [1].  To resolve
them in a clean way, one good solution is to always use swap cache as the
synchronization layer [2].  So we have to remove the swap cache bypass
swap-in path first.  It wasn't very doable due to performance issues, but
now combined with the swap table, removing the swap cache bypass path will
instead improve the performance, there is no reason to keep it.

Now we can rework the swap entry and cache synchronization following the
new design.  Swap cache synchronization was heavily relying on
SWAP_HAS_CACHE, which is the cause of many issues.  By dropping the usage
of special swap map bits and related workarounds, we get a cleaner code
base and prepare for merging the swap count into the swap table in the
next step.

And swap_map is now only used for swap count, so in the next phase,
swap_map can be merged into the swap table, which will clean up more
things and start to reduce the static memory usage.  Removal of
swap_cgroup_ctrl is also doable, but needs to be done after we also
simplify the allocation of swapin folios: always use the new
swap_cache_alloc_folio helper so the accounting will also be managed by
the swap layer by then.

Test results:

Redis / Valkey bench:
=====================

Testing on a ARM64 VM 1.5G memory:
Server: valkey-server --maxmemory 2560M
Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get

        no persistence              with BGSAVE
Before: 460475.84 RPS               311591.19 RPS
After:  451943.34 RPS (-1.9%)       371379.06 RPS (+19.2%)

Testing on a x86_64 VM with 4G memory (system components takes about 2G):
Server:
Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get

        no persistence              with BGSAVE
Before: 306044.38 RPS               102745.88 RPS
After:  309645.44 RPS (+1.2%)       125313.28 RPS (+22.0%)

The performance is a lot better when persistence is applied.  This should
apply to many other workloads that involve sharing memory and COW.  A
slight performance drop was observed for the ARM64 Redis test: We are
still using swap_map to track the swap count, which is causing redundant
cache and CPU overhead and is not very performance-friendly for some
arches.  This will be improved once we merge the swap map into the swap
table (as already demonstrated previously [3]).

vm-scabiity
===========
usemem --init-time -O -y -x -n 32 1536M (16G memory, global pressure,
simulated PMEM as swap), average result of 6 test run:

                           Before:         After:
System time:               282.22s         283.47s
Sum Throughput:            5677.35 MB/s    5688.78 MB/s
Single process Throughput: 176.41 MB/s     176.23 MB/s
Free latency:              518477.96 us    521488.06 us

Which is almost identical.

Build kernel test:
==================
Test using ZRAM as SWAP, make -j48, defconfig, on a x86_64 VM
with 4G RAM, under global pressure, avg of 32 test run:

                Before            After:
System time:    1379.91s          1364.22s (-0.11%)

Test using ZSWAP with NVME SWAP, make -j48, defconfig, on a x86_64 VM
with 4G RAM, under global pressure, avg of 32 test run:

                Before            After:
System time:    1822.52s          1803.33s (-0.11%)

Which is almost identical.

MySQL:
======
sysbench /usr/share/sysbench/oltp_read_only.lua --tables=16
--table-size=1000000 --threads=96 --time=600 (using ZRAM as SWAP, in a
512M memory cgroup, buffer pool set to 3G, 3 test run and 180s warm up).

Before: 318162.18 qps
After:  318512.01 qps (+0.01%)

In conclusion, the result is looking better or identical for most cases,
and it's especially better for workloads with swap count > 1 on SYNC_IO
devices, about ~20% gain in above test.  Next phases will start to merge
swap count into swap table and reduce memory usage.

One more gain here is that we now have better support for THP swapin.
Previously, the THP swapin was bound with swap cache bypassing, which only
works for single-mapped folios.  Removing the bypassing path also enabled
THP swapin for all folios.  The THP swapin is still limited to SYNC_IO
devices, the limitation can be removed later.

This may cause more serious THP thrashing for certain workloads, but
that's not an issue caused by this series, it's a common THP issue we
should resolve separately.

This patch (of 19):

__read_swap_cache_async is widely used to allocate and ensure a folio is
in swapcache, or get the folio if a folio is already there.

It's not async, and it's not doing any read.  Rename it to better present
its usage, and prepare to be reworked as part of new swap cache APIs.

Also, add some comments for the function.  Worth noting that the
skip_if_exists argument is an long existing workaround that will be
dropped soon.

Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-0-8862a265a033@tencent.com
Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-1-8862a265a033@tencent.com
Link: https://lore.kernel.org/linux-mm/CAMgjq7D5qoFEK9Omvd5_Zqs6M+TEoG03+2i_mhuP5CQPSOPrmQ@mail.gmail.com/
Link: https://lore.kernel.org/linux-mm/20240326185032.72159-1-ryncsn@gmail.com/
Link: https://lore.kernel.org/linux-mm/20250514201729.48420-1-ryncsn@gmail.com/
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Acked-by: Chris Li <chrisl@kernel.org>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Cc: Deepanshu Kartikey <kartikey406@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

author	Kairui Song <kasong@tencent.com>
	Fri, 19 Dec 2025 19:43:30 +0000 (03:43 +0800)
committer	Andrew Morton <akpm@linux-foundation.org>
	Sat, 31 Jan 2026 22:22:53 +0000 (14:22 -0800)
commit	d7cf0d54f21087dea8ec5901ffb482b0ab38a42d
tree	ae8d97415930fe91c61533c0ae5a102523b739f2	tree
parent	6ce964c02f1cb49b4dbb76507948c004d5a0b4fe	commit \| diff

mm/swap.h		diff \| blob \| blame \| history
mm/swap_state.c		diff \| blob \| blame \| history
mm/swapfile.c		diff \| blob \| blame \| history
mm/zswap.c		diff \| blob \| blame \| history