Patch series "mm, swap: swap table phase II: unify swapin use", v5.
This series removes the SWP_SYNCHRONOUS_IO swap cache bypass swapin code
and special swap flag bits including SWAP_HAS_CACHE, along with many
historical issues. The performance is about ~20% better for some
workloads, like Redis with persistence. This also cleans up the code to
prepare for later phases, some patches are from a previously posted
series.
Swap cache bypassing and swap synchronization in general had many issues.
Some are solved as workarounds, and some are still there [1]. To resolve
them in a clean way, one good solution is to always use swap cache as the
synchronization layer [2]. So we have to remove the swap cache bypass
swap-in path first. It wasn't very doable due to performance issues, but
now combined with the swap table, removing the swap cache bypass path will
instead improve the performance, there is no reason to keep it.
Now we can rework the swap entry and cache synchronization following the
new design. Swap cache synchronization was heavily relying on
SWAP_HAS_CACHE, which is the cause of many issues. By dropping the usage
of special swap map bits and related workarounds, we get a cleaner code
base and prepare for merging the swap count into the swap table in the
next step.
And swap_map is now only used for swap count, so in the next phase,
swap_map can be merged into the swap table, which will clean up more
things and start to reduce the static memory usage. Removal of
swap_cgroup_ctrl is also doable, but needs to be done after we also
simplify the allocation of swapin folios: always use the new
swap_cache_alloc_folio helper so the accounting will also be managed by
the swap layer by then.
Test results:
Redis / Valkey bench:
=====================
Testing on a ARM64 VM 1.5G memory:
Server: valkey-server --maxmemory 2560M
Client: redis-benchmark -r
3000000 -n
3000000 -d 1024 -c 12 -P 32 -t get
no persistence with BGSAVE
Before: 460475.84 RPS 311591.19 RPS
After: 451943.34 RPS (-1.9%) 371379.06 RPS (+19.2%)
Testing on a x86_64 VM with 4G memory (system components takes about 2G):
Server:
Client: redis-benchmark -r
3000000 -n
3000000 -d 1024 -c 12 -P 32 -t get
no persistence with BGSAVE
Before: 306044.38 RPS 102745.88 RPS
After: 309645.44 RPS (+1.2%) 125313.28 RPS (+22.0%)
The performance is a lot better when persistence is applied. This should
apply to many other workloads that involve sharing memory and COW. A
slight performance drop was observed for the ARM64 Redis test: We are
still using swap_map to track the swap count, which is causing redundant
cache and CPU overhead and is not very performance-friendly for some
arches. This will be improved once we merge the swap map into the swap
table (as already demonstrated previously [3]).
vm-scabiity
===========
usemem --init-time -O -y -x -n 32 1536M (16G memory, global pressure,
simulated PMEM as swap), average result of 6 test run:
Before: After:
System time: 282.22s 283.47s
Sum Throughput: 5677.35 MB/s 5688.78 MB/s
Single process Throughput: 176.41 MB/s 176.23 MB/s
Free latency: 518477.96 us 521488.06 us
Which is almost identical.
Build kernel test:
==================
Test using ZRAM as SWAP, make -j48, defconfig, on a x86_64 VM
with 4G RAM, under global pressure, avg of 32 test run:
Before After:
System time: 1379.91s 1364.22s (-0.11%)
Test using ZSWAP with NVME SWAP, make -j48, defconfig, on a x86_64 VM
with 4G RAM, under global pressure, avg of 32 test run:
Before After:
System time: 1822.52s 1803.33s (-0.11%)
Which is almost identical.
MySQL:
======
sysbench /usr/share/sysbench/oltp_read_only.lua --tables=16
--table-size=
1000000 --threads=96 --time=600 (using ZRAM as SWAP, in a
512M memory cgroup, buffer pool set to 3G, 3 test run and 180s warm up).
Before: 318162.18 qps
After: 318512.01 qps (+0.01%)
In conclusion, the result is looking better or identical for most cases,
and it's especially better for workloads with swap count > 1 on SYNC_IO
devices, about ~20% gain in above test. Next phases will start to merge
swap count into swap table and reduce memory usage.
One more gain here is that we now have better support for THP swapin.
Previously, the THP swapin was bound with swap cache bypassing, which only
works for single-mapped folios. Removing the bypassing path also enabled
THP swapin for all folios. The THP swapin is still limited to SYNC_IO
devices, the limitation can be removed later.
This may cause more serious THP thrashing for certain workloads, but
that's not an issue caused by this series, it's a common THP issue we
should resolve separately.
This patch (of 19):
__read_swap_cache_async is widely used to allocate and ensure a folio is
in swapcache, or get the folio if a folio is already there.
It's not async, and it's not doing any read. Rename it to better present
its usage, and prepare to be reworked as part of new swap cache APIs.
Also, add some comments for the function. Worth noting that the
skip_if_exists argument is an long existing workaround that will be
dropped soon.
Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-0-8862a265a033@tencent.com
Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-1-8862a265a033@tencent.com
Link: https://lore.kernel.org/linux-mm/CAMgjq7D5qoFEK9Omvd5_Zqs6M+TEoG03+2i_mhuP5CQPSOPrmQ@mail.gmail.com/
Link: https://lore.kernel.org/linux-mm/20240326185032.72159-1-ryncsn@gmail.com/
Link: https://lore.kernel.org/linux-mm/20250514201729.48420-1-ryncsn@gmail.com/
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Acked-by: Chris Li <chrisl@kernel.org>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Cc: Deepanshu Kartikey <kartikey406@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
void *swap_cache_get_shadow(swp_entry_t entry);
void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **shadow);
void swap_cache_del_folio(struct folio *folio);
+struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_flags,
+ struct mempolicy *mpol, pgoff_t ilx,
+ bool *alloced, bool skip_if_exists);
/* Below helpers require the caller to lock and pass in the swap cluster. */
void __swap_cache_del_folio(struct swap_cluster_info *ci,
struct folio *folio, swp_entry_t entry, void *shadow);
struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
struct vm_area_struct *vma, unsigned long addr,
struct swap_iocb **plug);
-struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_flags,
- struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated,
- bool skip_if_exists);
struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
struct mempolicy *mpol, pgoff_t ilx);
struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag,
}
}
-struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
- struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated,
- bool skip_if_exists)
+/**
+ * swap_cache_alloc_folio - Allocate folio for swapped out slot in swap cache.
+ * @entry: the swapped out swap entry to be binded to the folio.
+ * @gfp_mask: memory allocation flags
+ * @mpol: NUMA memory allocation policy to be applied
+ * @ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE
+ * @new_page_allocated: sets true if allocation happened, false otherwise
+ * @skip_if_exists: if the slot is a partially cached state, return NULL.
+ * This is a workaround that would be removed shortly.
+ *
+ * Allocate a folio in the swap cache for one swap slot, typically before
+ * doing IO (e.g. swap in or zswap writeback). The swap slot indicated by
+ * @entry must have a non-zero swap count (swapped out).
+ * Currently only supports order 0.
+ *
+ * Context: Caller must protect the swap device with reference count or locks.
+ * Return: Returns the existing folio if @entry is cached already. Returns
+ * NULL if failed due to -ENOMEM or @entry have a swap count < 1.
+ */
+struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_mask,
+ struct mempolicy *mpol, pgoff_t ilx,
+ bool *new_page_allocated,
+ bool skip_if_exists)
{
struct swap_info_struct *si = __swap_entry_to_info(entry);
struct folio *folio;
goto put_and_return;
/*
- * Protect against a recursive call to __read_swap_cache_async()
+ * Protect against a recursive call to swap_cache_alloc_folio()
* on the same entry waiting forever here because SWAP_HAS_CACHE
* is set but the folio is not the swap cache yet. This can
* happen today if mem_cgroup_swapin_charge_folio() below
* triggers reclaim through zswap, which may call
- * __read_swap_cache_async() in the writeback path.
+ * swap_cache_alloc_folio() in the writeback path.
*/
if (skip_if_exists)
goto put_and_return;
* We might race against __swap_cache_del_folio(), and
* stumble across a swap_map entry whose SWAP_HAS_CACHE
* has not yet been cleared. Or race against another
- * __read_swap_cache_async(), which has set SWAP_HAS_CACHE
+ * swap_cache_alloc_folio(), which has set SWAP_HAS_CACHE
* in swap_map, but not yet added its folio to swap cache.
*/
schedule_timeout_uninterruptible(1);
return NULL;
mpol = get_vma_policy(vma, addr, 0, &ilx);
- folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
+ folio = swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx,
&page_allocated, false);
mpol_cond_put(mpol);
blk_start_plug(&plug);
for (offset = start_offset; offset <= end_offset ; offset++) {
/* Ok, do the async read-ahead now */
- folio = __read_swap_cache_async(
- swp_entry(swp_type(entry), offset),
- gfp_mask, mpol, ilx, &page_allocated, false);
+ folio = swap_cache_alloc_folio(
+ swp_entry(swp_type(entry), offset), gfp_mask, mpol, ilx,
+ &page_allocated, false);
if (!folio)
continue;
if (page_allocated) {
lru_add_drain(); /* Push any new pages onto the LRU now */
skip:
/* The page was likely read above, so no need for plugging here */
- folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
+ folio = swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx,
&page_allocated, false);
if (unlikely(page_allocated))
swap_read_folio(folio, NULL);
if (!si)
continue;
}
- folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
+ folio = swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx,
&page_allocated, false);
if (si)
put_swap_device(si);
lru_add_drain();
skip:
/* The folio was likely read above, so no need for plugging here */
- folio = __read_swap_cache_async(targ_entry, gfp_mask, mpol, targ_ilx,
+ folio = swap_cache_alloc_folio(targ_entry, gfp_mask, mpol, targ_ilx,
&page_allocated, false);
if (unlikely(page_allocated))
swap_read_folio(folio, NULL);
* CPU1 CPU2
* do_swap_page()
* ... swapoff+swapon
- * __read_swap_cache_async()
+ * swap_cache_alloc_folio()
* swapcache_prepare()
* __swap_duplicate()
* // check swap_map
return -EEXIST;
mpol = get_task_policy(current);
- folio = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
- NO_INTERLEAVE_INDEX, &folio_was_allocated, true);
+ folio = swap_cache_alloc_folio(swpentry, GFP_KERNEL, mpol,
+ NO_INTERLEAVE_INDEX, &folio_was_allocated, true);
put_swap_device(si);
if (!folio)
return -ENOMEM;