The mm_slot_entry() won't return a valid value if slot is NULL generally.
But currently it works since slot is the first element of struct
ksm_mm_slot.
To reduce the ambiguity and make it robust, access mm_slot_entry() when
slot is !NULL.
Li Zhe [Fri, 19 Sep 2025 09:23:53 +0000 (17:23 +0800)]
hugetlb: increase number of reserving hugepages via cmdline
Commit 79359d6d24df ("hugetlb: perform vmemmap optimization on a list of
pages") batches the submission of HugeTLB vmemmap optimization (HVO)
during hugepage reservation. With HVO enabled, hugepages obtained from
the buddy allocator are not submitted for optimization and their
struct-page memory is therefore not released—until the entire
reservation request has been satisfied. As a result, any struct-page
memory freed in the course of the allocation cannot be reused for the
ongoing reservation, artificially limiting the number of huge pages that
can ultimately be provided.
As commit b1222550fbf7 ("mm/hugetlb: do pre-HVO for bootmem allocated
pages") already applies early HVO to bootmem-allocated huge pages, this
patch extends the same benefit to non-bootmem pages by incrementally
submitting them for HVO as they are allocated, thereby returning
struct-page memory to the buddy allocator in real time. The change raises
the maximum 2 MiB hugepage reservation from just under 376 GB to more than
381 GB on a 384 GB x86 VM.
Link: https://lkml.kernel.org/r/20250919092353.41671-1-lizhe.67@bytedance.com Signed-off-by: Li Zhe <lizhe.67@bytedance.com> Cc: David Hildenbrand <david@redhat.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Donet Tom [Tue, 23 Sep 2025 18:47:00 +0000 (00:17 +0530)]
selftests/mm: add fork inheritance test for ksm_merging_pages counter
Add a new selftest to verify whether the `ksm_merging_pages` counter in
`mm_struct` is not inherited by a child process after fork. This helps
ensure correctness of KSM accounting across process creation.
Link: https://lkml.kernel.org/r/e7bb17d374133bd31a3e423aa9e46e1122e74971.1758648700.git.donettom@linux.ibm.com Signed-off-by: Donet Tom <donettom@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Aboorva Devarajan <aboorvad@linux.ibm.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: xu xin <xu.xin16@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Donet Tom [Tue, 23 Sep 2025 18:46:59 +0000 (00:16 +0530)]
mm/ksm: fix incorrect KSM counter handling in mm_struct during fork
Patch series "mm/ksm: Fix incorrect accounting of KSM counters during
fork", v3.
The first patch in this series fixes the incorrect accounting of KSM
counters such as ksm_merging_pages, ksm_rmap_items, and the global
ksm_zero_pages during fork.
The following patch add a selftest to verify the ksm_merging_pages counter
was updated correctly during fork.
Test Results
============
Without the first patch
-----------------------
# [RUN] test_fork_ksm_merging_page_count
not ok 10 ksm_merging_page in child: 32
With the first patch
--------------------
# [RUN] test_fork_ksm_merging_page_count
ok 10 ksm_merging_pages is not inherited after fork
This patch (of 2):
Currently, the KSM-related counters in `mm_struct`, such as
`ksm_merging_pages`, `ksm_rmap_items`, and `ksm_zero_pages`, are inherited
by the child process during fork. This results in inconsistent
accounting.
When a process uses KSM, identical pages are merged and an rmap item is
created for each merged page. The `ksm_merging_pages` and
`ksm_rmap_items` counters are updated accordingly. However, after a fork,
these counters are copied to the child while the corresponding rmap items
are not. As a result, when the child later triggers an unmerge, there are
no rmap items present in the child, so the counters remain stale, leading
to incorrect accounting.
A similar issue exists with `ksm_zero_pages`, which maintains both a
global counter and a per-process counter. During fork, the per-process
counter is inherited by the child, but the global counter is not
incremented. Since the child also references zero pages, the global
counter should be updated as well. Otherwise, during zero-page unmerge,
both the global and per-process counters are decremented, causing the
global counter to become inconsistent.
To fix this, ksm_merging_pages and ksm_rmap_items are reset to 0 during
fork, and the global ksm_zero_pages counter is updated with the
per-process ksm_zero_pages value inherited by the child. This ensures
that KSM statistics remain accurate and reflect the activity of each
process correctly.
Link: https://lkml.kernel.org/r/cover.1758648700.git.donettom@linux.ibm.com Link: https://lkml.kernel.org/r/7b9870eb67ccc0d79593940d9dbd4a0b39b5d396.1758648700.git.donettom@linux.ibm.com Fixes: 7609385337a4 ("ksm: count ksm merging pages for each process") Fixes: cb4df4cae4f2 ("ksm: count allocated ksm rmap_items for each process") Fixes: e2942062e01d ("ksm: count all zero pages placed by KSM") Signed-off-by: Donet Tom <donettom@linux.ibm.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Cc: Aboorva Devarajan <aboorvad@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Donet Tom <donettom@linux.ibm.com> Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: xu xin <xu.xin16@zte.com.cn> Cc: <stable@vger.kernel.org> [6.6+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Donet Tom [Thu, 18 Sep 2025 05:41:44 +0000 (11:11 +0530)]
drivers/base/node: fix double free in register_one_node()
When device_register() fails in register_node(), it calls
put_device(&node->dev). This triggers node_device_release(), which calls
kfree(to_node(dev)), thereby freeing the entire node structure.
As a result, when register_node() returns an error, the node memory has
already been freed. Calling kfree(node) again in register_one_node()
leads to a double free.
This patch removes the redundant kfree(node) from register_one_node() to
prevent the double free.
Link: https://lkml.kernel.org/r/20250918054144.58980-1-donettom@linux.ibm.com Fixes: 786eb990cfb7 ("drivers/base/node: handle error properly in register_one_node()") Signed-off-by: Donet Tom <donettom@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Oscar Salvador <osalvador@suse.de> Cc: Alison Schofield <alison.schofield@intel.com> Cc: Chris Mason <clm@meta.com> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com> Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com> Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Cc: Yury Norov (NVIDIA) <yury.norov@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dev Jain [Thu, 18 Sep 2025 09:34:53 +0000 (15:04 +0530)]
mm: remove PMD alignment constraint in execmem_vmalloc()
When using vmalloc with VM_ALLOW_HUGE_VMAP flag, it will set the alignment
to PMD_SIZE internally, if it deems huge mappings to be eligible.
Therefore, setting the alignment in execmem_vmalloc is redundant. Apart
from this, it also reduces the probability of allocation in case vmalloc
fails to allocate hugepages - in the fallback case, vmalloc tries to use
the original alignment and allocate basepages, which unfortunately will
again be PMD_SIZE passed over from execmem_vmalloc, thus constraining the
search for a free space in vmalloc region.
Therefore, remove this constraint.
Link: https://lkml.kernel.org/r/20250918093453.75676-1-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The kernel currently does not mlock large folios when adding them to rmap,
stating that it is difficult to confirm that the folio is fully mapped and
safe to mlock it.
This leads to a significant undercount of Mlocked in /proc/meminfo,
causing problems in production where the stat was used to estimate system
utilization and determine if load shedding is required.
However, nowadays the caller passes a number of pages of the folio that
are getting mapped, making it easy to check if the entire folio is mapped
to the VMA.
mlock the folio on rmap if it is fully mapped to the VMA.
Mlocked in /proc/meminfo can still undercount, but the value is closer the
truth and is useful for userspace.
Link: https://lkml.kernel.org/r/20250923110711.690639-7-kirill@shutemov.name Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/fault: try to map the entire file folio in finish_fault()
finish_fault() uses per-page fault for file folios. This only occurs for
file folios smaller than PMD_SIZE.
The comment suggests that this approach prevents RSS inflation. However,
it only prevents RSS accounting. The folio is still mapped to the
process, and the fact that it is mapped by a single PTE does not affect
memory pressure. Additionally, the kernel's ability to map large folios
as PMD if they are large enough does not support this argument.
When possible, map large folios in one shot. This reduces the number of
minor page faults and allows for TLB coalescing.
Mapping large folios at once will allow the rmap code to mlock it on add,
as it will recognize that it is fully mapped and mlocking is safe.
Link: https://lkml.kernel.org/r/20250923110711.690639-5-kirill@shutemov.name Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently, try_to_unmap_once() only tries to mlock small folios.
Use logic similar to folio_referenced_one() to mlock large folios: only do
this for fully mapped folios and under page table lock that protects all
page table entries.
[akpm@linux-foundation.org: s/CROSSSED/CROSSED/] Link: https://lkml.kernel.org/r/20250923110711.690639-4-kirill@shutemov.name Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/rmap: fix a mlock race condition in folio_referenced_one()
The mlock_vma_folio() function requires the page table lock to be held in
order to safely mlock the folio. However, folio_referenced_one() mlocks a
large folios outside of the page_vma_mapped_walk() loop where the page
table lock has already been dropped.
Rework the mlock logic to use the same code path inside the loop for both
large and small folios.
Use PVMW_PGTABLE_CROSSED to detect when the folio is mapped across a page
table boundary.
[akpm@linux-foundation.org: s/CROSSSED/CROSSED/] Link: https://lkml.kernel.org/r/20250923110711.690639-3-kirill@shutemov.name Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/page_vma_mapped: track if the page is mapped across page table boundary
Patch series "mm: Improve mlock tracking for large folios", v3.
The patchset includes several fixes and improvements related to mlock
tracking of large folios.
The main objective is to reduce the undercount of Mlocked memory in
/proc/meminfo and improve the accuracy of the statistics.
Patches 1-2:
These patches address a minor race condition in folio_referenced_one()
related to mlock_vma_folio().
Currently, mlock_vma_folio() is called on large folio without the page
table lock, which can result in a race condition with unmap (i.e.
MADV_DONTNEED). This can lead to partially mapped folios on the
unevictable LRU list.
While not a significant issue, I do not believe backporting is necessary.
Patch 3:
This patch adds mlocking logic similar to folio_referenced_one() to
try_to_unmap_one(), allowing for mlocking of large folios where possible.
Patch 4-5:
These patches modifies finish_fault() and faultaround to map in the entire
folio when possible, enabling efficient mlocking upon addition to the
rmap.
Patch 6:
This patch makes rmap mlock large folios if they are fully mapped,
addressing the primary source of mlock undercount for large folios.
This patch (of 6):
Add a PVMW_PGTABLE_CROSSSED flag that page_vma_mapped_walk() will set if
the page is mapped across page table boundary. Unlike other PVMW_* flags,
this one is result of page_vma_mapped_walk() and not set by the caller.
folio_referenced_one() will use it to detect if it safe to mlock the
folio.
Wei Yang [Wed, 10 Sep 2025 09:22:40 +0000 (09:22 +0000)]
mm/compaction: fix low_pfn advance on isolating hugetlb
Commit 56ae0bb349b4 ("mm: compaction: convert to use a folio in
isolate_migratepages_block()") converts api from page to folio. But the
low_pfn advance for hugetlb page seems wrong when low_pfn doesn't point to
head page.
Originally, if page is a hugetlb tail page, compound_nr() return 1, which
means low_pfn only advance one in next iteration. After the change,
low_pfn would advance more than the hugetlb range, since folio_nr_pages()
always return total number of the large page. This results in skipping
some range to isolate and then to migrate.
The worst case for alloc_contig is it does all the isolation and
migration, but finally find some range is still not isolated. And then
undo all the work and try a new range.
Advance low_pfn to the end of hugetlb.
Link: https://lkml.kernel.org/r/20250910092240.3981-1-richard.weiyang@gmail.com Fixes: 56ae0bb349b4 ("mm: compaction: convert to use a folio in isolate_migratepages_block()") Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: Zi Yan <ziy@nvidia.com> Cc: "Vishal Moola (Oracle)" <vishal.moola@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
kho: make sure page being restored is actually from KHO
When restoring a page, no sanity checks are done to make sure the page
actually came from a kexec handover. The caller is trusted to pass in the
right address. If the caller has a bug and passes in a wrong address, an
in-use page might be "restored" and returned, causing all sorts of memory
corruption.
Harden the page restore logic by stashing in a magic number in
page->private along with the order. If the magic number does not match,
the page won't be touched. page->private is an unsigned long. The union
kho_page_info splits it into two parts, with one holding the order and the
other holding the magic number.
Link: https://lkml.kernel.org/r/20250917125725.665-2-pratyush@kernel.org Signed-off-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Cc: Changyuan Lyu <changyuanl@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
While KHO exposes folio as the primitive externally, internally its
restoration machinery operates on pages. This can be seen with
kho_restore_folio() for example. It performs some sanity checks and hands
it over to kho_restore_page() to do the heavy lifting of page restoration.
After the work done by kho_restore_page(), kho_restore_folio() only
converts the head page to folio and returns it. Similarly,
deserialize_bitmap() operates on the head page directly to store the
order.
Move the sanity checks for valid phys and order from the public-facing
kho_restore_folio() to the private-facing kho_restore_page(). This makes
the boundary between page and folio clearer from KHO's perspective.
While at it, drop the comment above kho_restore_page(). The comment is
misleading now. The function stopped looking like free_reserved_page()
since 12b9a2c05d1b4 ("kho: initialize tail pages for higher order folios
properly"), and now looks even more different.
Link: https://lkml.kernel.org/r/20250917125725.665-1-pratyush@kernel.org Signed-off-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Cc: Changyuan Lyu <changyuanl@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Wed, 17 Sep 2025 15:31:54 +0000 (08:31 -0700)]
mm/damon/sysfs: set damon_ctx->min_sz_region only for paddr use case
damon_ctx->addr_unit is respected only for physical address space
monitoring use case. Meanwhile, damon_ctx->min_sz_region is used by the
core layer for aligning regions, regardless of whether it is set for
physical address space monitoring or virtual address spaces monitoring.
And it is set as 'DAMON_MIN_REGION / damon_ctx->addr_unit'. Hence, if
user sets ->addr_unit on virtual address spaces monitoring mode, regions
can be unexpectedly aligned in <PAGE_SIZE granularity. It shouldn't cause
crash-like issues but make monitoring and DAMOS behavior difficult to
understand.
Fix the unexpected behavior by setting ->min_sz_region only when it is
configured for physical address space monitoring.
The issue was found from a result of Chris' experiments that thankfully
shared with me off-list.
Link: https://lkml.kernel.org/r/20250917160041.53187-1-sj@kernel.org Fixes: d8f867fa0825 ("mm/damon: add damon_ctx->min_sz_region") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: ze zuo <zuoze1@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/vmalloc: move resched point into alloc_vmap_area()
Currently vm_area_alloc_pages() contains two cond_resched() points.
However, the page allocator already has its own in slow path so an extra
resched is not optimal because it delays the loops.
The place where CPU time can be consumed is in the VA-space search in
alloc_vmap_area(), especially if the space is really fragmented using
synthetic stress tests, after a fast path falls back to a slow one.
Move a single cond_resched() there, after dropping free_vmap_area_lock in
a slow path. This keeps fairness where it matters while removing
redundant yields from the page-allocation path.
[akpm@linux-foundation.org: tweak comment grammar] Link: https://lkml.kernel.org/r/20250917185906.1595454-1-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This removes the last call to page_stable_node(), so delete the wrapper.
It also removes a call to trylock_page() and saves a call to
compound_head(), as well as removing a reference to folio->page.
Link: https://lkml.kernel.org/r/20250916181219.2400258-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Cc: Longlong Xia <xialonglong@kylinos.cn> Cc: xu xin <xu.xin16@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Fri, 19 Sep 2025 16:21:34 +0000 (12:21 -0400)]
mm: page_alloc: avoid kswapd thrashing due to NUMA restrictions
On NUMA systems without bindings, allocations check all nodes for free
space, then wake up the kswapds on all nodes and retry. This ensures
all available space is evenly used before reclaim begins. However,
when one process or certain allocations have node restrictions, they
can cause kswapds on only a subset of nodes to be woken up.
Since kswapd hysteresis targets watermarks that are *higher* than
needed for allocation, even *unrestricted* allocations can now get
suckered onto such nodes that are already pressured. This ends up
concentrating all allocations on them, even when there are idle nodes
available for the unrestricted requests.
This was observed with two numa nodes, where node0 is normal and node1
is ZONE_MOVABLE to facilitate hotplugging: a kernel allocation wakes
kswapd on node0 only (since node1 is not eligible); once kswapd0 is
active, the watermarks hover between low and high, and then even the
movable allocations end up on node0, only to be kicked out again;
meanwhile node1 is empty and idle.
Similar behavior is possible when a process with NUMA bindings is
causing selective kswapd wakeups.
To fix this, on NUMA systems augment the (misleading) watermark test
with a check for whether kswapd is already active during the first
iteration through the zonelist. If this fails to place the request,
kswapd must be running everywhere already, and the watermark test is
good enough to decide placement.
With this patch, unrestricted requests successfully make use of node1,
even while kswapd is reclaiming node0 for restricted allocations.
[gourry@gourry.net: don't retry if no kswapds were active] Link: https://lkml.kernel.org/r/20250919162134.1098208-1-hannes@cmpxchg.org Signed-off-by: Gregory Price <gourry@gourry.net> Tested-by: Joshua Hahn <joshua.hahnjy@gmail.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Wed, 17 Sep 2025 05:16:37 +0000 (06:16 +0100)]
mm/oom_kill.c: fix inverted check
Fix an incorrect logic conversion in process_mrelease().
Link: https://lkml.kernel.org/r/3b7f0faf-4dbc-4d67-8a71-752fbcdf0906@lucifer.local Fixes: 12e423ba4eae ("mm: convert core mm to mm_flags_*() accessors") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-by: Chris Mason <clm@meta.com> Closes: https://lkml.kernel.org/r/c2e28e27-d84b-4671-8784-de5fe0d14f41@lucifer.local Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The result depends on the address that the kernel returns on mmap(). If
it is located in an existing PMD table, the madvise() will succeed.
However, if the table does not exist, it will fail with -EINVAL.
This occurs because find_pmd_or_thp_or_none() returns SCAN_PMD_NULL when a
page table is missing, which causes collapse_pte_mapped_thp() to fail.
SCAN_PMD_NULL and SCAN_PMD_NONE should be treated the same in
collapse_pte_mapped_thp(): install the PMD leaf entry and allocate page
tables as needed.
Link: https://lkml.kernel.org/r/v5ivpub6z2n2uyemlnxgbilzs52ep4lrary7lm7o6axxoneb75@yfacfl5rkzeh Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Cc: Barry Song <baohua@kernel.org> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Wed, 3 Sep 2025 17:48:42 +0000 (18:48 +0100)]
mm: do not assume file == vma->vm_file in compat_vma_mmap_prepare()
In commit bb666b7c2707 ("mm: add mmap_prepare() compatibility layer for
nested file systems") we introduced the ability for stacked drivers and
file systems to correctly invoke the f_op->mmap_prepare() handler from an
f_op->mmap() handler via a compatibility layer implemented in
compat_vma_mmap_prepare().
This populates vm_area_desc fields according to those found in the (not
yet fully initialised) VMA passed to f_op->mmap().
However this function implicitly assumes that the struct file which we are
operating upon is equal to vma->vm_file. This is not a safe assumption in
all cases.
The only really sane situation in which this matters would be something
like e.g. i915_gem_dmabuf_mmap() which invokes vfs_mmap() against
obj->base.filp:
ret = vfs_mmap(obj->base.filp, vma);
if (ret)
return ret;
And then sets the VMA's file to this, should the mmap operation succeed:
vma_set_file(vma, obj->base.filp);
That is - it is the file that is intended to back the VMA mapping.
This is not an issue currently, as so far we have only implemented
f_op->mmap_prepare() handlers for some file systems and internal mm uses,
and the only stacked f_op->mmap() operations that can be performed upon
these are those in backing_file_mmap() and coda_file_mmap(), both of which
use vma->vm_file.
However, moving forward, as we convert drivers to using
f_op->mmap_prepare(), this will become a problem.
Resolve this issue by explicitly setting desc->file to the provided file
parameter and update callers accordingly.
Callers are expected to read desc->file and update desc->vm_file - the
former will be the file provided by the caller (if stacked, this may
differ from vma->vm_file).
If the caller needs to differentiate between the two they therefore now
can.
While we are here, also provide a variant of compat_vma_mmap_prepare()
that operates against a pointer to any file_operations struct and does not
assume that the file_operations struct we are interested in is file->f_op.
This function is __compat_vma_mmap_prepare() and we invoke it from
compat_vma_mmap_prepare() so that we share code between the two functions.
This is important, because some drivers provide hooks in a separate
struct, for instance struct drm_device provides an fops field for this
purpose.
Also update the VMA selftests accordingly.
Link: https://lkml.kernel.org/r/dd0c72df8a33e8ffaa243eeb9b01010b670610e9.1756920635.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: David Hildenbrand <david@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Wed, 3 Sep 2025 17:48:41 +0000 (18:48 +0100)]
mm: specify separate file and vm_file params in vm_area_desc
Patch series "mm: do not assume file == vma->vm_file in
compat_vma_mmap_prepare()", v2.
As part of the efforts to eliminate the problematic f_op->mmap callback, a
new callback - f_op->mmap_prepare was provided.
While we are converting these callbacks, we must deal with 'stacked'
filesystems and drivers - those which in their own f_op->mmap callback
invoke an inner f_op->mmap callback.
To accomodate for this, a compatibility layer is provided that, via
vfs_mmap(), detects if f_op->mmap_prepare is provided and if so, generates
a vm_area_desc containing the VMA's metadata and invokes the call.
So far, we have provided desc->file equal to vma->vm_file. However this
is not necessarily valid, especially in the case of stacked drivers which
wish to assign a new file after the inner hook is invoked.
To account for this, we adjust vm_area_desc to have both file and vm_file
fields. The .vm_file field is strictly set to vma->vm_file (or in the
case of a new mapping, what will become vma->vm_file).
However, .file is set to whichever file vfs_mmap() is invoked with when
using the compatibilty layer.
Therefore, if the VMA's file needs to be updated in .mmap_prepare,
desc->vm_file should be assigned, whilst desc->file should be read.
No current f_op->mmap_prepare users assign desc->file so this is safe to
do.
This makes the .mmap_prepare callback in the context of a stacked
filesystem or driver completely consistent with the existing .mmap
implementations.
While we're here, we do a few small cleanups, and ensure that we const-ify
things correctly in the vm_area_desc struct to avoid hooks accidentally
trying to assign fields they should not.
This patch (of 2):
Stacked filesystems and drivers may invoke mmap hooks with a struct file
pointer that differs from the overlying file. We will make this
functionality possible in a subsequent patch.
In order to prepare for this, let's update vm_area_struct to separately
provide desc->file and desc->vm_file parameters.
The desc->file parameter is the file that the hook is expected to operate
upon, and is not assignable (though the hok may wish to e.g. update the
file's accessed time for instance).
The desc->vm_file defaults to what will become vma->vm_file and is what
the hook must reassign should it wish to change the VMA"s vma->vm_file.
For now we keep desc->file, vm_file the same to remain consistent.
No f_op->mmap_prepare() callback sets a new vma->vm_file currently, so
this is safe to change.
While we're here, make the mm_struct desc->mm pointers at immutable as
well as the desc->mm field itself.
As part of this change, also update the single hook which this would
otherwise break - mlock_future_ok(), invoked by secretmem_mmap_prepare()).
We additionally update set_vma_from_desc() to compare fields in a more
logical fashion, checking the (possibly) user-modified fields as the first
operand against the existing value as the second one.
Additionally, update VMA tests to accommodate changes.
Link: https://lkml.kernel.org/r/cover.1756920635.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/3fa15a861bb7419f033d22970598aa61850ea267.1756920635.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Pedro Falcato <pfalcato@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dev Jain [Mon, 8 Sep 2025 07:50:27 +0000 (13:20 +0530)]
mm: enable khugepaged anonymous collapse on non-writable regions
Patch series "Expand scope of khugepaged anonymous collapse", v2.
Currently khugepaged does not collapse an anonymous region which does not
have a single writable pte. This is wasteful since a region mapped with
non-writable ptes, for example, non-writable VMAs mapped by the
application, won't benefit from THP collapse.
An additional consequence of this constraint is that MADV_COLLAPSE does
not perform a collapse on a non-writable VMA, and this restriction is
nowhere to be found on the manpage - the restriction itself sounds wrong
to me since the user knows the protection of the memory it has mapped, so
collapsing read-only memory via madvise() should be a choice of the user
which shouldn't be overridden by the kernel.
Therefore, remove this constraint.
On an arm64 bare metal machine, comparing with vanilla 6.17-rc2, an
average of 5% improvement is seen on some mmtests benchmarks, particularly
hackbench, with a maximum improvement of 12%. In the following table, (I)
denotes statistically significant improvement, (R) denotes statistically
significant regression.
Currently khugepaged does not collapse an anonymous region which does not
have a single writable pte. This is wasteful since a region mapped with
non-writable ptes, for example, non-writable VMAs mapped by the
application, won't benefit from THP collapse.
An additional consequence of this constraint is that MADV_COLLAPSE does
not perform a collapse on a non-writable VMA, and this restriction is
nowhere to be found on the manpage - the restriction itself sounds wrong
to me since the user knows the protection of the memory it has mapped, so
collapsing read-only memory via madvise() should be a choice of the user
which shouldn't be overridden by the kernel.
Therefore, remove this restriction by not honouring SCAN_PAGE_RO.
Link: https://lkml.kernel.org/r/20250908075028.38431-1-dev.jain@arm.com Link: https://lkml.kernel.org/r/20250908075028.38431-2-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Barry Song <baohua@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Tue, 16 Sep 2025 18:31:27 +0000 (11:31 -0700)]
mm/damon/stat: expose negative idle time
DAMON_STAT calculates the idle time of a region using the region's age if
the region's nr_accesses is zero. If the nr_accesses value is non-zero
(positive), the idle time of the region becomes zero.
This means the users cannot know how warm and hot data is distributed,
using DAMON_STAT's memory_idle_ms_percentiles output. The other stat,
namely estimated_memory_bandwidth, can help understanding how the overall
access temperature of the system is, but it is still very rough
information. On production systems, actually, a significant portion of
the system memory is observed with zero idle time, and we cannot break it
down based on its internal hotness distribution.
Define the idle time of the region using its age, similar to those having
zero nr_accesses, but multiples '-1' to distinguish it. And expose that
using the same parameter interface, memory_idle_ms_percentiles.
SeongJae Park [Tue, 16 Sep 2025 18:31:26 +0000 (11:31 -0700)]
mm/damon/stat: expose the current tuned aggregation interval
Patch series "mm/damon/stat: expose auto-tuned intervals and non-idle
ages".
DAMON_STAT is intentionally providing limited information for easy
consumption of the information. From production fleet level usages, below
limitations are found, though.
The aggregation interval of DAMON_STAT represents the granularity of the
memory_idle_ms_percentiles. But the interval is auto-tuned and not
exposed to users, so users cannot know the granularity.
All memory regions of non-zero (positive) nr_accesses are treated as
having zero idle time. A significant portion of production systems have
such zero idle time. Hence breakdown of warm and hot data is nearly
impossible.
Make following changes to overcome the limitations. Expose the auto-tuned
aggregation interval with a new parameter named aggr_interval_us. Expose
the age of non-zero nr_accesses (how long >0 access frequency the region
retained) regions as a negative idle time.
This patch (of 2):
DAMON_STAT calculates the idle time for a region as the region's age
multiplied by the aggregation interval. That is, the aggregation interval
is the granularity of the idle time. Since the aggregation interval is
auto-tuned and not exposed to users, however, users cannot easily know in
what granularity the stat is made. Expose the tuned aggregation interval
in microseconds via a new parameter, aggr_interval_us.
SeongJae Park [Tue, 16 Sep 2025 03:35:11 +0000 (20:35 -0700)]
samples/damon/mtier: use damon_initialized()
damon_sample_mtier is assuming DAMON is ready to use in module_init time,
and uses its own hack to see if it is the time. Use damon_initialized(),
which is a way for seeing if DAMON is ready to be used that is more
reliable and better to maintain instead of the hack.
SeongJae Park [Tue, 16 Sep 2025 03:35:10 +0000 (20:35 -0700)]
samples/damon/prcl: use damon_initialized()
damon_sample_prcl is assuming DAMON is ready to use in module_init time,
and uses its own hack to see if it is the time. Use damon_initialized(),
which is a way for seeing if DAMON is ready to be used that is more
reliable and better to maintain instead of the hack.
SeongJae Park [Tue, 16 Sep 2025 03:35:09 +0000 (20:35 -0700)]
samples/damon/wsse: use damon_initialized()
damon_sample_wsse is assuming DAMON is ready to use in module_init time,
and uses its own hack to see if it is the time. Use damon_initialized(),
which is a way for seeing if DAMON is ready to be used that is more
reliable and better to maintain instead of the hack.
SeongJae Park [Tue, 16 Sep 2025 03:35:08 +0000 (20:35 -0700)]
mm/damon/lru_sort: use damon_initialized()
DAMON_LRU_SORT is assuming DAMON is ready to use in module_init time, and
uses its own hack to see if it is the time. Use damon_initialized(),
which is a way for seeing if DAMON is ready to be used that is more
reliable and better to maintain instead of the hack.
SeongJae Park [Tue, 16 Sep 2025 03:35:07 +0000 (20:35 -0700)]
mm/damon/reclaim: use damon_initialized()
DAMON_RECLAIM is assuming DAMON is ready to use in module_init time, and
uses its own hack to see if it is the time. Use damon_initialized(),
which is a way for seeing if DAMON is ready to be used that is more
reliable and better to maintain instead of the hack.
SeongJae Park [Tue, 16 Sep 2025 03:35:06 +0000 (20:35 -0700)]
mm/damon/stat: use damon_initialized()
DAMON_STAT is assuming DAMON is ready to use in module_init time, and uses
its own hack to see if it is the time. Use damon_initialized(), which is
a way for seeing if DAMON is ready to be used that is more reliable and
better to maintain instead of the hack.
SeongJae Park [Tue, 16 Sep 2025 03:35:05 +0000 (20:35 -0700)]
mm/damon/core: implement damon_initialized() function
Patch series "mm/damon: define and use DAMON initialization check
function".
DAMON is initialized in subsystem initialization time, by damon_init().
If DAMON API functions are called before the initialization, the
system could crash. Actually such issues happened and were fixed [1]
in the past. For the fix, DAMON API callers have updated to check if
DAMON is initialized or not, using their own hacks. The hacks are
unnecessarily duplicated on every DAMON API callers and therefore it
would be difficult to reliably maintain in the long term.
Make it reliable and easy to maintain. For this, implement a new DAMON
core layer API function that returns if DAMON is successfully
initialized. If it returns true, it means DAMON API functions are safe
to be used. After the introduction of the new API, update DAMON API
callers to use the new function instead of their own hacks.
This patch (of 7):
If DAMON is tried to be used when it is not yet successfully initialized,
the caller could be crashed. DAMON core layer is not providing a reliable
way to see if it is successfully initialized and therefore ready to be
used, though. As a result, DAMON API callers are implementing their own
hacks to see it. The hacks simply assume DAMON should be ready on module
init time. It is not reliable as DAMON initialization can indeed fail if
KMEM_CACHE() fails, and difficult to maintain as those are duplicates.
Implement a core layer API function for better reliability and
maintainability to replace the hacks with followup commits.
SeongJae Park [Tue, 16 Sep 2025 03:23:39 +0000 (20:23 -0700)]
MAINTAINERS: rename DAMON section
DAMON section name is 'DATA ACCESS MONITOR', which implies it is only for
data access monitoring. But DAMON is now evolved for not only access
monitoring but also access-aware system operations (DAMOS). Rename the
section to simply DAMON. It might make it difficult to understand what it
does at a glance, but at least not spreading more confusion. Readers can
further refer to the documentation to better understand what really DAMON
does.
Link: https://lkml.kernel.org/r/20250916032339.115817-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Tue, 16 Sep 2025 03:23:38 +0000 (20:23 -0700)]
Docs/admin-guide/mm/damon/start: add --target_pid to DAMOS example command
The example command doesn't work [1] on the latest DAMON user-space tool,
since --damos_action option is updated to receive multiple arguments, and
hence cannot know if the final argument is for deductible monitoring
target or an argument for --damos_action option. Add --target_pid option
to let damo understand it is for target pid.
Link: https://lkml.kernel.org/r/20250916032339.115817-5-sj@kernel.org Link: https://github.com/damonitor/damo/pull/32 Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Tue, 16 Sep 2025 03:23:37 +0000 (20:23 -0700)]
Docs/mm/damon/maintainer-profile: update community meetup for reservation requirements
DAMON community meetup was having two different kinds of meetups:
reservation required ones and unrequired ones. Now the reservation
unrequested one is gone, but the documentation on the maintainer-profile
is not updated. Update.
Link: https://lkml.kernel.org/r/20250916032339.115817-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Tue, 16 Sep 2025 03:23:36 +0000 (20:23 -0700)]
mm/damon/core: set effective quota on first charge window
The effective quota of a scheme is initialized zero, which means there is
no quota. It is set based on user-specified time/quota/quota goals. But
the later value set is done only from the second charge window. As a
result, a scheme having a user-specified quota can work as not having the
quota (unexpectedly fast) for the first charge window. In practical and
common use cases the quota interval is not too long, and the scheme's
target access pattern is restrictive. Hence the issue should be modest.
That said, it is apparently an unintended misbehavior. Fix the problem by
setting esz on the first charge window.
Link: https://lkml.kernel.org/r/20250916032339.115817-3-sj@kernel.org Fixes: 1cd243030059 ("mm/damon/schemes: implement time quota") # 5.16.x Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Tue, 16 Sep 2025 03:23:35 +0000 (20:23 -0700)]
mm/damon/core: reset age if nr_accesses changes between non-zero and zero
Patch series "mm/damon: misc fixups and improvements for 6.18", v2.
Misc fixes and improvements for DAMON that are not critical and therefore
aims to be merged into Linux 6.18-rc1.
The first patch improves DAMON's age counting for nr_accesses zero to/from
non-zero changes.
The second patch fixes an initial DAMOS apply interval delay issue that is
not realistic but still could happen on an odd setup.
The third and the fourth patches update DAMON community meetup description
and DAMON user-space tool example command for DAMOS usage, respectively.
Finally, the fifth patch updates MAINTAINERS section name for DAMON to
just DAMON.
This patch (of 5):
DAMON resets the age of a region if its nr_accesses value has
significantly changed. Specifically, the threshold is calculated as 20%
of largest nr_accesses of the current snapshot. This means that regions
changing the nr_accesses from zero to small non-zero value or from a small
non-zero value to zero will keep the age. Since many users treat zero
nr_accesses regions special, this can be confusing. Kernel code including
DAMOS' regions priority calculation and DAMON_STAT's idle time calculation
also treat zero nr_accesses regions special. Make it unconfusing by
resetting the age when the nr_accesses changes between zero and a non-zero
value.
Link: https://lkml.kernel.org/r/20250916032339.115817-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250916032339.115817-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
alloc_tag: mark inaccurate allocation counters in /proc/allocinfo output
While rare, memory allocation profiling can contain inaccurate counters if
slab object extension vector allocation fails. That allocation might
succeed later but prior to that, slab allocations that would have used
that object extension vector will not be accounted for. To indicate
incorrect counters, "accurate:no" marker is appended to the call site line
in the /proc/allocinfo output. Bump up /proc/allocinfo version to reflect
the change in the file format and update documentation.
mm/oom_kill: the OOM reaper traverses the VMA maple tree in reverse order
Although the oom_reaper is delayed and it gives the oom victim chance to
clean up its address space this might take a while especially for
processes with a large address space footprint. In those cases oom_reaper
might start racing with the dying task and compete for shared resources -
e.g. page table lock contention has been observed.
Reduce those races by reaping the oom victim from the other end of the
address space.
It is also a significant improvement for process_mrelease(). When a
process is killed, process_mrelease is used to reap the killed process and
often runs concurrently with the dying task. The test data shows that
after applying the patch, lock contention is greatly reduced during the
procedure of reaping the killed process.
The test is conducted on arm64. The following basic perf numbers show
that applying this patch significantly reduces pte spin lock contention.
Detailed breakdowns for both scenarios are provided below. The cumulative
time for oom_reaper plus exit_mmap(victim) in both cases is also
summarized, making the performance improvements clear.
1. The reduction in total time comes mainly from the decrease in time
spent on pte spinlock and other locks.
2. oom_reaper performs more work in some areas, but at the same time,
exit_mmap also handles certain tasks more efficiently, such as
folio_remove_rmap_ptes.
Here is a more detailed perf report. [1]
Link: https://lkml.kernel.org/r/20250915162946.5515-3-zhongjinji@honor.com Link: https://lore.kernel.org/all/20250915162619.5133-1-zhongjinji@honor.com/ Signed-off-by: zhongjinji <zhongjinji@honor.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Len Brown <lenb@kernel.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "Improvements to Victim Process Thawing and OOM Reaper
Traversal Order", v10.
This patch series focuses on optimizing victim process thawing and
refining the traversal order of the OOM reaper. Since __thaw_task() is
used to thaw a single thread of the victim, thawing only one thread cannot
guarantee the exit of the OOM victim when it is frozen. Patch 1 thaw the
entire process of the OOM victim to ensure that OOM victims are able to
terminate themselves. Even if the oom_reaper is delayed, patch 2 is still
beneficial for reaping processes with a large address space footprint, and
it also greatly improves process_mrelease.
This patch (of 10):
OOM killer is a mechanism that selects and kills processes when the system
runs out of memory to reclaim resources and keep the system stable. But
the oom victim cannot terminate on its own when it is frozen, even if the
OOM victim task is thawed through __thaw_task(). This is because
__thaw_task() can only thaw a single OOM victim thread, and cannot thaw
the entire OOM victim process.
In addition, freezing_slow_path() determines whether a task is an OOM
victim by checking the task's TIF_MEMDIE flag. When a task is identified
as an OOM victim, the freezer bypasses both PM freezing and cgroup
freezing states to thaw it.
Historically, TIF_MEMDIE was a "this is the oom victim & it has access to
memory reserves" flag in the past. It has that thread vs. process
problems and tsk_is_oom_victim was introduced later to get rid of them and
other issues as well as the guarantee that we can identify the oom
victim's mm reliably for other oom_reaper.
Therefore, thaw_process() is introduced to unfreeze all threads within the
OOM victim process, ensuring that every thread is properly thawed. The
freezer now uses tsk_is_oom_victim() to determine OOM victim status,
allowing all victim threads to be unfrozen as necessary.
With this change, the entire OOM victim process will be thawed when an OOM
event occurs, ensuring that the victim can terminate on its own.
Link: https://lkml.kernel.org/r/20250915162946.5515-1-zhongjinji@honor.com Link: https://lkml.kernel.org/r/20250915162946.5515-2-zhongjinji@honor.com Signed-off-by: zhongjinji <zhongjinji@honor.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Len Brown <lenb@kernel.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Tue, 16 Sep 2025 03:15:49 +0000 (20:15 -0700)]
mm/damon/lru_sort: use param_ctx for damon_attrs staging
damon_lru_sort_apply_parameters() allocates a new DAMON context, stages
user-specified DAMON parameters on it, and commits to running DAMON
context at once, using damon_commit_ctx(). The code is, however, directly
updating the monitoring attributes of the running context. And the
attributes are over-written by later damon_commit_ctx() call. This means
that the monitoring attributes parameters are not really working. Fix the
wrong use of the parameter context.
Link: https://lkml.kernel.org/r/20250916031549.115326-1-sj@kernel.org Fixes: a30969436428 ("mm/damon/lru_sort: use damon_commit_ctx()") Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: <stable@vger.kernel.org> [6.11+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The while loop doesn't execute and following warning gets generated:
protection_keys.c:561:15: warning: code will never be executed
[-Wunreachable-code]
int rpkey = alloc_random_pkey();
Let's enable the while loop such that it gets executed nr_iterations
times. Simplify the code a bit as well.
Link: https://lkml.kernel.org/r/20250912123025.1271051-3-usama.anjum@collabora.com Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Leon Romanovsky <leon@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
selftests/mm: add -Wunreachable-code and fix warnings
Patch series "selftests/mm: Add -Wunreachable-code and fix warnings".
Add -Wunreachable-code to selftests and remove dead code from generated
warnings.
This patch (of 2):
Enable -Wunreachable-code flag to catch dead code and fix them.
1. Remove the dead code and write a comment instead:
hmm-tests.c:2033:3: warning: code will never be executed
[-Wunreachable-code]
perror("Should not reach this\n");
^~~~~~
2. ksft_exit_fail_msg() calls exit(). So cleanup isn't done. Replace it
with ksft_print_msg().
split_huge_page_test.c:301:3: warning: code will never be executed
[-Wunreachable-code]
goto cleanup;
^~~~~~~~~~~~
Link: https://lkml.kernel.org/r/20250912123025.1271051-1-usama.anjum@collabora.com Link: https://lkml.kernel.org/r/20250912123025.1271051-2-usama.anjum@collabora.com Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Lance Yang <lance.yang@linux.dev> Cc: Leon Romanovsky <leon@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
resource: improve child resource handling in release_mem_region_adjustable()
When memory block is removed via try_remove_memory(), it eventually
reaches release_mem_region_adjustable(). The current implementation
assumes that when a busy memory resource is split into two, all child
resources remain in the lower address range.
This simplification causes problems when child resources actually belong
to the upper split. For example:
* Initial memory layout:
lsmem
RANGE SIZE STATE REMOVABLE BLOCK
0x0000000000000000-0x00000002ffffffff 12G online yes 0-95
* After offlining and removing range
0x150000000-0x157ffffff
lsmem -o RANGE,SIZE,STATE,BLOCK,CONFIGURED
(output according to upcoming lsmem changes with the configured column:
s390)
RANGE SIZE STATE BLOCK CONFIGURED
0x0000000000000000-0x000000014fffffff 5.3G online 0-41 yes
0x0000000150000000-0x0000000157ffffff 128M offline 42 no
0x0000000158000000-0x00000002ffffffff 6.6G online 43-95 yes
The iomem resource gets split into two entries, but kernel code, kernel
data, and kernel bss remain attached to the lower resource [0–5376M]
instead of the correct upper resource [5504M–12288M].
Also, resource adjustment doesn't happen and stale resources still cover
[0-12288M]. Later, memory re-add fails in register_memory_resource() with
-EBUSY.
Enhance release_mem_region_adjustable() to reassign child resources to the
correct parent after a split. Children are now assigned based on their
actual range: If they fall within the lower split, keep them in the lower
parent. If they fall within the upper split, move them to the upper
parent.
Kernel code/data/bss regions are not offlined, so they will always reside
entirely within one parent and never span across both.
Quanmin Yan [Wed, 10 Sep 2025 11:32:21 +0000 (19:32 +0800)]
mm/damon/reclaim: support addr_unit for DAMON_RECLAIM
Implement a sysfs file to expose addr_unit for DAMON_RECLAIM users.
During parameter application, use the configured addr_unit parameter to
perform the necessary initialization. Similar to the core layer, prevent
setting addr_unit to zero.
It is worth noting that when monitor_region_start and monitor_region_end
are unset (i.e., 0), their values will later be set to biggest_system_ram.
At that point, addr_unit may not be the default value 1. Although we
could divide the biggest_system_ram value by addr_unit, changing addr_unit
without setting monitor_region_start/end should be considered a user
misoperation. And biggest_system_ram is only within the 0~ULONG_MAX
range, system can clearly work correctly with addr_unit=1. Therefore, if
monitor_region_start/end are unset, always silently reset addr_unit to 1.
Link: https://lkml.kernel.org/r/20250910113221.1065764-3-yanquanmin1@huawei.com Signed-off-by: Quanmin Yan <yanquanmin1@huawei.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: ze zuo <zuoze1@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Quanmin Yan [Wed, 10 Sep 2025 11:32:20 +0000 (19:32 +0800)]
mm/damon/lru_sort: support addr_unit for DAMON_LRU_SORT
Patch series "mm/damon: add addr_unit for DAMON_LRU_SORT and
DAMON_RECLAIM".
In DAMON_LRU_SORT and DAMON_RECLAIM, damon_ctx is independent of the core.
Add addr_unit to these modules to support systems like ARM32 with LPAE.
This patch (of 2):
Implement a sysfs file to expose addr_unit for DAMON_LRU_SORT users.
During parameter application, use the configured addr_unit parameter to
perform the necessary initialization. Similar to the core layer, prevent
setting addr_unit to zero.
It is worth noting that when monitor_region_start and monitor_region_end
are unset (i.e., 0), their values will later be set to biggest_system_ram.
At that point, addr_unit may not be the default value 1. Although we
could divide the biggest_system_ram value by addr_unit, changing addr_unit
without setting monitor_region_start/end should be considered a user
misoperation. And biggest_system_ram is only within the 0~ULONG_MAX
range, system can clearly work correctly with addr_unit=1. Therefore, if
monitor_region_start/end are unset, always silently reset addr_unit to 1.
selftests/mm: gup_tests: option to GUP all pages in a single call
We recently missed detecting an issue during early testing because the
default (!all) tests would not trigger it and even when running "all"
tests it only would happen sometimes because of races.
So let's allow for an easy way to specify "GUP all pages in a single
call", extend the test matrix and extend our default (!all) tests.
By GUP'ing all pages in a single call, with the default size of 128MiB
we'll cover multiple leaf page tables / PMDs on architectures with sane
THP sizes.
Link: https://lkml.kernel.org/r/20250910093051.1693097-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We already use page->private for storing the order of a page while it's in
the buddy allocator system; extend that to also storing the order while
it's in the pcp_llist.
Link: https://lkml.kernel.org/r/20250910142923.2465470-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: remove redundant test in validate_page_before_insert()
The page_has_type() call would have included slab since commit 46df8e73a4a3 and now we don't even get that far because slab pages have a
zero refcount since commit 9aec2fb0fd5e.
Link: https://lkml.kernel.org/r/20250910142923.2465470-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
These small cleanups can be applied now to reduce conflicts during the
next merge window. They're all from various efforts to split struct page
from other memdescs. Thanks to Vlastimil for the suggestion.
This patch (of 3):
These functions do not modify their arguments. Telling the compiler this
may improve code generation, and allows us to pass const arguments from
other functions.
mm: lru_add_drain_all() do local lru_add_drain() first
No numbers to back this up, but it seemed obvious to me, that if there are
competing lru_add_drain_all()ers, the work will be minimized if each
flushes its own local queues before locking and doing cross-CPU drains.
Link: https://lkml.kernel.org/r/33389bf8-f79d-d4dd-b7a4-680c4aa21b23@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Keir Fraser <keirf@google.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Li Zhe <lizhe.67@bytedance.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Xu <weixugc@google.com> Cc: Will Deacon <will@kernel.org> Cc: yangge <yangge1116@126.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Aristeu Rozanski [Tue, 26 Aug 2025 15:37:21 +0000 (11:37 -0400)]
mm: make folio page count functions return unsigned
As raised by Andrew [1], a folio/compound page never spans a negative
number of pages. Consequently, let's use "unsigned long" instead of
"long" consistently for folio_nr_pages(), folio_large_nr_pages() and
compound_nr().
Using "unsigned long" as return value is fine, because even
"(long)-folio_nr_pages()" will keep on working as expected. Using
"unsigned int" instead would actually break these use cases.
This patch takes the first step changing these to return unsigned long
(and making drm_gem_get_pages() use the new types instead of replacing
min()).
In the future, we might want to make more callers of these functions to
consistently use "unsigned long".
selftests/mm: remove PROT_EXEC req from file-collapse tests
As of v6.8 commit 7fbb5e188248 ("mm: remove VM_EXEC requirement for THP
eligibility") thp collapse no longer requires file-backed mappings be
created with PROT_EXEC.
Remove the overly-strict dependency from thp collapse tests so we test the
least-strict requirement for success.
Link: https://lkml.kernel.org/r/20250909190534.512801-1-zokeefe@google.com Signed-off-by: Zach O'Keefe <zokeefe@google.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Brian Norris [Tue, 9 Sep 2025 20:13:57 +0000 (13:13 -0700)]
mm: vm_event_item: explicit #include for THREAD_SIZE
This header uses THREAD_SIZE, which is provided by the thread_info.h
header but is not included in this header. Depending on the #include
ordering in other files, this can produce preprocessor errors.
Link: https://lkml.kernel.org/r/20250909201419.827638-1-briannorris@chromium.org Signed-off-by: Brian Norris <briannorris@chromium.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
alloc_tag: avoid warnings when freeing non-compound "tail" pages
When freeing "tail" pages of a non-compount high-order page, we properly
subtract the allocation tag counters, however later when these pages are
released, alloc_tag_sub() will issue warnings because tags for these pages
are NULL.
This issue was originally anticipated by Vlastimil in his review [1] and
then recently reported by David. Prevent warnings by marking the tags
empty.
Link: https://lkml.kernel.org/r/20250915212756.3998938-4-surenb@google.com Link: https://lore.kernel.org/all/6db0f0c8-81cb-4d04-9560-ba73d63db4b8@suse.cz/ Signed-off-by: Suren Baghdasaryan <surenb@google.com> Suggested-by: David Wang <00107082@163.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Usama Arif <usamaarif642@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Sourav Panda <souravpanda@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
alloc_tag: prevent enabling memory profiling if it was shut down
Memory profiling can be shut down due to reasons like a failure during
initialization. When this happens, the user should not be able to
re-enable it. Current sysctrl interface does not handle this properly and
will allow re-enabling memory profiling. Fix this by checking for this
condition during sysctrl write operation.
Link: https://lkml.kernel.org/r/20250915212756.3998938-3-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Usama Arif <usamaarif642@gmail.com> Cc: David Wang <00107082@163.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Sourav Panda <souravpanda@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jackie Liu [Mon, 8 Sep 2025 06:26:14 +0000 (14:26 +0800)]
mm/shmem: remove unused entry_order after large swapin rework
After commit 93c0476e7057 ("mm/shmem, swap: rework swap entry and index
calculation for large swapin"), xas_get_order() will never return a
non-zero value for `entry_order` in shmem_split_large_entry(). As a
result, the local variable `entry_order` is effectively unused.
Clean up the code by removing `entry_order` and directly using
`cur_order`. This change is purely a refactor and has no functional
impact.
No functional change intended.
Link: https://lkml.kernel.org/r/20250908062614.89880-1-liu.yun@linux.dev Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kairui Song <kasong@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lance Yang [Mon, 8 Sep 2025 09:07:41 +0000 (17:07 +0800)]
mm: skip mlocked THPs that are underused early in deferred_split_scan()
When we stumble over a fully-mapped mlocked THP in the deferred shrinker,
it does not make sense to try to detect whether it is underused, because
try_to_map_unused_to_zeropage(), called while splitting the folio, will
not actually replace any zeroed pages by the shared zeropage.
Splitting the folio in that case does not make any sense, so let's not
even scan to check if the folio is underused.
Link: https://lkml.kernel.org/r/20250908090741.61519-1-lance.yang@linux.dev Signed-off-by: Lance Yang <lance.yang@linux.dev> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Usama Arif <usamaarif642@gmail.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Kiryl Shutsemau <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Once support for THP migration of zone device pages is enabled, device
private swap entries will be found during the walk not only for PTEs but
also for PMDs.
Therefore, it is necessary to extend to PMDs the special handling which is
already in place for PTEs when device private pages are owned by the
caller: instead of faulting or skipping the range, the correct behavior is
to use the swap entry to populate HMM PFNs.
This change is a prerequisite to make use of device-private THP in drivers
using drivers/gpu/drm/drm_pagemap, such as xe.
Even though subsequent PFNs can be inferred when handling large order
PFNs, the PFN list is still fully populated because this is currently
expected by HMM users. In case this changes in the future, that is all
HMM users support a sparsely populated PFN list, the for() loop can be
made to skip remaining PFNs for the current order. A quick test shows the
loop takes about 10 ns, roughly 20 times faster than without this
optimization.
Link: https://lkml.kernel.org/r/20250908091052.612303-1-francois.dugast@intel.com Signed-off-by: Francois Dugast <francois.dugast@intel.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Leon Romanovsky <leonro@nvidia.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: David Airlie <airlied@gmail.com> Cc: Christian König <christian.koenig@amd.com> Cc: Mika Penttilä <mpenttil@redhat.com> Cc: Thomas Hellstrom <thomas.hellstrom@linux.intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/gup: fix handling of errors from arch_make_folio_accessible() in follow_page_pte()
In case we call arch_make_folio_accessible() and it fails, we would
incorrectly return a value that is "!= 0" to the caller, indicating that
we pinned all requested pages and that the caller can keep going.
follow_page_pte() is not supposed to return error values, but instead "0"
on failure and "1" on success -- we'll clean that up separately.
In case we return "!= 0", the caller will just keep going pinning more
pages. If we happen to pin a page afterwards, we're in trouble, because
we essentially skipped some pages in the requested range.
Staring at the arch_make_folio_accessible() implementation on s390x, I
assume it should actually never really fail unless something unexpected
happens (BUG?). So let's not CC stable and just fix common code to do the
right thing.
Clean up the code a bit now that there is no reason to store the return
value of arch_make_folio_accessible().
Link: https://lkml.kernel.org/r/20250908094517.303409-1-david@redhat.com Fixes: f28d43636d6f ("mm/gup/writeback: add callbacks for inaccessible pages") Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Chanwon Park [Mon, 8 Sep 2025 10:04:10 +0000 (19:04 +0900)]
mm: re-enable kswapd when memory pressure subsides or demotion is toggled
If kswapd fails to reclaim pages from a node MAX_RECLAIM_RETRIES in a
row, kswapd on that node gets disabled. That is, the system won't wakeup
kswapd for that node until page reclamation is observed at least once.
That reclamation is mostly done by direct reclaim, which in turn enables
kswapd back.
However, on systems with CXL memory nodes, workloads with high anon page
usage can disable kswapd indefinitely, without triggering direct
reclaim. This can be reproduced with following steps:
numa node 0 (32GB memory, 48 CPUs)
numa node 2~5 (512GB CXL memory, 128GB each)
(numa node 1 is disabled)
swap space 8GB
1) Set /sys/kernel/mm/demotion_enabled to 0.
2) Set /proc/sys/kernel/numa_balancing to 0.
3) Run a process that allocates and random accesses 500GB of anon
pages.
4) Let the process exit normally.
During 3), free memory on node 0 gets lower than low watermark, and
kswapd runs and depletes swap space. Then, kswapd fails consecutively
and gets disabled. Allocation afterwards happens on CXL memory, so node
0 never gains more memory pressure to trigger direct reclaim.
After 4), kswapd on node 0 remains disabled, and tasks running on that
node are unable to swap. If you turn on NUMA_BALANCING_MEMORY_TIERING
and demotion now, it won't work properly since kswapd is disabled.
To mitigate this problem, reset kswapd_failures to 0 on following
conditions:
a) ZONE_BELOW_HIGH bit of a zone in hopeless node with a fallback
memory node gets cleared.
b) demotion_enabled is changed from false to true.
Rationale for a):
ZONE_BELOW_HIGH bit being cleared might be a sign that the node may
be reclaimable afterwards. This won't help much if the memory-hungry
process keeps running without freeing anything, but at least the node
will go back to reclaimable state when the process exits.
Rationale for b):
When demotion_enabled is false, kswapd can only reclaim anon pages by
swapping them out to swap space. If demotion_enabled is turned on,
kswapd can demote anon pages to another node for reclaiming. So, the
original failure count for determining reclaimability is no longer
valid.
Since kswapd_failures resets may be missed by ++ operation, it is
changed from int to atomic_t.
[akpm@linux-foundation.org: tweak whitespace] Link: https://lkml.kernel.org/r/aL6qGi69jWXfPc4D@pcw-MS-7D22 Signed-off-by: Chanwon Park <flyinrm@gmail.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Chunyu Hu [Fri, 12 Sep 2025 01:37:11 +0000 (09:37 +0800)]
selftests/mm: fix va_high_addr_switch.sh failure on x86_64
The test will fail as below on x86_64 with cpu la57 support (will skip if
no la57 support). Note, the test requries nr_hugepages to be set first.
# running bash ./va_high_addr_switch.sh
# -------------------------------------
# mmap(addr_switch_hint - pagesize, pagesize): 0x7f55b60fa000 - OK
# mmap(addr_switch_hint - pagesize, (2 * pagesize)): 0x7f55b60f9000 - OK
# mmap(addr_switch_hint, pagesize): 0x800000000000 - OK
# mmap(addr_switch_hint, 2 * pagesize, MAP_FIXED): 0x800000000000 - OK
# mmap(NULL): 0x7f55b60f9000 - OK
# mmap(low_addr): 0x40000000 - OK
# mmap(high_addr): 0x1000000000000 - OK
# mmap(high_addr) again: 0xffff55b6136000 - OK
# mmap(high_addr, MAP_FIXED): 0x1000000000000 - OK
# mmap(-1): 0xffff55b6134000 - OK
# mmap(-1) again: 0xffff55b6132000 - OK
# mmap(addr_switch_hint - pagesize, pagesize): 0x7f55b60fa000 - OK
# mmap(addr_switch_hint - pagesize, 2 * pagesize): 0x7f55b60f9000 - OK
# mmap(addr_switch_hint - pagesize/2 , 2 * pagesize): 0x7f55b60f7000 - OK
# mmap(addr_switch_hint, pagesize): 0x800000000000 - OK
# mmap(addr_switch_hint, 2 * pagesize, MAP_FIXED): 0x800000000000 - OK
# mmap(NULL, MAP_HUGETLB): 0x7f55b5c00000 - OK
# mmap(low_addr, MAP_HUGETLB): 0x40000000 - OK
# mmap(high_addr, MAP_HUGETLB): 0x1000000000000 - OK
# mmap(high_addr, MAP_HUGETLB) again: 0xffff55b5e00000 - OK
# mmap(high_addr, MAP_FIXED | MAP_HUGETLB): 0x1000000000000 - OK
# mmap(-1, MAP_HUGETLB): 0x7f55b5c00000 - OK
# mmap(-1, MAP_HUGETLB) again: 0x7f55b5a00000 - OK
# mmap(addr_switch_hint - pagesize, 2*hugepagesize, MAP_HUGETLB): 0x800000000000 - FAILED
# mmap(addr_switch_hint , 2*hugepagesize, MAP_FIXED | MAP_HUGETLB): 0x800000000000 - OK
# [FAIL]
addr_switch_hint is defined as DFEFAULT_MAP_WINDOW in the failed test (for
x86_64, DFEFAULT_MAP_WINDOW is defined as (1UL<<47) - pagesize) in 64 bit.
Before commit cc92882ee218 ("mm: drop hugetlb_get_unmapped_area{_*}
functions"), for x86_64 hugetlb_get_unmapped_area() is handled in arch
code arch/x86/mm/hugetlbpage.c and addr is checked with
map_address_hint_valid() after align with 'addr &= huge_page_mask(h)'
which is a round down way, and it will fail the check because the addr is
within the DEFAULT_MAP_WINDOW but (addr + len) is above the
DFEFAULT_MAP_WINDOW. So it wil go through the
hugetlb_get_unmmaped_area_top_down() to find an area within the
DFEFAULT_MAP_WINDOW.
After commit cc92882ee218 ("mm: drop hugetlb_get_unmapped_area{_*}
functions"). The addr hint for hugetlb_get_unmmaped_area() will be
rounded up and aligned to hugepage size with ALIGN() for all arches. And
after the align, the addr will be above the default MAP_DEFAULT_WINDOW,
and the map_addresshint_valid() check will pass because both aligned addr
(addr0) and (addr + len) are above the DEFAULT_MAP_WINDOW, and the aligned
hint address (0x800000000000) is returned as an suitable gap is found
there, in arch_get_unmapped_area_topdown().
To still cover the case that addr is within the DEFAULT_MAP_WINDOW, and
addr + len is above the DFEFAULT_MAP_WINDOW, change to choose the last
hugepage aligned address within the DEFAULT_MAP_WINDOW as the hint addr,
and the addr + len (2 hugepages) will be one hugepage above the
DEFAULT_MAP_WINDOW. An aligned address won't be affected by the page
round up or round down from kernel, so it's determistic.
Link: https://lkml.kernel.org/r/20250912013711.3002969-4-chuhu@redhat.com Fixes: cc92882ee218 ("mm: drop hugetlb_get_unmapped_area{_*} functions") Signed-off-by: Chunyu Hu <chuhu@redhat.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Chunyu Hu [Fri, 12 Sep 2025 01:37:10 +0000 (09:37 +0800)]
selftests/mm: alloc hugepages in va_high_addr_switch test
Alloc hugepages in the test internally, so we don't fully rely on the
run_vmtests.sh. If run_vmtests.sh does that great, free hugepages is
enough for being used to run the test, leave it as it is, otherwise setup
the hugepages in the test.
Save the original nr_hugepages value and restore it after test finish, so
leave a stable test envronment.
Chunyu Hu [Fri, 12 Sep 2025 01:37:09 +0000 (09:37 +0800)]
selftests/mm: fix hugepages cleanup too early
Patch series "Fix va_high_addr_switch.sh test failure", v3.
These three patches fix the va_high_addr_switch.sh test failure on x86_64.
Patch 1 fixes the hugepage setup issue that nr_hugepages is reset too
early in run_vmtests.sh and break the later va_high_addr_switch testing.
Patch 2 adds hugepage setup in va_high_addr_switch test, so that it can
still work if vm_runtests.sh changes the hugepage setup someday.
Patch 3 fixes the test failure caused by the hint addr align method change
in hugetlb_get_unmapped_area().
This patch (of 3):
The nr_hugepgs variable is used to keep the original nr_hugepages at the
hugepage setup step at test beginning. After userfaultfd test, a cleaup
is executed, both /sys/kernel/mm/hugepages/hugepages-*/nr_hugepages and
/proc/sys//vm/nr_hugepages are reset to 'original' value before
userfaultfd test starts.
Issue here is the value used to restore /proc/sys/vm/nr_hugepages is
nr_hugepgs which is the initial value before the vm_runtests.sh runs, not
the value before userfaultfd test starts. 'va_high_addr_swith.sh' tests
runs after that will possibly see no hugepages available for test, and got
EINVAL when mmap(HUGETLB), making the result invalid.
And before pkey tests, nr_hugepgs is changed to be used as a temp variable
to save nr_hugepages before pkey test, and restore it after pkey tests
finish. The original nr_hugepages value is not tracked anymore, so no way
to restore it after all tests finish.
Add a new variable orig_nr_hugepgs to save the original nr_hugepages, and
and restore it to nr_hugepages after all tests finish. And change to use
the nr_hugepgs variable to save the /proc/sys/vm/nr_hugeages after
hugepage setup, it's also the value before userfaultfd test starts, and
the correct value to be restored after userfaultfd finishes. The
va_high_addr_switch.sh broken will be resolved.
That's because scripts/decodecode doesn't preserve the alignment. No need
to modify it, it is enough to give only the "Code: (...)" part to this
script, and print the prefix without modifications.
With lines having a symbol to decode, the script was only trying to
preserve the alignment for the timestamps, but not the rest, nor when the
caller was set (CONFIG_PRINTK_CALLER=y).
A few patches slightly improving the output generated by
decode_stacktrace.sh.
This patch (of 3):
Lines having a symbol to decode might not always have info after this
symbol. It means ${info_str} might not be set, but it will always be
printed after a space, causing trailing whitespaces.
That's a detail, but when the output is opened with an editor marking
these trailing whitespaces, that's a bit disturbing. It is easy to remove
them by printing this variable with a space only if it is set.
While at it, do the same with ${module} and print everything in one line.
ptdesc: remove references to folios from __pagetable_ctor() and pagetable_dtor()
In preparation for splitting struct ptdesc from struct page and struct
folio, remove mentions of struct folio from these functions. Introduce
ptdesc_nr_pages() to avoid using lruvec_stat_add/sub_folio()
Link: https://lkml.kernel.org/r/20250908171104.2409217-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The first two patches here are preparation for splitting struct ptdesc
from struct page and struct folio. I think their only dependency is on
the memdesc_flags_t patches from August which is in mm-new. The third
patch is just something I noticed while working on the code.
This patch (of 3):
Use the new memdesc_flags_t type to show that these are the same bits as
page/folio/slab and thesefore have the zone/node/section information in
them. Remove a use of ptdesc_folio() by converting
pagetable_is_reserved() to use test_bit() directly.
Alice Ryhl [Tue, 2 Sep 2025 08:36:11 +0000 (08:36 +0000)]
maple_tree: remove lockdep_map_p typedef
Having the ma_external_lock field exist when CONFIG_LOCKDEP=n isn't used
anywhere, so just get rid of it. This also avoids generating a typedef
called lockdep_map_p that could overlap with typedefs in other header
files.
Stanislav Fort [Fri, 5 Sep 2025 09:38:51 +0000 (12:38 +0300)]
mm/memcg: v1: account event registrations and drop world-writable cgroup.event_control
In cgroup v1, the legacy cgroup.event_control file is world-writable and
allows unprivileged users to register unbounded events and thresholds.
Each registration allocates kernel memory without capping or memcg
charging, which can be abused to exhaust kernel memory in affected
configurations.
Make the following minimal changes:
- Account allocations with __GFP_ACCOUNT in event and threshold registration.
- Remove CFTYPE_WORLD_WRITABLE from cgroup.event_control to make it
owner-writable.
This does not affect cgroup v2. Allocations are still subject to kmem
accounting being enabled, but this reduces unbounded global growth.
Link: https://lkml.kernel.org/r/20250905093851.80596-1-disclosure@aisle.com Signed-off-by: Stanislav Fort <disclosure@aisle.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Tue, 16 Sep 2025 16:01:00 +0000 (00:01 +0800)]
mm, swap: use a single page for swap table when the size fits
We have a cluster size of 512 slots. Each slot consumes 8 bytes in swap
table so the swap table size of each cluster is exactly one page (4K).
If that condition is true, allocate one page direct and disable the slab
cache to reduce the memory usage of swap table and avoid fragmentation.
Link: https://lkml.kernel.org/r/20250916160100.31545-16-ryncsn@gmail.com Co-developed-by: Chris Li <chrisl@kernel.org> Signed-off-by: Chris Li <chrisl@kernel.org> Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Suggested-by: Chris Li <chrisl@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Tue, 16 Sep 2025 16:00:59 +0000 (00:00 +0800)]
mm, swap: implement dynamic allocation of swap table
Now swap table is cluster based, which means free clusters can free its
table since no one should modify it.
There could be speculative readers, like swap cache look up, protect them
by making them RCU protected. All swap table should be filled with null
entries before free, so such readers will either see a NULL pointer or a
null filled table being lazy freed.
On allocation, allocate the table when a cluster is used by any order.
This way, we can reduce the memory usage of large swap device
significantly.
This idea to dynamically release unused swap cluster data was initially
suggested by Chris Li while proposing the cluster swap allocator and it
suits the swap table idea very well.
Link: https://lkml.kernel.org/r/20250916160100.31545-15-ryncsn@gmail.com Co-developed-by: Chris Li <chrisl@kernel.org> Signed-off-by: Chris Li <chrisl@kernel.org> Signed-off-by: Kairui Song <kasong@tencent.com> Suggested-by: Chris Li <chrisl@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Tue, 16 Sep 2025 16:00:58 +0000 (00:00 +0800)]
mm, swap: remove contention workaround for swap cache
Swap cluster setup will try to shuffle the clusters on initialization. It
was helpful to avoid contention for the swap cache space. The cluster
size (2M) was much smaller than each swap cache space (64M), so shuffling
the cluster means the allocator will try to allocate swap slots that are
in different swap cache spaces for each CPU, reducing the chance of two
CPUs using the same swap cache space, and hence reducing the contention.
Now, swap cache is managed by swap clusters, this shuffle is pointless.
Just remove it, and clean up related macros.
This also improves the HDD swap performance as shuffling IO is a bad idea
for HDD, and now the shuffling is gone. Test have shown a ~40%
performance gain for HDD [1]:
Doing sequential swap in of 8G data using 8 processes with usemem, average
of 3 test runs:
Before: 1270.91 KB/s per process
After: 1849.54 KB/s per process
Link: https://lore.kernel.org/linux-mm/CAMgjq7AdauQ8=X0zeih2r21QoV=-WWj1hyBxLWRzq74n-C=-Ng@mail.gmail.com/ Link: https://lkml.kernel.org/r/20250916160100.31545-14-ryncsn@gmail.com Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202504241621.f27743ec-lkp@intel.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Tue, 16 Sep 2025 16:00:57 +0000 (00:00 +0800)]
mm, swap: mark swap address space ro and add context debug check
Swap cache is now backed by swap table, and the address space is not
holding any mutable data anymore. And swap cache is now protected by the
swap cluster lock, instead of the XArray lock. All access to swap cache
are wrapped by swap cache helpers. Locking is mostly handled internally
by swap cache helpers, only a few __swap_cache_* helpers require the
caller to lock the cluster by themselves.
Worth noting that, unlike XArray, the cluster lock is not IRQ safe. The
swap cache was very different compared to filemap, and now it's completely
separated from filemap. Nothing wants to mark or change anything or do a
writeback callback in IRQ.
So explicitly document this and add a debug check to avoid further
potential misuse. And mark the swap cache space as read-only to avoid any
user wrongly mixing unexpected filemap helpers with swap cache.
Link: https://lkml.kernel.org/r/20250916160100.31545-13-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Tue, 16 Sep 2025 16:00:56 +0000 (00:00 +0800)]
mm, swap: use the swap table for the swap cache and switch API
Introduce basic swap table infrastructures, which are now just a
fixed-sized flat array inside each swap cluster, with access wrappers.
Each cluster contains a swap table of 512 entries. Each table entry is an
opaque atomic long. It could be in 3 types: a shadow type (XA_VALUE), a
folio type (pointer), or NULL.
In this first step, it only supports storing a folio or shadow, and it is
a drop-in replacement for the current swap cache. Convert all swap cache
users to use the new sets of APIs. Chris Li has been suggesting using a
new infrastructure for swap cache for better performance, and that idea
combined well with the swap table as the new backing structure. Now the
lock contention range is reduced to 2M clusters, which is much smaller
than the 64M address_space. And we can also drop the multiple
address_space design.
All the internal works are done with swap_cache_get_* helpers. Swap cache
lookup is still lock-less like before, and the helper's contexts are same
with original swap cache helpers. They still require a pin on the swap
device to prevent the backing data from being freed.
Swap cache updates are now protected by the swap cluster lock instead of
the XArray lock. This is mostly handled internally, but new
__swap_cache_* helpers require the caller to lock the cluster. So, a few
new cluster access and locking helpers are also introduced.
A fully cluster-based unified swap table can be implemented on top of this
to take care of all count tracking and synchronization work, with dynamic
allocation. It should reduce the memory usage while making the
performance even better.
Link: https://lkml.kernel.org/r/20250916160100.31545-12-ryncsn@gmail.com Co-developed-by: Chris Li <chrisl@kernel.org> Signed-off-by: Chris Li <chrisl@kernel.org> Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Tue, 16 Sep 2025 16:00:55 +0000 (00:00 +0800)]
mm, swap: wrap swap cache replacement with a helper
There are currently three swap cache users that are trying to replace an
existing folio with a new one: huge memory splitting, migration, and shmem
replacement. What they are doing is quite similar.
Introduce a common helper for this. In later commits, this can be easily
switched to use the swap table by updating this helper.
The newly added helper also makes the swap cache API better defined, and
make debugging easier by adding a few more debug checks.
Migration and shmem replace are meant to clone the folio, including
content, swap entry value, and flags. And splitting will adjust each sub
folio's swap entry according to order, which could be non-uniform in the
future. So document it clearly that it's the caller's responsibility to
set up the new folio's swap entries and flags before calling the helper.
The helper will just follow the new folio's entry value.
This also prepares for replacing high-order folios in the swap cache.
Currently, only splitting to order 0 is allowed for swap cache folios.
Using the new helper, we can handle high-order folio splitting better.
Link: https://lkml.kernel.org/r/20250916160100.31545-11-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Chris Li <chrisl@kernel.org> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Tue, 16 Sep 2025 16:00:54 +0000 (00:00 +0800)]
mm/shmem, swap: remove redundant error handling for replacing folio
Shmem may replace a folio in the swap cache if the cached one doesn't fit
the swapin's GFP zone. When doing so, shmem has already double checked
that the swap cache folio is locked, still has the swap cache flag set,
and contains the wanted swap entry. So it is impossible to fail due to an
XArray mismatch. There is even a comment for that.
Delete the defensive error handling path, and add a WARN_ON instead: if
that happened, something has broken the basic principle of how the swap
cache works, we should catch and fix that.
Link: https://lkml.kernel.org/r/20250916160100.31545-10-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Tue, 16 Sep 2025 16:00:53 +0000 (00:00 +0800)]
mm, swap: cleanup swap cache API and add kerneldoc
In preparation for replacing the swap cache backend with the swap table,
clean up and add proper kernel doc for all swap cache APIs. Now all swap
cache APIs are well-defined with consistent names.
No feature change, only renaming and documenting.
Link: https://lkml.kernel.org/r/20250916160100.31545-9-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Tue, 16 Sep 2025 16:00:52 +0000 (00:00 +0800)]
mm, swap: tidy up swap device and cluster info helpers
swp_swap_info is the most commonly used helper for retrieving swap info.
It has an internal check that may lead to a NULL return value, but almost
none of its caller checks the return value, making the internal check
pointless. In fact, most of these callers already ensured the entry is
valid and never expect a NULL value.
Tidy this up and improve the function names. If the caller can make sure
the swap entry/type is valid and the device is pinned, use the new
introduced __swap_entry_to_info/__swap_type_to_info instead. They have
more debug sanity checks and lower overhead as they are inlined.
Callers that may expect a NULL value should use
swap_entry_to_info/swap_type_to_info instead.
No feature change. The rearranged codes should have had no effect, or
they should have been hitting NULL de-ref bugs already. Only some new
sanity checks are added so potential issues may show up in debug build.
The new helpers will be frequently used with swap table later when working
with swap cache folios. A locked swap cache folio ensures the entries are
valid and stable so these helpers are very helpful.
Link: https://lkml.kernel.org/r/20250916160100.31545-8-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Tue, 16 Sep 2025 16:00:51 +0000 (00:00 +0800)]
mm, swap: rename and move some swap cluster definition and helpers
No feature change, move cluster related definitions and helpers to
mm/swap.h, also tidy up and add a "swap_" prefix for cluster lock/unlock
helpers, so they can be used outside of swap files. And while at it, add
kerneldoc.
Link: https://lkml.kernel.org/r/20250916160100.31545-7-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Barry Song <baohua@kernel.org> Acked-by: Chris Li <chrisl@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Suggested-by: Chris Li <chrisl@kernel.org> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Tue, 16 Sep 2025 16:00:50 +0000 (00:00 +0800)]
mm, swap: always lock and check the swap cache folio before use
Swap cache lookup only increases the reference count of the returned
folio. That's not enough to ensure a folio is stable in the swap cache,
so the folio could be removed from the swap cache at any time. The caller
should always lock and check the folio before using it.
We have just documented this in kerneldoc, now introduce a helper for swap
cache folio verification with proper sanity checks.
Also, sanitize a few current users to use this convention and the new
helper for easier debugging. They were not having observable problems
yet, only trivial issues like wasted CPU cycles on swapoff or reclaiming.
They would fail in some other way, but it is still better to always follow
this convention to make things robust and make later commits easier to do.
Link: https://lkml.kernel.org/r/20250916160100.31545-6-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Chris Li <chrisl@kernel.org> Acked-by: Nhat Pham <nphamcs@gmail.com> Suggested-by: Chris Li <chrisl@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Tue, 16 Sep 2025 16:00:49 +0000 (00:00 +0800)]
mm, swap: check page poison flag after locking it
Instead of checking the poison flag only in the fast swap cache lookup
path, always check the poison flags after locking a swap cache folio.
There are two reasons to do so.
The folio is unstable and could be removed from the swap cache anytime, so
it's totally possible that the folio is no longer the backing folio of a
swap entry, and could be an irrelevant poisoned folio. We might
mistakenly kill a faulting process.
And it's totally possible or even common for the slow swap in path
(swapin_readahead) to bring in a cached folio. The cache folio could be
poisoned, too. Only checking the poison flag in the fast path will miss
such folios.
The race window is tiny, so it's very unlikely to happen, though. While
at it, also add a unlikely prefix.
Link: https://lkml.kernel.org/r/20250916160100.31545-5-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Nhat Pham <nphamcs@gmail.com> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Tue, 16 Sep 2025 16:00:48 +0000 (00:00 +0800)]
mm, swap: fix swap cache index error when retrying reclaim
The allocator will reclaim cached slots while scanning. Currently, it
will try again if reclaim found a folio that is already removed from the
swap cache due to a race. But the following lookup will be using the
wrong index. It won't cause any OOB issue since the swap cache index is
truncated upon lookup, but it may lead to reclaiming of an irrelevant
folio.
This should not cause a measurable issue, but we should fix it.
Link: https://lkml.kernel.org/r/20250916160100.31545-4-ryncsn@gmail.com Fixes: fae859550531 ("mm, swap: avoid reclaiming irrelevant swap cache") Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Chris Li <chrisl@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Tue, 16 Sep 2025 16:00:47 +0000 (00:00 +0800)]
mm, swap: use unified helper for swap cache look up
The swap cache lookup helper swap_cache_get_folio currently does readahead
updates as well, so callers that are not doing swapin from any VMA or
mapping are forced to reuse filemap helpers instead, and have to access
the swap cache space directly.
So decouple readahead update with swap cache lookup. Move the readahead
update part into a standalone helper. Let the caller call the readahead
update helper if they do readahead. And convert all swap cache lookups to
use swap_cache_get_folio.
After this commit, there are only three special cases for accessing swap
cache space now: huge memory splitting, migration, and shmem replacing,
because they need to lock the XArray. The following commits will wrap
their accesses to the swap cache too, with special helpers.
And worth noting, currently dropbehind is not supported for anon folio,
and we will never see a dropbehind folio in swap cache. The unified
helper can be updated later to handle that.
While at it, add proper kernedoc for touched helpers.
No functional change.
Link: https://lkml.kernel.org/r/20250916160100.31545-3-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Barry Song <baohua@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Chris Li <chrisl@kernel.org> Acked-by: Nhat Pham <nphamcs@gmail.com> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Chris Li [Tue, 16 Sep 2025 16:00:46 +0000 (00:00 +0800)]
docs/mm: add document for swap table
Patch series "mm, swap: introduce swap table as swap cache (phase I)", v4.
This is the first phase of the bigger series implementing basic
infrastructures for the Swap Table idea proposed at the LSF/MM/BPF topic
"Integrate swap cache, swap maps with swap allocator" [1]. To give credit
where it is due, this is based on Chris Li's idea and a prototype of using
cluster size atomic arrays to implement swap cache.
This phase I contains 15 patches, introduces the swap table infrastructure
and uses it as the swap cache backend. By doing so, we have up to ~5-20%
performance gain in throughput, RPS or build time for benchmark and
workload tests. The speed up is due to less contention on the swap cache
access and shallower swap cache lookup path. The cluster size is much
finer-grained than the 64M address space split, which is removed in this
phase I. It also unifies and cleans up the swap code base.
Each swap cluster will dynamically allocate the swap table, which is an
atomic array to cover every swap slot in the cluster. It replaces the
swap cache backed by XArray. In phase I, the static allocated swap_map
still co-exists with the swap table. The memory usage is about the same
as the original on average. A few exception test cases show about 1%
higher in memory usage. In the following phases of the series, swap_map
will merge into the swap table without additional memory allocation. It
will result in net memory reduction compared to the original swap cache.
Testing has shown that phase I has a significant performance improvement
from 8c/1G ARM machine to 48c96t/128G x86_64 servers in many practical
workloads.
The full picture with a summary can be found at [2]. An older bigger
series of 28 patches is posted at [3].
vm-scability test:
==================
Test with:
usemem --init-time -O -y -x -n 31 1G (4G memcg, PMEM as swap)
Before: After:
System time: 219.12s 158.16s (-27.82%)
Sum Throughput: 4767.13 MB/s 6128.59 MB/s (+28.55%)
Single process Throughput: 150.21 MB/s 196.52 MB/s (+30.83%)
Free latency: 175047.58 us 131411.87 us (-24.92%)
usemem --init-time -O -y -x -n 32 1536M (16G memory, global pressure,
PMEM as swap)
Before: After:
System time: 356.16s 284.68s (-20.06%)
Sum Throughput: 4648.35 MB/s 5453.52 MB/s (+17.32%)
Single process Throughput: 141.63 MB/s 168.35 MB/s (+18.86%)
Free latency: 499907.71 us 484977.03 us (-2.99%)
This shows an improvement of more than 20% improvement in most readings.
Build kernel test:
==================
The following result matrix is from building kernel with defconfig on
tmpfs with ZSWAP / ZRAM, using different memory pressure and setups.
Measuring sys and real time in seconds, less is better (user time is
almost identical as expected):
-j<NR> / Mem | Sys before / after | Real before / after
Using 16G ZRAM with memcg limit:
6 / 192M | 9686 / 9472 -2.21% | 2130 / 2096 -1.59%
12 / 256M | 6610 / 6451 -2.41% | 827 / 812 -1.81%
24 / 384M | 5938 / 5701 -3.37% | 414 / 405 -2.17%
48 / 768M | 4696 / 4409 -6.11% | 188 / 182 -3.19%
With 64k folio:
24 / 512M | 4222 / 4162 -1.42% | 326 / 321 -1.53%
48 / 1G | 3688 / 3622 -1.79% | 151 / 149 -1.32%
With ZSWAP with 3G memcg (using higher limit due to kmem account):
48 / 3G | 603 / 581 -3.65% | 81 / 80 -1.23%
Testing extremely high global memory and schedule pressure: Using ZSWAP
with 32G NVMEs in a 48c VM that has 4G memory, no memcg limit, system
components take up about 1.5G already, using make -j48 to build defconfig:
Before: sys time: 2069.53s real time: 135.76s
After: sys time: 2021.13s (-2.34%) real time: 134.23s (-1.12%)
On another 48c 4G memory VM, using 16G ZRAM as swap, testing make
-j48 with same config:
Before: sys time: 1756.96s real time: 111.01s
After: sys time: 1715.90s (-2.34%) real time: 109.51s (-1.35%)
All cases are more or less faster, and no regression even under extremely
heavy global memory pressure.
Redis / Valkey bench:
=====================
The test machine is a ARM64 VM with 1536M memory 12 cores, Redis is set to
use 2500M memory, and ZRAM swap size is set to 5G:
With BGSAVE enabled, most Redis memory will have a swap count > 1 so swap
cache is heavily in use. We can see a about 6% performance gain. No
BGSAVE is very slightly slower (<0.1%) due to the higher memory pressure
of the co-existence of swap_map and swap table. This will be optimzed
into a net gain and up to 20% gain in BGSAVE case in the following phases.
HDD swap is also ~40% faster with usemem because we removed an old
contention workaround.
GUP no longer uses get_dev_pagemap(). As it was the only user of the
get_dev_pagemap() pgmap caching feature it can be removed.
Link: https://lkml.kernel.org/r/20250903225926.34702-2-apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Prior to commit aed877c2b425 ("device/dax: properly refcount device dax
pages when mapping") ZONE_DEVICE pages were not fully reference counted
when mapped into user page tables. Instead GUP would take a reference on
the associated pgmap to ensure the results of pfn_to_page() remained
valid.
This is no longer required and most of the code was removed by commit fd2825b0760a ("mm/gup: remove pXX_devmap usage from get_user_pages()").
Finish cleaning this up by removing the dead calls to put_dev_pagemap()
and the temporary context struct.
Link: https://lkml.kernel.org/r/20250903225926.34702-1-apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wei Yang [Fri, 5 Sep 2025 14:03:58 +0000 (14:03 +0000)]
mm/page_alloc: check the correct buddy if it is a starting block
find_large_buddy() search buddy based on start_pfn, which maybe different
from page's pfn, e.g. when page is not pageblock aligned, because
prep_move_freepages_block() always align start_pfn to pageblock.
This means when we found a starting block at start_pfn, it may check on
the wrong page theoretically. And not split the free page as it is
supposed to, causing a freelist migratetype mismatch.
The good news is the page passed to __move_freepages_block_isolate() has
only two possible cases:
* page is pageblock aligned
* page is __first_valid_page() of this block
So it is safe for the first case, and it won't get a buddy larger than
pageblock for the second case.
To fix the issue, check the returned pfn of find_large_buddy() to decide
whether to split the free page:
1. if it is not a PageBuddy pfn, no split;
2. if it is a PageBuddy pfn but order <= pageblock_order, no split;
3. if it is a PageBuddy pfn with order > pageblock_order, start_pfn is
either in the starting block or tail block, split the PageBuddy at
pageblock_order level.
Link: https://lkml.kernel.org/r/20250905140358.28849-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Miaohe Lin [Thu, 4 Sep 2025 06:22:58 +0000 (14:22 +0800)]
mm/hwpoison: decouple hwpoison_filter from mm/memory-failure.c
mm/memory-failure.c defines and uses hwpoison_filter_* parameters but the
values of those parameters can only be modified via mm/hwpoison-inject.c
from userspace. They have a potentially different life time. Decouple
those parameters from mm/memory-failure.c to fix this broken layering.
Link: https://lkml.kernel.org/r/20250904062258.3336092-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Suggested-by: Michal Hocko <mhocko@suse.com> Cc: David Hildenbrand <david@redhat.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>