Johannes Weiner [Mon, 2 Mar 2026 19:50:14 +0000 (14:50 -0500)]
mm: memcg: factor out trylock_stock() and unlock_stock()
Patch series "memcg: obj stock and slab stat caching cleanups".
This is a follow-up to `[PATCH] memcg: fix slab accounting in
refill_obj_stock() trylock path`. The way the slab stat cache and the
objcg charge cache interact appears a bit too fragile. This series
factors those paths apart as much as practical.
This patch (of 5):
Consolidate the local lock acquisition and the local stock lookup. This
allows subsequent patches to use !!stock as an easy way to disambiguate
the locked vs. contended cases through the callstack.
Link: https://lkml.kernel.org/r/20260302195305.620713-1-hannes@cmpxchg.org Link: https://lkml.kernel.org/r/20260302195305.620713-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Reviewed-by: Hao Li <hao.li@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baolin Wang [Fri, 6 Mar 2026 06:43:42 +0000 (14:43 +0800)]
arm64: mm: implement the architecture-specific test_and_clear_young_ptes()
Implement the Arm64 architecture-specific test_and_clear_young_ptes() to
enable batched checking of young flags, improving performance during large
folio reclamation when MGLRU is enabled.
While we're at it, simplify ptep_test_and_clear_young() by calling
test_and_clear_young_ptes(). Since callers guarantee that PTEs are
present before calling these functions, we can use pte_cont() to check the
CONT_PTE flag instead of pte_valid_cont().
Performance testing:
Enable MGLRU, then allocate 10G clean file-backed folios by mmap() in a
memory cgroup, and try to reclaim 8G file-backed folios via the
memory.reclaim interface. I can observe 60%+ performance improvement on
my Arm64 32-core server (and about 15% improvement on my X86 machine).
W/o patchset:
real 0m0.470s
user 0m0.000s
sys 0m0.470s
W/ patchset:
real 0m0.180s
user 0m0.001s
sys 0m0.179s
Link: https://lkml.kernel.org/r/7f891d42a720cc2e57862f3b79e4f774404f313c.1772778858.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Rik van Riel <riel@surriel.com> Reviewed-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Xu <weixugc@google.com> Cc: Will Deacon <will@kernel.org> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baolin Wang [Fri, 6 Mar 2026 06:43:41 +0000 (14:43 +0800)]
mm: support batched checking of the young flag for MGLRU
Use the batched helper test_and_clear_young_ptes_notify() to check and
clear the young flag to improve the performance during large folio
reclamation when MGLRU is enabled.
Meanwhile, we can also support batched checking the young and dirty flag
when MGLRU walks the mm's pagetable to update the folios' generation
counter. Since MGLRU also checks the PTE dirty bit, use
folio_pte_batch_flags() with FPB_MERGE_YOUNG_DIRTY set to detect batches
of PTEs for a large folio.
Then we can remove the ptep_test_and_clear_young_notify() since it has no
users now.
Note that we also update the 'young' counter and 'mm_stats[MM_LEAF_YOUNG]'
counter with the batched count in the lru_gen_look_around() and
walk_pte_range(). However, the batched operations may inflate these two
counters, because in a large folio not all PTEs may have been accessed.
(Additionally, tracking how many PTEs have been accessed within a large
folio is not very meaningful, since the mm core actually tracks
access/dirty on a per-folio basis, not per page). The impact analysis is
as follows:
1. The 'mm_stats[MM_LEAF_YOUNG]' counter has no functional impact and
is mainly for debugging.
2. The 'young' counter is used to decide whether to place the current
PMD entry into the bloom filters by suitable_to_scan() (so that next
time we can check whether it has been accessed again), which may set
the hash bit in the bloom filters for a PMD entry that hasn't seen much
access. However, bloom filters inherently allow some error, so this
effect appears negligible.
Link: https://lkml.kernel.org/r/378f4acf7d07410aa7c2e4b49d56bb165918eb34.1772778858.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Rik van Riel <riel@surriel.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Xu <weixugc@google.com> Cc: Will Deacon <will@kernel.org> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baolin Wang [Fri, 6 Mar 2026 06:43:40 +0000 (14:43 +0800)]
mm: add a batched helper to clear the young flag for large folios
Currently, MGLRU will call ptep_test_and_clear_young_notify() to check and
clear the young flag for each PTE sequentially, which is inefficient for
large folios reclamation.
Moreover, on Arm64 architecture, which supports contiguous PTEs, the
Arm64- specific ptep_test_and_clear_young() already implements an
optimization to clear the young flags for PTEs within a contiguous range.
However, this is not sufficient. Similar to the Arm64 specific
clear_flush_young_ptes(), we can extend this to perform batched operations
for the entire large folio (which might exceed the contiguous range:
CONT_PTE_SIZE).
Thus, we can introduce a new batched helper: test_and_clear_young_ptes()
and its wrapper test_and_clear_young_ptes_notify() which are consistent
with the existing functions, to perform batched checking of the young
flags for large folios, which can help improve performance during large
folio reclamation when MGLRU is enabled. And it will be overridden by the
architecture that implements a more efficient batch operation in the
following patches.
Link: https://lkml.kernel.org/r/23ec671bfcc06cd24ee0fbff8e329402742274a0.1772778858.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Rik van Riel <riel@surriel.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Xu <weixugc@google.com> Cc: Will Deacon <will@kernel.org> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baolin Wang [Fri, 6 Mar 2026 06:43:39 +0000 (14:43 +0800)]
mm: rmap: add a ZONE_DEVICE folio warning in folio_referenced()
The folio_referenced() is used to test whether a folio was referenced
during reclaim. Moreover, ZONE_DEVICE folios are controlled by their
device driver, have a lifetime tied to that driver, and are never placed
on the LRU list. That means we should never try to reclaim ZONE_DEVICE
folios, so add a warning to catch this unexpected behavior in
folio_referenced() to avoid confusion, as discussed in the previous
thread[1].
[1] https://lore.kernel.org/all/16fb7985-ec0f-4b56-91e7-404c5114f899@kernel.org/ Link: https://lkml.kernel.org/r/64d6fb2a33f7101e1d4aca2c9052e0758b76d492.1772778858.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Rik van Riel <riel@surriel.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Xu <weixugc@google.com> Cc: Will Deacon <will@kernel.org> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baolin Wang [Fri, 6 Mar 2026 06:43:37 +0000 (14:43 +0800)]
mm: use inline helper functions instead of ugly macros
Patch series "support batched checking of the young flag for MGLRU", v3.
This is a follow-up to the previous work [1], to support batched checking
of the young flag for MGLRU.
Similarly, batched checking of young flag for large folios can improve
performance during large-folio reclamation when MGLRU is enabled. I
observed noticeable performance improvements (see patch 5) on an Arm64
machine that supports contiguous PTEs. All mm-selftests are passed.
Patch 1 - 3: cleanup patches.
Patch 4: add a new generic batched PTE helper: test_and_clear_young_ptes().
Patch 5: support batched young flag checking for MGLRU.
Patch 6: implement the Arm64 arch-specific test_and_clear_young_ptes().
This patch (of 6):
People have already complained that these *_clear_young_notify() related
macros are very ugly, so let's use inline helpers to make them more
readable.
In addition, we cannot implement these inline helper functions in the
mmu_notifier.h file, because some arch-specific files will include the
mmu_notifier.h, which introduces header compilation dependencies and
causes build errors (e.g., arch/arm64/include/asm/tlbflush.h). Moreover,
since these functions are only used in the mm, implementing these inline
helpers in the mm/internal.h header seems reasonable.
Link: https://lkml.kernel.org/r/cover.1772778858.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/ea14af84e7967ccebb25082c28a8669d6da8fe57.1772778858.git.baolin.wang@linux.alibaba.com Link: https://lore.kernel.org/all/cover.1770645603.git.baolin.wang@linux.alibaba.com/ Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Rik van Riel <riel@surriel.com> Reviewed-by: Barry Song <baohua@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Xu <weixugc@google.com> Cc: Will Deacon <will@kernel.org> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Alistair Popple <apopple@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: rename zap_vma_ptes() to zap_special_vma_range()
zap_vma_ptes() is the only zapping function we export to modules.
It's essentially a wrapper around zap_vma_range(), however, with some
safety checks:
* That the passed range fits fully into the VMA
* That it's only used for VM_PFNMAP
We will add support for VM_MIXEDMAP next, so use the more-generic term
"special vma", although "special" is a bit overloaded. Maybe we'll later
just support any VM_SPECIAL flag.
While at it, improve the kerneldoc.
Link: https://lkml.kernel.org/r/20260227200848.114019-16-david@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Leon Romanovsky <leon@kernel.org> [drivers/infiniband] Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Arve <arve@android.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Carlos Llamas <cmllamas@google.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Daniel Borkman <daniel@iogearbox.net> Cc: Dave Airlie <airlied@gmail.com> Cc: David Ahern <dsahern@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hartley Sweeten <hsweeten@visionengravers.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ian Abbott <abbotti@mev.co.uk> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jakub Kacinski <kuba@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jann Horn <jannh@google.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Namhyung kim <namhyung@kernel.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Todd Kjos <tkjos@android.com> Cc: Tvrtko Ursulin <tursulin@ursulin.net> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/memory: convert details->even_cows into details->skip_cows
The current semantics are confusing: simply because someone specifies an
empty zap_detail struct suddenly makes should_zap_cows() behave
differently. The default should be to also zap CoW'ed anonymous pages.
Really only unmap_mapping_pages() and friends want to skip zapping of
these anon folios.
So let's invert the meaning; turn the confusing "reclaim_pt" check that
overrides other properties in should_zap_cows() into a safety check.
Note that the only caller that sets reclaim_pt=true is
madvise_dontneed_single_vma(), which wants to zap any pages.
Link: https://lkml.kernel.org/r/20260227200848.114019-10-david@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Arve <arve@android.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Carlos Llamas <cmllamas@google.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Daniel Borkman <daniel@iogearbox.net> Cc: Dave Airlie <airlied@gmail.com> Cc: David Ahern <dsahern@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hartley Sweeten <hsweeten@visionengravers.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ian Abbott <abbotti@mev.co.uk> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jakub Kacinski <kuba@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jann Horn <jannh@google.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Namhyung kim <namhyung@kernel.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Todd Kjos <tkjos@android.com> Cc: Tvrtko Ursulin <tursulin@ursulin.net> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/memory: move adjusting of address range to unmap_vmas()
__zap_vma_range() has two callers, whereby zap_page_range_single_batched()
documents that the range must fit into the VMA range.
So move adjusting the range to unmap_vmas() where it is actually required
and add a safety check in __zap_vma_range() instead. In unmap_vmas(),
we'd never expect to have empty ranges (otherwise, why have the vma in
there in the first place).
__zap_vma_range() will no longer be called with start == end, so cleanup
the function a bit. While at it, simplify the overly long comment to its
core message.
We will no longer call uprobe_munmap() for start == end, which actually
seems to be the right thing to do.
Note that hugetlb_zap_begin()->...->adjust_range_if_pmd_sharing_possible()
cannot result in the range exceeding the vma range.
Link: https://lkml.kernel.org/r/20260227200848.114019-9-david@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Arve <arve@android.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Carlos Llamas <cmllamas@google.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Daniel Borkman <daniel@iogearbox.net> Cc: Dave Airlie <airlied@gmail.com> Cc: David Ahern <dsahern@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hartley Sweeten <hsweeten@visionengravers.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ian Abbott <abbotti@mev.co.uk> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jakub Kacinski <kuba@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jann Horn <jannh@google.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Namhyung kim <namhyung@kernel.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Todd Kjos <tkjos@android.com> Cc: Tvrtko Ursulin <tursulin@ursulin.net> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/oom_kill: use MMU_NOTIFY_CLEAR in __oom_reap_task_mm()
In commit 7269f999934b ("mm/mmu_notifier: use correct mmu_notifier events
for each invalidation") we converted all MMU_NOTIFY_UNMAP to
MMU_NOTIFY_CLEAR, except the ones that actually perform munmap() or
mremap() as documented.
__oom_reap_task_mm() behaves much more like MADV_DONTNEED. So use
MMU_NOTIFY_CLEAR as well.
This is a preparation for further changes.
Link: https://lkml.kernel.org/r/20260227200848.114019-6-david@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Arve <arve@android.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Carlos Llamas <cmllamas@google.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Daniel Borkman <daniel@iogearbox.net> Cc: Dave Airlie <airlied@gmail.com> Cc: David Ahern <dsahern@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hartley Sweeten <hsweeten@visionengravers.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ian Abbott <abbotti@mev.co.uk> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jakub Kacinski <kuba@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jann Horn <jannh@google.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Namhyung kim <namhyung@kernel.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Todd Kjos <tkjos@android.com> Cc: Tvrtko Ursulin <tursulin@ursulin.net> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/memory: inline unmap_mapping_range_vma() into unmap_mapping_range_tree()
Let's remove the number of unmap-related functions that cause confusion by
inlining unmap_mapping_range_vma() into its single caller. The end result
looks pretty readable.
Link: https://lkml.kernel.org/r/20260227200848.114019-4-david@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Arve <arve@android.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Carlos Llamas <cmllamas@google.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Daniel Borkman <daniel@iogearbox.net> Cc: Dave Airlie <airlied@gmail.com> Cc: David Ahern <dsahern@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hartley Sweeten <hsweeten@visionengravers.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ian Abbott <abbotti@mev.co.uk> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jakub Kacinski <kuba@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jann Horn <jannh@google.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Namhyung kim <namhyung@kernel.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Todd Kjos <tkjos@android.com> Cc: Tvrtko Ursulin <tursulin@ursulin.net> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/memory: remove "zap_details" parameter from zap_page_range_single()
Nobody except memory.c should really set that parameter to non-NULL. So
let's just drop it and make unmap_mapping_range_vma() use
zap_page_range_single_batched() instead.
[david@kernel.org: format on a single line] Link: https://lkml.kernel.org/r/8a27e9ac-2025-4724-a46d-0a7c90894ba7@kernel.org Link: https://lkml.kernel.org/r/20260227200848.114019-3-david@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: Puranjay Mohan <puranjay@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Arve <arve@android.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Carlos Llamas <cmllamas@google.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Daniel Borkman <daniel@iogearbox.net> Cc: Dave Airlie <airlied@gmail.com> Cc: David Ahern <dsahern@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hartley Sweeten <hsweeten@visionengravers.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ian Abbott <abbotti@mev.co.uk> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jakub Kacinski <kuba@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jann Horn <jannh@google.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Namhyung kim <namhyung@kernel.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Todd Kjos <tkjos@android.com> Cc: Tvrtko Ursulin <tursulin@ursulin.net> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/madvise: drop range checks in madvise_free_single_vma()
Patch series "mm: cleanups around unmapping / zapping".
A bunch of cleanups around unmapping and zapping. Mostly simplifications,
code movements, documentation and renaming of zapping functions.
With this series, we'll have the following high-level zap/unmap functions
(excluding high-level folio zapping):
* unmap_vmas() for actual unmapping (vmas will go away)
* zap_vma(): zap all page table entries in a vma
* zap_vma_for_reaping(): zap_vma() that must not block
* zap_vma_range(): zap a range of page table entries
* zap_vma_range_batched(): zap_vma_range() with more options and batching
* zap_special_vma_range(): limited zap_vma_range() for modules
* __zap_vma_range(): internal helper
Patch #1 is not about unmapping/zapping, but I stumbled over it while
verifying MADV_DONTNEED range handling.
Patch #16 is related to [1], but makes sense even independent of that.
This patch (of 16):
madvise_vma_behavior()->
madvise_dontneed_free()->madvise_free_single_vma() is only called from
madvise_walk_vmas()
(a) After try_vma_read_lock() confirmed that the whole range falls into
a single VMA (see is_vma_lock_sufficient()).
(b) After adjusting the range to the VMA in the loop afterwards.
madvise_dontneed_free() might drop the MM lock when handling userfaultfd,
but it properly looks up the VMA again to adjust the range.
So in madvise_free_single_vma(), the given range should always fall into a
single VMA and should also span at least one page.
Let's drop the error checks.
The code now matches what we do in madvise_dontneed_single_vma(), where we
call zap_vma_range_batched() that documents: "The range must fit into one
VMA.". Although that function still adjusts that range, we'll change that
soon.
Link: https://lkml.kernel.org/r/20260227200848.114019-1-david@kernel.org Link: https://lkml.kernel.org/r/20260227200848.114019-2-david@kernel.org Link: https://lore.kernel.org/r/aYSKyr7StGpGKNqW@google.com Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Arve <arve@android.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Carlos Llamas <cmllamas@google.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Daniel Borkman <daniel@iogearbox.net> Cc: Dave Airlie <airlied@gmail.com> Cc: David Ahern <dsahern@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hartley Sweeten <hsweeten@visionengravers.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ian Abbott <abbotti@mev.co.uk> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jakub Kacinski <kuba@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jann Horn <jannh@google.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Namhyung kim <namhyung@kernel.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Todd Kjos <tkjos@android.com> Cc: Tvrtko Ursulin <tursulin@ursulin.net> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
kasan: docs: SLUB is the only remaining slab implementation
We have only the SLUB implementation left in the kernel (referred to as
"slab"). Therefore, there is nothing special regarding KASAN modes when
it comes to the slab allocator anymore.
Drop the stale comment regarding differing SLUB vs. SLAB support.
Michal Hocko [Mon, 2 Mar 2026 11:47:40 +0000 (12:47 +0100)]
vmalloc: support __GFP_RETRY_MAYFAIL and __GFP_NORETRY
__GFP_RETRY_MAYFAIL and __GFP_NORETRY haven't been supported so far
because their semantic (i.e. to not trigger OOM killer) is not possible
with the existing vmalloc page table allocation which is allowing for the
OOM killer.
There are usecases for these modifiers when a large allocation request
should rather fail than trigger OOM killer which wouldn't be able to
handle the situation anyway [1].
While we cannot change existing page table allocation code easily we can
piggy back on scoped NOWAIT allocation for them that we already have in
place. The rationale is that the bulk of the consumed memory is sitting
in pages backing the vmalloc allocation. Page tables are only
participating a tiny fraction. Moreover page tables for virtually
allocated areas are never reclaimed so the longer the system runs to less
likely they are. It makes sense to allow an approximation of
__GFP_RETRY_MAYFAIL and __GFP_NORETRY even if the page table allocation
part is much weaker. This doesn't break the failure mode while it allows
for the no OOM semantic.
mm/vmalloc: fix incorrect size reporting on allocation failure
When __vmalloc_area_node() fails to allocate pages, the failure message
may report an incorrect allocation size, for example:
vmalloc error: size 0, failed to allocate pages, ...
This happens because the warning prints area->nr_pages * PAGE_SIZE. At
this point, area->nr_pages may be zero or partly populated thus it is not
valid.
Report the originally requested allocation size instead by using
nr_small_pages * PAGE_SIZE, which reflects the actual number of pages
being requested by user.
Link: https://lkml.kernel.org/r/20260302114740.2668450-1-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jane Chu [Mon, 2 Mar 2026 20:10:15 +0000 (13:10 -0700)]
Documentation: fix a hugetlbfs reservation statement
Documentation/mm/hugetlbfs_reserv.rst has
if (resv_needed <= (resv_huge_pages - free_huge_pages))
resv_huge_pages += resv_needed;
which describes this code in gather_surplus_pages()
needed = (h->resv_huge_pages + delta) - h->free_huge_pages;
if (needed <= 0) {
h->resv_huge_pages += delta;
return 0;
}
which means if there are enough free hugepages to account for the new
reservation, simply update the global reservation count without
further action.
But the description is backwards, it should be
if (resv_needed <= (free_huge_pages - resv_huge_pages))
instead.
Link: https://lkml.kernel.org/r/20260302201015.1824798-1-jane.chu@oracle.com Fixes: 70bc0dc578b3 ("Documentation: vm, add hugetlbfs reservation overview") Signed-off-by: Jane Chu <jane.chu@oracle.com> Cc: David Hildenbrand <david@kernel.org> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Gladyshev Ilya [Sun, 1 Mar 2026 13:19:39 +0000 (13:19 +0000)]
mm: make ref_unless functions unless_zero only
There are no users of (folio/page)_ref_add_unless(page, nr, u) with u != 0
[1] and all current users are "internal" for page refcounting API. This
allows us to safely drop this parameter and reduce function semantics to
the "unless zero" cases only.
If needed, these functions for the u!=0 cases can be trivially
reintroduced later using the same atomic_add_unless operations as before.
[1]: The last user was dropped in v5.18 kernel, commit 27674ef6c73f ("mm:
remove the extra ZONE_DEVICE struct page refcount"). There is no trace of
discussion as to why this cleanup wasn't done earlier.
Link: https://lkml.kernel.org/r/a0c89b49d38c671a0bdd35069d15ee13e08314d2.1772370066.git.gladyshev.ilya1@h-partners.com Co-developed-by: Gorbunov Ivan <gorbunov.ivan@h-partners.com> Signed-off-by: Gorbunov Ivan <gorbunov.ivan@h-partners.com> Signed-off-by: Gladyshev Ilya <gladyshev.ilya1@h-partners.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Kiryl Shutsemau <kas@kernel.org> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Fri, 27 Feb 2026 17:07:59 +0000 (18:07 +0100)]
mm/page_alloc: remove IRQ saving/restoring from pcp locking
Effectively revert commit 038a102535eb ("mm/page_alloc: prevent pcp
corruption with SMP=n"). The original problem is now avoided by
pcp_spin_trylock() always failing on CONFIG_SMP=n, so we do not need to
disable IRQs anymore.
It's not a complete revert, because keeping the pcp_spin_(un)lock()
wrappers is useful. Rename them from _maybe_irqsave/restore to _nopin.
The difference from pcp_spin_trylock()/pcp_spin_unlock() is that the
_nopin variants don't perform pcpu_task_pin/unpin().
Link: https://lkml.kernel.org/r/20260227-b4-pcp-locking-cleanup-v1-2-f7e22e603447@kernel.org Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Fri, 27 Feb 2026 17:07:58 +0000 (18:07 +0100)]
mm/page_alloc: effectively disable pcp with CONFIG_SMP=n
Patch series "mm/page_alloc: pcp locking cleanup".
This is a followup to the hotfix 038a102535eb ("mm/page_alloc: prevent pcp
corruption with SMP=n"), to simplify the code and deal with the original
issue properly. The previous RFC attempt [1] argued for changing the UP
spinlock implementation, which was discouraged, but thanks to David's
off-list suggestion, we can achieve the goal without changing the spinlock
implementation.
The main change in Patch 1 relies on the fact that on UP we don't need the
pcp lists for scalability, so just make them always bypassed during
alloc/free by making the pcp trylock an unconditional failure.
The various drain paths that use pcp_spin_lock_maybe_irqsave() continue to
exist but will never do any work in practice. In Patch 2 we can again
remove the irq saving from them that commit 038a102535eb added.
Besides simpler code with all the ugly UP_flags removed, we get less bloat
with CONFIG_SMP=n for mm/page_alloc.o as a result:
The page allocator has been using a locking scheme for its percpu page
caches (pcp) based on spin_trylock() with no _irqsave() part. The trick
is that if we interrupt the locked section, we fail the trylock and just
fallback to the slowpath taking the zone lock. That's more expensive, but
rare, so we don't need to pay the irqsave/restore cost all the time in the
fastpaths.
It's similar to but not exactly local_trylock_t (which is also newer
anyway) because in some cases we do lock the pcp of a non-local cpu to
drain it, in a way that's cheaper than using IPI or queue_work_on().
The complication of this scheme has been UP non-debug spinlock
implementation which assumes spin_trylock() can't fail on UP and has no
state to track whether it's locked. It just doesn't anticipate this usage
scenario. So to work around that we disable IRQs only on UP, complicating
the implementation. Also recently we found years old bug in where we
didn't disable IRQs in related paths - see 038a102535eb ("mm/page_alloc:
prevent pcp corruption with SMP=n").
We can avoid this UP complication by realizing that we do not need the pcp
caching for scalability on UP in the first place. Removing it completely
with #ifdefs is not worth the trouble either. Just make
pcp_spin_trylock() return NULL unconditionally on CONFIG_SMP=n. This
makes the slowpaths unconditional, and we can remove the IRQ save/restore
handling in pcp_spin_trylock()/unlock() completely.
SeongJae Park [Sat, 28 Feb 2026 22:28:26 +0000 (14:28 -0800)]
mm/damon/vaddr: do not split regions for min_nr_regions
The previous commit made DAMON core split regions at the beginning for
min_nr_regions. The virtual address space operation set (vaddr) does
similar work on its own, for a case user delegates entire initial
monitoring regions setup to vaddr. It is unnecessary now, as DAMON core
will do similar work for any case. Remove the duplicated work in vaddr.
Also, remove a helper function that was being used only for the work, and
the test code of the helper function.
Link: https://lkml.kernel.org/r/20260228222831.7232-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Sat, 28 Feb 2026 22:28:25 +0000 (14:28 -0800)]
mm/damon/core: split regions for min_nr_regions
Patch series "mm/damon: strictly respect min_nr_regions".
DAMON core respects min_nr_regions only at merge operation. DAMON API
callers are therefore responsible to respect or ignore that. Only vaddr
ops is respecting that, but only for initial start time. DAMON sysfs
interface allows users to setup the initial regions that DAMON core also
respects. But, again, it works for only the initial time. Users setting
the regions for min_nr_regions can be difficult and inefficient, when the
min_nr_regions value is high. There was actually a report [1] from a
user. The use case was page granular access monitoring with a large
aggregation interval.
Make the following three changes for resolving the issue. First (patch
1), make DAMON core split regions at the beginning and every aggregation
interval, to respect the min_nr_regions. Second (patch 2), drop the
vaddr's split operations and related code that are no more needed. Third
(patch 3), add a kunit test for the newly introduced function.
This patch (of 3):
DAMON core layer respects the min_nr_regions parameter by setting the
maximum size of each region as total monitoring region size divided by the
parameter value. And the limit is applied by preventing merge of regions
that result in a region larger than the maximum size. The limit is
updated per ops update interval, because vaddr updates the monitoring
regions on the ops update callback.
It does nothing for the beginning state. That's because the users can set
the initial monitoring regions as they want. That is, if the users really
care about the min_nr_regions, they are supposed to set the initial
monitoring regions to have more than min_nr_regions regions. The virtual
address space operation set, vaddr, has an exceptional case. Users can
ask the ops set to configure the initial regions on its own. For the
case, vaddr sets up the initial regions to meet the min_nr_regions. So,
vaddr has exceptional support, but basically users are required to set the
regions on their own if they want min_nr_regions to be respected.
When 'min_nr_regions' is high, such initial setup is difficult. If DAMON
sysfs interface is used for that, the memory for saving the initial setup
is also a waste.
Even if the user forgives the setup, DAMON will eventually make more than
min_nr_regions regions by splitting operations. But it will take time.
If the aggregation interval is long, the delay could be problematic.
There was actually a report [1] of the case. The reporter wanted to do
page granular monitoring with a large aggregation interval.
Also, DAMON is doing nothing for online changes on monitoring regions and
min_nr_regions. For example, the user can remove a monitoring region or
increase min_nr_regions while DAMON is running.
Split regions larger than the size at the beginning of the kdamond main
loop, to fix the initial setup issue. Also do the split every aggregation
interval, for online changes. This means the behavior is slightly
changed. It is difficult to imagine a use case that actually depends on
the old behavior, though. So this change is arguably fine.
Note that the size limit is aligned by damon_ctx->min_region_sz and cannot
be zero. That is, if min_nr_region is larger than the total size of
monitoring regions divided by ->min_region_sz, that cannot be respected.
kasan_free_pxd() assumes the page table is always struct page aligned.
But that's not always the case for all architectures. E.g. In case of
powerpc with 64K pagesize, PUD table (of size 4096) comes from slab cache
named pgtable-2^9. Hence instead of page_to_virt(pxd_page()) let's just
directly pass the start of the pxd table which is passed as the 1st
argument.
This fixes the below double free kasan issue seen with PMEM:
radix-mmu: Mapped 0x0000047d10000000-0x0000047f90000000 with 2.00 MiB pages
==================================================================
BUG: KASAN: double-free in kasan_remove_zero_shadow+0x9c4/0xa20
Free of addr c0000003c38e0000 by task ndctl/2164
The buggy address belongs to the object at c0000003c38e0000
which belongs to the cache pgtable-2^9 of size 4096
The buggy address is located 0 bytes inside of
4096-byte region [c0000003c38e0000, c0000003c38e1000)
[ 138.953636] [ T2164] Memory state around the buggy address:
[ 138.953643] [ T2164] c0000003c38dff00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 138.953652] [ T2164] c0000003c38dff80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 138.953661] [ T2164] >c0000003c38e0000: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 138.953669] [ T2164] ^
[ 138.953675] [ T2164] c0000003c38e0080: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 138.953684] [ T2164] c0000003c38e0100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 138.953692] [ T2164] ==================================================================
[ 138.953701] [ T2164] Disabling lock debugging due to kernel taint
Link: https://lkml.kernel.org/r/2f9135c7866c6e0d06e960993b8a5674a9ebc7ec.1771938394.git.ritesh.list@gmail.com Fixes: 0207df4fa1a8 ("kernel/memremap, kasan: make ZONE_DEVICE with work with KASAN") Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Replace READ_ONCE() with the existing standard page table accessor for PUD
aka pudp_get() in pud_trans_unstable(). This does not create any
functional change for platforms that do not override pudp_get(), which
still defaults to READ_ONCE().
Link: https://lkml.kernel.org/r/20260227040300.2091901-1-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: SeongJae Park <sj@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mike Rapoport <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/debug_vm_pgtable: replace WRITE_ONCE() with pxd_clear()
Replace WRITE_ONCE() with generic pxd_clear() to clear out the page table
entries as required. Besides this does not cause any functional change as
well.
Link: https://lkml.kernel.org/r/20260227061204.2215395-1-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Suggested-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Ackeed-by: SeongJae Park <sj@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Proof: Both loops in hpage_collapse_scan_file and collapse_file, which
iterate on the xarray, have the invariant that start <= folio->index <
start + HPAGE_PMD_NR ... (i)
A folio is always naturally aligned in the pagecache, therefore
folio_order == HPAGE_PMD_ORDER => IS_ALIGNED(folio->index, HPAGE_PMD_NR) == true ... (ii)
thp_vma_allowable_order -> thp_vma_suitable_order requires that the virtual
offsets in the VMA are aligned to the order,
=> IS_ALIGNED(start, HPAGE_PMD_NR) == true ... (iii)
Combining (i), (ii) and (iii), the claim is proven.
Therefore, remove this check.
While at it, simplify the comments.
Link: https://lkml.kernel.org/r/20260227143501.1488110-1-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Barry Song <baohua@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Fri, 27 Feb 2026 17:06:21 +0000 (09:06 -0800)]
mm/damon/core: do non-safe region walk on kdamond_apply_schemes()
kdamond_apply_schemes() is using damon_for_each_region_safe(), which is
safe for deallocation of the region inside the loop. However, the loop
internal logic does not deallocate regions. Hence it is only wasting the
next pointer. Also, it causes a problem.
When an address filter is applied, and there is a region that intersects
with the filter, the filter splits the region on the filter boundary. The
intention is to let DAMOS apply action to only filtered-in address ranges.
However, it is using damon_for_each_region_safe(), which sets the next
region before the execution of the iteration. Hence, the region that
split and now will be next to the previous region, is simply ignored. As
a result, DAMOS applies the action to target regions bit slower than
expected, when the address filter is used. Shouldn't be a big problem but
definitely better to be fixed. damos_skip_charged_region() was working
around the issue using a double pointer hack.
Use damon_for_each_region(), which is safe for this use case. And drop
the work around in damos_skip_charged_region().
SeongJae Park [Fri, 27 Feb 2026 17:06:20 +0000 (09:06 -0800)]
mm/damon/core: set quota-score histogram with core filters
Patch series "mm/damon/core: improve DAMOS quota efficiency for core layer
filters".
Improve two below problematic behaviors of DAMOS that makes it less
efficient when core layer filters are used.
DAMOS generates the under-quota regions prioritization-purpose access
temperature histogram [1] with only the scheme target access pattern. The
DAMOS filters are ignored on the histogram, and this can result in the
scheme not applied to eligible regions. For working around this, users
had to use separate DAMON contexts. The memory tiering approaches are
such examples.
DAMOS splits regions that intersect with address filters, so that only
filtered-out part of the region is skipped. But, the implementation is
skipping the other part of the region that is not filtered out, too. As a
result, DAMOS can work slower than expected.
Improve the two inefficient behaviors with two patches, respectively.
Read the patches for more details about the problem and how those are
fixed.
This patch (of 2):
The histogram for under-quota region prioritization [1] is made for all
regions that are eligible for the DAMOS target access pattern. When there
are DAMOS filters, the prioritization-threshold access temperature that
generated from the histogram could be inaccurate.
For example, suppose there are three regions. Each region is 1 GiB. The
access temperature of the regions are 100, 50, and 0. And a DAMOS scheme
that targets _any_ access temperature with quota 2 GiB is being used. The
histogram will look like below:
temperature size of regions having >=temperature temperature
0 3 GiB
50 2 GiB
100 1 GiB
Based on the histogram and the quota (2 GiB), DAMOS applies the action to
only the regions having >=50 temperature. This is all good.
Let's suppose the region of temperature 50 is excluded by a DAMOS filter.
Regardless of the filter, DAMOS will try to apply the action on only
regions having >=50 temperature. Because the region of temperature 50 is
filtered out, the action is applied to only the region of temperature 100.
Worse yet, suppose the filter is excluding regions of temperature 50 and
100. Then no action is really applied to any region, while the region of
temperature 0 is there.
People used to work around this by utilizing multiple contexts, instead of
the core layer DAMOS filters. For example, DAMON-based memory tiering
approaches including the quota auto-tuning based one [2] are using a DAMON
context per NUMA node. If the above explained issue is effectively
alleviated, those can be configured again to run with single context and
DAMOS filters for applying the promotion and demotion to only specific
NUMA nodes.
Alleviate the problem by checking core DAMOS filters when generating the
histogram. The reason to check only core filters is the overhead. While
core filters are usually for coarse-grained filtering (e.g.,
target/address filters for process, NUMA, zone level filtering), operation
layer filters are usually for fine-grained filtering (e.g., for anon
page). Doing this for operation layer filters would cause significant
overhead. There is no known use case that is affected by the operation
layer filters-distorted histogram problem, though. Do this for only core
filters for now. We will revisit this for operation layer filters in
future. We might be able to apply a sort of sampling based operation
layer filtering.
After this fix is applied, for the first case that there is a DAMOS filter
excluding the region of temperature 50, the histogram will be like below:
temperature size of regions having >=temperature temperature
0 2 GiB
100 1 GiB
And DAMOS will set the temperature threshold as 0, allowing both regions
of temperatures 0 and 100 be applied.
For the second case that there is a DAMOS filter excluding the regions of
temperature 50 and 100, the histogram will be like below:
temperature size of regions having >=temperature temperature
0 1 GiB
And DAMOS will set the temperature threshold as 0, allowing the region of
temperature 0 be applied.
[1] 'Prioritization' section of Documentation/mm/damon/design.rst
[2] commit 0e1c773b501f ("mm/damon/core: introduce damos quota goal
metrics for memory node utilization")
Kiryl Shutsemau [Fri, 27 Feb 2026 19:42:56 +0000 (19:42 +0000)]
mm/slab: use compound_head() in page_slab()
page_slab() contained an open-coded implementation of compound_head().
Replace the duplicated code with a direct call to compound_head().
Link: https://lkml.kernel.org/r/20260227194302.274384-19-kas@kernel.org Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kiryl Shutsemau [Fri, 27 Feb 2026 19:42:55 +0000 (19:42 +0000)]
hugetlb: update vmemmap_dedup.rst
Update the documentation regarding vmemmap optimization for hugetlb to
reflect the changes in how the kernel maps the tail pages.
Fake heads no longer exist. Remove their description.
[kas@kernel.org: update vmemmap_dedup.rst] Link: https://lkml.kernel.org/r/20260302105630.303492-1-kas@kernel.org Link: https://lkml.kernel.org/r/20260227194302.274384-18-kas@kernel.org Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Muchun Song <muchun.song@linux.dev> Reviewed-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kiryl Shutsemau [Fri, 27 Feb 2026 19:42:54 +0000 (19:42 +0000)]
mm: remove the branch from compound_head()
The compound_head() function is a hot path. For example, the zap path
calls it for every leaf page table entry.
Rewrite the helper function in a branchless manner to eliminate the risk
of CPU branch misprediction.
Link: https://lkml.kernel.org/r/20260227194302.274384-17-kas@kernel.org Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Muchun Song <muchun.song@linux.dev> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: WANG Xuerui <kernel@xen0n.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The hugetlb_optimize_vmemmap_key static key was used to guard fake head
detection in compound_head() and related functions. It allowed skipping
the fake head checks entirely when HVO was not in use.
With fake heads eliminated and the detection code removed, the static key
serves no purpose. Remove its definition and all increment/decrement
calls.
Link: https://lkml.kernel.org/r/20260227194302.274384-16-kas@kernel.org Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Muchun Song <muchun.song@linux.dev> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kiryl Shutsemau [Fri, 27 Feb 2026 19:42:52 +0000 (19:42 +0000)]
hugetlb: remove VMEMMAP_SYNCHRONIZE_RCU
The VMEMMAP_SYNCHRONIZE_RCU flag triggered synchronize_rcu() calls to
prevent a race between HVO remapping and page_ref_add_unless(). The race
could occur when a speculative PFN walker tried to modify the refcount on
a struct page that was in the process of being remapped to a fake head.
With fake heads eliminated, page_ref_add_unless() no longer needs RCU
protection.
Remove the flag and synchronize_rcu() calls.
Link: https://lkml.kernel.org/r/20260227194302.274384-15-kas@kernel.org Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Muchun Song <muchun.song@linux.dev> Reviewed-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kiryl Shutsemau [Fri, 27 Feb 2026 19:42:51 +0000 (19:42 +0000)]
mm: drop fake head checks
With fake head pages eliminated in the previous commit, remove the
supporting infrastructure:
- page_fixed_fake_head(): no longer needed to detect fake heads;
- page_is_fake_head(): no longer needed;
- page_count_writable(): no longer needed for RCU protection;
- RCU read_lock in page_ref_add_unless(): no longer needed;
This substantially simplifies compound_head() and page_ref_add_unless(),
removing both branches and RCU overhead from these hot paths.
RCU was required to serialize allocation of hugetlb page against
get_page_unless_zero() and prevent writing to read-only fake head. It is
redundant without fake heads.
See bd225530a4c7 ("mm/hugetlb_vmemmap: fix race with speculative PFN
walkers") for more details.
synchronize_rcu() in mm/hugetlb_vmemmap.c will be removed by a separate
patch.
Link: https://lkml.kernel.org/r/20260227194302.274384-14-kas@kernel.org Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Muchun Song <muchun.song@linux.dev> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kiryl Shutsemau [Fri, 27 Feb 2026 19:42:50 +0000 (19:42 +0000)]
mm/hugetlb: remove fake head pages
HugeTLB Vmemmap Optimization (HVO) reduces memory usage by freeing most
vmemmap pages for huge pages and remapping the freed range to a single
page containing the struct page metadata.
With the new mask-based compound_info encoding (for power-of-2 struct page
sizes), all tail pages of the same order are now identical regardless of
which compound page they belong to. This means the tail pages can be
truly shared without fake heads.
Allocate a single page of initialized tail struct pages per zone per order
in the vmemmap_tails[] array in struct zone. All huge pages of that order
in the zone share this tail page, mapped read-only into their vmemmap.
The head page remains unique per huge page.
Redefine MAX_FOLIO_ORDER using ilog2(). The define has to produce a
compile-constant as it is used to specify vmemmap_tail array size. For
some reason, compiler is not able to solve get_order() at compile-time,
but ilog2() works.
Avoid PUD_ORDER to define MAX_FOLIO_ORDER as it adds dependency to
<linux/pgtable.h> which generates hard-to-break include loop.
This eliminates fake heads while maintaining the same memory savings, and
simplifies compound_head() by removing fake head detection.
Link: https://lkml.kernel.org/r/20260227194302.274384-13-kas@kernel.org Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
x86/vdso: undefine CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP for vdso32
The 32-bit VDSO build on x86_64 uses fake_32bit_build.h to undefine
various kernel configuration options that are not suitable for the VDSO
context or may cause build issues when including kernel headers.
Undefine CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP in fake_32bit_build.h to
prepare for change in HugeTLB Vmemmap Optimization.
Link: https://lkml.kernel.org/r/20260227194302.274384-12-kas@kernel.org Signed-off-by: Kiryl Shutsemau (Meta) <kas@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kiryl Shutsemau [Fri, 27 Feb 2026 19:42:48 +0000 (19:42 +0000)]
mm/hugetlb: refactor code around vmemmap_walk
To prepare for removing fake head pages, the vmemmap_walk code is being
reworked.
The reuse_page and reuse_addr variables are being eliminated. There will
no longer be an expectation regarding the reuse address in relation to the
operated range. Instead, the caller will provide head and tail vmemmap
pages.
Currently, vmemmap_head and vmemmap_tail are set to the same page, but
this will change in the future.
The only functional change is that __hugetlb_vmemmap_optimize_folio() will
abandon optimization if memory allocation fails.
Link: https://lkml.kernel.org/r/20260227194302.274384-11-kas@kernel.org Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Hildenbrand (arm) <david@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/hugetlb: defer vmemmap population for bootmem hugepages
Currently, the vmemmap for bootmem-allocated gigantic pages is populated
early in hugetlb_vmemmap_init_early(). However, the zone information is
only available after zones are initialized. If it is later discovered
that a page spans multiple zones, the HVO mapping must be undone and
replaced with a normal mapping using vmemmap_undo_hvo().
Defer the actual vmemmap population to hugetlb_vmemmap_init_late(). At
this stage, zones are already initialized, so it can be checked if the
page is valid for HVO before deciding how to populate the vmemmap.
This allows us to remove vmemmap_undo_hvo() and the complex logic required
to rollback HVO mappings.
In hugetlb_vmemmap_init_late(), if HVO population fails or if the zones
are invalid, fall back to a normal vmemmap population.
Postponing population until hugetlb_vmemmap_init_late() also makes zone
information available from within vmemmap_populate_hvo().
Link: https://lkml.kernel.org/r/20260227194302.274384-10-kas@kernel.org Signed-off-by: Kiryl Shutsemau (Meta) <kas@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kiryl Shutsemau [Fri, 27 Feb 2026 19:42:46 +0000 (19:42 +0000)]
mm/sparse: check memmap alignment for compound_info_has_mask()
If page->compound_info encodes a mask, it is expected that vmemmap to be
naturally aligned to the maximum folio size.
Add a VM_WARN_ON_ONCE() to check the alignment.
Link: https://lkml.kernel.org/r/20260227194302.274384-9-kas@kernel.org Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Hildenbrand (arm) <david@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kiryl Shutsemau [Fri, 27 Feb 2026 19:42:45 +0000 (19:42 +0000)]
mm: rework compound_head() for power-of-2 sizeof(struct page)
For tail pages, the kernel uses the 'compound_info' field to get to the
head page. The bit 0 of the field indicates whether the page is a tail
page, and if set, the remaining bits represent a pointer to the head page.
For cases when size of struct page is power-of-2, change the encoding of
compound_info to store a mask that can be applied to the virtual address
of the tail page in order to access the head page. It is possible because
struct page of the head page is naturally aligned with regards to order of
the page.
The significant impact of this modification is that all tail pages of the
same order will now have identical 'compound_info', regardless of the
compound page they are associated with. This paves the way for
eliminating fake heads.
The HugeTLB Vmemmap Optimization (HVO) creates fake heads and it is only
applied when the sizeof(struct page) is power-of-2. Having identical tail
pages allows the same page to be mapped into the vmemmap of all pages,
maintaining memory savings without fake heads.
If sizeof(struct page) is not power-of-2, there is no functional changes.
Limit mask usage to HugeTLB vmemmap optimization (HVO) where it makes a
difference. The approach with mask would work in the wider set of
conditions, but it requires validating that struct pages are naturally
aligned for all orders up to the MAX_FOLIO_ORDER, which can be tricky.
Link: https://lkml.kernel.org/r/20260227194302.274384-8-kas@kernel.org Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Muchun Song <muchun.song@linux.dev> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Usama Arif <usamaarif642@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: WANG Xuerui <kernel@xen0n.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kiryl Shutsemau [Fri, 27 Feb 2026 19:42:44 +0000 (19:42 +0000)]
LoongArch/mm: align vmemmap to maximal folio size
The upcoming change to the HugeTLB vmemmap optimization (HVO) requires
struct pages of the head page to be naturally aligned with regard to the
folio size.
Align vmemmap to MAX_FOLIO_VMEMMAP_ALIGN.
Link: https://lkml.kernel.org/r/20260227194302.274384-7-kas@kernel.org Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Hildenbrand (arm) <david@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kiryl Shutsemau [Fri, 27 Feb 2026 19:42:43 +0000 (19:42 +0000)]
riscv/mm: align vmemmap to maximal folio size
The upcoming change to the HugeTLB vmemmap optimization (HVO) requires
struct pages of the head page to be naturally aligned with regard to the
folio size.
Align vmemmap to the newly introduced MAX_FOLIO_VMEMMAP_ALIGN.
Link: https://lkml.kernel.org/r/20260227194302.274384-6-kas@kernel.org Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Hildenbrand (arm) <david@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kiryl Shutsemau [Fri, 27 Feb 2026 19:42:42 +0000 (19:42 +0000)]
mm: move set/clear_compound_head() next to compound_head()
Move set_compound_head() and clear_compound_head() to be adjacent to the
compound_head() function in page-flags.h.
These functions encode and decode the same compound_info field, so keeping
them together makes it easier to verify their logic is consistent,
especially when the encoding changes.
Link: https://lkml.kernel.org/r/20260227194302.274384-5-kas@kernel.org Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Muchun Song <muchun.song@linux.dev> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand (arm) <david@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: WANG Xuerui <kernel@xen0n.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kiryl Shutsemau [Fri, 27 Feb 2026 19:42:41 +0000 (19:42 +0000)]
mm: rename the 'compound_head' field in the 'struct page' to 'compound_info'
The 'compound_head' field in the 'struct page' encodes whether the page is
a tail and where to locate the head page. Bit 0 is set if the page is a
tail, and the remaining bits in the field point to the head page.
As preparation for changing how the field encodes information about the
head page, rename the field to 'compound_info'.
Link: https://lkml.kernel.org/r/20260227194302.274384-4-kas@kernel.org Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Muchun Song <muchun.song@linux.dev> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand (arm) <david@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: WANG Xuerui <kernel@xen0n.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* tag 'x86-urgent-2026-04-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/platform/geode: Fix on-stack property data use-after-return bug
x86/kexec: Disable KCOV instrumentation after load_segments()
Kiryl Shutsemau [Fri, 27 Feb 2026 19:42:40 +0000 (19:42 +0000)]
mm: change the interface of prep_compound_tail()
Instead of passing down the head page and tail page index, pass the tail
and head pages directly, as well as the order of the compound page.
This is a preparation for changing how the head position is encoded in the
tail page.
Link: https://lkml.kernel.org/r/20260227194302.274384-3-kas@kernel.org Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Muchun Song <muchun.song@linux.dev> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand (arm) <david@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: WANG Xuerui <kernel@xen0n.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kiryl Shutsemau [Fri, 27 Feb 2026 19:42:39 +0000 (19:42 +0000)]
mm: move MAX_FOLIO_ORDER definition to mmzone.h
Patch series "mm: Eliminate fake head pages from vmemmap optimization",
v7.
This series removes "fake head pages" from the HugeTLB vmemmap
optimization (HVO) by changing how tail pages encode their relationship to
the head page.
It simplifies compound_head() and page_ref_add_unless(). Both are in the
hot path.
Background
==========
HVO reduces memory overhead by freeing vmemmap pages for HugeTLB pages and
remapping the freed virtual addresses to a single physical page.
Previously, all tail page vmemmap entries were remapped to the first
vmemmap page (containing the head struct page), creating "fake heads" -
tail pages that appear to have PG_head set when accessed through the
deduplicated vmemmap.
This required special handling in compound_head() to detect and work
around fake heads, adding complexity and overhead to a very hot path.
New Approach
============
For architectures/configs where sizeof(struct page) is a power of 2 (the
common case), this series changes how position of the head page is encoded
in the tail pages.
Instead of storing a pointer to the head page, the ->compound_info
(renamed from ->compound_head) now stores a mask.
The mask can be applied to any tail page's virtual address to compute the
head page address. Critically, all tail pages of the same order now have
identical compound_info values, regardless of which compound page they
belong to.
The key insight is that all tail pages of the same order now have
identical compound_info values, regardless of which compound page they
belong to.
In v7, these shared tail pages are allocated per-zone. This ensures that
zone information (stored in page->flags) is correct even for shared tail
pages, removing the need for the special-casing in page_zonenum() proposed
in earlier versions.
To support per-zone shared pages for boot-allocated gigantic pages, the
vmemmap population is deferred until zones are initialized. This
simplifies the logic significantly and allows the removal of
vmemmap_undo_hvo().
Benefits
========
1. Simplified compound_head(): No fake head detection needed, can be
implemented in a branchless manner.
2. Simplified page_ref_add_unless(): RCU protection removed since there's
no race with fake head remapping.
3. Cleaner architecture: The shared tail pages are truly read-only and
contain valid tail page metadata.
If sizeof(struct page) is not power-of-2, there are no functional changes.
HVO is not supported in this configuration.
I had hoped to see performance improvement, but my testing thus far has
shown either no change or only a slight improvement within the noise.
Series Organization
===================
Patch 1: Move MAX_FOLIO_ORDER definition to mmzone.h.
Patches 2-4: Refactoring of field names and interfaces.
Patches 5-6: Architecture alignment for LoongArch and RISC-V.
Patch 7: Mask-based compound_head() implementation.
Patch 8: Add memmap alignment checks.
Patch 9: Branchless compound_head() optimization.
Patch 10: Defer vmemmap population for bootmem hugepages.
Patch 11: Refactor vmemmap_walk.
Patch 12: x86 vDSO build fix.
Patch 13: Eliminate fake heads with per-zone shared tail pages.
Patches 14-16: Cleanup of fake head infrastructure.
Patch 17: Documentation update.
Patch 18: Use compound_head() in page_slab().
This patch (of 17):
Move MAX_FOLIO_ORDER definition from mm.h to mmzone.h.
This is preparation for adding the vmemmap_tails array to struct zone,
which requires MAX_FOLIO_ORDER to be available in mmzone.h.
Link: https://lkml.kernel.org/r/20260227194302.274384-1-kas@kernel.org Link: https://lkml.kernel.org/r/20260227194302.274384-2-kas@kernel.org Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: Muchun Song <muchun.song@linux.dev> Acked-by: Usama Arif <usamaarif642@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: WANG Xuerui <kernel@xen0n.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
gao xu [Thu, 26 Feb 2026 12:37:22 +0000 (12:37 +0000)]
zram: use statically allocated compression algorithm names
Currently, zram dynamically allocates memory for compressor algorithm
names when they are set by the user. This requires careful memory
management, including explicit `kfree` calls and special handling to avoid
freeing statically defined default compressor names.
This patch refactors the way zram handles compression algorithm names.
Instead of storing dynamically allocated copies, `zram->comp_algs` will
now store pointers directly to the static name strings defined within the
`zcomp_ops` backend structures, thereby removing the need for conditional
`kfree` calls.
Tal Zussman [Wed, 25 Feb 2026 23:44:28 +0000 (18:44 -0500)]
folio_batch: rename PAGEVEC_SIZE to FOLIO_BATCH_SIZE
struct pagevec no longer exists. Rename the macro appropriately.
Link: https://lkml.kernel.org/r/20260225-pagevec_cleanup-v2-4-716868cc2d11@columbia.edu Signed-off-by: Tal Zussman <tz2294@columbia.edu> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tal Zussman [Wed, 25 Feb 2026 23:44:27 +0000 (18:44 -0500)]
folio_batch: rename pagevec.h to folio_batch.h
struct pagevec was removed in commit 1e0877d58b1e ("mm: remove struct
pagevec"). Rename include/linux/pagevec.h to reflect reality and update
includes tree-wide. Add the new filename to MAINTAINERS explicitly, as it
no longer matches the "include/linux/page[-_]*" pattern in MEMORY
MANAGEMENT - CORE.
Link: https://lkml.kernel.org/r/20260225-pagevec_cleanup-v2-3-716868cc2d11@columbia.edu Signed-off-by: Tal Zussman <tz2294@columbia.edu> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tal Zussman [Wed, 25 Feb 2026 23:44:26 +0000 (18:44 -0500)]
fs: remove unncessary pagevec.h includes
Remove unused pagevec.h includes from .c files. These were found with
the following command:
grep -rl '#include.*pagevec\.h' --include='*.c' | while read f; do
grep -qE 'PAGEVEC_SIZE|folio_batch' "$f" || echo "$f"
done
There are probably more removal candidates in .h files, but those are
more complex to analyze.
Link: https://lkml.kernel.org/r/20260225-pagevec_cleanup-v2-2-716868cc2d11@columbia.edu Signed-off-by: Tal Zussman <tz2294@columbia.edu> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tal Zussman [Wed, 25 Feb 2026 23:44:25 +0000 (18:44 -0500)]
mm: remove stray references to struct pagevec
Patch series "mm: Remove stray references to pagevec", v2.
struct pagevec was removed in commit 1e0877d58b1e ("mm: remove struct
pagevec"). Remove any stray references to it and rename relevant files
and macros accordingly.
While at it, remove unnecessary #includes of pagevec.h (now folio_batch.h)
in .c files. There are probably more of these that could be removed in .h
files, but those are more complex to verify.
This patch (of 4):
struct pagevec was removed in commit 1e0877d58b1e ("mm: remove struct
pagevec"). Remove remaining forward declarations and change
__folio_batch_release()'s declaration to match its definition.
Pasha Tatashin [Wed, 25 Feb 2026 22:38:57 +0000 (17:38 -0500)]
kho: fix KASAN support for restored vmalloc regions
Restored vmalloc regions are currently not properly marked for KASAN,
causing KASAN to treat accesses to these regions as out-of-bounds.
Fix this by properly unpoisoning the restored vmalloc area using
kasan_unpoison_vmalloc(). This requires setting the VM_UNINITIALIZED flag
during the initial area allocation and clearing it after the pages have
been mapped and unpoisoned, using the clear_vm_uninitialized_flag()
helper.
Link: https://lkml.kernel.org/r/20260225223857.1714801-3-pasha.tatashin@soleen.com Fixes: a667300bd53f ("kho: add support for preserving vmalloc allocations") Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reported-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Pratyush Yadav (Google) <pratyush@kernel.org> Tested-by: Pratyush Yadav (Google) <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pasha Tatashin [Wed, 25 Feb 2026 22:38:56 +0000 (17:38 -0500)]
mm/vmalloc: export clear_vm_uninitialized_flag()
Patch series "Fix KASAN support for KHO restored vmalloc regions".
When KHO restores a vmalloc area, it maps existing physical pages into a
newly allocated virtual memory area. However, because these areas were
not properly unpoisoned, KASAN would treat any access to the restored
region as out-of-bounds, as seen in the following trace:
BUG: KASAN: vmalloc-out-of-bounds in kho_test_restore_data.isra.0+0x17b/0x2cd
Read of size 8 at addr ffffc90000025000 by task swapper/0/1
[...]
Call Trace:
[...]
kasan_report+0xe8/0x120
kho_test_restore_data.isra.0+0x17b/0x2cd
kho_test_init+0x15a/0x1f0
do_one_initcall+0xd5/0x4b0
The fix involves deferring KASAN's default poisoning by using the
VM_UNINITIALIZED flag during allocation, manually unpoisoning the memory
once it is correctly mapped, and then clearing the uninitialized flag
using a newly exported helper.
This patch (of 2):
Make clear_vm_uninitialized_flag() available to other parts of the kernel
that need to manage vmalloc areas manually, such as KHO for restoring
vmallocs.
Link: https://lkml.kernel.org/r/20260225220223.1695350-1-pasha.tatashin@soleen.com Link: https://lkml.kernel.org/r/20260225223857.1714801-2-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Acked-by: Pratyush Yadav (Google) <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Catalin Marinas [Wed, 25 Feb 2026 16:14:02 +0000 (16:14 +0000)]
mm: do not map the shadow stack as THP
The default shadow stack size allocated on first prctl() for the main
thread or subsequently on clone() is either half of RLIMIT_STACK or half
of a thread's stack size (for arm64). Both of these are likely to be
suitable for a THP allocation and the kernel is more aggressive in
creating such mappings. However, it does not make much sense to use a
huge page. It didn't make sense for the normal stacks either, see commit c4608d1bf7c6 ("mm: mmap: map MAP_STACK to VM_NOHUGEPAGE").
Force VM_NOHUGEPAGE when allocating/mapping the shadow stack. As per
commit 7190b3c8bd2b ("mm: mmap: map MAP_STACK to VM_NOHUGEPAGE only if THP
is enabled"), only pass this flag if TRANSPARENT_HUGEPAGE is enabled as
not to confuse CRIU tools.
Link: https://lkml.kernel.org/r/20260225161404.3157851-6-catalin.marinas@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Tested-by: Deepak Gupta <debug@rivosinc.com> Reviewed-by: Mark Brown <broonie@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <pjw@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Catalin Marinas [Wed, 25 Feb 2026 16:13:58 +0000 (16:13 +0000)]
mm: introduce vm_mmap_shadow_stack() as a helper for VM_SHADOW_STACK mappings
Patch series "mm: arch/shstk: Common shadow stack mapping helper and
VM_NOHUGEPAGE", v2.
A series to extract the common shadow stack mmap into a separate helper
for arm64, riscv and x86.
This patch (of 5):
arm64, riscv and x86 use a similar pattern for mapping the user shadow
stack (cloned from x86). Extract this into a helper to facilitate code
reuse.
The call to do_mmap() from the new helper uses PROT_READ|PROT_WRITE prot
bits instead of the PROT_READ with an explicit VM_WRITE vm_flag. The x86
intent was to avoid PROT_WRITE implying normal write since the shadow
stack is not writable by normal stores. However, from a kernel
perspective, the vma is writeable. Functionally there is no difference.
Link: https://lkml.kernel.org/r/20260225161404.3157851-1-catalin.marinas@arm.com Link: https://lkml.kernel.org/r/20260225161404.3157851-2-catalin.marinas@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Tested-by: Deepak Gupta <debug@rivosinc.com> Reviewed-by: Mark Brown <broonie@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Paul Walmsley <pjw@kernel.org> Cc: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Michal Koutný [Wed, 25 Feb 2026 18:38:44 +0000 (19:38 +0100)]
mm: do not allocate shrinker info with cgroup.memory=nokmem
There'd be no work for memcg-aware shrinkers when kernel memory is not
accounted per cgroup, so we can skip allocating per memcg shrinker data.
This saves some memory, avoids holding shrinker_mutex with O(nr_memcgs)
and saves work in shrink_slab_memcg().
Then there are SHRINKER_NONSLAB shrinkers which handle non-kernel memory
so nokmem should not disable their per-memcg behavior. Such shrinkers
(e.g. deferred_split_shrinker) still need access to per-memcg data (see
also commit 0a432dcbeb32e ("mm: shrinker: make shrinker not depend on
memcg kmem")).
The savings with this patch come on container hosts that create many
superblocks (each with own shrinker) but tracking and processing per-memcg
data is pointless with nokmem (shrink_slab_memcg() is partially guarded
with !memcg_kmem_online already).
The patch uses "boottime" predicate mem_cgroup_kmem_disabled() (not
memcg_kmem_online()) to avoid mistakenly un-MEMCG_AWARE-ing shrinkers
registered before first non-root memcg is mkdir'd.
Youngjun Park [Thu, 26 Feb 2026 01:07:39 +0000 (10:07 +0900)]
MAINTAINERS: add Youngjun Park as reviewer for SWAP
Recently, I have been actively contributing to the swap subsystem through
works such as swap-tier patches and flash friendly swap proposal. During
this process, I have consistently reviewed swap table code, some other
patches and fixed several bugs.
As I am already CC'd on many patches and maintaining active interest in
ongoing developments, I would like to officially add myself as a reviewer.
I am committed to contributing to the kernel community with greater
responsibility.
Link: https://lkml.kernel.org/r/20260226010739.3773838-1-youngjun.park@lge.com Signed-off-by: Youngjun Park <youngjun.park@lge.com> Reviewed-by: Barry Song <baohua@kernel.org> Acked-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Acked-by: Baoquan He <bhe@redhat.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lance Yang [Tue, 24 Feb 2026 14:21:01 +0000 (22:21 +0800)]
mm/mmu_gather: replace IPI with synchronize_rcu() when batch allocation fails
When freeing page tables, we try to batch them. If batch allocation fails
(GFP_NOWAIT), __tlb_remove_table_one() immediately frees the one without
batching.
On !CONFIG_PT_RECLAIM, the fallback sends an IPI to all CPUs via
tlb_remove_table_sync_one(). It disrupts all CPUs even when only a single
process is unmapping memory. IPI broadcast was reported to hurt RT
workloads[1].
tlb_remove_table_sync_one() synchronizes with lockless page-table walkers
(e.g. GUP-fast) that rely on IRQ disabling. These walkers use
local_irq_disable(), which is also an RCU read-side critical section.
This patch introduces tlb_remove_table_sync_rcu() which uses RCU grace
period (synchronize_rcu()) instead of IPI broadcast. This provides the
same guarantee as IPI but without disrupting all CPUs. Since batch
allocation already failed, we are in a slow path where sleeping is
acceptable - we are in process context (unmap_region, exit_mmap) with only
mmap_lock held.
tlb_remove_table_sync_one() is retained for other callers (e.g.,
khugepaged after pmdp_collapse_flush(), tlb_finish_mmu() when
tlb->fully_unshared_tables) that are not slow paths. Converting those may
require different approaches such as targeted IPIs.
Thomas Ballasi [Mon, 16 Mar 2026 16:09:07 +0000 (09:09 -0700)]
mm: vmscan: add cgroup IDs to vmscan tracepoints
Memory reclaim events are currently difficult to attribute to specific
cgroups, making debugging memory pressure issues challenging. This patch
adds memory cgroup ID (memcg_id) to key vmscan tracepoints to enable
better correlation and analysis.
For operations not associated with a specific cgroup, the field is
defaulted to 0.
Link: https://lkml.kernel.org/r/20260316160908.42727-3-tballasi@linux.microsoft.com Signed-off-by: Thomas Ballasi <tballasi@linux.microsoft.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Steven Rostedt [Mon, 16 Mar 2026 16:09:06 +0000 (09:09 -0700)]
tracing: add __event_in_*irq() helpers
Patch series "mm: vmscan: add PID and cgroup ID to vmscan tracepoints", v8.
This patch (of 3):
Some trace events want to expose in their output if they were triggered in
an interrupt or softirq context. Instead of recording this in the event
structure itself, as this information is stored in the flags portion of
the event header, add helper macros that can be used in the print format:
Johannes Weiner [Mon, 23 Feb 2026 16:01:06 +0000 (11:01 -0500)]
mm: vmalloc: streamline vmalloc memory accounting
Use a vmstat counter instead of a custom, open-coded atomic. This has
the added benefit of making the data available per-node, and prepares
for cleaning up the memcg accounting as well.
Link: https://lkml.kernel.org/r/20260223160147.3792777-1-hannes@cmpxchg.org Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jason Miu [Fri, 6 Feb 2026 02:14:28 +0000 (18:14 -0800)]
kho: remove finalize state and clients
Eliminate the `kho_finalize()` function and its associated state from the
KHO subsystem. The transition to a radix tree for memory tracking makes
the explicit "finalize" state and its serialization step obsolete.
Remove the `kho_finalize()` and `kho_finalized()` APIs and their stub
implementations. Update KHO client code and the debugfs interface to no
longer call or depend on the `kho_finalize()` mechanism.
Complete the move towards a stateless KHO, simplifying the overall design
by removing unnecessary state management.
Link: https://lkml.kernel.org/r/20260206021428.3386442-3-jasonmiu@google.com Signed-off-by: Jason Miu <jasonmiu@google.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Cc: Changyuan Lyu <changyuanl@google.com> Cc: David Matlack <dmatlack@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Pratyush Yadav <pratyush@kernel.org> Cc: Ran Xiaokai <ran.xiaokai@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jason Miu [Fri, 6 Feb 2026 02:14:27 +0000 (18:14 -0800)]
kho: adopt radix tree for preserved memory tracking
Patch series "Make KHO Stateless", v9.
This series transitions KHO from an xarray-based metadata tracking system
with serialization to a radix tree data structure that can be passed
directly to the next kernel.
The key motivations for this change are to:
- Eliminate the need for data serialization before kexec.
- Remove the KHO finalize state.
- Pass preservation metadata more directly to the next kernel via the FDT.
The new approach uses a radix tree to mark preserved pages. A page's
physical address and its order are encoded into a single value. The tree
is composed of multiple levels of page-sized tables, with leaf nodes being
bitmaps where each set bit represents a preserved page. The physical
address of the radix tree's root is passed in the FDT, allowing the next
kernel to reconstruct the preserved memory map.
This series is broken down into the following patches:
1. kho: Adopt radix tree for preserved memory tracking:
Replaces the xarray-based tracker with the new radix tree
implementation and increments the ABI version.
2. kho: Remove finalize state and clients:
Removes the now-obsolete kho_finalize() function and its usage
from client code and debugfs.
This patch (of 2):
Introduce a radix tree implementation for tracking preserved memory pages
and switch the KHO memory tracking mechanism to use it. This lays the
groundwork for a stateless KHO implementation that eliminates the need for
serialization and the associated "finalize" state.
This patch introduces the core radix tree data structures and constants to
the KHO ABI. It adds the radix tree node and leaf structures, along with
documentation for the radix tree key encoding scheme that combines a
page's physical address and order.
To support broader use by other kernel subsystems, such as hugetlb
preservation, the core radix tree manipulation functions are exported as a
public API.
The xarray-based memory tracking is replaced with this new radix tree
implementation. The core KHO preservation and unpreservation functions
are wired up to use the radix tree helpers. On boot, the second kernel
restores the preserved memory map by walking the radix tree whose root
physical address is passed via the FDT.
The ABI `compatible` version is bumped to "kho-v2" to reflect the
structural changes in the preserved memory map and sub-FDT property names.
This includes renaming "fdt" to "preserved-data" to better reflect that
preserved state may use formats other than FDT.
[ran.xiaokai@zte.com.cn: fix child node parsing for debugfs in/sub_fdts] Link: https://lkml.kernel.org/r/20260309033530.244508-1-ranxiaokai627@163.com Link: https://lkml.kernel.org/r/20260206021428.3386442-1-jasonmiu@google.com Link: https://lkml.kernel.org/r/20260206021428.3386442-2-jasonmiu@google.com Signed-off-by: Jason Miu <jasonmiu@google.com> Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Cc: Changyuan Lyu <changyuanl@google.com> Cc: David Matlack <dmatlack@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Pratyush Yadav <pratyush@kernel.org> Cc: Ran Xiaokai <ran.xiaokai@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
kho: move alloc tag init to kho_init_{folio,pages}()
Commit 8f1081892d62 ("kho: simplify page initialization in
kho_restore_page()") cleaned up the page initialization logic by moving
the folio and 0-order-page paths into separate functions. It missed
moving the alloc tag initialization.
Do it now to keep the two paths cleanly separated. While at it, touch up
the comments to be a tiny bit shorter (mainly so it doesn't end up
splitting into a multiline comment). This is purely a cosmetic change and
there should be no change in behaviour.
Link: https://lkml.kernel.org/r/20260213085914.2778107-1-pratyush@kernel.org Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: centralize+fix comments about compound_mapcount() in new sync_with_folio_pmd_zap()
We still mention compound_mapcount() in two comments.
Instead of simply referring to the folio mapcount in both places, let's
factor out the odd-looking PTL sync into sync_with_folio_pmd_zap(), and
add centralized documentation why this is required.
[akpm@linux-foundation.org: update comment per Matthew and David] Link: https://lkml.kernel.org/r/20260223163920.287720-1-david@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vernon Yang [Sat, 21 Feb 2026 09:39:18 +0000 (17:39 +0800)]
mm: khugepaged: skip lazy-free folios
For example, create three task: hot1 -> cold -> hot2. After all three
task are created, each allocate memory 128MB. the hot1/hot2 task
continuously access 128 MB memory, while the cold task only accesses its
memory briefly and then call madvise(MADV_FREE). However, khugepaged
still prioritizes scanning the cold task and only scans the hot2 task
after completing the scan of the cold task.
All folios in VM_DROPPABLE are lazyfree, Collapsing maintains that
property, so we can just collapse and memory pressure in the future will
free it up. In contrast, collapsing in !VM_DROPPABLE does not maintain
that property, the collapsed folio will not be lazyfree and memory
pressure in the future will not be able to free it up.
So if the user has explicitly informed us via MADV_FREE that this memory
will be freed, and this vma does not have VM_DROPPABLE flags, it is
appropriate for khugepaged to skip it only, thereby avoiding unnecessary
scan and collapse operations to reducing CPU wastage.
Here are the performance test results:
(Throughput bigger is better, other smaller is better)
[vernon2gm@gmail.com: add comment about VM_DROPPABLE in code, make it clearer] Link: https://lkml.kernel.org/r/i4uowkt4h2ev47obm5h2vtd4zbk6fyw5g364up7kkjn2vmcikq@auepvqethj5r Link: https://lkml.kernel.org/r/20260221093918.1456187-5-vernon2gm@gmail.com Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn> Acked-by: David Hildenbrand (arm) <david@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Barry Song <baohua@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vernon Yang [Sat, 21 Feb 2026 09:39:17 +0000 (17:39 +0800)]
mm: add folio_test_lazyfree helper
Add folio_test_lazyfree() function to identify lazy-free folios to improve
code readability.
Link: https://lkml.kernel.org/r/20260221093918.1456187-4-vernon2gm@gmail.com Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Barry Song <baohua@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vernon Yang [Thu, 26 Feb 2026 14:31:34 +0000 (22:31 +0800)]
mm-khugepaged-refine-scan-progress-number-fix
Based on previous discussions [1], v2 as follow, and testing shows the
same performance benefits. Just make code cleaner, no function changes.
Link: https://lkml.kernel.org/r/hbftflvdmnranprul4zkq3d2iymqm7ta2a7fwiphggsmt36gt7@bihvv5jg2ko5 Link: https://lore.kernel.org/linux-mm/zdvzmoop5xswqcyiwmvvrdfianm4ccs3gryfecwbm4bhuh7ebo@7an4huwgbuwo Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand (arm) <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vernon Yang [Sat, 21 Feb 2026 09:39:16 +0000 (17:39 +0800)]
mm: khugepaged: refine scan progress number
Currently, each scan always increases "progress" by HPAGE_PMD_NR,
even if only scanning a single PTE/PMD entry.
- When only scanning a sigle PTE entry, let me provide a detailed
example:
static int hpage_collapse_scan_pmd()
{
for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
_pte++, addr += PAGE_SIZE) {
pte_t pteval = ptep_get(_pte);
...
if (pte_uffd_wp(pteval)) { <-- first scan hit
result = SCAN_PTE_UFFD_WP;
goto out_unmap;
}
}
}
During the first scan, if pte_uffd_wp(pteval) is true, the loop exits
directly. In practice, only one PTE is scanned before termination. Here,
"progress += 1" reflects the actual number of PTEs scanned, but previously
"progress += HPAGE_PMD_NR" always.
- When the memory has been collapsed to PMD, let me provide a detailed
example:
The following data is traced by bpftrace on a desktop system. After the
system has been left idle for 10 minutes upon booting, a lot of
SCAN_PMD_MAPPED or SCAN_NO_PTE_TABLE are observed during a full scan by
khugepaged.
From trace_mm_khugepaged_scan_pmd and trace_mm_khugepaged_scan_file, the
following statuses were observed, with frequency mentioned next to them:
SCAN_SUCCEED : 1
SCAN_EXCEED_SHARED_PTE: 2
SCAN_PMD_MAPPED : 142
SCAN_NO_PTE_TABLE : 178
total progress size : 674 MB
Total time : 419 seconds, include khugepaged_scan_sleep_millisecs
The khugepaged_scan list save all task that support collapse into
hugepage, as long as the task is not destroyed, khugepaged will not remove
it from the khugepaged_scan list. This exist a phenomenon where task has
already collapsed all memory regions into hugepage, but khugepaged
continues to scan it, which wastes CPU time and invalid, and due to
khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
scanning a large number of invalid task, so scanning really valid task is
later.
After applying this patch, when the memory is either SCAN_PMD_MAPPED or
SCAN_NO_PTE_TABLE, just skip it, as follow:
SCAN_EXCEED_SHARED_PTE: 2
SCAN_PMD_MAPPED : 147
SCAN_NO_PTE_TABLE : 173
total progress size : 45 MB
Total time : 20 seconds
SCAN_PTE_MAPPED_HUGEPAGE is the same, for detailed data, refer to
https://lore.kernel.org/linux-mm/4qdu7owpmxfh3ugsue775fxarw5g2gcggbxdf5psj75nnu7z2u@cv2uu2yocaxq
Link: https://lkml.kernel.org/r/20260221093918.1456187-3-vernon2gm@gmail.com Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn> Reviewed-by: Dev Jain <dev.jain@arm.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand (arm) <david@kernel.org> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This series improves the khugepaged scan logic and reduces CPU consumption
by prioritizing scanning tasks that access memory frequently.
The following data is traced by bpftrace[1] on a desktop system. After
the system has been left idle for 10 minutes upon booting, a lot of
SCAN_PMD_MAPPED or SCAN_NO_PTE_TABLE are observed during a full scan by
khugepaged.
@scan_pmd_status[1]: 1 ## SCAN_SUCCEED
@scan_pmd_status[6]: 2 ## SCAN_EXCEED_SHARED_PTE
@scan_pmd_status[3]: 142 ## SCAN_PMD_MAPPED
@scan_pmd_status[2]: 178 ## SCAN_NO_PTE_TABLE
total progress size: 674 MB
Total time : 419 seconds ## include khugepaged_scan_sleep_millisecs
The khugepaged has below phenomenon: the khugepaged list is scanned in a
FIFO manner, as long as the task is not destroyed,
1. the task no longer has memory that can be collapsed into hugepage,
continues scan it always.
2. the task at the front of the khugepaged scan list is cold, they are
still scanned first.
3. everyone scan at intervals of khugepaged_scan_sleep_millisecs
(default 10s). If we always scan the above two cases first, the valid
scan will have to wait for a long time.
For the first case, when the memory is either SCAN_PMD_MAPPED or
SCAN_NO_PTE_TABLE or SCAN_PTE_MAPPED_HUGEPAGE [5], just skip it.
For the second case, if the user has explicitly informed us via
MADV_FREE that these folios will be freed, just skip it only.
Create three task[2]: hot1 -> cold -> hot2. After all three task are
created, each allocate memory 128MB. the hot1/hot2 task continuously
access 128 MB memory, while the cold task only accesses its memory
briefly andthen call madvise(MADV_FREE). Here are the performance test
results:
(Throughput bigger is better, other smaller is better)
Add mm_khugepaged_scan event to track the total time for full scan and the
total number of pages scanned of khugepaged.
Link: https://lkml.kernel.org/r/20260221093918.1456187-2-vernon2gm@gmail.com Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Dev Jain <dev.jain@arm.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Zhongqiu Han [Fri, 30 Jan 2026 09:37:29 +0000 (17:37 +0800)]
mm/kmemleak: use PF_KTHREAD flag to detect kernel threads
Replace the current->mm check with PF_KTHREAD flag for more reliable
kernel thread detection in scan_should_stop(). The PF_KTHREAD flag is the
standard way to identify kernel threads and is not affected by temporary
mm borrowing via kthread_use_mm() (although kmemleak does not currently
encounter such cases, this makes the code more robust).
Since commit f1879e8a0c60 ("mm, swap: never bypass the swap cache even for
SWP_SYNCHRONOUS_IO"), all swap-in operations go through the swap cache,
including those from SWP_SYNCHRONOUS_IO devices like zram. Which means
the workaround for swap cache bypassing introduced by commit 25cd241408a2
("mm: zswap: fix data loss on SWP_SYNCHRONOUS_IO devices") is no longer
needed. Remove it, but keep the comments that are still helpful.
Link: https://lkml.kernel.org/r/20260202-zswap-syncio-cleanup-v1-1-86bb24a64521@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Suggested-by: Yosry Ahmed <yosry.ahmed@linux.dev> Reviewed-by: Barry Song <baohua@kernel.org> Acked-by: Chris Li <chrisl@kernel.org> Acked-by: Yosry Ahmed <yosry.ahmed@linux.dev> Acked-by: Nhat Pham <nphamcs@gmail.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Cc: Baoquan He <bhe@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
qinyu [Tue, 3 Feb 2026 10:26:49 +0000 (18:26 +0800)]
mm/page_idle.c: remove redundant mmu notifier in aging code
Now we have mmu_notifier_clear_young immediately follows
pmdp_clear_young_notify which internally calls mmu_notifier_clear_young,
this is redundant. change it with non-notify variant and keep consistent
with ptep aging code.
Link: https://lkml.kernel.org/r/20260203102649.2486836-1-qin.yuA@h3c.com Signed-off-by: qinyu <qin.yuA@h3c.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: SeongJae Park <sj@kernel.org> Acked-by: David Hildenbrand (arm) <david@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Li RongQing [Wed, 4 Feb 2026 08:09:37 +0000 (03:09 -0500)]
mm/mmu_notifiers: use hlist_for_each_entry_srcu() for SRCU list traversal
The mmu_notifier_subscriptions list is protected by SRCU. While the
current code uses hlist_for_each_entry_rcu() with an explicit SRCU lockdep
check, it is more appropriate to use the dedicated
hlist_for_each_entry_srcu() macro.
This change aligns the code with the preferred kernel API for
SRCU-protected lists, improving code clarity and ensuring that the
synchronization method is explicitly documented by the iterator name
itself.
Link: https://lkml.kernel.org/r/20260204080937.2472-1-lirongqing@baidu.com Signed-off-by: Li RongQing <lirongqing@baidu.com> Acked-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vernon Yang [Sat, 7 Feb 2026 08:16:13 +0000 (16:16 +0800)]
mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY
When an mm with the MMF_DISABLE_THP_COMPLETELY flag is detected during
scanning, directly set khugepaged_scan.mm_slot to the next mm_slot, reduce
redundant operation.
Without this patch, entering khugepaged_scan_mm_slot() next time, we will
set khugepaged_scan.mm_slot to the next mm_slot.
With this patch, we will directly set khugepaged_scan.mm_slot to the next
mm_slot.
Link: https://lkml.kernel.org/r/20260207081613.588598-6-vernon2gm@gmail.com Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Barry Song <baohua@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Chen Ni [Wed, 11 Feb 2026 06:43:11 +0000 (14:43 +0800)]
selftests/mm: remove duplicate include of unistd.h
Remove duplicate inclusion of unistd.h in memory-failure.c to clean up
redundant code.
Link: https://lkml.kernel.org/r/20260211064311.2981726-1-nichen@iscas.ac.cn Signed-off-by: Chen Ni <nichen@iscas.ac.cn> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: SeongJae Park <sj@kernel.org> Reviewed-by: Dev Jain <dev.jain@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>