Baolin Wang [Mon, 9 Feb 2026 14:07:28 +0000 (22:07 +0800)]
mm: rmap: support batched unmapping for file large folios
Similar to folio_referenced_one(), we can apply batched unmapping for file
large folios to optimize the performance of file folios reclamation.
Barry previously implemented batched unmapping for lazyfree anonymous
large folios[1] and did not further optimize anonymous large folios or
file-backed large folios at that stage. As for file-backed large folios,
the batched unmapping support is relatively straightforward, as we only
need to clear the consecutive (present) PTE entries for file-backed large
folios.
Note that it's not ready to support batched unmapping for uffd case, so
let's still fallback to per-page unmapping for the uffd case.
Performance testing:
Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and
try to reclaim 8G file-backed folios via the memory.reclaim interface. I
can observe 75% performance improvement on my Arm64 32-core server (and
50%+ improvement on my X86 machine) with this patch.
W/o patch:
real 0m1.018s
user 0m0.000s
sys 0m1.018s
W/ patch:
real 0m0.249s
user 0m0.000s
sys 0m0.249s
[1] https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@gmail.com/T/#u Link: https://lkml.kernel.org/r/b53a16f67c93a3fe65e78092069ad135edf00eff.1770645603.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: Barry Song <baohua@kernel.org> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baolin Wang [Mon, 9 Feb 2026 14:07:27 +0000 (22:07 +0800)]
arm64: mm: implement the architecture-specific clear_flush_young_ptes()
Implement the Arm64 architecture-specific clear_flush_young_ptes() to
enable batched checking of young flags and TLB flushing, improving
performance during large folio reclamation.
Performance testing:
Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and
try to reclaim 8G file-backed folios via the memory.reclaim interface. I
can observe 33% performance improvement on my Arm64 32-core server (and
10%+ improvement on my X86 machine). Meanwhile, the hotspot
folio_check_references() dropped from approximately 35% to around 5%.
W/o patchset:
real 0m1.518s
user 0m0.000s
sys 0m1.518s
W/ patchset:
real 0m1.018s
user 0m0.000s
sys 0m1.018s
Link: https://lkml.kernel.org/r/ce749fbae3e900e733fa104a16fcb3ca9fe4f9bd.1770645603.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baolin Wang [Mon, 9 Feb 2026 14:07:26 +0000 (22:07 +0800)]
arm64: mm: support batch clearing of the young flag for large folios
Currently, contpte_ptep_test_and_clear_young() and
contpte_ptep_clear_flush_young() only clear the young flag and flush TLBs
for PTEs within the contiguous range. To support batch PTE operations for
other sized large folios in the following patches, adding a new parameter
to specify the number of PTEs that map consecutive pages of the same large
folio in a single VMA and a single page table.
While we are at it, rename the functions to maintain consistency with
other contpte_*() functions.
Link: https://lkml.kernel.org/r/5644250dcc0417278c266ad37118d27f541fd052.1770645603.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baolin Wang [Mon, 9 Feb 2026 14:07:24 +0000 (22:07 +0800)]
mm: rmap: support batched checks of the references for large folios
Patch series "support batch checking of references and unmapping for large
folios", v6.
Currently, folio_referenced_one() always checks the young flag for each
PTE sequentially, which is inefficient for large folios. This
inefficiency is especially noticeable when reclaiming clean file-backed
large folios, where folio_referenced() is observed as a significant
performance hotspot.
Moreover, on Arm architecture, which supports contiguous PTEs, there is
already an optimization to clear the young flags for PTEs within a
contiguous range. However, this is not sufficient. We can extend this to
perform batched operations for the entire large folio (which might exceed
the contiguous range: CONT_PTE_SIZE).
Similar to folio_referenced_one(), we can also apply batched unmapping for
large file folios to optimize the performance of file folio reclamation.
By supporting batched checking of the young flags, flushing TLB entries,
and unmapping, I can observed a significant performance improvements in my
performance tests for file folios reclamation. Please check the
performance data in the commit message of each patch.
This patch (of 5):
Currently, folio_referenced_one() always checks the young flag for each
PTE sequentially, which is inefficient for large folios. This
inefficiency is especially noticeable when reclaiming clean file-backed
large folios, where folio_referenced() is observed as a significant
performance hotspot.
Moreover, on Arm64 architecture, which supports contiguous PTEs, there is
already an optimization to clear the young flags for PTEs within a
contiguous range. However, this is not sufficient. We can extend this to
perform batched operations for the entire large folio (which might exceed
the contiguous range: CONT_PTE_SIZE).
Introduce a new API: clear_flush_young_ptes() to facilitate batched
checking of the young flags and flushing TLB entries, thereby improving
performance during large folio reclamation. And it will be overridden by
the architecture that implements a more efficient batch operation in the
following patches.
While we are at it, rename ptep_clear_flush_young_notify() to
clear_flush_young_ptes_notify() to indicate that this is a batch
operation.
Link: https://lkml.kernel.org/r/cover.1770645603.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/12132694536834262062d1fb304f8f8a064b6750.1770645603.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Barry Song <baohua@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Thu, 22 Jan 2026 16:06:22 +0000 (16:06 +0000)]
tools/testing/vma: add VMA userland tests for VMA flag functions
Now we have the capability to test the new helpers for the bitmap VMA
flags in userland, do so.
We also update the Makefile such that both VMA (and while we're here)
mm_struct flag sizes can be customised on build. We default to 128-bit to
enable testing of flags above word size even on 64-bit systems.
We add userland tests to ensure that we do not regress VMA flag behaviour
with the introduction when using bitmap VMA flags, nor accidentally
introduce unexpected results due to for instance higher bit values not
being correctly cleared/set.
As part of this change, make __mk_vma_flags() a custom function so we can
handle specifying invalid VMA bits. This is purposeful so we can have the
VMA tests work at lower and higher number of VMA flags without having to
duplicate code too much.
Link: https://lkml.kernel.org/r/7fe6afe9c8c61e4d3cfc9a2d50a5d24da8528e68.1769097829.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Cc: Damien Le Moal <dlemoal@kernel.org> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Yury Norov <ynorov@nvidia.com> Cc: Chris Mason <clm@fb.com> Cc: Pedro Falcato <pfalcato@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Thu, 22 Jan 2026 16:06:21 +0000 (16:06 +0000)]
tools/testing/vma: separate out vma_internal.h into logical headers
The vma_internal.h file is becoming entirely unmanageable. It combines
duplicated kernel implementation logic that needs to be kept in-sync with
the kernel, stubbed out declarations that we simply ignore for testing
purposes and custom logic added to aid testing.
If we separate each of the three things into separate headers it makes
things far more manageable, so do so:
* include/stubs.h contains the stubbed declarations,
* include/dup.h contains the duplicated kernel declarations, and
* include/custom.h contains declarations customised for testing.
[lorenzo.stoakes@oracle.com: avoid a duplicate struct define] Link: https://lkml.kernel.org/r/1e032732-61c3-485c-9aa7-6a09016fefc1@lucifer.local Link: https://lkml.kernel.org/r/dd57baf5b5986cb96a167150ac712cbe804b63ee.1769097829.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Cc: Damien Le Moal <dlemoal@kernel.org> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Yury Norov <ynorov@nvidia.com> Cc: Chris Mason <clm@fb.com> Cc: Pedro Falcato <pfalcato@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Thu, 22 Jan 2026 16:06:20 +0000 (16:06 +0000)]
tools/testing/vma: separate VMA userland tests into separate files
So far the userland VMA tests have been established as a rough expression
of what's been possible.
Adapt it into a more usable form by separating out tests and shared
helper functions.
Since we test functions that are declared statically in mm/vma.c, we make
use of the trick of #include'ing kernel C files directly.
In order for the tests to continue to function, we must therefore also
this way into the tests/ directory.
We try to keep as much shared logic actually modularised into a separate
compilation unit in shared.c, however the merge_existing() and
attach_vma() helpers rely on statically declared mm/vma.c functions so
these must be declared in main.c.
Link: https://lkml.kernel.org/r/a0455ccfe4fdcd1c962c64f76304f612e5662a4e.1769097829.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Cc: Damien Le Moal <dlemoal@kernel.org> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Yury Norov <ynorov@nvidia.com> Cc: Chris Mason <clm@fb.com> Cc: Pedro Falcato <pfalcato@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Thu, 22 Jan 2026 16:06:19 +0000 (16:06 +0000)]
mm: make vm_area_desc utilise vma_flags_t only
Now we have eliminated all uses of vm_area_desc->vm_flags, eliminate this
field, and have mmap_prepare users utilise the vma_flags_t
vm_area_desc->vma_flags field only.
As part of this change we alter is_shared_maywrite() to accept a
vma_flags_t parameter, and introduce is_shared_maywrite_vm_flags() for use
with legacy vm_flags_t flags.
We also update struct mmap_state to add a union between vma_flags and
vm_flags temporarily until the mmap logic is also converted to using
vma_flags_t.
Also update the VMA userland tests to reflect this change.
Link: https://lkml.kernel.org/r/fd2a2938b246b4505321954062b1caba7acfc77a.1769097829.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Cc: Damien Le Moal <dlemoal@kernel.org> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Yury Norov <ynorov@nvidia.com> Cc: Chris Mason <clm@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Thu, 22 Jan 2026 16:06:18 +0000 (16:06 +0000)]
mm: update all remaining mmap_prepare users to use vma_flags_t
We will be shortly removing the vm_flags_t field from vm_area_desc so we
need to update all mmap_prepare users to only use the dessc->vma_flags
field.
This patch achieves that and makes all ancillary changes required to make
this possible.
This lays the groundwork for future work to eliminate the use of
vm_flags_t in vm_area_desc altogether and more broadly throughout the
kernel.
While we're here, we take the opportunity to replace VM_REMAP_FLAGS with
VMA_REMAP_FLAGS, the vma_flags_t equivalent.
No functional changes intended.
Link: https://lkml.kernel.org/r/fb1f55323799f09fe6a36865b31550c9ec67c225.1769097829.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Damien Le Moal <dlemoal@kernel.org> [zonefs] Acked-by: "Darrick J. Wong" <djwong@kernel.org> Acked-by: Pedro Falcato <pfalcato@suse.de> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Yury Norov <ynorov@nvidia.com> Cc: Chris Mason <clm@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Thu, 22 Jan 2026 16:06:17 +0000 (16:06 +0000)]
mm: update shmem_[kernel]_file_*() functions to use vma_flags_t
In order to be able to use only vma_flags_t in vm_area_desc we must adjust
shmem file setup functions to operate in terms of vma_flags_t rather than
vm_flags_t.
This patch makes this change and updates all callers to use the new
functions.
No functional changes intended.
[akpm@linux-foundation.org: comment fixes, per Baolin] Link: https://lkml.kernel.org/r/736febd280eb484d79cef5cf55b8a6f79ad832d2.1769097829.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Yury Norov <ynorov@nvidia.com> Cc: Chris Mason <clm@fb.com> Cc: Pedro Falcato <pfalcato@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Thu, 22 Jan 2026 16:06:16 +0000 (16:06 +0000)]
mm: update secretmem to use VMA flags on mmap_prepare
This patch updates secretmem to use the new vma_flags_t type which will
soon supersede vm_flags_t altogether.
In order to make this change we also have to update mlock_future_ok(), we
replace the vm_flags_t parameter with a simple boolean is_vma_locked one,
which also simplifies the invocation here.
This is laying the groundwork for eliminating the vm_flags_t in
vm_area_desc and more broadly throughout the kernel.
No functional changes intended.
[lorenzo.stoakes@oracle.com: fix check_brk_limits(), per Chris] Link: https://lkml.kernel.org/r/3aab9ab1-74b4-405e-9efb-08fc2500c06e@lucifer.local Link: https://lkml.kernel.org/r/a243a09b0a5d0581e963d696de1735f61f5b2075.1769097829.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Cc: Damien Le Moal <dlemoal@kernel.org> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Yury Norov <ynorov@nvidia.com> Cc: Chris Mason <clm@fb.com> Cc: Pedro Falcato <pfalcato@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Thu, 22 Jan 2026 16:06:15 +0000 (16:06 +0000)]
mm: update hugetlbfs to use VMA flags on mmap_prepare
In order to update all mmap_prepare users to utilising the new VMA flags
type vma_flags_t and associated helper functions, we start by updating
hugetlbfs which has a lot of additional logic that requires updating to
make this change.
This is laying the groundwork for eliminating the vm_flags_t from struct
vm_area_desc and using vma_flags_t only, which further lays the ground for
removing the deprecated vm_flags_t type altogether.
No functional changes intended.
Link: https://lkml.kernel.org/r/9226bec80c9aa3447cc2b83354f733841dba8a50.1769097829.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Cc: Damien Le Moal <dlemoal@kernel.org> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Yury Norov <ynorov@nvidia.com> Cc: Chris Mason <clm@fb.com> Cc: Pedro Falcato <pfalcato@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Thu, 22 Jan 2026 16:06:14 +0000 (16:06 +0000)]
mm: add basic VMA flag operation helper functions
Now we have the mk_vma_flags() macro helper which permits easy
specification of any number of VMA flags, add helper functions which
operate with vma_flags_t parameters.
This patch provides vma_flags_test[_mask](), vma_flags_set[_mask]() and
vma_flags_clear[_mask]() respectively testing, setting and clearing flags
with the _mask variants accepting vma_flag_t parameters, and the non-mask
variants implemented as macros which accept a list of flags.
This allows us to trivially test/set/clear aggregate VMA flag values as
necessary, for instance:
if (vma_flags_test(&flags, VMA_READ_BIT, VMA_WRITE_BIT))
goto readwrite;
We also add a function for testing that ALL flags are set for convenience,
e.g.:
if (vma_flags_test_all(&flags, VMA_READ_BIT, VMA_MAYREAD_BIT)) {
/* Both READ and MAYREAD flags set */
...
}
The compiler generates optimal assembly for each such that they behave as
if the caller were setting the bitmap flags manually.
This is important for e.g. drivers which manipulate flag values rather
than a VMA's specific flag values.
We also add helpers for testing, setting and clearing flags for VMA's and
VMA descriptors to reduce boilerplate.
Also add the EMPTY_VMA_FLAGS define to aid initialisation of empty flags.
Finally, update the userland VMA tests to add the helpers there so they
can be utilised as part of userland testing.
Link: https://lkml.kernel.org/r/885d4897d67a6a57c0b07fa182a7055ad752df11.1769097829.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Cc: Damien Le Moal <dlemoal@kernel.org> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Yury Norov <ynorov@nvidia.com> Cc: Chris Mason <clm@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The bitmap_subset() and bitmap_andnot() functions are not present in the
tools version of include/linux/bitmap.h, so add them as subsequent patches
implement test code that requires them.
We also add the missing __bitmap_subset() to tools/lib/bitmap.c.
Link: https://lkml.kernel.org/r/0fd0d4ec868297f522003cb4b5898b53b498805b.1769097829.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Cc: Damien Le Moal <dlemoal@kernel.org> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Yury Norov <ynorov@nvidia.com> Cc: Chris Mason <clm@fb.com> Cc: Pedro Falcato <pfalcato@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Thu, 22 Jan 2026 16:06:12 +0000 (16:06 +0000)]
mm: add mk_vma_flags() bitmap flag macro helper
This patch introduces the mk_vma_flags() macro helper to allow easy
manipulation of VMA flags utilising the new bitmap representation
implemented of VMA flags defined by the vma_flags_t type.
It is a variadic macro which provides a bitwise-or'd representation of all
of each individual VMA flag specified.
Note that, while we maintain VM_xxx flags for backwards compatibility
until the conversion is complete, we define VMA flags of type vma_flag_t
using VMA_xxx_BIT to avoid confusing the two.
Testing has demonstrated that the compiler optimises this code such that
it generates the same assembly utilising this macro as it does if the
flags were specified manually, for instance:
Link: https://lkml.kernel.org/r/fde00df6ff7fb8c4b42cc0defa5a4924c7a1943a.1769097829.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Cc: Damien Le Moal <dlemoal@kernel.org> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Yury Norov <ynorov@nvidia.com> Cc: Chris Mason <clm@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Thu, 22 Jan 2026 16:06:11 +0000 (16:06 +0000)]
mm: rename vma_flag_test/set_atomic() to vma_test/set_atomic_flag()
In order to stay consistent between functions which manipulate a
vm_flags_t argument of the form of vma_flags_...() and those which
manipulate a VMA (in this case the flags field of a VMA), rename
vma_flag_[test/set]_atomic() to vma_[test/set]_atomic_flag().
This lays the groundwork for adding VMA flag manipulation functions in a
subsequent commit.
Link: https://lkml.kernel.org/r/033dcf12e819dee5064582bced9b12ea346d1607.1769097829.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Cc: Damien Le Moal <dlemoal@kernel.org> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Yury Norov <ynorov@nvidia.com> Cc: Chris Mason <clm@fb.com> Cc: Pedro Falcato <pfalcato@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Thu, 22 Jan 2026 16:06:10 +0000 (16:06 +0000)]
mm/vma: remove __private sparse decoration from vma_flags_t
Patch series "mm: add bitmap VMA flag helpers and convert all mmap_prepare
to use them", v2.
We introduced the bitmap VMA type vma_flags_t in the aptly named commit 9ea35a25d51b ("mm: introduce VMA flags bitmap type") in order to permit
future growth in VMA flags and to prevent the asinine requirement that VMA
flags be available to 64-bit kernels only if they happened to use a bit
number about 32-bits.
This is a long-term project as there are very many users of VMA flags
within the kernel that need to be updated in order to utilise this new
type.
In order to further this aim, this series adds a number of helper
functions to enable ordinary interactions with VMA flags - that is
testing, setting and clearing them.
In order to make working with VMA bit numbers less cumbersome this series
introduces the mk_vma_flags() helper macro which generates a vma_flags_t
from a variadic parameter list, e.g.:
Providing means of testing any flag, testing all flags, setting, and
clearing a specific vma_flags_t mask.
For convenience, helper macros are provided - vma_flags_test(),
vma_flags_set() and vma_flags_clear(), each of which utilise
mk_vma_flags() to make these operations easier, as well as an
EMPTY_VMA_FLAGS macro to make initialisation of an empty vma_flags_t value
easier, e.g.:
Since callers are often dealing with a vm_area_struct (VMA) or
vm_area_desc (VMA descriptor as used in .mmap_prepare) object, this series
further provides helpers for these - firstly vma_set_flags_mask() and
vma_set_flags() for a VMA:
Note that these do NOT ensure appropriate locks are taken and assume the
callers takes care of this.
For VMA descriptors this series adds vma_desc_[test, set,
clear]_flags_mask() and vma_desc_[test, set, clear]_flags() for a VMA
descriptor, e.g.:
static int foo_mmap_prepare(struct vm_area_desc *desc)
{
...
vma_desc_set_flags(desc, VMA_SEQ_READ_BIT);
vma_desc_clear_flags(desc, VMA_RAND_READ_BIT);
...
if (vma_desc_test_flags(desc, VMA_SHARED_BIT) {
...
}
...
}
With these helpers introduced, this series then updates all mmap_prepare
users to make use of the vma_flags_t vm_area_desc->vma_flags field rather
than the legacy vm_flags_t vm_area_desc->vm_flags field.
In order to do so, several other related functions need to be updated,
with separate patches for larger changes in hugetlbfs, secretmem and shmem
before finally removing vm_area_desc->vm_flags altogether.
This lays the foundations for future elimination of vm_flags_t and
associated defines and functionality altogether in the long run, and
elimination of the use of vm_flags_t in f_op->mmap() hooks in the near
term as mmap_prepare replaces these.
There is a useful synergy between the VMA flags and mmap_prepare work here
as with this change in place, converting f_op->mmap() to
f_op->mmap_prepare naturally also converts use of vm_flags_t to
vma_flags_t in all drivers which declare mmap handlers.
This accounts for the majority of the users of the legacy vm_flags_*()
helpers and thus a large number of drivers which need to interact with VMA
flags in general.
This series also updates the userland VMA tests to account for the change,
and adds unit tests for these helper functions to assert that they behave
as expected.
In order to faciliate this change in a sensible way, the series also
separates out the VMA unit tests into - code that is duplicated from the
kernel that should be kept in sync, code that is customised for test
purposes and code that is stubbed out.
We also separate out the VMA userland tests into separate files to make it
easier to manage and to provide a sensible baseline for adding the
userland tests for these helpers.
This patch (of 13):
We need to pass around these values and access them in a way that sparse
does not allow, as __private implies noderef, i.e. disallowing
dereference of the value, which manifests as sparse warnings even when
passed around benignly.
Link: https://lkml.kernel.org/r/cover.1769097829.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/64fa89f416f22a60ae74cfff8fd565e7677be192.1769097829.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Cc: Damien Le Moal <dlemoal@kernel.org> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Yury Norov <ynorov@nvidia.com> Cc: Chris Mason <clm@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Wed, 21 Jan 2026 16:49:46 +0000 (11:49 -0500)]
mm: use unmap_desc struct for freeing page tables
Pass through the unmap_desc to free_pgtables() because it almost has
everything necessary and is already on the stack.
Updates testing code as necessary.
No functional changes intended.
[Liam.Howlett@oracle.com: fix up unmap desc use on exit_mmap()] Link: https://lkml.kernel.org/r/20260210214214.364856-1-Liam.Howlett@oracle.com Link: https://lkml.kernel.org/r/20260121164946.2093480-12-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: SeongJae Park <sj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Wed, 21 Jan 2026 16:49:45 +0000 (11:49 -0500)]
mm/vma: use unmap_region() in vms_clear_ptes()
There is no need to open code the vms_clear_ptes() now that unmap_desc
struct is used.
Link: https://lkml.kernel.org/r/20260121164946.2093480-11-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: SeongJae Park <sj@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Wed, 21 Jan 2026 16:49:44 +0000 (11:49 -0500)]
mm/vma: use unmap_desc in exit_mmap() and vms_clear_ptes()
Convert vms_clear_ptes() to use unmap_desc to call unmap_vmas() instead of
the large argument list. The UNMAP_STATE() cannot be used because the vma
iterator in the vms does not point to the correct maple state
(mas_detach), and the tree_end will be set incorrectly. Setting up the
arguments manually avoids setting the struct up incorrectly and doing
extra work to get the correct pagetable range.
exit_mmap() also calls unmap_vmas() with many arguments. Using the
unmap_all_init() function to set the unmap descriptor for all vmas makes
this a bit easier to read.
Update to the vma test code is necessary to ensure testing continues to
function.
No functional changes intended.
Link: https://lkml.kernel.org/r/20260121164946.2093480-10-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: SeongJae Park <sj@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Wed, 21 Jan 2026 16:49:43 +0000 (11:49 -0500)]
mm: introduce unmap_desc struct to reduce function arguments
The unmap_region code uses a number of arguments that could use better
documentation. With the addition of a descriptor for unmap (called
unmap_desc), the arguments can be more self-documenting and increase the
descriptions within the declaration.
No functional change intended
Link: https://lkml.kernel.org/r/20260121164946.2093480-9-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: SeongJae Park <sj@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Wed, 21 Jan 2026 16:49:42 +0000 (11:49 -0500)]
mm: change dup_mmap() recovery
When the dup_mmap() fails during the vma duplication or setup, don't write
the XA_ZERO entry in the vma tree. Instead, destroy the tree and free the
new resources, leaving an empty vma tree.
Using XA_ZERO introduced races where the vma could be found between
dup_mmap() dropping all locks and exit_mmap() taking the locks. The race
can occur because the mm can be reached through the other trees via
successfully copied vmas and other methods such as the swapoff code.
XA_ZERO was marking the location to stop vma removal and pagetable
freeing. The newly created arguments to the unmap_vmas() and
free_pgtables() serve this function.
Replacing the XA_ZERO entry use with the new argument list also means the
checks for xa_is_zero() are no longer necessary so these are also removed.
Note that dup_mmap() now cleans up when ALL vmas are successfully copied,
but the dup_mmap() fails to completely set up some other aspect of the
duplication.
Link: https://lkml.kernel.org/r/20260121164946.2093480-8-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: SeongJae Park <sj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Wed, 21 Jan 2026 16:49:41 +0000 (11:49 -0500)]
mm/vma: add page table limit to unmap_region()
The unmap_region() calls need to pass through the page table limit for a
future patch.
No functional changes intended.
Link: https://lkml.kernel.org/r/20260121164946.2093480-7-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: SeongJae Park <sj@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Wed, 21 Jan 2026 16:49:40 +0000 (11:49 -0500)]
mm/memory: add tree limit to free_pgtables()
The ceiling and tree search limit need to be different arguments for the
future change in the failed fork attempt. The ceiling and floor variables
are not very descriptive, so change them to pg_start/pg_end.
Adding a new variable for the vma_end to the function as it will differ
from the pg_end in the later patches in the series.
Add a kernel doc about the free_pgtables() function.
Test code also updated.
No functional changes intended.
Link: https://lkml.kernel.org/r/20260121164946.2093480-6-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: SeongJae Park <sj@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Wed, 21 Jan 2026 16:49:39 +0000 (11:49 -0500)]
mm/vma: add limits to unmap_region() for vmas
Add a limit to the vma search instead of using the start and end of the
one passed in.
No functional changes intended.
Link: https://lkml.kernel.org/r/20260121164946.2093480-5-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: SeongJae Park <sj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Wed, 21 Jan 2026 16:49:38 +0000 (11:49 -0500)]
mm/mmap: abstract vma clean up from exit_mmap()
Create the new function tear_down_vmas() to remove a range of vmas.
exit_mmap() will be removing all the vmas.
This is necessary for future patches.
No functional changes intended.
Link: https://lkml.kernel.org/r/20260121164946.2093480-4-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: SeongJae Park <sj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Wed, 21 Jan 2026 16:49:37 +0000 (11:49 -0500)]
mm/mmap: move exit_mmap() trace point
Move the trace point later in the function so that it is not skipped in
the event of a failed fork.
Link: https://lkml.kernel.org/r/20260121164946.2093480-3-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: SeongJae Park <sj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Wed, 21 Jan 2026 16:49:36 +0000 (11:49 -0500)]
mm: relocate the page table ceiling and floor definitions
Patch series " Remove XA_ZERO from error recovery of dup_mmap()", v3.
It is possible that the dup_mmap() call fails on allocating or setting up
a vma after the maple tree of the oldmm is copied. Today, that failure
point is marked by inserting an XA_ZERO entry over the failure point so
that the exact location does not need to be communicated through to
exit_mmap().
However, a race exists in the tear down process because the dup_mmap()
drops the mmap lock before exit_mmap() can remove the partially set up vma
tree. This means that other tasks may get to the mm tree and find the
invalid vma pointer (since it's an XA_ZERO entry), even though the mm is
marked as MMF_OOM_SKIP and MMF_UNSTABLE.
To remove the race fully, the tree must be cleaned up before dropping the
lock. This is accomplished by extracting the vma cleanup in exit_mmap()
and changing the required functions to pass through the vma search limit.
Any other tree modifications would require extra cycles which should be
spent on freeing memory.
This does run the risk of increasing the possibility of finding no vmas
(which is already possible!) in code that isn't careful.
The final four patches are to address the excessive argument lists being
passed between the functions. Using the struct unmap_desc also allows
some special-case code to be removed in favour of the struct setup
differences.
This patch (of 11):
pgtables.h defines a fallback for ceiling and floor of the page tables
within the CONFIG_MMU section. Moving the definitions to outside the
CONFIG_MMU allows for using them in generic code.
[akpm@linux-foundation.org: remove stray newline, per SeongJae] Link: https://lkml.kernel.org/r/20260121164946.2093480-1-Liam.Howlett@oracle.com Link: https://lkml.kernel.org/r/20260121164946.2093480-2-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Suggested-by: SeongJae Park <sj@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bing Jiao [Wed, 14 Jan 2026 20:53:03 +0000 (20:53 +0000)]
mm/vmscan: select the closest preferred node in demote_folio_list()
The preferred demotion node (migration_target_control.nid) should be the
one closest to the source node to minimize migration latency. Currently,
a discrepancy exists where demote_folio_list() randomly selects an allowed
node if the preferred node from next_demotion_node() is not set in
mems_effective.
To address it, update next_demotion_node() to select a preferred target
against allowed nodes; and to return the closest demotion target if all
preferred nodes are not in mems_effective via next_demotion_node().
It ensures that the preferred demotion target is consistently the closest
available node to the source node.
[akpm@linux-foundation.org: fix comment typo, per Shakeel] Link: https://lkml.kernel.org/r/20260114205305.2869796-3-bingjiao@google.com Signed-off-by: Bing Jiao <bingjiao@google.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Gregory Price <gourry@gourry.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bing Jiao [Wed, 14 Jan 2026 20:53:02 +0000 (20:53 +0000)]
mm/vmscan: fix demotion targets checks in reclaim/demotion
Patch series "mm/vmscan: fix demotion targets checks in reclaim/demotion",
v9.
This patch series addresses two issues in demote_folio_list(),
can_demote(), and next_demotion_node() in reclaim/demotion.
1. demote_folio_list() and can_demote() do not correctly check
demotion target against cpuset.mems_effective, which will cause (a)
pages to be demoted to not-allowed nodes and (b) pages fail demotion
even if the system still has allowed demotion nodes.
Patch 1 fixes this bug by updating cpuset_node_allowed() and
mem_cgroup_node_allowed() to return effective_mems, allowing directly
logic-and operation against demotion targets.
2. next_demotion_node() returns a preferred demotion target, but it
does not check the node against allowed nodes.
Patch 2 ensures that next_demotion_node() filters against the allowed
node mask and selects the closest demotion target to the source node.
This patch (of 2):
Fix two bugs in demote_folio_list() and can_demote() due to incorrect
demotion target checks against cpuset.mems_effective in reclaim/demotion.
Commit 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim")
introduces the cpuset.mems_effective check and applies it to can_demote().
However:
1. It does not apply this check in demote_folio_list(), which leads
to situations where pages are demoted to nodes that are
explicitly excluded from the task's cpuset.mems.
2. It checks only the nodes in the immediate next demotion hierarchy
and does not check all allowed demotion targets in can_demote().
This can cause pages to never be demoted if the nodes in the next
demotion hierarchy are not set in mems_effective.
These bugs break resource isolation provided by cpuset.mems. This is
visible from userspace because pages can either fail to be demoted
entirely or are demoted to nodes that are not allowed in multi-tier memory
systems.
To address these bugs, update cpuset_node_allowed() and
mem_cgroup_node_allowed() to return effective_mems, allowing directly
logic-and operation against demotion targets. Also update can_demote()
and demote_folio_list() accordingly.
Bug 1 reproduction:
Assume a system with 4 nodes, where nodes 0-1 are top-tier and
nodes 2-3 are far-tier memory. All nodes have equal capacity.
Bug 2 reproduction:
Assume a system with 6 nodes, where nodes 0-2 are top-tier,
node 3 is a far-tier node, and nodes 4-5 are the farthest-tier nodes.
All nodes have equal capacity.
Test script:
echo 1 > /sys/kernel/mm/numa/demotion_enabled
mkdir /sys/fs/cgroup/test
echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
echo "0-2,4-5" > /sys/fs/cgroup/test/cpuset.mems
echo $$ > /sys/fs/cgroup/test/cgroup.procs
swapoff -a
# Expectation: Pages are demoted to Nodes 4-5
# Observation: No pages are demoted before oom.
stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1,2
Link: https://lkml.kernel.org/r/20260114205305.2869796-1-bingjiao@google.com Link: https://lkml.kernel.org/r/20260114205305.2869796-2-bingjiao@google.com Fixes: 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim") Signed-off-by: Bing Jiao <bingjiao@google.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Gregory Price <gourry@gourry.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/memory: handle non-split locks correctly in zap_empty_pte_table()
While we handle pte_lockptr() == pmd_lockptr() correctly in
zap_pte_table_if_empty(), we don't handle it in zap_empty_pte_table(),
making the spin_trylock() always fail and forcing us onto the slow path.
So let's handle the scenario where pte_lockptr() == pmd_lockptr() better,
which can only happen if CONFIG_SPLIT_PTE_PTLOCKS is not set.
This is only relevant once we unlock CONFIG_PT_RECLAIM on architectures
that are not x86-64.
Link: https://lkml.kernel.org/r/20260119220708.3438514-3-david@kernel.org Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Some cleanups for PT table reclaim code, triggered by a false-positive
warning we might start to see soon after we unlocked pt-reclaim on
architectures besides x86-64.
This patch (of 2):
The pte-table reclaim code is only called from memory.c, while zapping
pages, and it better also stays that way in the long run. If we ever have
to call it from other files, we should expose proper high-level helpers
for zapping if the existing helpers are not good enough.
So, let's move the code over (it's not a lot) and slightly clean it up a
bit by:
- Renaming the functions.
- Dropping the "Check if it is empty PTE page" comment, which is now
self-explaining given the function name.
- Making zap_pte_table_if_empty() return whether zapping worked so the
caller can free it.
- Adding a comment in pte_table_reclaim_possible().
- Inlining free_pte() in the last remaining user.
- In zap_empty_pte_table(), switch from pmdp_get_lcokless() to
pmd_clear(), we are holding the PMD PT lock.
By moving the code over, compilers can also easily figure out when
zap_empty_pte_table() does not initialize the pmdval variable, avoiding
false-positive warnings about the variable possibly not being initialized.
Link: https://lkml.kernel.org/r/20260119220708.3438514-1-david@kernel.org Link: https://lkml.kernel.org/r/20260119220708.3438514-2-david@kernel.org Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Qi Zheng [Tue, 27 Jan 2026 12:13:01 +0000 (20:13 +0800)]
mm: make PT_RECLAIM depends on MMU_GATHER_RCU_TABLE_FREE
The PT_RECLAIM can work on all architectures that support
MMU_GATHER_RCU_TABLE_FREE, except for those that have selected
HAVE_ARCH_TLB_REMOVE_TABLE,so make PT_RECLAIM depends on
MMU_GATHER_RCU_TABLE_FREE && !HAVE_ARCH_TLB_REMOVE_TABLE.
BTW, change PT_RECLAIM to be enabled by default, since nobody should want
to turn it off.
Link: https://lkml.kernel.org/r/83b034810935a9ff18e425b085e065bb0acb28f3.1769515122.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Helge Deller <deller@gmx.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Lance Yang <ioworker0@gmail.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Richard Weinberger <richard@nod.at> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Qi Zheng [Tue, 27 Jan 2026 12:13:00 +0000 (20:13 +0800)]
mm: convert __HAVE_ARCH_TLB_REMOVE_TABLE to CONFIG_HAVE_ARCH_TLB_REMOVE_TABLE config
For architectures that define __HAVE_ARCH_TLB_REMOVE_TABLE, the page
tables at the pmd/pud level are generally not of struct ptdesc type, and
do not have pt_rcu_head member, thus these architectures cannot support
PT_RECLAIM.
In preparation for enabling PT_RECLAIM on more architectures, convert
__HAVE_ARCH_TLB_REMOVE_TABLE to CONFIG_HAVE_ARCH_TLB_REMOVE_TABLE config,
so that we can make conditional judgments in Kconfig.
Link: https://lkml.kernel.org/r/5ebfa3d4b56e63c6906bda5eccaa9f7194d3a86b.1769515122.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Tested-by: Andreas Larsson <andreas@gaisler.com> [sparc, UP&SMP] Acked-by: Andreas Larsson <andreas@gaisler.com> [sparc] Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Helge Deller <deller@gmx.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Lance Yang <ioworker0@gmail.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Richard Weinberger <richard@nod.at> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Qi Zheng [Tue, 27 Jan 2026 12:12:59 +0000 (20:12 +0800)]
um: mm: enable MMU_GATHER_RCU_TABLE_FREE
On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of
empty PTE page table pages (such as 100GB+). To resolve this problem,
first enable MMU_GATHER_RCU_TABLE_FREE to prepare for enabling the
PT_RECLAIM feature, which resolves this problem.
Link: https://lkml.kernel.org/r/e2217546504668b8a87a39eb0e378839339a1bb4.1769515122.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Richard Weinberger <richard@nod.at> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Helge Deller <deller@gmx.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Qi Zheng [Tue, 27 Jan 2026 12:12:58 +0000 (20:12 +0800)]
parisc: mm: enable MMU_GATHER_RCU_TABLE_FREE
On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of
empty PTE page table pages (such as 100GB+). To resolve this problem,
first enable MMU_GATHER_RCU_TABLE_FREE to prepare for enabling the
PT_RECLAIM feature, which resolves this problem.
Link: https://lkml.kernel.org/r/b827939046dbc94bc7c585cdbed8522baab75b15.1769515122.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Helge Deller <deller@gmx.de> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Lance Yang <ioworker0@gmail.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Richard Weinberger <richard@nod.at> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Qi Zheng [Tue, 27 Jan 2026 12:12:57 +0000 (20:12 +0800)]
mips: mm: enable MMU_GATHER_RCU_TABLE_FREE
On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of
empty PTE page table pages (such as 100GB+). To resolve this problem,
first enable MMU_GATHER_RCU_TABLE_FREE to prepare for enabling the
PT_RECLAIM feature, which resolves this problem.
Link: https://lkml.kernel.org/r/0d17f00a724f77aaca2da7c847acd490c3a47571.1769515122.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Helge Deller <deller@gmx.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Lance Yang <ioworker0@gmail.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Richard Weinberger <richard@nod.at> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Qi Zheng [Tue, 27 Jan 2026 12:12:56 +0000 (20:12 +0800)]
LoongArch: mm: enable MMU_GATHER_RCU_TABLE_FREE
On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of
empty PTE page table pages (such as 100GB+). To resolve this problem,
first enable MMU_GATHER_RCU_TABLE_FREE to prepare for enabling the
PT_RECLAIM feature, which resolves this problem.
Link: https://lkml.kernel.org/r/bd1b11bc1a13686aeba81a40194f87b369d62661.1769515122.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Helge Deller <deller@gmx.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Lance Yang <ioworker0@gmail.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Richard Weinberger <richard@nod.at> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Qi Zheng [Tue, 27 Jan 2026 12:12:55 +0000 (20:12 +0800)]
alpha: mm: enable MMU_GATHER_RCU_TABLE_FREE
On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of
empty PTE page table pages (such as 100GB+). To resolve this problem,
first enable MMU_GATHER_RCU_TABLE_FREE to prepare for enabling the
PT_RECLAIM feature, which resolves this problem.
Link: https://lkml.kernel.org/r/3380f40a89b73c488202c85f9a8abf99fb08543b.1769515122.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: Magnus Lindholm <linmag7@gmail.com> [alpha] Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Helge Deller <deller@gmx.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Lance Yang <ioworker0@gmail.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Weinberger <richard@nod.at> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Qi Zheng [Tue, 27 Jan 2026 12:12:54 +0000 (20:12 +0800)]
mm: change mm/pt_reclaim.c to use asm/tlb.h instead of asm-generic/tlb.h
Patch series "enable PT_RECLAIM on more 64-bit architectures", v4.
This series aims to enable PT_RECLAIM on more 64-bit architectures.
On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of
empty PTE page table pages (such as 100GB+). To resolve this problem, we
need to enable PT_RECLAIM, which depends on MMU_GATHER_RCU_TABLE_FREE.
For these architectures that define its own __tlb_remove_table(), since
their page tables are not of type struct ptdesc, they cannot be supported
PT_RECLAIM.
Therefore, this series first enables MMU_GATHER_RCU_TABLE_FREE on all
64-bit architectures, then converts __HAVE_ARCH_TLB_REMOVE_TABLE to
CONFIG_HAVE_ARCH_TLB_REMOVE_TABLE config, and finally makes PT_RECLAIM
depend on MMU_GATHER_RCU_TABLE_FREE && !HAVE_ARCH_TLB_REMOVE_TABLE. This
way, PT_RECLAIM can be enabled by default on most 64-bit architectures.
Of course, this will also be enabled on some 32-bit architectures that
already support MMU_GATHER_RCU_TABLE_FREE. That's fine, PT_RECLAIM works
well on all 32-bit architectures as well. Although the benefit isn't
significant, there's still memory that can be reclaimed. Perhaps
PT_RECLAIM can be enabled on all 32-bit architectures in the future.
This patch (of 8):
Generally, the asm/tlb.h will include asm-generic/tlb.h, so change
mm/pt_reclaim.c to use asm/tlb.h instead of asm-generic/tlb.h. This is a
preparation for enabling CONFIG_PT_RECLAIM on other architectures, such as
alpha.
Link: https://lkml.kernel.org/r/cover.1769515122.git.zhengqi.arch@bytedance.com Link: https://lkml.kernel.org/r/befca537d10c6bf8d531b1ee0a8af1e3b31352b0.1769515122.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dev Jain <dev.jain@arm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Matt Turner <mattst88@gmail.com> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Richard Weinberger <richard@nod.at> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: WANG Xuerui <kernel@xen0n.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Li RongQing [Fri, 30 Jan 2026 08:56:03 +0000 (03:56 -0500)]
mm/damon/stat: remove __read_mostly from memory_idle_ms_percentiles
The 'memory_idle_ms_percentiles' array in DAMON_STAT is updated frequently
by the kernel to reflect the latest idle time statistics. Marking it as
'__read_mostly' is inappropriate for data that is regularly written to, as
it can lead to cache pollution in the read-mostly section.
Remove the '__read_mostly' annotation to accurately reflect the
variable's usage pattern.
Currently, zsmalloc creates kmem_cache of handles and zspages for each
pool, which may be suboptimal from the memory usage point of view (extra
internal fragmentation per pool). Systems that create multiple zsmalloc
pools may benefit from shared common zsmalloc caches.
Make handles and zspages kmem caches global. The memory savings depend on
particular setup and data patterns and can be found via slabinfo.
Link: https://lkml.kernel.org/r/20260117025406.799428-1-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Brian Geffon <bgeffon@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tim Bird [Wed, 4 Feb 2026 21:31:01 +0000 (14:31 -0700)]
mm: add SPDX id lines to some mm source files
Some of the memory management source files are missing
SPDX-License-Identifier lines. Add appropriate IDs
to these files (mostly GPL-2.0, but one LGPL-2.1).
Sahil Chandna [Wed, 4 Feb 2026 18:54:08 +0000 (00:24 +0530)]
mm/zswap: use %pe to print error pointers
Use the %pe printk format specifier to report error pointers directly
instead of printing PTR_ERR() as a long value. This improves clarity,
produces more readable error messages.
This instance was flagged by the Coccinelle script
(misc/ptr_err_to_pe.cocci) as an opportunity to adopt %pe.
Found by: make coccicheck MODE=report M=mm/
No functional change intended.
Sahil Chandna [Wed, 4 Feb 2026 18:54:07 +0000 (00:24 +0530)]
mm/vmscan: use %pe to print error pointers
Use the %pe printk format specifier to report error pointers directly
instead of printing PTR_ERR() as a long value. This improves clarity,
produces more readable error messages.
This instance was flagged by the Coccinelle script
(misc/ptr_err_to_pe.cocci) as an opportunity to adopt %pe.
Found by: make coccicheck MODE=report M=mm/
No functional change intended
Shakeel Butt [Fri, 30 Jan 2026 04:29:25 +0000 (20:29 -0800)]
mm: khugepaged: fix NR_FILE_PAGES and NR_SHMEM in collapse_file()
In META's fleet, we observed high-level cgroups showing zero file memcg
stats while their descendants had non-zero values. Investigation using
drgn revealed that these parent cgroups actually had negative file stats,
aggregated from their children.
This issue became more frequent after deploying thp-always more widely,
pointing to a correlation with THP file collapsing. The root cause is
that collapse_file() assumes old folios and the new THP belong to the same
node and memcg. When this assumption breaks, stats become skewed. The
bug affects not just memcg stats but also per-numa stats, and not just
NR_FILE_PAGES but also NR_SHMEM.
The assumption breaks in scenarios such as:
1. Small folios allocated on one node while the THP gets allocated on a
different node.
2. A package downloader running in one cgroup populates the page cache,
while a job in a different cgroup executes the downloaded binary.
3. A file shared between processes in different cgroups, where one
process faults in the pages and khugepaged (or madvise(COLLAPSE))
collapses them on behalf of the other.
Fix the accounting by explicitly incrementing stats for the new THP and
decrementing stats for the old folios being replaced.
Link: https://lkml.kernel.org/r/20260130042925.2797946-1-shakeel.butt@linux.dev Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages") Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Barry Song <baohua@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Kiryl Shutsemau <kas@kernel.org> Acked-by: David Hildenbrand (arm) <david@kernel.org> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nico Pache <npache@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Song Liu <songliubraving@fb.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Justin Green [Wed, 28 Jan 2026 22:56:47 +0000 (17:56 -0500)]
mm: refactor vma_map_pages to use vm_insert_pages
vma_map_pages currently calls vm_insert_page on each individual page in
the mapping, which creates significant overhead because we are repeatedly
spinlocking. Instead, we should batch insert pages using vm_insert_pages,
which amortizes the cost of the spinlock.
Tested through watching hardware accelerated video on a MTK ChromeOS
device. This particular path maps both a V4L2 buffer and a GEM allocated
buffer into userspace and converts the contents from one pixel format to
another. Both vb2_mmap() and mtk_gem_object_mmap() exercise this pathway.
Link: https://lkml.kernel.org/r/20260128225648.2938636-1-greenjustin@chromium.org Signed-off-by: Justin Green <greenjustin@chromium.org> Acked-by: Brian Geffon <bgeffon@google.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Arjun Roy <arjunroy@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Enze Li [Thu, 29 Jan 2026 10:08:45 +0000 (18:08 +0800)]
mm/damon: unify address range representation with damon_addr_range
Currently, DAMON defines two identical structures for representing address
ranges: damon_system_ram_region and damon_addr_range. Both structures
share the same semantic interpretation of a half-open interval [start,
end), where the start address is inclusive and the end address is
exclusive.
This duplication adds unnecessary redundancy and increases maintenance
overhead. This patch replaces all uses of damon_system_ram_region with
the more generic damon_addr_range structure, ensuring a unified type
representation for address ranges within the DAMON subsystem. The change
simplifies the codebase, improves readability, and avoids potential
inconsistencies in future modifications.
Yosry Ahmed [Wed, 21 Jan 2026 01:36:15 +0000 (01:36 +0000)]
mm: zswap: use SG list decompression APIs from zsmalloc
Use the new zs_obj_read_sg_*() APIs in zswap_decompress(), instead of
zs_obj_read_*() APIs returning a linear address. The SG list is passed
directly to the crypto API, simplifying the logic and dropping the
workaround that copies highmem addresses to a buffer. The crypto API
should internally linearize the SG list if needed.
This avoids the memcpy() in zsmalloc for objects spanning multiple pages,
although an equivalent operation will be done internally by acomp/scomp.
However, in the future compression algorithms could support handling
discontiguous SG lists, completely eliminating the copying for spanning
objects.
Zsmalloc fills an SG list up to 2 entries in size, so change the input SG
list to fit 2 entries.
Update the incompressible entries path to use memcpy_from_sglist() to copy
the data to the folio. Opportunistically set dlen to PAGE_SIZE in the
same code path (rather that at the top of the function) to make it
clearer.
Drop the goto in zswap_compress() as the code now is not simple enough for
an if-else statement instead. Rename 'decomp_ret' to 'ret' and reuse it
to keep the intermediate return value of crypto_acomp_decompress() to keep
line lengths manageable.
No functional change intended.
Link: https://lkml.kernel.org/r/20260121013615.2906368-1-yosry.ahmed@linux.dev Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Fri, 19 Dec 2025 19:43:48 +0000 (03:43 +0800)]
mm, swap: remove no longer needed _swap_info_get
There are now only two users of _swap_info_get after consolidating these
callers, folio_free_swap and swp_swapcount.
folio_free_swap already holds the folio lock, and the folio must be in the
swap cache, _swap_info_get is redundant.
For swp_swapcount, it should use get_swap_device instead. get_swap_device
increases the device ref count, which is actually a bit safer. The only
current use is smap walking, and the performance change here is tiny.
And after these changes, _swap_info_get is no longer used, so we can
safely remove it.
Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-19-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Fri, 19 Dec 2025 19:43:47 +0000 (03:43 +0800)]
mm, swap: drop the SWAP_HAS_CACHE flag
Now, the swap cache is managed by the swap table. All swap cache users
are checking the swap table directly to check the swap cache state.
SWAP_HAS_CACHE is now just a temporary pin before the first increase from
0 to 1 of a slot's swap count (swap_dup_entries) after swap allocation
(folio_alloc_swap), or before the final free of slots pinned by folio in
swap cache (put_swap_folio).
Drop these two usages. For the first dup, SWAP_HAS_CACHE pinning was hard
to kill because it used to have multiple meanings, more than just "a slot
is cached". We have just simplified that and defined that the first dup
is always done with folio locked in swap cache (folio_dup_swap), so stop
checking the SWAP_HAS_CACHE bit and just check the swap cache (swap table)
directly, and add a WARN if a swap entry's count is being increased for
the first time while the folio is not in swap cache.
As for freeing, just let the swap cache free all swap entries of a folio
that have a swap count of zero directly upon folio removal. We have also
just cleaned up batch freeing to check the swap cache usage using the swap
table: a slot with swap cache in the swap table will not be freed until
its cache is gone, and no SWAP_HAS_CACHE bit is involved anymore. And
besides, the removal of a folio and freeing of the slots are being done in
the same critical section now, which should improve the performance.
After these two changes, SWAP_HAS_CACHE no longer has any users. Swap
cache synchronization is also done by the swap table directly, so using
SWAP_HAS_CACHE to pin a slot before adding the cache is also no longer
needed. Remove all related logic and helpers. swap_map is now only used
for tracking the count, so all swap_map users can just read it directly,
ignoring the swap_count helper, which was previously used to filter out
the SWAP_HAS_CACHE bit.
The idea of dropping SWAP_HAS_CACHE and using the swap table directly was
initially from Chris's idea of merging all the metadata usage of all swaps
into one place.
Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-18-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Suggested-by: Chris Li <chrisl@kernel.org> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Fri, 19 Dec 2025 19:43:46 +0000 (03:43 +0800)]
mm, swap: clean up and improve swap entries freeing
There are a few problems with the current freeing of swap entries.
When freeing a set of swap entries directly (swap_put_entries_direct,
typically from zapping the page table), it scans the whole swap region
multiple times. First, it scans the whole region to check if it can be
batch freed and if there is any cached folio. Then do a batch free only
if the whole region's swap count equals 1. And if any entry is cached,
even if only one, it will have to walk the whole region again to clean up
the cache.
And if any entry is not in a consistent status with other entries, it will
fall back to order 0 freeing. For example, if only one of them is cached,
the batch free will fall back.
And the current batch freeing workflow relies on the swap map's
SWAP_HAS_CACHE bit for both continuous checking and batch freeing, which
isn't compatible with the swap table design.
Tidy this up, introduce a new cluster scoped helper for all swap entry
freeing job. It will batch frees all continuous entries, and just start a
new batch if any inconsistent entry is found. This may improve the batch
size when the clusters are fragmented. This should also be more robust
with more sanity checks, and make it clear that a slot pinned by swap
cache will be cleared upon cache reclaim.
And the cache reclaim scan is also now limited to each cluster. If a
cluster has any clean swap cache left after putting the swap count,
reclaim the cluster only instead of the whole region.
And since a folio's entries are always in the same cluster, putting swap
entries from a folio can also use the new helper directly.
This should be both an optimization and a cleanup, and the new helper is
adapted to the swap table.
Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-17-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Fri, 19 Dec 2025 19:43:45 +0000 (03:43 +0800)]
mm, swap: check swap table directly for checking cache
Instead of looking at the swap map, check swap table directly to tell if a
swap slot is cached. Prepares for the removal of SWAP_HAS_CACHE.
Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-16-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Fri, 19 Dec 2025 19:43:44 +0000 (03:43 +0800)]
mm, swap: add folio to swap cache directly on allocation
The allocator uses SWAP_HAS_CACHE to pin a swap slot upon allocation.
SWAP_HAS_CACHE is being deprecated as it caused a lot of confusion. This
pinning usage here can be dropped by adding the folio to swap cache
directly on allocation.
All swap allocations are folio-based now (except for hibernation), so the
swap allocator can always take the folio as the parameter. And now both
swap cache (swap table) and swap map are protected by the cluster lock,
scanning the map and inserting the folio can be done in the same critical
section. This eliminates the time window that a slot is pinned by
SWAP_HAS_CACHE, but it has no cache, and avoids touching the lock multiple
times.
This is both a cleanup and an optimization.
Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-15-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Fri, 19 Dec 2025 19:43:43 +0000 (03:43 +0800)]
mm, swap: cleanup swap entry management workflow
The current swap entry allocation/freeing workflow has never had a clear
definition. This makes it hard to debug or add new optimizations.
This commit introduces a proper definition of how swap entries would be
allocated and freed. Now, most operations are folio based, so they will
never exceed one swap cluster, and we now have a cleaner border between
swap and the rest of mm, making it much easier to follow and debug,
especially with new added sanity checks. Also making more optimization
possible.
Swap entry will be mostly freed and free with a folio bound. The folio
lock will be useful for resolving many swap related races.
Now swap allocation (except hibernation) always starts with a folio in the
swap cache, and gets duped/freed protected by the folio lock:
- folio_alloc_swap() - The only allocation entry point now.
Context: The folio must be locked.
This allocates one or a set of continuous swap slots for a folio and
binds them to the folio by adding the folio to the swap cache. The
swap slots' swap count start with zero value.
- folio_dup_swap() - Increase the swap count of one or more entries.
Context: The folio must be locked and in the swap cache. For now, the
caller still has to lock the new swap entry owner (e.g., PTL).
This increases the ref count of swap entries allocated to a folio.
Newly allocated swap slots' count has to be increased by this helper
as the folio got unmapped (and swap entries got installed).
- folio_put_swap() - Decrease the swap count of one or more entries.
Context: The folio must be locked and in the swap cache. For now, the
caller still has to lock the new swap entry owner (e.g., PTL).
This decreases the ref count of swap entries allocated to a folio.
Typically, swapin will decrease the swap count as the folio got
installed back and the swap entry got uninstalled
This won't remove the folio from the swap cache and free the
slot. Lazy freeing of swap cache is helpful for reducing IO.
There is already a folio_free_swap() for immediate cache reclaim.
This part could be further optimized later.
The above locking constraints could be further relaxed when the swap table
is fully implemented. Currently dup still needs the caller to lock the
swap entry container (e.g. PTL), or a concurrent zap may underflow the
swap count.
Some swap users need to interact with swap count without involving folio
(e.g. forking/zapping the page table or mapping truncate without swapin).
In such cases, the caller has to ensure there is no race condition on
whatever owns the swap count and use the below helpers:
- swap_put_entries_direct() - Decrease the swap count directly.
Context: The caller must lock whatever is referencing the slots to
avoid a race.
Typically the page table zapping or shmem mapping truncate will need
to free swap slots directly. If a slot is cached (has a folio bound),
this will also try to release the swap cache.
- swap_dup_entry_direct() - Increase the swap count directly.
Context: The caller must lock whatever is referencing the entries to
avoid race, and the entries must already have a swap count > 1.
Typically, forking will need to copy the page table and hence needs to
increase the swap count of the entries in the table. The page table is
locked while referencing the swap entries, so the entries all have a
swap count > 1 and can't be freed.
Hibernation subsystem is a bit different, so two special wrappers are here:
- swap_alloc_hibernation_slot() - Allocate one entry from one device.
- swap_free_hibernation_slot() - Free one entry allocated by the above
helper.
All hibernation entries are exclusive to the hibernation subsystem and
should not interact with ordinary swap routines.
By separating the workflows, it will be possible to bind folio more
tightly with swap cache and get rid of the SWAP_HAS_CACHE as a temporary
pin.
This commit should not introduce any behavior change
[kasong@tencent.com: fix leak, per Chris Mason. Remove WARN_ON, per Lai Yi] Link: https://lkml.kernel.org/r/CAMgjq7AUz10uETVm8ozDWcB3XohkOqf0i33KGrAquvEVvfp5cg@mail.gmail.com
[ryncsn@gmail.com: fix KSM copy pages for swapoff, per Chris] Link: https://lkml.kernel.org/r/aXxkANcET3l2Xu6J@KASONG-MC4 Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-14-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Signed-off-by: Kairui Song <ryncsn@gmail.com> Acked-by: Rafael J. Wysocki (Intel) <rafael@kernel.org> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Cc: Chris Mason <clm@fb.com> Cc: Chris Mason <clm@meta.com> Cc: Lai Yi <yi1.lai@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Fri, 19 Dec 2025 19:43:42 +0000 (03:43 +0800)]
mm, swap: remove workaround for unsynchronized swap map cache state
Remove the "skip if exists" check from commit a65b0e7607ccb ("zswap: make
shrinking memcg-aware"). It was needed because there is a tiny time
window between setting the SWAP_HAS_CACHE bit and actually adding the
folio to the swap cache. If a user is trying to add the folio into the
swap cache but another user was interrupted after setting SWAP_HAS_CACHE
but hasn't added the folio to the swap cache yet, it might lead to a
deadlock.
We have moved the bit setting to the same critical section as adding the
folio, so this is no longer needed. Remove it and clean it up.
Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-13-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Fri, 19 Dec 2025 19:43:41 +0000 (03:43 +0800)]
mm, swap: use swap cache as the swap in synchronize layer
Current swap in synchronization mostly uses the swap_map's SWAP_HAS_CACHE
bit. Whoever sets the bit first does the actual work to swap in a folio.
This has been causing many issues as it's just a poor implementation of a
bit lock. Raced users have no idea what is pinning a slot, so it has to
loop with a schedule_timeout_uninterruptible(1), which is ugly and causes
long-tailing or other performance issues. Besides, the abuse of
SWAP_HAS_CACHE has been causing many other troubles for synchronization or
maintenance.
This is the first step to remove this bit completely.
Now all swap in paths are using the swap cache, and both the swap cache
and swap map are protected by the cluster lock. So we can just resolve
the swap synchronization with the swap cache layer directly using the
cluster lock and folio lock. Whoever inserts a folio in the swap cache
first does the swap in work. And because folios are locked during swap
operations, other raced swap operations will just wait on the folio lock.
The SWAP_HAS_CACHE will be removed in later commit. For now, we still set
it for some remaining users. But now we do the bit setting and swap cache
folio adding in the same critical section, after swap cache is ready. No
one will have to spin on the SWAP_HAS_CACHE bit anymore.
This both simplifies the logic and should improve the performance,
eliminating issues like the one solved in commit 01626a1823024 ("mm: avoid
unconditional one-tick sleep when swapcache_prepare fails"), or the
"skip_if_exists" from commit a65b0e7607ccb ("zswap: make shrinking
memcg-aware"), which will be removed very soon.
[kasong@tencent.com: fix cgroup v1 accounting issue] Link: https://lkml.kernel.org/r/CAMgjq7CGUnzOVG7uSaYjzw9wD7w2dSKOHprJfaEp4CcGLgE3iw@mail.gmail.com Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-12-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Fri, 19 Dec 2025 19:43:40 +0000 (03:43 +0800)]
mm, swap: split locked entry duplicating into a standalone helper
No feature change, split the common logic into a stand alone helper to be
reused later.
Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-11-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Fri, 19 Dec 2025 19:43:39 +0000 (03:43 +0800)]
mm, swap: consolidate cluster reclaim and usability check
Swap cluster cache reclaim requires releasing the lock, so the cluster may
become unusable after the reclaim. To prepare for checking swap cache
using the swap table directly, consolidate the swap cluster reclaim and
the check logic.
We will want to avoid touching the cluster's data completely with the swap
table, to avoid RCU overhead here. And by moving the cluster usable check
into the reclaim helper, it will also help avoid a redundant scan of the
slots if the cluster is no longer usable, and we will want to avoid
touching the cluster.
Also, adjust it very slightly while at it: always scan the whole region
during reclaim, don't skip slots covered by a reclaimed folio. Because
the reclaim is lockless, it's possible that new cache lands at any time.
And for allocation, we want all caches to be reclaimed to avoid
fragmentation. Besides, if the scan offset is not aligned with the size
of the reclaimed folio, we might skip some existing cache and fail the
reclaim unexpectedly.
There should be no observable behavior change. It might slightly improve
the fragmentation issue or performance.
Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-10-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Fri, 19 Dec 2025 19:43:38 +0000 (03:43 +0800)]
mm, swap: swap entry of a bad slot should not be considered as swapped out
When checking if a swap entry is swapped out, we simply check if the
bitwise result of the count value is larger than 0. But SWAP_MAP_BAD will
also be considered as a swao count value larger than 0.
SWAP_MAP_BAD being considered as a count value larger than 0 is useful for
the swap allocator: they will be seen as a used slot, so the allocator
will skip them. But for the swapped out check, this isn't correct.
There is currently no observable issue. The swapped out check is only
useful for readahead and folio swapped-out status check. For readahead,
the swap cache layer will abort upon checking and updating the swap map.
For the folio swapped out status check, the swap allocator will never
allocate an entry of bad slots to folio, so that part is fine too. The
worst that could happen now is redundant allocation/freeing of folios and
waste CPU time.
This also makes it easier to get rid of swap map checking and update
during folio insertion in the swap cache layer.
Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-9-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Nhat Pham [Fri, 19 Dec 2025 19:43:37 +0000 (03:43 +0800)]
mm/shmem, swap: remove SWAP_MAP_SHMEM
The SWAP_MAP_SHMEM state was introduced in the commit aaa468653b4a
("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a swap entry
belongs to shmem during swapoff.
However, swapoff has since been rewritten in the commit b56a2d8af914 ("mm:
rid swapoff of quadratic complexity"). Now having swap count ==
SWAP_MAP_SHMEM value is basically the same as having swap count == 1, and
swap_shmem_alloc() behaves analogously to swap_duplicate(). The only
difference of note is that swap_shmem_alloc() does not check for -ENOMEM
returned from __swap_duplicate(), but it is OK because shmem never
re-duplicates any swap entry it owns. This will stil be safe if we use
(batched) swap_duplicate() instead.
This commit adds swap_duplicate_nr(), the batched variant of
swap_duplicate(), and removes the SWAP_MAP_SHMEM state and the associated
swap_shmem_alloc() helper to simplify the state machine (both mentally and
in terms of actual code). We will also have an extra state/special value
that can be repurposed (for swap entries that never gets re-duplicated).
Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-8-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Signed-off-by: Nhat Pham <nphamcs@gmail.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Fri, 19 Dec 2025 19:57:51 +0000 (03:57 +0800)]
mm/shmem: never bypass the swap cache for SWP_SYNCHRONOUS_IO
Now the overhead of the swap cache is trivial to none, bypassing the swap
cache is no longer a good optimization.
We have removed the cache bypass swapin for anon memory, now do the same
for shmem. Many helpers and functions can be dropped now.
The performance may slightly drop because of the co-existence and double
update of swap_map and swap table, and this problem will be improved very
soon in later commits by dropping the swap_map update partially:
Swapin of 24 GB file with tmpfs with
transparent_hugepage_tmpfs=within_size and ZRAM, 3 test runs on my
machine:
Before: After this commit: After this series:
5.99s 6.29s 6.08s
And later swap table phases will drop the swap_map completely to avoid
overhead and reduce memory usage.
Link: https://lkml.kernel.org/r/20251219195751.61328-1-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Fri, 19 Dec 2025 19:43:35 +0000 (03:43 +0800)]
mm, swap: free the swap cache after folio is mapped
Currently, we remove the folio from the swap cache and free the swap cache
before mapping the PTE. To reduce repeated faults due to parallel swapins
of the same PTE, change it to remove the folio from the swap cache after
it is mapped. So new faults from the swap PTE will be much more likely to
see the folio in the swap cache and wait on it.
This does not eliminate all swapin races: an ongoing swapin fault may
still see an empty swap cache. That's harmless, as the PTE is changed
before the swap cache is cleared, so it will just return and not trigger
any repeated faults. This does help to reduce the chance.
Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-6-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Fri, 19 Dec 2025 19:43:34 +0000 (03:43 +0800)]
mm, swap: simplify the code and reduce indention
Now swap cache is always used, multiple swap cache checks are no longer
useful, remove them and reduce the code indention.
No behavior change.
Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-5-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Fri, 19 Dec 2025 19:43:33 +0000 (03:43 +0800)]
mm, swap: always try to free swap cache for SWP_SYNCHRONOUS_IO devices
Now SWP_SYNCHRONOUS_IO devices are also using swap cache. One side effect
is that a folio may stay in swap cache for a longer time due to lazy
freeing (vm_swap_full()). This can help save some CPU / IO if folios are
being swapped out very frequently right after swapin, hence improving the
performance. But the long pinning of swap slots also increases the
fragmentation rate of the swap device significantly, and currently, all
in-tree SWP_SYNCHRONOUS_IO devices are RAM disks, so it also causes the
backing memory to be pinned, increasing the memory pressure.
So drop the swap cache immediately for SWP_SYNCHRONOUS_IO devices after
swapin finishes. Swap cache has served its role as a synchronization
layer to prevent any parallel swap-in from wasting CPU or memory
allocation, and the redundant IO is not a major concern for
SWP_SYNCHRONOUS_IO devices.
Worth noting, without this patch, this series so far can provide a ~30%
performance gain for certain workloads like MySQL or kernel compilation,
but causes significant regression or OOM when under extreme global
pressure. With this patch, we still have a nice performance gain for most
workloads, and without introducing any observable regressions. This is a
hint that further optimization can be done based on the new unified swapin
with swap cache, but for now, just keep the behaviour consistent with
before.
Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-4-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Fri, 19 Dec 2025 19:43:32 +0000 (03:43 +0800)]
mm, swap: never bypass the swap cache even for SWP_SYNCHRONOUS_IO
Now the overhead of the swap cache is trivial. Bypassing the swap cache
is no longer a valid optimization. So unify the swapin path using the
swap cache. This changes the swap in behavior in two observable ways.
Readahead is now always disabled for SWP_SYNCHRONOUS_IO devices, which is
a huge win for some workloads: We used to rely on `SWP_SYNCHRONOUS_IO &&
__swap_count(entry) == 1` as the indicator to bypass both the swap cache
and readahead, the swap count check made bypassing ineffective in many
cases, and it's not a good indicator. The limitation existed because the
current swap design made it hard to decouple readahead bypassing and swap
cache bypassing. We do want to always bypass readahead for
SWP_SYNCHRONOUS_IO devices, but bypassing swap cache at the same time will
cause repeated IO and memory overhead. Now that swap cache bypassing is
gone, this swap count check can be dropped.
The second thing here is that this enabled large swapin for all swap
entries on SWP_SYNCHRONOUS_IO devices. Previously, the large swap in is
also coupled with swap cache bypassing, and so the swap count checking
also makes large swapin less effective. Now this is also improved. We
will always have large swapin supported for all SWP_SYNCHRONOUS_IO cases.
And to catch potential issues with large swapin, especially with page
exclusiveness and swap cache, more debug sanity checks and comments are
added. But overall, the code is simpler. And new helper and routines
will be used by other components in later commits too. And now it's
possible to rely on the swap cache layer for resolving synchronization
issues, which will also be done by a later commit.
Worth mentioning that for a large folio workload, this may cause more
serious thrashing. This isn't a problem with this commit, but a generic
large folio issue. For a 4K workload, this commit increases the
performance.
Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-3-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Fri, 19 Dec 2025 19:43:31 +0000 (03:43 +0800)]
mm, swap: split swap cache preparation loop into a standalone helper
To prepare for the removal of swap cache bypass swapin, introduce a new
helper that accepts an allocated and charged fresh folio, prepares the
folio, the swap map, and then adds the folio to the swap cache.
This doesn't change how swap cache works yet, we are still depending on
the SWAP_HAS_CACHE in the swap map for synchronization. But all
synchronization hacks are now all in this single helper.
No feature change.
Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-2-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This series removes the SWP_SYNCHRONOUS_IO swap cache bypass swapin code
and special swap flag bits including SWAP_HAS_CACHE, along with many
historical issues. The performance is about ~20% better for some
workloads, like Redis with persistence. This also cleans up the code to
prepare for later phases, some patches are from a previously posted
series.
Swap cache bypassing and swap synchronization in general had many issues.
Some are solved as workarounds, and some are still there [1]. To resolve
them in a clean way, one good solution is to always use swap cache as the
synchronization layer [2]. So we have to remove the swap cache bypass
swap-in path first. It wasn't very doable due to performance issues, but
now combined with the swap table, removing the swap cache bypass path will
instead improve the performance, there is no reason to keep it.
Now we can rework the swap entry and cache synchronization following the
new design. Swap cache synchronization was heavily relying on
SWAP_HAS_CACHE, which is the cause of many issues. By dropping the usage
of special swap map bits and related workarounds, we get a cleaner code
base and prepare for merging the swap count into the swap table in the
next step.
And swap_map is now only used for swap count, so in the next phase,
swap_map can be merged into the swap table, which will clean up more
things and start to reduce the static memory usage. Removal of
swap_cgroup_ctrl is also doable, but needs to be done after we also
simplify the allocation of swapin folios: always use the new
swap_cache_alloc_folio helper so the accounting will also be managed by
the swap layer by then.
Test results:
Redis / Valkey bench:
=====================
Testing on a ARM64 VM 1.5G memory:
Server: valkey-server --maxmemory 2560M
Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get
no persistence with BGSAVE
Before: 460475.84 RPS 311591.19 RPS
After: 451943.34 RPS (-1.9%) 371379.06 RPS (+19.2%)
Testing on a x86_64 VM with 4G memory (system components takes about 2G):
Server:
Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get
no persistence with BGSAVE
Before: 306044.38 RPS 102745.88 RPS
After: 309645.44 RPS (+1.2%) 125313.28 RPS (+22.0%)
The performance is a lot better when persistence is applied. This should
apply to many other workloads that involve sharing memory and COW. A
slight performance drop was observed for the ARM64 Redis test: We are
still using swap_map to track the swap count, which is causing redundant
cache and CPU overhead and is not very performance-friendly for some
arches. This will be improved once we merge the swap map into the swap
table (as already demonstrated previously [3]).
vm-scabiity
===========
usemem --init-time -O -y -x -n 32 1536M (16G memory, global pressure,
simulated PMEM as swap), average result of 6 test run:
Before: After:
System time: 282.22s 283.47s
Sum Throughput: 5677.35 MB/s 5688.78 MB/s
Single process Throughput: 176.41 MB/s 176.23 MB/s
Free latency: 518477.96 us 521488.06 us
Which is almost identical.
Build kernel test:
==================
Test using ZRAM as SWAP, make -j48, defconfig, on a x86_64 VM
with 4G RAM, under global pressure, avg of 32 test run:
Before After:
System time: 1379.91s 1364.22s (-0.11%)
Test using ZSWAP with NVME SWAP, make -j48, defconfig, on a x86_64 VM
with 4G RAM, under global pressure, avg of 32 test run:
Before After:
System time: 1822.52s 1803.33s (-0.11%)
Which is almost identical.
MySQL:
======
sysbench /usr/share/sysbench/oltp_read_only.lua --tables=16
--table-size=1000000 --threads=96 --time=600 (using ZRAM as SWAP, in a
512M memory cgroup, buffer pool set to 3G, 3 test run and 180s warm up).
In conclusion, the result is looking better or identical for most cases,
and it's especially better for workloads with swap count > 1 on SYNC_IO
devices, about ~20% gain in above test. Next phases will start to merge
swap count into swap table and reduce memory usage.
One more gain here is that we now have better support for THP swapin.
Previously, the THP swapin was bound with swap cache bypassing, which only
works for single-mapped folios. Removing the bypassing path also enabled
THP swapin for all folios. The THP swapin is still limited to SYNC_IO
devices, the limitation can be removed later.
This may cause more serious THP thrashing for certain workloads, but
that's not an issue caused by this series, it's a common THP issue we
should resolve separately.
This patch (of 19):
__read_swap_cache_async is widely used to allocate and ensure a folio is
in swapcache, or get the folio if a folio is already there.
It's not async, and it's not doing any read. Rename it to better present
its usage, and prepare to be reworked as part of new swap cache APIs.
Also, add some comments for the function. Worth noting that the
skip_if_exists argument is an long existing workaround that will be
dropped soon.
Mark Brown [Fri, 23 Jan 2026 22:39:24 +0000 (22:39 +0000)]
selftests/mm: have the harness run each test category separately
At present the mm selftests are integrated into the kselftest harness by
having it run run_vmtest.sh and letting it pick it's default set of tests
to invoke, rather than by telling the kselftest framework about each test
program individually as is more standard. This has some unfortunate
interactions with the kselftest harness:
- If any of the tests hangs the harness will kill the entire mm
selftests run rather than just the individual test, meaning no
further tests get run.
- The timeout applied by the harness is applied to the whole run rather
than an individual test which frequently leads to the suite not being
completed in production testing.
Deploy a crude but effective mitigation for these issues by telling the
kselftest framework to run each of the test categories that run_vmtests.sh
has separately. Since kselftest really wants to run test programs this is
done by providing a trivial wrapper script for each categorty that invokes
run_vmtest.sh, this is not a thing of great elegence but it is clear and
simple. Since run_vmtests.sh is doing runtime support detection, scenario
enumeration and setup for many of the tests we can't consistently tell the
framework about the individual test programs.
This has the side effect of reordering the tests, hopefully the testing
is not overly sensitive to this.
Link: https://lkml.kernel.org/r/20260123-selftests-mm-run-suites-separately-v2-1-3e934edacbfa@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: David Hildenbrand <david@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Leon Romanovsky <leon@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dennis Zhou [Fri, 23 Jan 2026 20:55:35 +0000 (12:55 -0800)]
percpu: add double free check to pcpu_free_area()
Percpu memory provides access via offsets into the percpu address space.
Offsets are essentially fixed for the lifetime of a chunk and therefore
require all users be good samaritans. If a user improperly handles the
lifetime of the percpu object, it can result in corruption in a couple of
ways:
- immediate double free - breaks percpu metadata accounting
- free after subsequent allocation
- corruption due to multiple owner problem (either prior owner still
writes or future allocation happens)
- potential for oops if the percpu pages are reclaimed as the
subsequent allocation isn't pinning the pages down
- can lead to page->private pointers pointing to freed chunks
Sebastian noticed that if this happens, none of the memory debugging
facilities add additional information [1].
This patch aims to catch invalid free scenarios within valid chunks. To
better guard free_percpu(), we can either add a magic number or some
tracking facility to the percpu subsystem in a separate patch.
The invalid free check in pcpu_free_area() validates that the allocation's
starting bit is set in both alloc_map and bound_map. The alloc_map bit
test ensures the area is allocated while the bound_map bit test checks we
are freeing from the beginning of an allocation. We choose not to check
the validity of the offset as that is encoded in page->private being a
valid chunk.
pcpu_stats_area_dealloc() is moved later to only be on the happy path so
stats are only updated on valid frees.
Li Zhe [Thu, 22 Jan 2026 03:50:02 +0000 (11:50 +0800)]
hugetlb: increase hugepage reservations when using node-specific "hugepages=" cmdline
Commit 3dfd02c90037 ("hugetlb: increase number of reserving hugepages via
cmdline") raised the number of hugepages that can be reserved through the
boot-time "hugepages=" parameter for the non-node-specific case, but left
the node-specific form of the same parameter unchanged.
This patch extends the same optimization to node-specific reservations.
When HugeTLB vmemmap optimization (HVO) is enabled and a node cannot
satisfy the requested hugepages, the code first releases ordinary
struct-page memory of hugepages obtained from the buddy allocator,
allowing their struct-page memory to be reclaimed and reused for
additional hugepage reservations on that node.
This is particularly beneficial for configurations that require identical,
large per-node hugepage reservations. On a four-node, 384 GB x86 VM, the
patch raises the attainable 2 MiB hugepage reservation from under 374 GB
to more than 379 GB.
Link: https://lkml.kernel.org/r/20260122035002.79958-1-lizhe.67@bytedance.com Signed-off-by: Li Zhe <lizhe.67@bytedance.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Acked-by: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Fri, 23 Jan 2026 20:12:20 +0000 (20:12 +0000)]
mm/vma: add and use vma_assert_stabilised()
Sometimes we wish to assert that a VMA is stable, that is - the VMA cannot
be changed underneath us. This will be the case if EITHER the VMA lock or
the mmap lock is held.
In order to do so, we introduce a new assert vma_assert_stabilised() -
this will make a lockdep assert if lockdep is enabled AND the VMA is
read-locked.
Currently lockdep tracking for VMA write locks is not implemented, so it
suffices to check in this case that we have either an mmap read or write
semaphore held.
Note that because the VMA lock uses the non-standard vmlock_dep_map naming
convention, we cannot use lockdep_assert_is_write_held() so have to open
code this ourselves via lockdep-asserting that
lock_is_held_type(&vma->vmlock_dep_map, 0).
We have to be careful here - for instance when merging a VMA, we use the
mmap write lock to stabilise the examination of adjacent VMAs which might
be simultaneously VMA read-locked whilst being faulted in.
If we were to assert VMA read lock using lockdep we would encounter an
incorrect lockdep assert.
Also, we have to be careful about asserting mmap locks are held - if we
try to address the above issue by first checking whether mmap lock is held
and if so asserting it via lockdep, we may find that we were raced by
another thread acquiring an mmap read lock simultaneously that either we
don't own (and thus can be released any time - so we are not stable) or
was indeed released since we last checked.
So to deal with these complexities we end up with either a precise (if
lockdep is enabled) or imprecise (if not) approach - in the first instance
we assert the lock is held using lockdep and thus whether we own it.
If we do own it, then the check is complete, otherwise we must check for
the VMA read lock being held (VMA write lock implies mmap write lock so
the mmap lock suffices for this).
If lockdep is not enabled we simply check if the mmap lock is held and
risk a false negative (i.e. not asserting when we should do).
There are a couple places in the kernel where we already do this
stabliisation check - the anon_vma_name() helper in mm/madvise.c and
vma_flag_set_atomic() in include/linux/mm.h, which we update to use
vma_assert_stabilised().
This change abstracts these into vma_assert_stabilised(), uses lockdep if
possible, and avoids a duplicate check of whether the mmap lock is held.
This is also self-documenting and lays the foundations for further VMA
stability checks in the code.
The only functional change here is adding the lockdep check.
Link: https://lkml.kernel.org/r/6c9e64bb2b56ddb6f806fde9237f8a00cb3a776b.1769198904.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Waiman Long <longman@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Fri, 23 Jan 2026 20:12:19 +0000 (20:12 +0000)]
mm/vma: update vma_assert_locked() to use lockdep
We can use lockdep to avoid unnecessary work here, otherwise update the
code to logically evaluate all pertinent cases and share code with
vma_assert_write_locked().
Make it clear here that we treat the VMA being detached at this point as a
bug, this was only implicit before.
Additionally, abstract references to vma->vmlock_dep_map by introducing a
macro helper __vma_lockdep_map() which accesses this field if lockdep is
enabled.
Since lock_is_held() is specified as an extern function if lockdep is
disabled, we can simply have __vma_lockdep_map() defined as NULL in this
case, and then use IS_ENABLED(CONFIG_LOCKDEP) to avoid ugly ifdeffery.
[lorenzo.stoakes@oracle.com: add helper macro __vma_lockdep_map(), per Vlastimil] Link: https://lkml.kernel.org/r/7c4b722e-604b-4b20-8e33-03d2f8d55407@lucifer.local Link: https://lkml.kernel.org/r/538762f079cc4fa76ff8bf30a8a9525a09961451.1769198904.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Waiman Long <longman@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Fri, 23 Jan 2026 20:12:18 +0000 (20:12 +0000)]
mm/vma: improve and document __is_vma_write_locked()
We don't actually need to return an output parameter providing mm sequence
number, rather we can separate that out into another function -
__vma_raw_mm_seqnum() - and have any callers which need to obtain that
invoke that instead.
The access to the raw sequence number requires that we hold the exclusive
mmap lock such that we know we can't race vma_end_write_all(), so move the
assert to __vma_raw_mm_seqnum() to make this requirement clear.
Also while we're here, convert all of the VM_BUG_ON_VMA()'s to
VM_WARN_ON_ONCE_VMA()'s in line with the convention that we do not invoke
oopses when we can avoid it.
[lorenzo.stoakes@oracle.com: minor tweaks, per Vlastimil] Link: https://lkml.kernel.org/r/3fa89c13-232d-4eee-86cc-96caa75c2c67@lucifer.local Link: https://lkml.kernel.org/r/ef6c415c2d2c03f529dca124ccaed66bc2f60edc.1769198904.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Fri, 23 Jan 2026 20:12:17 +0000 (20:12 +0000)]
mm/vma: introduce helper struct + thread through exclusive lock fns
It is confusing to have __vma_start_exclude_readers() return 0, 1 or an
error (but only when waiting for readers in TASK_KILLABLE state), and
having the return value be stored in a stack variable called 'locked'
is further confusion.
More generally, we are doing a lot of rather finnicky things during the
acquisition of a state in which readers are excluded and moving out of
this state, including tracking whether we are detached or not or
whether an error occurred.
We are implementing logic in __vma_start_exclude_readers() that
effectively acts as if 'if one caller calls us do X, if another then do
Y', which is very confusing from a control flow perspective.
Introducing the shared helper object state helps us avoid this, as we
can now handle the 'an error arose but we're detached' condition
correctly in both callers - a warning if not detaching, and treating
the situation as if no error arose in the case of a VMA detaching.
This also acts to help document what's going on and allows us to add
some more logical debug asserts.
Also update vma_mark_detached() to add a guard clause for the likely
'already detached' state (given we hold the mmap write lock), and add a
comment about ephemeral VMA read lock reference count increments to
clarify why we are entering/exiting an exclusive locked state here.
Finally, separate vma_mark_detached() into its fast-path component and
make it inline, then place the slow path for excluding readers in
mmap_lock.c.
Lorenzo Stoakes [Fri, 23 Jan 2026 20:12:16 +0000 (20:12 +0000)]
mm/vma: clean up __vma_enter/exit_locked()
These functions are very confusing indeed. 'Entering' a lock could be
interpreted as acquiring it, but this is not what these functions are
interacting with.
Equally they don't indicate at all what kind of lock we are 'entering' or
'exiting'. Finally they are misleading as we invoke these functions when
we already hold a write lock to detach a VMA.
These functions are explicitly simply 'entering' and 'exiting' a state in
which we hold the EXCLUSIVE lock in order that we can either mark the VMA
as being write-locked, or mark the VMA detached.
Rename the functions accordingly, and also update
__vma_end_exclude_readers() to return detached state with a __must_check
directive, as it is simply clumsy to pass an output pointer here to
detached state and inconsistent vs. __vma_start_exclude_readers().
Finally, remove the unnecessary 'inline' directives.
No functional change intended.
Link: https://lkml.kernel.org/r/33273be9389712347d69987c408ca7436f0c1b22.1769198904.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Waiman Long <longman@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The code is littered with inscrutable and duplicative lockdep
incantations, replace these with defines which explain what is going on
and add commentary to explain what we're doing.
If lockdep is disabled these become no-ops. We must use defines so
_RET_IP_ remains meaningful.
These are self-documenting and aid readability of the code.
Additionally, instead of using the confusing rwsem_*() form for something
that is emphatically not an rwsem, we instead explicitly use
lock_[acquired, release]_shared/exclusive() lockdep invocations since we
are doing something rather custom here and these make more sense to use.
No functional change intended.
Link: https://lkml.kernel.org/r/fdae72441949ecf3b4a0ed3510da803e881bb153.1769198904.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Fri, 23 Jan 2026 20:12:13 +0000 (20:12 +0000)]
mm/vma: rename is_vma_write_only(), separate out shared refcount put
The is_vma_writer_only() function is misnamed - this isn't determining if
there is only a write lock, as it checks for the presence of the
VM_REFCNT_EXCLUDE_READERS_FLAG.
Really, it is checking to see whether readers are excluded, with a
possibility of a false positive in the case of a detachment (there we
expect the vma->vm_refcnt to eventually be set to
VM_REFCNT_EXCLUDE_READERS_FLAG, whereas for an attached VMA we expect it
to eventually be set to VM_REFCNT_EXCLUDE_READERS_FLAG + 1).
Rename the function accordingly.
Relatedly, we use a __refcount_dec_and_test() primitive directly in
vma_refcount_put(), using the old value to determine what the reference
count ought to be after the operation is complete (ignoring racing
reference count adjustments).
Wrap this into a __vma_refcount_put_return() function, which we can then
utilise in vma_mark_detached() and thus keep the refcount primitive usage
abstracted.
This function, as the name implies, returns the value after the reference
count has been updated.
This reduces duplication in the two invocations of this function.
Also adjust comments, removing duplicative comments covered elsewhere and
adding more to aid understanding.
No functional change intended.
Link: https://lkml.kernel.org/r/32053580bff460eb1092ef780b526cefeb748bad.1769198904.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Waiman Long <longman@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Fri, 23 Jan 2026 20:12:12 +0000 (20:12 +0000)]
mm/vma: document possible vma->vm_refcnt values and reference comment
The possible vma->vm_refcnt values are confusing and vague, explain in
detail what these can be in a comment describing the vma->vm_refcnt field
and reference this comment in various places that read/write this field.
No functional change intended.
[akpm@linux-foundation.org: fix typo, per Suren] Link: https://lkml.kernel.org/r/d462e7678c6cc7461f94e5b26c776547d80a67e8.1769198904.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Waiman Long <longman@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Fri, 23 Jan 2026 20:12:11 +0000 (20:12 +0000)]
mm/vma: rename VMA_LOCK_OFFSET to VM_REFCNT_EXCLUDE_READERS_FLAG
Patch series "mm: add and use vma_assert_stabilised() helper", v4.
This series first introduces a series of refactorings, intended to
significantly improve readability and abstraction of the code.
Sometimes we wish to assert that a VMA is stable, that is - the VMA cannot
be changed underneath us. This will be the case if EITHER the VMA lock or
the mmap lock is held.
We already open-code this in two places - anon_vma_name() in mm/madvise.c
and vma_flag_set_atomic() in include/linux/mm.h.
This series adds vma_assert_stablised() which abstract this can be used in
these callsites instead.
This implementation uses lockdep where possible - that is VMA read locks -
which correctly track read lock acquisition/release via:
We don't track the VMA locks using lockdep for VMA write locks, however
these are predicated upon mmap write locks whose lockdep state we do
track, and additionally vma_assert_stabillised() asserts this check if VMA
read lock is not held, so we get lockdep coverage in this case also.
We also add extensive comments to describe what we're doing.
There's some tricky stuff around mmap locking and stabilisation races that
we have to be careful of that I describe in the patch introducing
vma_assert_stabilised().
This change also lays the foundation for future series to add this assert
in further places where we wish to make it clear that we rely upon a
stabilised VMA.
The motivation for this change was precisely this.
This patch (of 10):
The VMA_LOCK_OFFSET value encodes a flag which vma->vm_refcnt is set to in
order to indicate that a VMA is in the process of having VMA read-locks
excluded in __vma_enter_locked() (that is, first checking if there are any
VMA read locks held, and if there are, waiting on them to be released).
This happens when a VMA write lock is being established, or a VMA is being
marked detached and discovers that the VMA reference count is elevated due
to read-locks temporarily elevating the reference count only to discover a
VMA write lock is in place.
The naming does not convey any of this, so rename VMA_LOCK_OFFSET to
VM_REFCNT_EXCLUDE_READERS_FLAG (with a sensible new prefix to
differentiate from the newly introduced VMA_*_BIT flags).
Also rename VMA_REF_LIMIT to VM_REFCNT_LIMIT to make this consistent also.
Update comments to reflect this.
No functional change intended.
Link: https://lkml.kernel.org/r/817bd763e5fe35f23e01347996f9007e6eb88460.1769198904.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Waiman Long <longman@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The DAMON performance tests [1] use PARSEC 3.0 as its major test
workloads. But the official web site for PARSEC 3.0 has gone, so there is
no easy way to get the benchmark. Mainly due to the fact, DAMON
performance tests are difficult to run, and effectively broken. Do not
request running it for now. Instead, suggest running any benchmarks or
real world workloads that make sense for performance changes.
SeongJae Park [Sun, 18 Jan 2026 18:02:58 +0000 (10:02 -0800)]
Docs/mm/damon/maintainer-profile: fix wrong MAITNAINERS section name
Commit 9044cbe50a70 ("MAINTAINERS: rename DAMON section") renamed the
section for DAMON from "DATA ACCESS MONITOR" to "DAMON". But the commit
forgot updating the name on the maintainer-profile document. Update.
Link: https://lkml.kernel.org/r/20260118180305.70023-8-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Sun, 18 Jan 2026 18:02:57 +0000 (10:02 -0800)]
Docs/admin-guide/mm/damon/usage: update stats update process for refresh_ms
DAMOS stats on sysfs was only manually updated. Recent addition of
'refresh_ms' knob enabled periodic and automated updates of the stats.
The document for stats update process is not updated for the change,
however. Update.
Link: https://lkml.kernel.org/r/20260118180305.70023-7-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Sun, 18 Jan 2026 18:02:56 +0000 (10:02 -0800)]
Docs/admin-guide/mm/damon/usage: introduce DAMON modules at the beginning
DAMON usage document provides a list of available DAMON interfaces with
brief introduction at the beginning of the doc. The list is missing DAMON
modules for special purposes, while it is one of the major suggested
interfaces. Add an item for those to the list.
Link: https://lkml.kernel.org/r/20260118180305.70023-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Sun, 18 Jan 2026 18:02:55 +0000 (10:02 -0800)]
Docs/mm/damon/design: add reference to DAMON_STAT usage
Design document's special-purpose DAMON modules section is providing the
list of links to the usage documents of existing DAMON modules. It is
missing the link for DAMON_STAT, though. Add the missed link.
Link: https://lkml.kernel.org/r/20260118180305.70023-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
People sometimes get confused about the purposes of DAMON special-purpose
modules and sample modules. Clarify those on the design document by
adding a section describing their existence and purposes.
Link: https://lkml.kernel.org/r/20260118180305.70023-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Sun, 18 Jan 2026 18:02:53 +0000 (10:02 -0800)]
Docs/mm/damon/design: link repology instead of Fedora package
The document is introducing Fedora as one way to get DAMON user-space tool
(damo) from OS-providing packaging system. Linux distros more than Fedora
are providing damo with their packaging systems, though. Replace the
Fedora part with the repology.org page that shows damo packaging status
for multiple Linux distros.
Link: https://lkml.kernel.org/r/20260118180305.70023-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Sun, 18 Jan 2026 18:02:52 +0000 (10:02 -0800)]
Docs/mm/damon/index: simplify the intro
Patch series "Docs/mm/damon: update intro, modules, maintainer profile,
and misc".
Update DAMON documentations for wordsmithing, clarifications, and
miscellaneous outdated things with eight patches. Patch 1 simplifies the
brief introduction of DAMON. Patch 2 updates DAMON user-space tool
packaged distros information on design doc to include not only Fedora, but
refer to repology. Three following patches update design and usage
documents for clarifying DAMON sample modules purposes (patch 3), and
outdated information about usages of DAMON modules (patches 4 and 5).
Final three patches update usage and maintainer-profile for sysfs
refresh_ms feature behavior (patch 6), synchronize DAMON MAINTAINERS
section name (patch 7), and broken damon-tests performance tests (patch
8).
This patch (of 8):
The intro is a bit verbose and redundant. Simplify it by replacing
details with more links to the design docs, and refining the design points
list.
Link: https://lkml.kernel.org/r/20260118180305.70023-1-sj@kernel.org Link: https://lkml.kernel.org/r/20260118180305.70023-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Taeyang Kim [Sat, 17 Jan 2026 10:14:28 +0000 (19:14 +0900)]
mm: update kernel-doc for __swap_cache_clear_shadow()
The kernel-doc comment referred to swap_cache_clear_shadow(), but the
actual function name is __swap_cache_clear_shadow().
Update the comment to match the function name.
Link: https://lkml.kernel.org/r/20260117101428.113154-1-maainnewkin59@gmail.com Signed-off-by: Taeyang Kim <maainnewkin59@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Sat, 17 Jan 2026 17:52:55 +0000 (09:52 -0800)]
mm/damon: rename min_sz_region of damon_ctx to min_region_sz
'min_sz_region' field of 'struct damon_ctx' represents the minimum size of
each DAMON region for the context. 'struct damos_access_pattern' has a
field of the same name. It confuses readers and makes 'grep' less optimal
for them. Rename it to 'min_region_sz'.
SeongJae Park [Sat, 17 Jan 2026 17:52:54 +0000 (09:52 -0800)]
mm/damon: rename DAMON_MIN_REGION to DAMON_MIN_REGION_SZ
The macro is for the default minimum size of each DAMON region. There was
a case that a reader was confused if it is the minimum number of total
DAMON regions, which is set on damon_attrs->min_nr_regions. Make the name
more explicit.
SeongJae Park [Sat, 17 Jan 2026 17:52:53 +0000 (09:52 -0800)]
mm/damon/core: rename damos_filter_out() to damos_core_filter_out()
DAMOS filters are processed on the core layer and operations layer,
depending on their types. damos_filter_out() in core.c, which is for only
core layer handled filters, can confuse the fact. Rename it to
damos_core_filter_out(), to be more explicit about the fact.
damon_call_control->dealloc_on_cancel works only when ->repeat is true.
But the behavior is not clearly documented. DAMON API callers can
understand the behavior only after reading kdamond_call() code. Document
the behavior on the kernel-doc comment of damon_call_control.