vma_mmu_pagesize() is also queried on non-hugetlb VMAs and does not really
belong into hugetlb.c.
PPC64 provides a custom overwrite with CONFIG_HUGETLB_PAGE, see
arch/powerpc/mm/book3s64/slice.c, so we cannot easily make this a static
inline function.
So let's move it to vma.c and add some proper kerneldoc.
To make vma tests happy, add a simple vma_kernel_pagesize() stub in
tools/testing/vma/include/custom.h.
Link: https://lkml.kernel.org/r/20260309151901.123947-3-david@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: "Christophe Leroy (CS GROUP)" <chleroy@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: move vma_kernel_pagesize() from hugetlb to mm.h
Patch series "mm: move vma_(kernel|mmu)_pagesize() out of hugetlb.c", v2.
Looking into vma_(kernel|mmu)_pagesize(), I realized that there is one
scenario where DAX would not do the right thing when the kernel is not
compiled with hugetlb support.
Without hugetlb support, vma_(kernel|mmu)_pagesize() will always return
PAGE_SIZE instead of using the ->pagesize() result provided by dax-device
code.
Fix that by moving vma_kernel_pagesize() to core MM code, where it
belongs. I don't think this is stable material, but am not 100% sure.
Also, move vma_mmu_pagesize() while at it. Remove the unnecessary
hugetlb.h inclusion from KVM code.
This patch (of 4):
In the past, only hugetlb had special "vma_kernel_pagesize()"
requirements, so it provided its own implementation.
In commit 05ea88608d4e ("mm, hugetlbfs: introduce ->pagesize() to
vm_operations_struct") we generalized that approach by providing a
vm_ops->pagesize() callback to be used by device-dax.
Once device-dax started using that callback in commit c1d53b92b95c
("device-dax: implement ->pagesize() for smaps to report MMUPageSize") it
was missed that CONFIG_DEV_DAX does not depend on hugetlb support.
So building a kernel with CONFIG_DEV_DAX but without CONFIG_HUGETLBFS
would not pick up that value.
Fix it by moving vma_kernel_pagesize() to mm.h, providing only a single
implementation. While at it, improve the kerneldoc a bit.
Ideally, we'd move vma_mmu_pagesize() as well to the header. However, its
__weak symbol might be overwritten by a PPC variant in hugetlb code. So
let's leave it in there for now, as it really only matters for some
hugetlb oddities.
This was found by code inspection.
Link: https://lkml.kernel.org/r/20260309151901.123947-1-david@kernel.org Link: https://lkml.kernel.org/r/20260309151901.123947-2-david@kernel.org Fixes: c1d53b92b95c ("device-dax: implement ->pagesize() for smaps to report MMUPageSize") Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: "Christophe Leroy (CS GROUP)" <chleroy@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Akinobu Mita [Tue, 10 Mar 2026 15:18:37 +0000 (00:18 +0900)]
docs: mm: fix typo in numa_memory_policy.rst
Fix a typo: MPOL_INTERLEAVED -> MPOL_INTERLEAVE.
Link: https://lkml.kernel.org/r/20260310151837.5888-1-akinobu.mita@gmail.com Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Sat, 7 Mar 2026 19:53:54 +0000 (11:53 -0800)]
Docs/mm/damon/maintainer-profile: use flexible review cadence
The document mentions the maitainer is working in the usual 9-5 fashion.
The maintainer nowadays prefers working in a more flexible way. Update
the document to avoid contributors having a wrong time expectation.
Link: https://lkml.kernel.org/r/20260307195356.203753-7-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Acked-by: wang lian <lianux.mm@gmail.com> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Sat, 7 Mar 2026 19:53:51 +0000 (11:53 -0800)]
mm/damon/core: clarify damon_set_attrs() usages
damon_set_attrs() is called for multiple purposes from multiple places.
Calling it in an unsafe context can make DAMON internal state polluted and
results in unexpected behaviors. Clarify when it is safe, and where it is
being called.
Link: https://lkml.kernel.org/r/20260307195356.203753-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Acked-by: wang lian <lianux.mm@gmail.com> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Sat, 7 Mar 2026 19:53:49 +0000 (11:53 -0800)]
mm/damon/core: use mult_frac()
Patch series "mm/damon: improve/fixup/update ratio calculation, test and
documentation".
Yet another batch of misc/minor improvements and fixups. Use mult_frac()
instead of the worse open-coding for rate calculations (patch 1). Add a
test for a previously found and fixed bug (patch 2). Improve and update
comments and documentations for easier code review and up-to-date
information (patches 3-6). Finally, fix an obvious typo (patch 7).
This patch (of 7):
There are multiple places in core code that do open-code rate
calculations. Use mult_frac(), which is developed for doing that in a way
more safe from overflow and precision loss.
Link: https://lkml.kernel.org/r/20260307195356.203753-1-sj@kernel.org Link: https://lkml.kernel.org/r/20260307195356.203753-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Acked-by: wang lian <lianux.mm@gmail.com> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Sat, 7 Mar 2026 19:49:14 +0000 (11:49 -0800)]
mm/damon/core: use time_after_eq() in kdamond_fn()
damon_ctx->passed_sample_intervals and damon_ctx->next_*_sis are unsigned
long. Those are compared in kdamond_fn() using normal comparison
operators. It is unsafe from overflow. Use time_after_eq(), which is
safe from overflows when correctly used, instead.
SeongJae Park [Sat, 7 Mar 2026 19:49:13 +0000 (11:49 -0800)]
mm/damon/core: use time_before() for next_apply_sis
damon_ctx->passed_sample_intervals and damos->next_apply_sis are unsigned
long, and compared via normal comparison operators. It is unsafe from
overflow. Use time_before(), which is safe from overflow when correctly
used, instead.
Patch series "mm/damon/core: make passed_sample_intervals comparisons
overflow-safe".
DAMON accounts time using its own jiffies-like time counter, namely
damon_ctx->passed_sample_intervals. The counter is incremented on each
iteration of kdamond_fn() main loop, which sleeps at least one sample
interval. Hence the name is like that.
DAMON has time-periodic operations including monitoring results
aggregation and DAMOS action application. DAMON sets the next time to do
each of such operations in the passed_sample_intervals unit. And it does
the operation when the counter becomes the same to or larger than the
pre-set values, and update the next time for the operation. Note that the
operation is done not only when the values exactly match but also when the
time is passed, because the values can be updated for online-committed
DAMON parameters.
The counter is 'unsigned long' type, and the comparison is done using
normal comparison operators. It is not safe from overflows. This can
cause rare and limited but odd situations.
Let's suppose there is an operation that should be executed every 20
sampling intervals, and the passed_sample_intervals value for next
execution of the operation is ULONG_MAX - 3. Once the
passed_sample_intervals reaches ULONG_MAX - 3, the operation will be
executed, and the next time value for doing the operation becomes 17
(ULONG_MAX - 3 + 20), since overflow happens. In the next iteration of
the kdamond_fn() main loop, passed_sample_intervals is larger than the
next operation time value, so the operation will be executed again. It
will continue executing the operation for each iteration, until the
passed_sample_intervals also overflows.
Note that this will not be common and problematic in the real world. The
sampling interval, which takes for each passed_sample_intervals increment,
is 5 ms by default. And it is usually [auto-]tuned for hundreds of
milliseconds. That means it takes about 248 days or 4,971 days to have
the overflow on 32 bit machines when the sampling interval is 5 ms and 100
ms, respectively (1<<32 * sampling_interval_in_seconds / 3600 / 24). On
64 bit machines, the numbers become 2924712086.77536 and 58494241735.5072
years. So the real user impact is negligible. But still this is better
to be fixed as long as the fix is simple and efficient.
Fix this by simply replacing the overflow-unsafe native comparison
operators with the existing overflow-safe time comparison helpers.
The first patch only cleans up the next DAMOS action application time
setup for consistency and reduced code. The second and the third patches
update DAMOS action application time setup and rest, respectively.
This patch (of 3):
There is a function for damos->next_apply_sis setup. But some places are
open-coding it. Consistently use the helper.
SeongJae Park [Sat, 7 Mar 2026 19:42:21 +0000 (11:42 -0800)]
Docs/mm/damon/design: document the power-of-two limitation for addr_unit
The min_region_sz is set as max(DAMON_MIN_REGION_SZ / addr_unit, 1).
DAMON_MIN_REGION_SZ is the same to PAGE_SIZE, and addr_unit is what the
user can arbitrarily set. Commit c80f46ac228b ("mm/damon/core: disallow
non-power of two min_region_sz") made min_region_sz to always be a power
of two. Hence, addr_unit should be a power of two when it is smaller than
PAGE_SIZE. While 'addr_unit' is a user-exposed parameter, the rule is not
documented. This can confuse users. Specifically, if the user sets
addr_unit as a value that is smaller than PAGE_SIZE and not a power of
two, the setup will explicitly fail.
Document the rule on the design document. Usage documents reference the
design document for detail, so updating only the design document should
suffice.
Link: https://lkml.kernel.org/r/20260307194222.202075-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Sat, 7 Mar 2026 19:42:20 +0000 (11:42 -0800)]
mm/damon/tests/core-kunit: add a test for damon_commit_ctx()
Patch series "mm/damon: test and document power-of-2 min_region_sz
requirement".
Since commit c80f46ac228b ("mm/damon/core: disallow non-power of two
min_region_sz"), min_region_sz is always restricted to be a power of two.
Add a kunit test to confirm the functionality. Also, the change adds a
restriction to addr_unit parameter. Clarify it on the document.
This patch (of 2):
Add a kunit test for confirming the change that is made on commit c80f46ac228b ("mm/damon/core: disallow non-power of two min_region_sz")
functions as expected.
Link: https://lkml.kernel.org/r/20260307194222.202075-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: SeongJae Park <sj@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
At time of damon_reset_aggregated(), aggregation of the interval should be
completed, and hence nr_accesses and nr_accesses_bp should match. I found
a few bugs caused it to be broken in the past, from online parameters
update and complicated nr_accesses handling changes. Add a sanity check
for that under CONFIG_DAMON_DEBUG_SANITY.
Link: https://lkml.kernel.org/r/20260306152914.86303-9-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
damon_merge_regions_of() should be called only after aggregation is
finished and therefore each region's nr_accesses and nr_accesses_bp match.
There were bugs that broke the assumption, during development of online
DAMON parameter updates and monitoring results handling changes. Add a
sanity check for that under CONFIG_DAMON_DEBUG_SANITY.
Link: https://lkml.kernel.org/r/20260306152914.86303-7-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
A data corruption could cause damon_merge_two_regions() creating zero
length DAMON regions. Add a sanity check for that under
CONFIG_DAMON_DEBUG_SANITY.
Link: https://lkml.kernel.org/r/20260306152914.86303-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
damon_target->nr_regions is introduced to get the number quickly without
having to iterate regions always. Add a sanity check for that under
CONFIG_DAMON_DEBUG_SANITY.
Link: https://lkml.kernel.org/r/20260306152914.86303-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Fri, 6 Mar 2026 15:29:04 +0000 (07:29 -0800)]
mm/damon: add CONFIG_DAMON_DEBUG_SANITY
Patch series "mm/damon: add optional debugging-purpose sanity checks".
DAMON code has a few assumptions that can be critical if violated.
Validating the assumptions in code can be useful at finding such critical
bugs. I was actually adding some such additional sanity checks in my
personal tree, and those were useful at finding bugs that I made during
the development of new patches. We also found [1] sometimes the
assumptions are misunderstood. The validation can work as good
documentation for such cases.
Add some of such debugging purpose sanity checks. Because those
additional checks can impose more overhead, make those only optional via
new config, CONFIG_DAMON_DEBUG_SANITY, that is recommended for only
development and test setups. And as recommended, enable it for DAMON
kunit tests and selftests.
Note that the verification only WARN_ON() for each of the insanity. The
developer or tester may better to set panic_on_oops together, like
damon-tests/corr did [2].
This patch (of 10):
Add a new build config that will enable additional DAMON sanity checks.
It is recommended to be enabled on only development and test setups, since
it can impose additional overhead.
Usama Arif [Mon, 9 Mar 2026 21:25:02 +0000 (14:25 -0700)]
mm/migrate_device: document folio_get requirement before frozen PMD split
split_huge_pmd_address() with freeze=true splits a PMD migration entry
into PTE migration entries, consuming one folio reference in the process.
The folio_get() before it provides this reference.
Add a comment explaining this relationship. The expected folio refcount
at the start of migrate_vma_split_unmapped_folio() is 1.
Link: https://lkml.kernel.org/r/20260309212502.3922825-1-usama.arif@linux.dev Signed-off-by: Usama Arif <usama.arif@linux.dev> Suggested-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Nico Pache <npache@redhat.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Ying Huang <ying.huang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Arnd Bergmann [Fri, 6 Mar 2026 15:05:49 +0000 (16:05 +0100)]
ubsan: turn off kmsan inside of ubsan instrumentation
The structure initialization in the two type mismatch handling functions
causes a call to __msan_memset() to be generated inside of a UACCESS
block, which in turn leads to an objtool warning about possibly leaking
uaccess-enabled state:
lib/ubsan.o: warning: objtool: __ubsan_handle_type_mismatch+0xda: call to __msan_memset() with UACCESS enabled
lib/ubsan.o: warning: objtool: __ubsan_handle_type_mismatch_v1+0xf4: call to __msan_memset() with UACCESS enabled
Most likely __msan_memset() is safe to be called here and could be added
to the uaccess_safe_builtin[] list of safe functions, but seeing that the
ubsan file itself already has kasan, ubsan and kcsan disabled itself, it
is probably a good idea to also turn off kmsan here, in particular this
also avoids the risk of recursing between ubsan and kcsan checks in other
functions of this file.
I saw this happen while testing randconfig builds with clang-22, but did
not try older versions, or attempt to see which kernel change introduced
the warning.
Byungchul Park [Tue, 24 Feb 2026 05:13:47 +0000 (14:13 +0900)]
mm: introduce a new page type for page pool in page type
Currently, the condition 'page->pp_magic == PP_SIGNATURE' is used to
determine if a page belongs to a page pool. However, with the planned
removal of @pp_magic, we should instead leverage the page_type in struct
page, such as PGTY_netpp, for this purpose.
Introduce and use the page type APIs e.g. PageNetpp(), __SetPageNetpp(),
and __ClearPageNetpp() instead, and remove the existing APIs accessing
@pp_magic e.g. page_pool_page_is_pp(), netmem_or_pp_magic(), and
netmem_clear_pp_magic().
Plus, add @page_type to struct net_iov at the same offset as struct page
so as to use the page_type APIs for struct net_iov as well. While at it,
reorder @type and @owner in struct net_iov to avoid a hole and increasing
the struct size.
Chengkaitao [Sun, 1 Feb 2026 06:35:31 +0000 (14:35 +0800)]
sparc: use vmemmap_populate_hugepages for vmemmap_populate
Change sparc's implementation of vmemmap_populate() using
vmemmap_populate_hugepages() to streamline the code. Another benefit is
that it allows us to eliminate the external declarations of
vmemmap_p?d_populate functions and convert them to static functions.
Since vmemmap_populate_hugepages may fallback to vmemmap_populate-
_basepages, which differs from sparc's original implementation. During
the v1 discussion with Mike Rapoport, sparc uses base pages in the kernel
page tables, so it should be able to use them in vmemmap as well.
Consequently, no additional special handling is required.
1. In the SPARC architecture, reimplement vmemmap_populate using
vmemmap_populate_hugepages.
2. Allow the SPARC arch to fallback to vmemmap_populate_basepages(),
when vmemmap_alloc_block returns NULL.
Link: https://lkml.kernel.org/r/20260201063532.44807-2-pilgrimtao@gmail.com Signed-off-by: Chengkaitao <chengkaitao@kylinos.cn> Tested-by: Andreas Larsson <andreas@gaisler.com> Acked-by: Andreas Larsson <andreas@gaisler.com> Cc: David Hildenbrand <david@kernel.org> Cc: David S. Miller <davem@davemloft.net> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
tools/testing/vma: add test for vma_flags_test(), vma_desc_test()
Now we have helpers which test singular VMA flags - vma_flags_test() and
vma_desc_test() - add a test to explicitly assert that these behave as
expected.
[ljs@kernel.org: test_vma_flags_test(): use struct initializer, per David] Link: https://lkml.kernel.org/r/f6f396d2-1ba2-426f-b756-d8cc5985cc7c@lucifer.local Link: https://lkml.kernel.org/r/376a39eb9e134d2c8ab10e32720dd292970b080a.1772704455.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Babu Moger <babu.moger@amd.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chao Yu <chao@kernel.org> Cc: Chatre, Reinette <reinette.chatre@intel.com> Cc: Chunhai Guo <guochunhai@vivo.com> Cc: Damien Le Maol <dlemoal@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Dave Martin <dave.martin@arm.com> Cc: Gao Xiang <xiang@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hongbo Li <lihongbo22@huawei.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Morse <james.morse@arm.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jeffle Xu <jefflexu@linux.alibaba.com> Cc: Johannes Thumshirn <jth@kernel.org> Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naohiro Aota <naohiro.aota@wdc.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sandeep Dhavale <dhavale@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Yue Hu <zbestahu@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: reintroduce vma_desc_test() as a singular flag test
Similar to vma_flags_test(), we have previously renamed vma_desc_test() to
vma_desc_test_any(). Now that is in place, we can reintroduce
vma_desc_test() to explicitly check for a single VMA flag.
As with vma_flags_test(), this is useful as often flag tests are against a
single flag, and vma_desc_test_any(flags, VMA_READ_BIT) reads oddly and
potentially causes confusion.
As with vma_flags_test() a combination of sparse and vma_flags_t being a
struct means that users cannot misuse this function without it getting
flagged.
Also update the VMA tests to reflect this change.
Link: https://lkml.kernel.org/r/3a65ca23defb05060333f0586428fe279a484564.1772704455.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Babu Moger <babu.moger@amd.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chao Yu <chao@kernel.org> Cc: Chatre, Reinette <reinette.chatre@intel.com> Cc: Chunhai Guo <guochunhai@vivo.com> Cc: Damien Le Maol <dlemoal@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Dave Martin <dave.martin@arm.com> Cc: Gao Xiang <xiang@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hongbo Li <lihongbo22@huawei.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Morse <james.morse@arm.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jeffle Xu <jefflexu@linux.alibaba.com> Cc: Johannes Thumshirn <jth@kernel.org> Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naohiro Aota <naohiro.aota@wdc.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sandeep Dhavale <dhavale@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Yue Hu <zbestahu@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: reintroduce vma_flags_test() as a singular flag test
Since we've now renamed vma_flags_test() to vma_flags_test_any() to be
very clear as to what we are in fact testing, we now have the opportunity
to bring vma_flags_test() back, but for explicitly testing a single VMA
flag.
This is useful, as often flag tests are against a single flag, and
vma_flags_test_any(flags, VMA_READ_BIT) reads oddly and potentially causes
confusion.
We use sparse to enforce that users won't accidentally pass vm_flags_t to
this function without it being flagged so this should make it harder to
get this wrong.
Of course, passing vma_flags_t to the function is impossible, as it is a
struct.
Also update the VMA tests to reflect this change.
Link: https://lkml.kernel.org/r/f33f8d7f16c3f3d286a1dc2cba12c23683073134.1772704455.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Babu Moger <babu.moger@amd.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chao Yu <chao@kernel.org> Cc: Chatre, Reinette <reinette.chatre@intel.com> Cc: Chunhai Guo <guochunhai@vivo.com> Cc: Damien Le Maol <dlemoal@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Dave Martin <dave.martin@arm.com> Cc: Gao Xiang <xiang@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hongbo Li <lihongbo22@huawei.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Morse <james.morse@arm.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jeffle Xu <jefflexu@linux.alibaba.com> Cc: Johannes Thumshirn <jth@kernel.org> Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naohiro Aota <naohiro.aota@wdc.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sandeep Dhavale <dhavale@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Yue Hu <zbestahu@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: always inline __mk_vma_flags() and invoked functions
Be explicit about __mk_vma_flags() (which is used by the mk_vma_flags()
macro) always being inline, as we rely on the compiler to evaluate the
loop in this function and determine that it can replace the code with the
an equivalent constant value, e.g. that:
Most likely an 'inline' will suffice for this, but be explicit as we can
be.
Also update all of the functions __mk_vma_flags() ultimately invokes to be
always inline too.
Note that test_bitmap_const_eval() asserts that the relevant bitmap
functions result in build time constant values.
Additionally, vma_flag_set() operates on a vma_flags_t type, so it is
inconsistently named versus other VMA flags functions.
We only use vma_flag_set() in __mk_vma_flags() so we don't need to worry
about its new name being rather cumbersome, so rename it to
vma_flags_set_flag() to disambiguate it from vma_flags_set().
Also update the VMA test headers to reflect the changes.
Link: https://lkml.kernel.org/r/241f49c52074d436edbb9c6a6662a8dc142a8f43.1772704455.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Babu Moger <babu.moger@amd.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chao Yu <chao@kernel.org> Cc: Chatre, Reinette <reinette.chatre@intel.com> Cc: Chunhai Guo <guochunhai@vivo.com> Cc: Damien Le Maol <dlemoal@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Dave Martin <dave.martin@arm.com> Cc: Gao Xiang <xiang@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hongbo Li <lihongbo22@huawei.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Morse <james.morse@arm.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jeffle Xu <jefflexu@linux.alibaba.com> Cc: Johannes Thumshirn <jth@kernel.org> Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naohiro Aota <naohiro.aota@wdc.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sandeep Dhavale <dhavale@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Yue Hu <zbestahu@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
erofs and zonefs are using vma_desc_test_any() twice to check whether all
of VMA_SHARED_BIT and VMA_MAYWRITE_BIT are set, this is silly, so add
vma_desc_test_all() to test all flags and update erofs and zonefs to use
it.
While we're here, update the helper function comments to be more
consistent.
Also add the same to the VMA test headers.
Link: https://lkml.kernel.org/r/568c8f8d6a84ff64014f997517cba7a629f7eed6.1772704455.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Babu Moger <babu.moger@amd.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chao Yu <chao@kernel.org> Cc: Chatre, Reinette <reinette.chatre@intel.com> Cc: Chunhai Guo <guochunhai@vivo.com> Cc: Damien Le Maol <dlemoal@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Dave Martin <dave.martin@arm.com> Cc: Gao Xiang <xiang@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hongbo Li <lihongbo22@huawei.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Morse <james.morse@arm.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jeffle Xu <jefflexu@linux.alibaba.com> Cc: Johannes Thumshirn <jth@kernel.org> Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naohiro Aota <naohiro.aota@wdc.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sandeep Dhavale <dhavale@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Yue Hu <zbestahu@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The ongoing work around introducing non-system word VMA flags has
introduced a number of helper functions and macros to make life easier
when working with these flags and to make conversions from the legacy use
of VM_xxx flags more straightforward.
This series improves these to reduce confusion as to what they do and to
improve consistency and readability.
Firstly the series renames vma_flags_test() to vma_flags_test_any() to
make it abundantly clear that this function tests whether any of the flags
are set (as opposed to vma_flags_test_all()).
It then renames vma_desc_test_flags() to vma_desc_test_any() for the same
reason. Note that we drop the 'flags' suffix here, as
vma_desc_test_any_flags() would be cumbersome and 'test' implies a flag
test.
Similarly, we rename vma_test_all_flags() to vma_test_all() for
consistency.
Next, we have a couple of instances (erofs, zonefs) where we are now
testing for vma_desc_test_any(desc, VMA_SHARED_BIT) &&
vma_desc_test_any(desc, VMA_MAYWRITE_BIT).
This is silly, so this series introduces vma_desc_test_all() so these
callers can instead invoke vma_desc_test_all(desc, VMA_SHARED_BIT,
VMA_MAYWRITE_BIT).
We then observe that quite a few instances of vma_flags_test_any() and
vma_desc_test_any() are in fact only testing against a single flag.
Using the _any() variant here is just confusing - 'any' of single item
reads strangely and is liable to cause confusion.
So in these instances the series reintroduces vma_flags_test() and
vma_desc_test() as helpers which test against a single flag.
The fact that vma_flags_t is a struct and that vma_flag_t utilises sparse
to avoid confusion with vm_flags_t makes it impossible for a user to
misuse these helpers without it getting flagged somewhere.
The series also updates __mk_vma_flags() and functions invoked by it to
explicitly mark them always inline to match expectation and to be
consistent with other VMA flag helpers.
It also renames vma_flag_set() to vma_flags_set_flag() (a function only
used by __mk_vma_flags()) to be consistent with other VMA flag helpers.
Finally it updates the VMA tests for each of these changes, and introduces
explicit tests for vma_flags_test() and vma_desc_test() to assert that
they behave as expected.
This patch (of 6):
On reflection, it's confusing to have vma_flags_test() and
vma_desc_test_flags() test whether any comma-separated VMA flag bit is
set, while also having vma_flags_test_all() and vma_test_all_flags()
separately test whether all flags are set.
Firstly, rename vma_flags_test() to vma_flags_test_any() to eliminate this
confusion.
Secondly, since the VMA descriptor flag functions are becoming rather
cumbersome, prefer vma_desc_test*() to vma_desc_test_flags*(), and also
rename vma_desc_test_flags() to vma_desc_test_any().
Finally, rename vma_test_all_flags() to vma_test_all() to keep the
VMA-specific helper consistent with the VMA descriptor naming convention
and to help avoid confusion vs. vma_flags_test_all().
While we're here, also update whitespace to be consistent in helper
functions.
Link: https://lkml.kernel.org/r/cover.1772704455.git.ljs@kernel.org Link: https://lkml.kernel.org/r/0f9cb3c511c478344fac0b3b3b0300bb95be95e9.1772704455.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Suggested-by: Pedro Falcato <pfalcato@suse.de> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Babu Moger <babu.moger@amd.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chao Yu <chao@kernel.org> Cc: Chatre, Reinette <reinette.chatre@intel.com> Cc: Chunhai Guo <guochunhai@vivo.com> Cc: Damien Le Maol <dlemoal@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Dave Martin <dave.martin@arm.com> Cc: Gao Xiang <xiang@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hongbo Li <lihongbo22@huawei.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Morse <james.morse@arm.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jeffle Xu <jefflexu@linux.alibaba.com> Cc: Johannes Thumshirn <jth@kernel.org> Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naohiro Aota <naohiro.aota@wdc.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sandeep Dhavale <dhavale@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Yue Hu <zbestahu@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrey Ryabinin [Thu, 5 Mar 2026 18:56:59 +0000 (19:56 +0100)]
kasan: fix bug type classification for SW_TAGS mode
kasan_non_canonical_hook() derives orig_addr from kasan_shadow_to_mem(),
but the pointer tag may remain in the top byte. In SW_TAGS mode this
tagged address is compared against PAGE_SIZE and TASK_SIZE, which leads to
incorrect bug classification.
As a result, NULL pointer dereferences may be reported as
"wild-memory-access".
Strip the tag before performing these range checks and use the untagged
value when reporting addresses in these ranges.
Before:
[ ] Unable to handle kernel paging request at virtual address ffef800000000000
[ ] KASAN: maybe wild-memory-access in range [0xff00000000000000-0xff0000000000000f]
After:
[ ] Unable to handle kernel paging request at virtual address ffef800000000000
[ ] KASAN: null-ptr-deref in range [0x0000000000000000-0x000000000000000f]
Link: https://lkml.kernel.org/r/20260305185659.20807-1-ryabinin.a.a@gmail.com Signed-off-by: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bing Jiao [Tue, 3 Mar 2026 05:25:17 +0000 (05:25 +0000)]
mm/vmscan: fix unintended mtc->nmask mutation in alloc_demote_folio()
In alloc_demote_folio(), mtc->nmask is set to NULL for the first
allocation. If that succeeds, it returns without restoring mtc->nmask to
allowed_mask. For subsequent allocations from the migrate_pages() batch,
mtc->nmask will be NULL. If the target node then becomes full, the
fallback allocation will use nmask = NULL, allocating from any node
allowed by the task cpuset, which for kswapd is all nodes.
To address this issue, use a local copy of the mtc structure with nmask =
NULL for the first allocation attempt specifically, ensuring the original
mtc remains unmodified.
Link: https://lkml.kernel.org/r/20260303052519.109244-1-bingjiao@google.com Fixes: 320080272892 ("mm/demotion: demote pages according to allocation fallback order") Signed-off-by: Bing Jiao <bingjiao@google.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yuvraj Sakshith [Tue, 3 Mar 2026 11:30:31 +0000 (03:30 -0800)]
mm/page_reporting: change PAGE_REPORTING_ORDER_UNSPECIFIED to -1
PAGE_REPORTING_ORDER_UNSPECIFIED is now set to zero. This means, pages of
order zero cannot be reported to a client/driver -- as zero is used to
signal a fallback to MAX_PAGE_ORDER.
Change PAGE_REPORTING_ORDER_UNSPECIFIED to (-1), so that zero can be used
as a valid order with which pages can be reported.
Patch series "Allow order zero pages in page reporting", v4.
Today, page reporting sets page_reporting_order in two ways:
(1) page_reporting.page_reporting_order cmdline parameter
(2) Driver can pass order while registering itself.
In both cases, order zero is ignored by free page reporting because it is
used to set page_reporting_order to a default value, like MAX_PAGE_ORDER.
In some cases we might want page_reporting_order to be zero.
For instance, when virtio-balloon runs inside a guest with tiny memory
(say, 16MB), it might not be able to find a order 1 page (or in the worst
case order MAX_PAGE_ORDER page) after some uptime. Page reporting should
be able to return order zero pages back for optimal memory relinquishment.
This patch changes the default fallback value from '0' to '-1' in all
possible clients of free page reporting (hv_balloon and virtio-balloon)
together with allowing '0' as a valid order in page_reporting_register().
This patch (of 5):
Drivers can pass order of pages to be reported while registering itself.
Today, this is a magic number, 0.
Label this with PAGE_REPORTING_ORDER_UNSPECIFIED and check for it when the
driver is being registered.
Johannes Weiner [Mon, 2 Mar 2026 19:50:18 +0000 (14:50 -0500)]
mm: memcg: separate slab stat accounting from objcg charge cache
Cgroup slab metrics are cached per-cpu the same way as the sub-page charge
cache. However, the intertwined code to manage those dependent caches
right now is quite difficult to follow.
Specifically, cached slab stat updates occur in consume() if there was
enough charge cache to satisfy the new object. If that fails, whole pages
are reserved, and slab stats are updated when the remainder of those
pages, after subtracting the size of the new slab object, are put into the
charge cache. This already juggles a delicate mix of the object size, the
page charge size, and the remainder to put into the byte cache. Doing
slab accounting in this path as well is fragile, and has recently caused a
bug where the input parameters between the two caches were mixed up.
Refactor the consume() and refill() paths into unlocked and locked
variants that only do charge caching. Then let the slab path manage its
own lock section and open-code charging and accounting.
This makes the slab stat cache subordinate to the charge cache:
__refill_obj_stock() is called first to prepare it; __account_obj_stock()
follows to hitch a ride.
This results in a minor behavioral change: previously, a mismatching
percpu stock would always be drained for the purpose of setting up slab
account caching, even if there was no byte remainder to put into the
charge cache. Now, the stock is left alone, and slab accounting takes the
uncached path if there is a mismatch. This is exceedingly rare, and it
was probably never worth draining the whole stock just to cache the slab
stat update.
Link: https://lkml.kernel.org/r/20260302195305.620713-6-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Hao Li <hao.li@linux.dev> Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Johannes Weiner <jweiner@meta.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Mon, 2 Mar 2026 19:50:14 +0000 (14:50 -0500)]
mm: memcg: factor out trylock_stock() and unlock_stock()
Patch series "memcg: obj stock and slab stat caching cleanups".
This is a follow-up to `[PATCH] memcg: fix slab accounting in
refill_obj_stock() trylock path`. The way the slab stat cache and the
objcg charge cache interact appears a bit too fragile. This series
factors those paths apart as much as practical.
This patch (of 5):
Consolidate the local lock acquisition and the local stock lookup. This
allows subsequent patches to use !!stock as an easy way to disambiguate
the locked vs. contended cases through the callstack.
Link: https://lkml.kernel.org/r/20260302195305.620713-1-hannes@cmpxchg.org Link: https://lkml.kernel.org/r/20260302195305.620713-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Reviewed-by: Hao Li <hao.li@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baolin Wang [Fri, 6 Mar 2026 06:43:42 +0000 (14:43 +0800)]
arm64: mm: implement the architecture-specific test_and_clear_young_ptes()
Implement the Arm64 architecture-specific test_and_clear_young_ptes() to
enable batched checking of young flags, improving performance during large
folio reclamation when MGLRU is enabled.
While we're at it, simplify ptep_test_and_clear_young() by calling
test_and_clear_young_ptes(). Since callers guarantee that PTEs are
present before calling these functions, we can use pte_cont() to check the
CONT_PTE flag instead of pte_valid_cont().
Performance testing:
Enable MGLRU, then allocate 10G clean file-backed folios by mmap() in a
memory cgroup, and try to reclaim 8G file-backed folios via the
memory.reclaim interface. I can observe 60%+ performance improvement on
my Arm64 32-core server (and about 15% improvement on my X86 machine).
W/o patchset:
real 0m0.470s
user 0m0.000s
sys 0m0.470s
W/ patchset:
real 0m0.180s
user 0m0.001s
sys 0m0.179s
Link: https://lkml.kernel.org/r/7f891d42a720cc2e57862f3b79e4f774404f313c.1772778858.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Rik van Riel <riel@surriel.com> Reviewed-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Xu <weixugc@google.com> Cc: Will Deacon <will@kernel.org> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baolin Wang [Fri, 6 Mar 2026 06:43:41 +0000 (14:43 +0800)]
mm: support batched checking of the young flag for MGLRU
Use the batched helper test_and_clear_young_ptes_notify() to check and
clear the young flag to improve the performance during large folio
reclamation when MGLRU is enabled.
Meanwhile, we can also support batched checking the young and dirty flag
when MGLRU walks the mm's pagetable to update the folios' generation
counter. Since MGLRU also checks the PTE dirty bit, use
folio_pte_batch_flags() with FPB_MERGE_YOUNG_DIRTY set to detect batches
of PTEs for a large folio.
Then we can remove the ptep_test_and_clear_young_notify() since it has no
users now.
Note that we also update the 'young' counter and 'mm_stats[MM_LEAF_YOUNG]'
counter with the batched count in the lru_gen_look_around() and
walk_pte_range(). However, the batched operations may inflate these two
counters, because in a large folio not all PTEs may have been accessed.
(Additionally, tracking how many PTEs have been accessed within a large
folio is not very meaningful, since the mm core actually tracks
access/dirty on a per-folio basis, not per page). The impact analysis is
as follows:
1. The 'mm_stats[MM_LEAF_YOUNG]' counter has no functional impact and
is mainly for debugging.
2. The 'young' counter is used to decide whether to place the current
PMD entry into the bloom filters by suitable_to_scan() (so that next
time we can check whether it has been accessed again), which may set
the hash bit in the bloom filters for a PMD entry that hasn't seen much
access. However, bloom filters inherently allow some error, so this
effect appears negligible.
Link: https://lkml.kernel.org/r/378f4acf7d07410aa7c2e4b49d56bb165918eb34.1772778858.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Rik van Riel <riel@surriel.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Xu <weixugc@google.com> Cc: Will Deacon <will@kernel.org> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baolin Wang [Fri, 6 Mar 2026 06:43:40 +0000 (14:43 +0800)]
mm: add a batched helper to clear the young flag for large folios
Currently, MGLRU will call ptep_test_and_clear_young_notify() to check and
clear the young flag for each PTE sequentially, which is inefficient for
large folios reclamation.
Moreover, on Arm64 architecture, which supports contiguous PTEs, the
Arm64- specific ptep_test_and_clear_young() already implements an
optimization to clear the young flags for PTEs within a contiguous range.
However, this is not sufficient. Similar to the Arm64 specific
clear_flush_young_ptes(), we can extend this to perform batched operations
for the entire large folio (which might exceed the contiguous range:
CONT_PTE_SIZE).
Thus, we can introduce a new batched helper: test_and_clear_young_ptes()
and its wrapper test_and_clear_young_ptes_notify() which are consistent
with the existing functions, to perform batched checking of the young
flags for large folios, which can help improve performance during large
folio reclamation when MGLRU is enabled. And it will be overridden by the
architecture that implements a more efficient batch operation in the
following patches.
Link: https://lkml.kernel.org/r/23ec671bfcc06cd24ee0fbff8e329402742274a0.1772778858.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Rik van Riel <riel@surriel.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Xu <weixugc@google.com> Cc: Will Deacon <will@kernel.org> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baolin Wang [Fri, 6 Mar 2026 06:43:39 +0000 (14:43 +0800)]
mm: rmap: add a ZONE_DEVICE folio warning in folio_referenced()
The folio_referenced() is used to test whether a folio was referenced
during reclaim. Moreover, ZONE_DEVICE folios are controlled by their
device driver, have a lifetime tied to that driver, and are never placed
on the LRU list. That means we should never try to reclaim ZONE_DEVICE
folios, so add a warning to catch this unexpected behavior in
folio_referenced() to avoid confusion, as discussed in the previous
thread[1].
[1] https://lore.kernel.org/all/16fb7985-ec0f-4b56-91e7-404c5114f899@kernel.org/ Link: https://lkml.kernel.org/r/64d6fb2a33f7101e1d4aca2c9052e0758b76d492.1772778858.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Rik van Riel <riel@surriel.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Xu <weixugc@google.com> Cc: Will Deacon <will@kernel.org> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baolin Wang [Fri, 6 Mar 2026 06:43:37 +0000 (14:43 +0800)]
mm: use inline helper functions instead of ugly macros
Patch series "support batched checking of the young flag for MGLRU", v3.
This is a follow-up to the previous work [1], to support batched checking
of the young flag for MGLRU.
Similarly, batched checking of young flag for large folios can improve
performance during large-folio reclamation when MGLRU is enabled. I
observed noticeable performance improvements (see patch 5) on an Arm64
machine that supports contiguous PTEs. All mm-selftests are passed.
Patch 1 - 3: cleanup patches.
Patch 4: add a new generic batched PTE helper: test_and_clear_young_ptes().
Patch 5: support batched young flag checking for MGLRU.
Patch 6: implement the Arm64 arch-specific test_and_clear_young_ptes().
This patch (of 6):
People have already complained that these *_clear_young_notify() related
macros are very ugly, so let's use inline helpers to make them more
readable.
In addition, we cannot implement these inline helper functions in the
mmu_notifier.h file, because some arch-specific files will include the
mmu_notifier.h, which introduces header compilation dependencies and
causes build errors (e.g., arch/arm64/include/asm/tlbflush.h). Moreover,
since these functions are only used in the mm, implementing these inline
helpers in the mm/internal.h header seems reasonable.
Link: https://lkml.kernel.org/r/cover.1772778858.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/ea14af84e7967ccebb25082c28a8669d6da8fe57.1772778858.git.baolin.wang@linux.alibaba.com Link: https://lore.kernel.org/all/cover.1770645603.git.baolin.wang@linux.alibaba.com/ Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Rik van Riel <riel@surriel.com> Reviewed-by: Barry Song <baohua@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Xu <weixugc@google.com> Cc: Will Deacon <will@kernel.org> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Alistair Popple <apopple@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: rename zap_vma_ptes() to zap_special_vma_range()
zap_vma_ptes() is the only zapping function we export to modules.
It's essentially a wrapper around zap_vma_range(), however, with some
safety checks:
* That the passed range fits fully into the VMA
* That it's only used for VM_PFNMAP
We will add support for VM_MIXEDMAP next, so use the more-generic term
"special vma", although "special" is a bit overloaded. Maybe we'll later
just support any VM_SPECIAL flag.
While at it, improve the kerneldoc.
Link: https://lkml.kernel.org/r/20260227200848.114019-16-david@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Leon Romanovsky <leon@kernel.org> [drivers/infiniband] Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Arve <arve@android.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Carlos Llamas <cmllamas@google.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Daniel Borkman <daniel@iogearbox.net> Cc: Dave Airlie <airlied@gmail.com> Cc: David Ahern <dsahern@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hartley Sweeten <hsweeten@visionengravers.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ian Abbott <abbotti@mev.co.uk> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jakub Kacinski <kuba@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jann Horn <jannh@google.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Namhyung kim <namhyung@kernel.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Todd Kjos <tkjos@android.com> Cc: Tvrtko Ursulin <tursulin@ursulin.net> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/memory: convert details->even_cows into details->skip_cows
The current semantics are confusing: simply because someone specifies an
empty zap_detail struct suddenly makes should_zap_cows() behave
differently. The default should be to also zap CoW'ed anonymous pages.
Really only unmap_mapping_pages() and friends want to skip zapping of
these anon folios.
So let's invert the meaning; turn the confusing "reclaim_pt" check that
overrides other properties in should_zap_cows() into a safety check.
Note that the only caller that sets reclaim_pt=true is
madvise_dontneed_single_vma(), which wants to zap any pages.
Link: https://lkml.kernel.org/r/20260227200848.114019-10-david@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Arve <arve@android.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Carlos Llamas <cmllamas@google.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Daniel Borkman <daniel@iogearbox.net> Cc: Dave Airlie <airlied@gmail.com> Cc: David Ahern <dsahern@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hartley Sweeten <hsweeten@visionengravers.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ian Abbott <abbotti@mev.co.uk> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jakub Kacinski <kuba@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jann Horn <jannh@google.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Namhyung kim <namhyung@kernel.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Todd Kjos <tkjos@android.com> Cc: Tvrtko Ursulin <tursulin@ursulin.net> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/memory: move adjusting of address range to unmap_vmas()
__zap_vma_range() has two callers, whereby zap_page_range_single_batched()
documents that the range must fit into the VMA range.
So move adjusting the range to unmap_vmas() where it is actually required
and add a safety check in __zap_vma_range() instead. In unmap_vmas(),
we'd never expect to have empty ranges (otherwise, why have the vma in
there in the first place).
__zap_vma_range() will no longer be called with start == end, so cleanup
the function a bit. While at it, simplify the overly long comment to its
core message.
We will no longer call uprobe_munmap() for start == end, which actually
seems to be the right thing to do.
Note that hugetlb_zap_begin()->...->adjust_range_if_pmd_sharing_possible()
cannot result in the range exceeding the vma range.
Link: https://lkml.kernel.org/r/20260227200848.114019-9-david@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Arve <arve@android.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Carlos Llamas <cmllamas@google.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Daniel Borkman <daniel@iogearbox.net> Cc: Dave Airlie <airlied@gmail.com> Cc: David Ahern <dsahern@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hartley Sweeten <hsweeten@visionengravers.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ian Abbott <abbotti@mev.co.uk> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jakub Kacinski <kuba@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jann Horn <jannh@google.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Namhyung kim <namhyung@kernel.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Todd Kjos <tkjos@android.com> Cc: Tvrtko Ursulin <tursulin@ursulin.net> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/oom_kill: use MMU_NOTIFY_CLEAR in __oom_reap_task_mm()
In commit 7269f999934b ("mm/mmu_notifier: use correct mmu_notifier events
for each invalidation") we converted all MMU_NOTIFY_UNMAP to
MMU_NOTIFY_CLEAR, except the ones that actually perform munmap() or
mremap() as documented.
__oom_reap_task_mm() behaves much more like MADV_DONTNEED. So use
MMU_NOTIFY_CLEAR as well.
This is a preparation for further changes.
Link: https://lkml.kernel.org/r/20260227200848.114019-6-david@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Arve <arve@android.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Carlos Llamas <cmllamas@google.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Daniel Borkman <daniel@iogearbox.net> Cc: Dave Airlie <airlied@gmail.com> Cc: David Ahern <dsahern@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hartley Sweeten <hsweeten@visionengravers.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ian Abbott <abbotti@mev.co.uk> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jakub Kacinski <kuba@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jann Horn <jannh@google.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Namhyung kim <namhyung@kernel.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Todd Kjos <tkjos@android.com> Cc: Tvrtko Ursulin <tursulin@ursulin.net> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/memory: inline unmap_mapping_range_vma() into unmap_mapping_range_tree()
Let's remove the number of unmap-related functions that cause confusion by
inlining unmap_mapping_range_vma() into its single caller. The end result
looks pretty readable.
Link: https://lkml.kernel.org/r/20260227200848.114019-4-david@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Arve <arve@android.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Carlos Llamas <cmllamas@google.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Daniel Borkman <daniel@iogearbox.net> Cc: Dave Airlie <airlied@gmail.com> Cc: David Ahern <dsahern@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hartley Sweeten <hsweeten@visionengravers.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ian Abbott <abbotti@mev.co.uk> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jakub Kacinski <kuba@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jann Horn <jannh@google.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Namhyung kim <namhyung@kernel.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Todd Kjos <tkjos@android.com> Cc: Tvrtko Ursulin <tursulin@ursulin.net> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/memory: remove "zap_details" parameter from zap_page_range_single()
Nobody except memory.c should really set that parameter to non-NULL. So
let's just drop it and make unmap_mapping_range_vma() use
zap_page_range_single_batched() instead.
[david@kernel.org: format on a single line] Link: https://lkml.kernel.org/r/8a27e9ac-2025-4724-a46d-0a7c90894ba7@kernel.org Link: https://lkml.kernel.org/r/20260227200848.114019-3-david@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: Puranjay Mohan <puranjay@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Arve <arve@android.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Carlos Llamas <cmllamas@google.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Daniel Borkman <daniel@iogearbox.net> Cc: Dave Airlie <airlied@gmail.com> Cc: David Ahern <dsahern@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hartley Sweeten <hsweeten@visionengravers.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ian Abbott <abbotti@mev.co.uk> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jakub Kacinski <kuba@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jann Horn <jannh@google.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Namhyung kim <namhyung@kernel.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Todd Kjos <tkjos@android.com> Cc: Tvrtko Ursulin <tursulin@ursulin.net> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/madvise: drop range checks in madvise_free_single_vma()
Patch series "mm: cleanups around unmapping / zapping".
A bunch of cleanups around unmapping and zapping. Mostly simplifications,
code movements, documentation and renaming of zapping functions.
With this series, we'll have the following high-level zap/unmap functions
(excluding high-level folio zapping):
* unmap_vmas() for actual unmapping (vmas will go away)
* zap_vma(): zap all page table entries in a vma
* zap_vma_for_reaping(): zap_vma() that must not block
* zap_vma_range(): zap a range of page table entries
* zap_vma_range_batched(): zap_vma_range() with more options and batching
* zap_special_vma_range(): limited zap_vma_range() for modules
* __zap_vma_range(): internal helper
Patch #1 is not about unmapping/zapping, but I stumbled over it while
verifying MADV_DONTNEED range handling.
Patch #16 is related to [1], but makes sense even independent of that.
This patch (of 16):
madvise_vma_behavior()->
madvise_dontneed_free()->madvise_free_single_vma() is only called from
madvise_walk_vmas()
(a) After try_vma_read_lock() confirmed that the whole range falls into
a single VMA (see is_vma_lock_sufficient()).
(b) After adjusting the range to the VMA in the loop afterwards.
madvise_dontneed_free() might drop the MM lock when handling userfaultfd,
but it properly looks up the VMA again to adjust the range.
So in madvise_free_single_vma(), the given range should always fall into a
single VMA and should also span at least one page.
Let's drop the error checks.
The code now matches what we do in madvise_dontneed_single_vma(), where we
call zap_vma_range_batched() that documents: "The range must fit into one
VMA.". Although that function still adjusts that range, we'll change that
soon.
Link: https://lkml.kernel.org/r/20260227200848.114019-1-david@kernel.org Link: https://lkml.kernel.org/r/20260227200848.114019-2-david@kernel.org Link: https://lore.kernel.org/r/aYSKyr7StGpGKNqW@google.com Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Arve <arve@android.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Carlos Llamas <cmllamas@google.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Daniel Borkman <daniel@iogearbox.net> Cc: Dave Airlie <airlied@gmail.com> Cc: David Ahern <dsahern@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hartley Sweeten <hsweeten@visionengravers.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ian Abbott <abbotti@mev.co.uk> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jakub Kacinski <kuba@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jann Horn <jannh@google.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Namhyung kim <namhyung@kernel.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Todd Kjos <tkjos@android.com> Cc: Tvrtko Ursulin <tursulin@ursulin.net> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
kasan: docs: SLUB is the only remaining slab implementation
We have only the SLUB implementation left in the kernel (referred to as
"slab"). Therefore, there is nothing special regarding KASAN modes when
it comes to the slab allocator anymore.
Drop the stale comment regarding differing SLUB vs. SLAB support.
Michal Hocko [Mon, 2 Mar 2026 11:47:40 +0000 (12:47 +0100)]
vmalloc: support __GFP_RETRY_MAYFAIL and __GFP_NORETRY
__GFP_RETRY_MAYFAIL and __GFP_NORETRY haven't been supported so far
because their semantic (i.e. to not trigger OOM killer) is not possible
with the existing vmalloc page table allocation which is allowing for the
OOM killer.
There are usecases for these modifiers when a large allocation request
should rather fail than trigger OOM killer which wouldn't be able to
handle the situation anyway [1].
While we cannot change existing page table allocation code easily we can
piggy back on scoped NOWAIT allocation for them that we already have in
place. The rationale is that the bulk of the consumed memory is sitting
in pages backing the vmalloc allocation. Page tables are only
participating a tiny fraction. Moreover page tables for virtually
allocated areas are never reclaimed so the longer the system runs to less
likely they are. It makes sense to allow an approximation of
__GFP_RETRY_MAYFAIL and __GFP_NORETRY even if the page table allocation
part is much weaker. This doesn't break the failure mode while it allows
for the no OOM semantic.
mm/vmalloc: fix incorrect size reporting on allocation failure
When __vmalloc_area_node() fails to allocate pages, the failure message
may report an incorrect allocation size, for example:
vmalloc error: size 0, failed to allocate pages, ...
This happens because the warning prints area->nr_pages * PAGE_SIZE. At
this point, area->nr_pages may be zero or partly populated thus it is not
valid.
Report the originally requested allocation size instead by using
nr_small_pages * PAGE_SIZE, which reflects the actual number of pages
being requested by user.
Link: https://lkml.kernel.org/r/20260302114740.2668450-1-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jane Chu [Mon, 2 Mar 2026 20:10:15 +0000 (13:10 -0700)]
Documentation: fix a hugetlbfs reservation statement
Documentation/mm/hugetlbfs_reserv.rst has
if (resv_needed <= (resv_huge_pages - free_huge_pages))
resv_huge_pages += resv_needed;
which describes this code in gather_surplus_pages()
needed = (h->resv_huge_pages + delta) - h->free_huge_pages;
if (needed <= 0) {
h->resv_huge_pages += delta;
return 0;
}
which means if there are enough free hugepages to account for the new
reservation, simply update the global reservation count without
further action.
But the description is backwards, it should be
if (resv_needed <= (free_huge_pages - resv_huge_pages))
instead.
Link: https://lkml.kernel.org/r/20260302201015.1824798-1-jane.chu@oracle.com Fixes: 70bc0dc578b3 ("Documentation: vm, add hugetlbfs reservation overview") Signed-off-by: Jane Chu <jane.chu@oracle.com> Cc: David Hildenbrand <david@kernel.org> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Gladyshev Ilya [Sun, 1 Mar 2026 13:19:39 +0000 (13:19 +0000)]
mm: make ref_unless functions unless_zero only
There are no users of (folio/page)_ref_add_unless(page, nr, u) with u != 0
[1] and all current users are "internal" for page refcounting API. This
allows us to safely drop this parameter and reduce function semantics to
the "unless zero" cases only.
If needed, these functions for the u!=0 cases can be trivially
reintroduced later using the same atomic_add_unless operations as before.
[1]: The last user was dropped in v5.18 kernel, commit 27674ef6c73f ("mm:
remove the extra ZONE_DEVICE struct page refcount"). There is no trace of
discussion as to why this cleanup wasn't done earlier.
Link: https://lkml.kernel.org/r/a0c89b49d38c671a0bdd35069d15ee13e08314d2.1772370066.git.gladyshev.ilya1@h-partners.com Co-developed-by: Gorbunov Ivan <gorbunov.ivan@h-partners.com> Signed-off-by: Gorbunov Ivan <gorbunov.ivan@h-partners.com> Signed-off-by: Gladyshev Ilya <gladyshev.ilya1@h-partners.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Kiryl Shutsemau <kas@kernel.org> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Fri, 27 Feb 2026 17:07:59 +0000 (18:07 +0100)]
mm/page_alloc: remove IRQ saving/restoring from pcp locking
Effectively revert commit 038a102535eb ("mm/page_alloc: prevent pcp
corruption with SMP=n"). The original problem is now avoided by
pcp_spin_trylock() always failing on CONFIG_SMP=n, so we do not need to
disable IRQs anymore.
It's not a complete revert, because keeping the pcp_spin_(un)lock()
wrappers is useful. Rename them from _maybe_irqsave/restore to _nopin.
The difference from pcp_spin_trylock()/pcp_spin_unlock() is that the
_nopin variants don't perform pcpu_task_pin/unpin().
Link: https://lkml.kernel.org/r/20260227-b4-pcp-locking-cleanup-v1-2-f7e22e603447@kernel.org Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Fri, 27 Feb 2026 17:07:58 +0000 (18:07 +0100)]
mm/page_alloc: effectively disable pcp with CONFIG_SMP=n
Patch series "mm/page_alloc: pcp locking cleanup".
This is a followup to the hotfix 038a102535eb ("mm/page_alloc: prevent pcp
corruption with SMP=n"), to simplify the code and deal with the original
issue properly. The previous RFC attempt [1] argued for changing the UP
spinlock implementation, which was discouraged, but thanks to David's
off-list suggestion, we can achieve the goal without changing the spinlock
implementation.
The main change in Patch 1 relies on the fact that on UP we don't need the
pcp lists for scalability, so just make them always bypassed during
alloc/free by making the pcp trylock an unconditional failure.
The various drain paths that use pcp_spin_lock_maybe_irqsave() continue to
exist but will never do any work in practice. In Patch 2 we can again
remove the irq saving from them that commit 038a102535eb added.
Besides simpler code with all the ugly UP_flags removed, we get less bloat
with CONFIG_SMP=n for mm/page_alloc.o as a result:
The page allocator has been using a locking scheme for its percpu page
caches (pcp) based on spin_trylock() with no _irqsave() part. The trick
is that if we interrupt the locked section, we fail the trylock and just
fallback to the slowpath taking the zone lock. That's more expensive, but
rare, so we don't need to pay the irqsave/restore cost all the time in the
fastpaths.
It's similar to but not exactly local_trylock_t (which is also newer
anyway) because in some cases we do lock the pcp of a non-local cpu to
drain it, in a way that's cheaper than using IPI or queue_work_on().
The complication of this scheme has been UP non-debug spinlock
implementation which assumes spin_trylock() can't fail on UP and has no
state to track whether it's locked. It just doesn't anticipate this usage
scenario. So to work around that we disable IRQs only on UP, complicating
the implementation. Also recently we found years old bug in where we
didn't disable IRQs in related paths - see 038a102535eb ("mm/page_alloc:
prevent pcp corruption with SMP=n").
We can avoid this UP complication by realizing that we do not need the pcp
caching for scalability on UP in the first place. Removing it completely
with #ifdefs is not worth the trouble either. Just make
pcp_spin_trylock() return NULL unconditionally on CONFIG_SMP=n. This
makes the slowpaths unconditional, and we can remove the IRQ save/restore
handling in pcp_spin_trylock()/unlock() completely.
SeongJae Park [Sat, 28 Feb 2026 22:28:26 +0000 (14:28 -0800)]
mm/damon/vaddr: do not split regions for min_nr_regions
The previous commit made DAMON core split regions at the beginning for
min_nr_regions. The virtual address space operation set (vaddr) does
similar work on its own, for a case user delegates entire initial
monitoring regions setup to vaddr. It is unnecessary now, as DAMON core
will do similar work for any case. Remove the duplicated work in vaddr.
Also, remove a helper function that was being used only for the work, and
the test code of the helper function.
Link: https://lkml.kernel.org/r/20260228222831.7232-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Sat, 28 Feb 2026 22:28:25 +0000 (14:28 -0800)]
mm/damon/core: split regions for min_nr_regions
Patch series "mm/damon: strictly respect min_nr_regions".
DAMON core respects min_nr_regions only at merge operation. DAMON API
callers are therefore responsible to respect or ignore that. Only vaddr
ops is respecting that, but only for initial start time. DAMON sysfs
interface allows users to setup the initial regions that DAMON core also
respects. But, again, it works for only the initial time. Users setting
the regions for min_nr_regions can be difficult and inefficient, when the
min_nr_regions value is high. There was actually a report [1] from a
user. The use case was page granular access monitoring with a large
aggregation interval.
Make the following three changes for resolving the issue. First (patch
1), make DAMON core split regions at the beginning and every aggregation
interval, to respect the min_nr_regions. Second (patch 2), drop the
vaddr's split operations and related code that are no more needed. Third
(patch 3), add a kunit test for the newly introduced function.
This patch (of 3):
DAMON core layer respects the min_nr_regions parameter by setting the
maximum size of each region as total monitoring region size divided by the
parameter value. And the limit is applied by preventing merge of regions
that result in a region larger than the maximum size. The limit is
updated per ops update interval, because vaddr updates the monitoring
regions on the ops update callback.
It does nothing for the beginning state. That's because the users can set
the initial monitoring regions as they want. That is, if the users really
care about the min_nr_regions, they are supposed to set the initial
monitoring regions to have more than min_nr_regions regions. The virtual
address space operation set, vaddr, has an exceptional case. Users can
ask the ops set to configure the initial regions on its own. For the
case, vaddr sets up the initial regions to meet the min_nr_regions. So,
vaddr has exceptional support, but basically users are required to set the
regions on their own if they want min_nr_regions to be respected.
When 'min_nr_regions' is high, such initial setup is difficult. If DAMON
sysfs interface is used for that, the memory for saving the initial setup
is also a waste.
Even if the user forgives the setup, DAMON will eventually make more than
min_nr_regions regions by splitting operations. But it will take time.
If the aggregation interval is long, the delay could be problematic.
There was actually a report [1] of the case. The reporter wanted to do
page granular monitoring with a large aggregation interval.
Also, DAMON is doing nothing for online changes on monitoring regions and
min_nr_regions. For example, the user can remove a monitoring region or
increase min_nr_regions while DAMON is running.
Split regions larger than the size at the beginning of the kdamond main
loop, to fix the initial setup issue. Also do the split every aggregation
interval, for online changes. This means the behavior is slightly
changed. It is difficult to imagine a use case that actually depends on
the old behavior, though. So this change is arguably fine.
Note that the size limit is aligned by damon_ctx->min_region_sz and cannot
be zero. That is, if min_nr_region is larger than the total size of
monitoring regions divided by ->min_region_sz, that cannot be respected.
kasan_free_pxd() assumes the page table is always struct page aligned.
But that's not always the case for all architectures. E.g. In case of
powerpc with 64K pagesize, PUD table (of size 4096) comes from slab cache
named pgtable-2^9. Hence instead of page_to_virt(pxd_page()) let's just
directly pass the start of the pxd table which is passed as the 1st
argument.
This fixes the below double free kasan issue seen with PMEM:
radix-mmu: Mapped 0x0000047d10000000-0x0000047f90000000 with 2.00 MiB pages
==================================================================
BUG: KASAN: double-free in kasan_remove_zero_shadow+0x9c4/0xa20
Free of addr c0000003c38e0000 by task ndctl/2164
The buggy address belongs to the object at c0000003c38e0000
which belongs to the cache pgtable-2^9 of size 4096
The buggy address is located 0 bytes inside of
4096-byte region [c0000003c38e0000, c0000003c38e1000)
[ 138.953636] [ T2164] Memory state around the buggy address:
[ 138.953643] [ T2164] c0000003c38dff00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 138.953652] [ T2164] c0000003c38dff80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 138.953661] [ T2164] >c0000003c38e0000: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 138.953669] [ T2164] ^
[ 138.953675] [ T2164] c0000003c38e0080: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 138.953684] [ T2164] c0000003c38e0100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 138.953692] [ T2164] ==================================================================
[ 138.953701] [ T2164] Disabling lock debugging due to kernel taint
Link: https://lkml.kernel.org/r/2f9135c7866c6e0d06e960993b8a5674a9ebc7ec.1771938394.git.ritesh.list@gmail.com Fixes: 0207df4fa1a8 ("kernel/memremap, kasan: make ZONE_DEVICE with work with KASAN") Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Replace READ_ONCE() with the existing standard page table accessor for PUD
aka pudp_get() in pud_trans_unstable(). This does not create any
functional change for platforms that do not override pudp_get(), which
still defaults to READ_ONCE().
Link: https://lkml.kernel.org/r/20260227040300.2091901-1-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: SeongJae Park <sj@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mike Rapoport <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/debug_vm_pgtable: replace WRITE_ONCE() with pxd_clear()
Replace WRITE_ONCE() with generic pxd_clear() to clear out the page table
entries as required. Besides this does not cause any functional change as
well.
Link: https://lkml.kernel.org/r/20260227061204.2215395-1-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Suggested-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Ackeed-by: SeongJae Park <sj@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Proof: Both loops in hpage_collapse_scan_file and collapse_file, which
iterate on the xarray, have the invariant that start <= folio->index <
start + HPAGE_PMD_NR ... (i)
A folio is always naturally aligned in the pagecache, therefore
folio_order == HPAGE_PMD_ORDER => IS_ALIGNED(folio->index, HPAGE_PMD_NR) == true ... (ii)
thp_vma_allowable_order -> thp_vma_suitable_order requires that the virtual
offsets in the VMA are aligned to the order,
=> IS_ALIGNED(start, HPAGE_PMD_NR) == true ... (iii)
Combining (i), (ii) and (iii), the claim is proven.
Therefore, remove this check.
While at it, simplify the comments.
Link: https://lkml.kernel.org/r/20260227143501.1488110-1-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Barry Song <baohua@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Fri, 27 Feb 2026 17:06:21 +0000 (09:06 -0800)]
mm/damon/core: do non-safe region walk on kdamond_apply_schemes()
kdamond_apply_schemes() is using damon_for_each_region_safe(), which is
safe for deallocation of the region inside the loop. However, the loop
internal logic does not deallocate regions. Hence it is only wasting the
next pointer. Also, it causes a problem.
When an address filter is applied, and there is a region that intersects
with the filter, the filter splits the region on the filter boundary. The
intention is to let DAMOS apply action to only filtered-in address ranges.
However, it is using damon_for_each_region_safe(), which sets the next
region before the execution of the iteration. Hence, the region that
split and now will be next to the previous region, is simply ignored. As
a result, DAMOS applies the action to target regions bit slower than
expected, when the address filter is used. Shouldn't be a big problem but
definitely better to be fixed. damos_skip_charged_region() was working
around the issue using a double pointer hack.
Use damon_for_each_region(), which is safe for this use case. And drop
the work around in damos_skip_charged_region().
SeongJae Park [Fri, 27 Feb 2026 17:06:20 +0000 (09:06 -0800)]
mm/damon/core: set quota-score histogram with core filters
Patch series "mm/damon/core: improve DAMOS quota efficiency for core layer
filters".
Improve two below problematic behaviors of DAMOS that makes it less
efficient when core layer filters are used.
DAMOS generates the under-quota regions prioritization-purpose access
temperature histogram [1] with only the scheme target access pattern. The
DAMOS filters are ignored on the histogram, and this can result in the
scheme not applied to eligible regions. For working around this, users
had to use separate DAMON contexts. The memory tiering approaches are
such examples.
DAMOS splits regions that intersect with address filters, so that only
filtered-out part of the region is skipped. But, the implementation is
skipping the other part of the region that is not filtered out, too. As a
result, DAMOS can work slower than expected.
Improve the two inefficient behaviors with two patches, respectively.
Read the patches for more details about the problem and how those are
fixed.
This patch (of 2):
The histogram for under-quota region prioritization [1] is made for all
regions that are eligible for the DAMOS target access pattern. When there
are DAMOS filters, the prioritization-threshold access temperature that
generated from the histogram could be inaccurate.
For example, suppose there are three regions. Each region is 1 GiB. The
access temperature of the regions are 100, 50, and 0. And a DAMOS scheme
that targets _any_ access temperature with quota 2 GiB is being used. The
histogram will look like below:
temperature size of regions having >=temperature temperature
0 3 GiB
50 2 GiB
100 1 GiB
Based on the histogram and the quota (2 GiB), DAMOS applies the action to
only the regions having >=50 temperature. This is all good.
Let's suppose the region of temperature 50 is excluded by a DAMOS filter.
Regardless of the filter, DAMOS will try to apply the action on only
regions having >=50 temperature. Because the region of temperature 50 is
filtered out, the action is applied to only the region of temperature 100.
Worse yet, suppose the filter is excluding regions of temperature 50 and
100. Then no action is really applied to any region, while the region of
temperature 0 is there.
People used to work around this by utilizing multiple contexts, instead of
the core layer DAMOS filters. For example, DAMON-based memory tiering
approaches including the quota auto-tuning based one [2] are using a DAMON
context per NUMA node. If the above explained issue is effectively
alleviated, those can be configured again to run with single context and
DAMOS filters for applying the promotion and demotion to only specific
NUMA nodes.
Alleviate the problem by checking core DAMOS filters when generating the
histogram. The reason to check only core filters is the overhead. While
core filters are usually for coarse-grained filtering (e.g.,
target/address filters for process, NUMA, zone level filtering), operation
layer filters are usually for fine-grained filtering (e.g., for anon
page). Doing this for operation layer filters would cause significant
overhead. There is no known use case that is affected by the operation
layer filters-distorted histogram problem, though. Do this for only core
filters for now. We will revisit this for operation layer filters in
future. We might be able to apply a sort of sampling based operation
layer filtering.
After this fix is applied, for the first case that there is a DAMOS filter
excluding the region of temperature 50, the histogram will be like below:
temperature size of regions having >=temperature temperature
0 2 GiB
100 1 GiB
And DAMOS will set the temperature threshold as 0, allowing both regions
of temperatures 0 and 100 be applied.
For the second case that there is a DAMOS filter excluding the regions of
temperature 50 and 100, the histogram will be like below:
temperature size of regions having >=temperature temperature
0 1 GiB
And DAMOS will set the temperature threshold as 0, allowing the region of
temperature 0 be applied.
[1] 'Prioritization' section of Documentation/mm/damon/design.rst
[2] commit 0e1c773b501f ("mm/damon/core: introduce damos quota goal
metrics for memory node utilization")
Kiryl Shutsemau [Fri, 27 Feb 2026 19:42:56 +0000 (19:42 +0000)]
mm/slab: use compound_head() in page_slab()
page_slab() contained an open-coded implementation of compound_head().
Replace the duplicated code with a direct call to compound_head().
Link: https://lkml.kernel.org/r/20260227194302.274384-19-kas@kernel.org Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kiryl Shutsemau [Fri, 27 Feb 2026 19:42:55 +0000 (19:42 +0000)]
hugetlb: update vmemmap_dedup.rst
Update the documentation regarding vmemmap optimization for hugetlb to
reflect the changes in how the kernel maps the tail pages.
Fake heads no longer exist. Remove their description.
[kas@kernel.org: update vmemmap_dedup.rst] Link: https://lkml.kernel.org/r/20260302105630.303492-1-kas@kernel.org Link: https://lkml.kernel.org/r/20260227194302.274384-18-kas@kernel.org Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Muchun Song <muchun.song@linux.dev> Reviewed-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kiryl Shutsemau [Fri, 27 Feb 2026 19:42:54 +0000 (19:42 +0000)]
mm: remove the branch from compound_head()
The compound_head() function is a hot path. For example, the zap path
calls it for every leaf page table entry.
Rewrite the helper function in a branchless manner to eliminate the risk
of CPU branch misprediction.
Link: https://lkml.kernel.org/r/20260227194302.274384-17-kas@kernel.org Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Muchun Song <muchun.song@linux.dev> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: WANG Xuerui <kernel@xen0n.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The hugetlb_optimize_vmemmap_key static key was used to guard fake head
detection in compound_head() and related functions. It allowed skipping
the fake head checks entirely when HVO was not in use.
With fake heads eliminated and the detection code removed, the static key
serves no purpose. Remove its definition and all increment/decrement
calls.
Link: https://lkml.kernel.org/r/20260227194302.274384-16-kas@kernel.org Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Muchun Song <muchun.song@linux.dev> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kiryl Shutsemau [Fri, 27 Feb 2026 19:42:52 +0000 (19:42 +0000)]
hugetlb: remove VMEMMAP_SYNCHRONIZE_RCU
The VMEMMAP_SYNCHRONIZE_RCU flag triggered synchronize_rcu() calls to
prevent a race between HVO remapping and page_ref_add_unless(). The race
could occur when a speculative PFN walker tried to modify the refcount on
a struct page that was in the process of being remapped to a fake head.
With fake heads eliminated, page_ref_add_unless() no longer needs RCU
protection.
Remove the flag and synchronize_rcu() calls.
Link: https://lkml.kernel.org/r/20260227194302.274384-15-kas@kernel.org Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Muchun Song <muchun.song@linux.dev> Reviewed-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kiryl Shutsemau [Fri, 27 Feb 2026 19:42:51 +0000 (19:42 +0000)]
mm: drop fake head checks
With fake head pages eliminated in the previous commit, remove the
supporting infrastructure:
- page_fixed_fake_head(): no longer needed to detect fake heads;
- page_is_fake_head(): no longer needed;
- page_count_writable(): no longer needed for RCU protection;
- RCU read_lock in page_ref_add_unless(): no longer needed;
This substantially simplifies compound_head() and page_ref_add_unless(),
removing both branches and RCU overhead from these hot paths.
RCU was required to serialize allocation of hugetlb page against
get_page_unless_zero() and prevent writing to read-only fake head. It is
redundant without fake heads.
See bd225530a4c7 ("mm/hugetlb_vmemmap: fix race with speculative PFN
walkers") for more details.
synchronize_rcu() in mm/hugetlb_vmemmap.c will be removed by a separate
patch.
Link: https://lkml.kernel.org/r/20260227194302.274384-14-kas@kernel.org Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Muchun Song <muchun.song@linux.dev> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kiryl Shutsemau [Fri, 27 Feb 2026 19:42:50 +0000 (19:42 +0000)]
mm/hugetlb: remove fake head pages
HugeTLB Vmemmap Optimization (HVO) reduces memory usage by freeing most
vmemmap pages for huge pages and remapping the freed range to a single
page containing the struct page metadata.
With the new mask-based compound_info encoding (for power-of-2 struct page
sizes), all tail pages of the same order are now identical regardless of
which compound page they belong to. This means the tail pages can be
truly shared without fake heads.
Allocate a single page of initialized tail struct pages per zone per order
in the vmemmap_tails[] array in struct zone. All huge pages of that order
in the zone share this tail page, mapped read-only into their vmemmap.
The head page remains unique per huge page.
Redefine MAX_FOLIO_ORDER using ilog2(). The define has to produce a
compile-constant as it is used to specify vmemmap_tail array size. For
some reason, compiler is not able to solve get_order() at compile-time,
but ilog2() works.
Avoid PUD_ORDER to define MAX_FOLIO_ORDER as it adds dependency to
<linux/pgtable.h> which generates hard-to-break include loop.
This eliminates fake heads while maintaining the same memory savings, and
simplifies compound_head() by removing fake head detection.
Link: https://lkml.kernel.org/r/20260227194302.274384-13-kas@kernel.org Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
x86/vdso: undefine CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP for vdso32
The 32-bit VDSO build on x86_64 uses fake_32bit_build.h to undefine
various kernel configuration options that are not suitable for the VDSO
context or may cause build issues when including kernel headers.
Undefine CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP in fake_32bit_build.h to
prepare for change in HugeTLB Vmemmap Optimization.
Link: https://lkml.kernel.org/r/20260227194302.274384-12-kas@kernel.org Signed-off-by: Kiryl Shutsemau (Meta) <kas@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kiryl Shutsemau [Fri, 27 Feb 2026 19:42:48 +0000 (19:42 +0000)]
mm/hugetlb: refactor code around vmemmap_walk
To prepare for removing fake head pages, the vmemmap_walk code is being
reworked.
The reuse_page and reuse_addr variables are being eliminated. There will
no longer be an expectation regarding the reuse address in relation to the
operated range. Instead, the caller will provide head and tail vmemmap
pages.
Currently, vmemmap_head and vmemmap_tail are set to the same page, but
this will change in the future.
The only functional change is that __hugetlb_vmemmap_optimize_folio() will
abandon optimization if memory allocation fails.
Link: https://lkml.kernel.org/r/20260227194302.274384-11-kas@kernel.org Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Hildenbrand (arm) <david@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/hugetlb: defer vmemmap population for bootmem hugepages
Currently, the vmemmap for bootmem-allocated gigantic pages is populated
early in hugetlb_vmemmap_init_early(). However, the zone information is
only available after zones are initialized. If it is later discovered
that a page spans multiple zones, the HVO mapping must be undone and
replaced with a normal mapping using vmemmap_undo_hvo().
Defer the actual vmemmap population to hugetlb_vmemmap_init_late(). At
this stage, zones are already initialized, so it can be checked if the
page is valid for HVO before deciding how to populate the vmemmap.
This allows us to remove vmemmap_undo_hvo() and the complex logic required
to rollback HVO mappings.
In hugetlb_vmemmap_init_late(), if HVO population fails or if the zones
are invalid, fall back to a normal vmemmap population.
Postponing population until hugetlb_vmemmap_init_late() also makes zone
information available from within vmemmap_populate_hvo().
Link: https://lkml.kernel.org/r/20260227194302.274384-10-kas@kernel.org Signed-off-by: Kiryl Shutsemau (Meta) <kas@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kiryl Shutsemau [Fri, 27 Feb 2026 19:42:46 +0000 (19:42 +0000)]
mm/sparse: check memmap alignment for compound_info_has_mask()
If page->compound_info encodes a mask, it is expected that vmemmap to be
naturally aligned to the maximum folio size.
Add a VM_WARN_ON_ONCE() to check the alignment.
Link: https://lkml.kernel.org/r/20260227194302.274384-9-kas@kernel.org Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: David Hildenbrand (arm) <david@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>