SeongJae Park [Sun, 18 Jan 2026 18:02:56 +0000 (10:02 -0800)]
Docs/admin-guide/mm/damon/usage: introduce DAMON modules at the beginning
DAMON usage document provides a list of available DAMON interfaces with
brief introduction at the beginning of the doc. The list is missing DAMON
modules for special purposes, while it is one of the major suggested
interfaces. Add an item for those to the list.
Link: https://lkml.kernel.org/r/20260118180305.70023-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Sun, 18 Jan 2026 18:02:55 +0000 (10:02 -0800)]
Docs/mm/damon/design: add reference to DAMON_STAT usage
Design document's special-purpose DAMON modules section is providing the
list of links to the usage documents of existing DAMON modules. It is
missing the link for DAMON_STAT, though. Add the missed link.
Link: https://lkml.kernel.org/r/20260118180305.70023-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
People sometimes get confused about the purposes of DAMON special-purpose
modules and sample modules. Clarify those on the design document by
adding a section describing their existence and purposes.
Link: https://lkml.kernel.org/r/20260118180305.70023-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Sun, 18 Jan 2026 18:02:53 +0000 (10:02 -0800)]
Docs/mm/damon/design: link repology instead of Fedora package
The document is introducing Fedora as one way to get DAMON user-space tool
(damo) from OS-providing packaging system. Linux distros more than Fedora
are providing damo with their packaging systems, though. Replace the
Fedora part with the repology.org page that shows damo packaging status
for multiple Linux distros.
Link: https://lkml.kernel.org/r/20260118180305.70023-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Sun, 18 Jan 2026 18:02:52 +0000 (10:02 -0800)]
Docs/mm/damon/index: simplify the intro
Patch series "Docs/mm/damon: update intro, modules, maintainer profile,
and misc".
Update DAMON documentations for wordsmithing, clarifications, and
miscellaneous outdated things with eight patches. Patch 1 simplifies the
brief introduction of DAMON. Patch 2 updates DAMON user-space tool
packaged distros information on design doc to include not only Fedora, but
refer to repology. Three following patches update design and usage
documents for clarifying DAMON sample modules purposes (patch 3), and
outdated information about usages of DAMON modules (patches 4 and 5).
Final three patches update usage and maintainer-profile for sysfs
refresh_ms feature behavior (patch 6), synchronize DAMON MAINTAINERS
section name (patch 7), and broken damon-tests performance tests (patch
8).
This patch (of 8):
The intro is a bit verbose and redundant. Simplify it by replacing
details with more links to the design docs, and refining the design points
list.
Link: https://lkml.kernel.org/r/20260118180305.70023-1-sj@kernel.org Link: https://lkml.kernel.org/r/20260118180305.70023-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Taeyang Kim [Sat, 17 Jan 2026 10:14:28 +0000 (19:14 +0900)]
mm: update kernel-doc for __swap_cache_clear_shadow()
The kernel-doc comment referred to swap_cache_clear_shadow(), but the
actual function name is __swap_cache_clear_shadow().
Update the comment to match the function name.
Link: https://lkml.kernel.org/r/20260117101428.113154-1-maainnewkin59@gmail.com Signed-off-by: Taeyang Kim <maainnewkin59@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Sat, 17 Jan 2026 17:52:55 +0000 (09:52 -0800)]
mm/damon: rename min_sz_region of damon_ctx to min_region_sz
'min_sz_region' field of 'struct damon_ctx' represents the minimum size of
each DAMON region for the context. 'struct damos_access_pattern' has a
field of the same name. It confuses readers and makes 'grep' less optimal
for them. Rename it to 'min_region_sz'.
SeongJae Park [Sat, 17 Jan 2026 17:52:54 +0000 (09:52 -0800)]
mm/damon: rename DAMON_MIN_REGION to DAMON_MIN_REGION_SZ
The macro is for the default minimum size of each DAMON region. There was
a case that a reader was confused if it is the minimum number of total
DAMON regions, which is set on damon_attrs->min_nr_regions. Make the name
more explicit.
SeongJae Park [Sat, 17 Jan 2026 17:52:53 +0000 (09:52 -0800)]
mm/damon/core: rename damos_filter_out() to damos_core_filter_out()
DAMOS filters are processed on the core layer and operations layer,
depending on their types. damos_filter_out() in core.c, which is for only
core layer handled filters, can confuse the fact. Rename it to
damos_core_filter_out(), to be more explicit about the fact.
damon_call_control->dealloc_on_cancel works only when ->repeat is true.
But the behavior is not clearly documented. DAMON API callers can
understand the behavior only after reading kdamond_call() code. Document
the behavior on the kernel-doc comment of damon_call_control.
SeongJae Park [Sat, 17 Jan 2026 17:52:51 +0000 (09:52 -0800)]
mm/damon/core: process damon_call_control requests on a local list
kdamond_call() handles damon_call() requests on the ->call_controls list
of damon_ctx, which is shared with damon_call() callers. To protect the
list from concurrent accesses while letting the callback function
independent of the call_controls_lock, the function does complicated
locking operations. For each damon_call_control object on the list, the
function removes the control object from the list under locking, invoke
the callback of the control object without locking, and then puts the
control object back to the list if needed, under locking. It is
complicated, and can contend the locks more frequently with other DAMON
API caller threads as the number of concurrent callback requests
increases. Contention overhead is not a big deal, but the increased race
opportunity can make headaches.
Simplify the locking sequence by moving all damon_call_control objects
from the shared list to a local list at once under the single lock
protection, processing the callback requests without locking, and adding
back repeat mode controls to the shared list again at once again, again
under the single lock protection. This change makes the number of locking
in kdamond_call() be always two, regardless of the number of the queued
requests.
SeongJae Park [Sat, 17 Jan 2026 17:52:50 +0000 (09:52 -0800)]
mm/damon/core: cancel damos_walk() before damon_ctx->kdamond reset
damos_walk() request is canceled after damon_ctx->kdamond is reset. This
can make weird situations where damon_is_running() returns false but the
DAMON context has the damos_walk() request linked. There was a similar
situation for damon_call() requests handling [1], which _was_ able to
cause a racy use-after-free bug. Unlike the case of damon_call(), because
damos_walk() is always synchronously handled and allows only single
request at time, there is no such problematic race cases. But, keeping it
as is could stem another subtle race condition bug in future.
Avoid that by cancelling the requests before the ->kdamond reset. Note
that this change also makes all damon_ctx dependent resource cleanups
consistently done before the damon_ctx->kdamond reset.
SeongJae Park [Sat, 17 Jan 2026 17:52:49 +0000 (09:52 -0800)]
mm/damon/core: cleanup targets and regions at once on kdamond termination
When kdamond terminates, it destroys the regions of the context first, and
the targets of the context just before the kdamond main function returns.
Because regions are linked inside targets, doing them separately is only
inefficient and looks weird. A more serious problem is that the cleanup
of the targets is done after damon_ctx->kdamond reset, which is the event
that lets DAMON API callers know the kdamond is no longer actively
running. That is, some DAMON targets could still exist while kdamond is
not running. There are no real problems from this, but this implicit fact
could cause subtle racy issues in future. Destroy targets and regions at
one.
Adding contexts on how the code has evolved in the way. Doing only
regions destruction was because putting pids of the targets were done on
DAMON API callers. Commit 7114bc5e01cf ("mm/damon/core: add
cleanup_target() ops callback") moved the role to be done via operations
set on each target destruction. Hence it removed the reason to do only
regions cleanup. Commit 3a69f1635769 ("mm/damon/core: destroy targets
when kdamond_fn() finish") therefore further destructed targets on kdamond
termination time. It was still separated from regions destruction because
damon_operations->cleanup() may do additional targets cleanup. Placing
the targets destruction after damon_ctx->kdamond reset was just an
unnecessary decision of the commit. The previous commit removed
damon_operations->cleanup(), so there is no more reason to do destructions
of regions and targets separately.
SeongJae Park [Sat, 17 Jan 2026 17:52:48 +0000 (09:52 -0800)]
mm/damon: remove damon_operations->cleanup()
Patch series "mm/damon: cleanup kdamond, damon_call(), damos filter and
DAMON_MIN_REGION".
Do miscellaneous code cleanups for improving readability. First three
patches cleanup kdamond termination process, by removing unused operation
set cleanup callback (patch 1) and moving damon_ctx specific resource
cleanups on kdamond termination to synchronization-easy place (patches 2
and 3). Next two patches touch damon_call() infrastructure, by
refactoring kdamond_call() function to do less and simpler locking
operations (patch 4), and documenting when dealloc_on_free does work
(patch 5). Final three patches rename things for clear uses of those.
Those rename damos_filter_out() to be more explicit about the fact that it
is only for core-handled filters (patch 6), DAMON_MIN_REGION macro to be
more explicit it is not about number of regions but size of each region
(patch 7), and damon_ctx->min_sz_region to be different from
damos_access_patern->min_sz_region (patch 8), so that those are not
confusing and easy to grep.
This patch (of 8):
damon_operations->cleanup() was added for a case that an operation set
implementation requires additional cleanups. But no such implementation
exists at the moment. Remove it.
When the test fails, it shows whole sampled working set size measurements.
The purpose is showing the distribution of the measured values, to let
the tester know if it was just intermittent failure. Multiple same values
on the output are therefore unnecessary. It was not a big deal since the
test was failing only once in the past. But the test can now fail
multiple times with increased working set size, until it passes or the
working set size reaches a limit. Hence the noisy output can be quite
long and annoying. Print only the deduplicated distribution information.
SeongJae Park [Sat, 17 Jan 2026 02:07:27 +0000 (18:07 -0800)]
selftests/damon/wss_estimation: ensure number of collected wss
DAMON selftest for working set size estimation collects DAMON's working
set size measurements of the running artificial memory access generator
program until the program is finished. Depending on how quickly the
program finishes, and how quickly DAMON starts, the number of collected
working set size measurements may vary, and make the test results
unreliable. Ensure it collects 40 measurements by using the repeat mode
of the artificial memory access generator program, and finish the
measurements only after the desired number of collections are made.
SeongJae Park [Sat, 17 Jan 2026 02:07:26 +0000 (18:07 -0800)]
selftests/damon/access_memory: add repeat mode
'access_memory' is an artificial memory access generator program that is
used for a few DAMON selftests. It accesses a given number of regions one
by one only once, and exits. Depending on systems, the test workload may
exit faster than expected, making the tests unreliable. For reliable
control of the artificial memory access pattern, add a mode to make it
repeat running.
SeongJae Park [Sat, 17 Jan 2026 02:07:25 +0000 (18:07 -0800)]
selftests/damon/wss_estimation: test for up to 160 MiB working set size
DAMON reads and writes Accessed bits of page tables without manual TLB
flush for two reasons. First, it minimizes the overhead. Second, real
systems that need DAMON are expected to be memory intensive enough to
cause periodic TLB flushes. For test setups that use small test
workloads, however, the system's TLB could be big enough to cover whole or
most accesses of the test workload. In this case, no page table walk
happens and DAMON cannot show any access from the test workload.
The test workload for DAMON's working set size estimation selftest is such
a case. It accesses only 10 MiB working set, and it turned out there are
test setups that have TLBs large enough to cover the 10 MiB data accesses.
As a result, the test fails depending on the test machine.
Make it more reliable by trying larger working sets up to 160 MiB when it
fails.
SeongJae Park [Sat, 17 Jan 2026 02:07:24 +0000 (18:07 -0800)]
selftests/damon/sysfs_memcg_path_leak.sh: use kmemleak
Patch series "selftests/damon: improve leak detection and wss estimation
reliability".
Two DAMON selftets, namely 'sysfs_memcg_leak' and
'sysfs_update_schemes_tried_regions_wss_estimation' frequently show
intermittent failures due to their unreliable leak detection and working
set size estimation. Make those more reliable.
This patch (of 5):
sysfs_memcg_path_leak.sh determines if the memory leak has happened by
seeing if Slab size on /proc/meminfo increases more than expected after an
action. Depending on the system and background workloads, the reasonable
expectation varies. For the reason, the test frequently shows
intermittent failures. Use kmemleak, which is much more reliable and
correct, instead.
Lorenzo Stoakes [Fri, 16 Jan 2026 13:20:53 +0000 (13:20 +0000)]
selftests/mm: remove virtual_address_range test
This self test is asserting internal implementation details and is highly
vulnerable to internal kernel changes as a result.
It is currently failing locally from at least v6.17, and it seems that it
may have been failing for longer in many configurations/hardware as it
skips if e.g. CONFIG_ANON_VMA_NAME is not specified.
With these skips and the fact that run_vmtests.sh won't run the tests in
certain configurations it is likely we have simply missed this test being
broken in CI for a long while.
I have tried multiple versions of these tests and am unable to find a
working bisect as previous versions of the test fail also.
The tests are essentially mmap()'ing a series of mappings with no hint and
asserting what the get_unmapped_area*() functions will come up with, with
seemingly few checks for what other mappings may already be in place.
It then appears to be mmap()'ing with a hint, and making a series of
similar assertions about the internal implementation details of the
hinting logic.
Commit 0ef3783d7558 ("selftests/mm: add support to test 4PB VA on PPC64"),
commit 3bd6137220bb ("selftests/mm: virtual_address_range: avoid reading
from VM_IO mappings"), and especially commit a005145b9c96 ("selftests/mm:
virtual_address_range: mmap() without PROT_WRITE") are good examples of
the whack-a-mole nature of maintaining this test.
The last commit there being particularly pertinent as it was accounting
for an internal implementation detail change that really should have no
bearing on self-tests, that is commit e93d2521b27f ("x86/vdso: Split
virtual clock pages into dedicated mapping").
The purpose of the mm self-tests are to assert attributes about the API
exposed to users, and to ensure that expectations are met.
This test is emphatically not doing this, rather making a series of
assumptions about internal implementation details and asserting them.
It therefore, sadly, seems that the best course is to remove this test
altogether.
Link: https://lkml.kernel.org/r/20260116132053.857887-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kefeng Wang [Mon, 12 Jan 2026 15:09:54 +0000 (23:09 +0800)]
mm: hugetlb_cma: mark hugetlb_cma{_only} as __ro_after_init
hugetlb_cma and hugetlb_cma_only are initialized once during init and
never changed.
Link: https://lkml.kernel.org/r/20260112150954.1802953-6-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: Muchun Song <muchun.song@linux.dev> Cc: Brendan Jackman <jackmanb@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Jane Chu <jane.chu@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Check hugetlb_cma_size which helps to avoid unnecessary gfp check or
nodemask traversal.
Link: https://lkml.kernel.org/r/20260112150954.1802953-5-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Jane Chu <jane.chu@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
If no free hugepage folios are available, there is no need to perform any
replacement operations. Additionally, gigantic folios should not be
replaced under any circumstances. Therefore, we only check for the
presence of non-gigantic folios, also adding the gigantic folio check to
avoid accidental replacement.
To optimize performance, we skip unnecessary iterations over pfn for
compound pages and high-order buddy pages to save processing time.
A simple test on machine with 114G free memory, allocate 120 * 1G HugeTLB
folios(104 successfully returned),
time echo 120 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
Before: 0m0.602s
After: 0m0.431s
[wangkefeng.wang@huawei.com: v2] Link: https://lkml.kernel.org/r/20260114135512.2159799-1-wangkefeng.wang@huawei.com
[akpm@linux-foundation.org: use single-return-point style, tweak comment] Link: https://lkml.kernel.org/r/20260112150954.1802953-4-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Jane Chu <jane.chu@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
To optimize this process, use the new helper page_is_unmovable() to avoid
more unnecessary iterations for compound pages, such as THP not on LRU,
and high-order buddy pages, which significantly improving the efficiency
of contiguous memory allocation.
A simple test on machine with 114G free memory, allocate 120 * 1G
HugeTLB folios(104 successfully returned),
time echo 120 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
Before: 0m3.605s
After: 0m0.602s
Link: https://lkml.kernel.org/r/20260112150954.1802953-3-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Jane Chu <jane.chu@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kefeng Wang [Mon, 12 Jan 2026 15:09:50 +0000 (23:09 +0800)]
mm: page_isolation: introduce page_is_unmovable()
Patch series "mm: accelerate gigantic folio allocation".
Optimize pfn_range_valid_contig() and replace_free_hugepage_folios() in
alloc_contig_frozen_pages() to speed up gigantic folio allocation. The
allocation time for 120*1G folios drops from 3.605s to 0.431s.
This patch (of 5):
Factor out the check if a page is unmovable into a new helper, and will be
reused in the following patch.
No functional change intended, the minor changes are as follows,
1) Avoid unnecessary calls by checking CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
2) Directly call PageCompound since PageTransCompound may be dropped
3) Using folio_test_hugetlb()
Link: https://lkml.kernel.org/r/20260112150954.1802953-1-wangkefeng.wang@huawei.com Link: https://lkml.kernel.org/r/20260112150954.1802953-2-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Brendan Jackman <jackmanb@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Jane Chu <jane.chu@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kevin Brodsky [Thu, 22 Jan 2026 17:02:24 +0000 (17:02 +0000)]
selftests/mm: report SKIP in pfnmap if a check fails
pfnmap currently checks the target file in FIXTURE_SETUP(pfnmap), meaning
once for every test, and skips the test if any check fails.
The target file is the same for every test so this is a little overkill.
More importantly, this approach means that the whole suite will report
PASS even if all the tests are skipped because kernel configuration (e.g.
CONFIG_STRICT_DEVMEM=y) prevented /dev/mem from being mapped, for
instance.
Let's ensure that KSFT_SKIP is returned as exit code if any check fails by
performing the checks in pfnmap_init(), run once. That function also
takes care of finding the offset of the pages to be mapped and saves it in
a global. The file is now opened only once and the fd saved in a global,
but it is still mapped/unmapped for every test, as some of them modify the
mapping.
Link: https://lkml.kernel.org/r/20260122170224.4056513-10-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Usama Anjum <Usama.Anjum@arm.com> Cc: wang lian <lianux.mm@gmail.com> Cc: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kevin Brodsky [Thu, 22 Jan 2026 17:02:23 +0000 (17:02 +0000)]
selftests/mm: fix exit code in pagemap_ioctl
Make sure pagemap_ioctl exits with an appropriate value:
* If the tests are run, call ksft_finished() to report the right
status instead of reporting PASS unconditionally.
* Report SKIP if userfaultfd isn't available (in line with other
tests)
* Report FAIL if we failed to open /proc/self/pagemap, as this file
has been added a long time ago and doesn't depend on any CONFIG
option (returning -EINVAL from main() is meaningless)
Link: https://lkml.kernel.org/r/20260122170224.4056513-9-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Mark Brown <broonie@kernel.org> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Acked-by: SeongJae Park <sj@kernel.org> Reviewed-by: wang lian <lianux.mm@gmail.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Cc: Usama Anjum <Usama.Anjum@arm.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kevin Brodsky [Thu, 22 Jan 2026 17:02:22 +0000 (17:02 +0000)]
selftests/mm: fix faulting-in code in pagemap_ioctl test
One of the pagemap_ioctl tests attempts to fault in pages by memcpy()'ing
them to an unused buffer. This probably worked originally, but since
commit 46036188ea1f ("selftests/mm: build with -O2") the compiler is free
to optimise away that unused buffer and the memcpy() with it. As a result
there might not be any resident page in the mapping and the test may fail.
We don't need to copy all that memory anyway. Just fault in every page.
While at it also make sure to compute the number of pages once using
simple integer arithmetic instead of ceilf() and implicit conversions.
Link: https://lkml.kernel.org/r/20260122170224.4056513-8-kevin.brodsky@arm.com Fixes: 46036188ea1f ("selftests/mm: build with -O2") Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Muhammad Usama Anjum <usama.anjum@arm.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: wang lian <lianux.mm@gmail.com> Cc: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kevin Brodsky [Thu, 22 Jan 2026 17:02:21 +0000 (17:02 +0000)]
selftests/mm: introduce helper to read every page
FORCE_READ(*addr) ensures that the compiler will emit a load from addr.
Several tests need to trigger such a load for a range of pages, ensuring
that every page is faulted in, if it wasn't already.
Introduce a new helper force_read_pages() that does exactly that and
replace existing loops with a call to it.
The step size (regular/huge page size) is preserved for all loops, except
in split_huge_page_test. Reading every byte is unnecessary; we now read
every huge page, matching the following call to check_huge_file().
Link: https://lkml.kernel.org/r/20260122170224.4056513-7-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Muhammad Usama Anjum <usama.anjum@arm.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: wang lian <lianux.mm@gmail.com> Cc: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kevin Brodsky [Thu, 22 Jan 2026 17:02:20 +0000 (17:02 +0000)]
selftests/mm: check that FORCE_READ() succeeded
Many cow tests rely on FORCE_READ() to populate pages. Introduce a helper
to make sure that the pages are actually populated, and fail otherwise.
Link: https://lkml.kernel.org/r/20260122170224.4056513-6-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Usama Anjum <Usama.Anjum@arm.com> Cc: wang lian <lianux.mm@gmail.com> Cc: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kevin Brodsky [Thu, 22 Jan 2026 17:02:19 +0000 (17:02 +0000)]
selftests/mm: fix usage of FORCE_READ() in cow tests
Commit 5bbc2b785e63 ("selftests/mm: fix FORCE_READ to read input value
correctly") modified FORCE_READ() to take a value instead of a pointer.
It also changed most of the call sites accordingly, but missed many of
them in cow.c. In those cases, we ended up with the pointer itself being
read, not the memory it points to.
No failure occurred as a result, so it looks like the tests work just fine
without faulting in. However, the huge_zeropage tests explicitly check
that pages are populated, so those became skipped.
Convert all the remaining FORCE_READ() to fault in the mapped page, as was
originally intended. This allows the huge_zeropage tests to run again (3
tests in total).
Link: https://lkml.kernel.org/r/20260122170224.4056513-5-kevin.brodsky@arm.com Fixes: 5bbc2b785e63 ("selftests/mm: fix FORCE_READ to read input value correctly") Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Acked-by: SeongJae Park <sj@kernel.org> Reviewed-by: wang lian <lianux.mm@gmail.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Dev Jain <dev.jain@arm.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Usama Anjum <Usama.Anjum@arm.com> Cc: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kevin Brodsky [Thu, 22 Jan 2026 17:02:18 +0000 (17:02 +0000)]
selftests/mm: pass down full CC and CFLAGS to check_config.sh
check_config.sh checks that liburing is available by running the compiler
provided as its first argument. This makes two assumptions:
1. CC consists of only one word
2. No extra flag is required
Unfortunately, there are many situations where these assumptions don't
hold. For instance:
- When using Clang, CC consists of multiple words
- When cross-compiling, extra flags may be required to allow the
compiler to find headers
Remove these assumptions by passing down CC and CFLAGS as-is from the
Makefile, so that the same command line is used as when actually building
the tests.
Link: https://lkml.kernel.org/r/20260122170224.4056513-4-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Reviewed-by: Mark Brown <broonie@kernel.org> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Usama Anjum <Usama.Anjum@arm.com> Cc: wang lian <lianux.mm@gmail.com> Cc: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kevin Brodsky [Thu, 22 Jan 2026 17:02:17 +0000 (17:02 +0000)]
selftests/mm: remove flaky header check
Commit 96ed62ea0298 ("mm: page_frag: fix a compile error when kernel is
not compiled") introduced a check to avoid attempting to build the
page_frag module if <linux/page_frag_cache.h> is missing.
Unfortunately this check only works if KDIR points to /lib/modules/... or
an in-tree kernel build. It always fails if KDIR points to an out-of-tree
build (i.e. when the kernel was built with O=... make) because only
generated headers are present under $KDIR/include/ in that case.
A recent commit switched KDIR to default to the kernel's build directory,
so that check is no longer justified.
Link: https://lkml.kernel.org/r/20260122170224.4056513-3-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Reviewed-by: Mark Brown <broonie@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Yunsheng Lin <linyunsheng@huawei.com> Cc: David Hildenbrand <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Usama Anjum <Usama.Anjum@arm.com> Cc: wang lian <lianux.mm@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kevin Brodsky [Thu, 22 Jan 2026 17:02:16 +0000 (17:02 +0000)]
selftests/mm: default KDIR to build directory
Patch series "Various mm kselftests improvements/fixes", v3.
Various improvements/fixes for the mm kselftests:
- Patch 1-3 extend support for more build configurations: out-of-tree
$KDIR, cross-compilation, etc.
- Patch 4-7 fix issues related to faulting in pages, introducing a new
helper for that purpose.
- Patch 8 fixes the value returned by pagemap_ioctl (PASS was always
returned, which explains why the issue fixed in patch 6 went
unnoticed).
- Patch 9 improves the exit code of pfnmap.
Net results:
- 1 test no longer fails (patch 7)
- 3 tests are no longer skipped (patch 4)
- More accurate return values for whole suites (patch 8, 9)
- Extra tests are more likely to be built (patch 1-3)
This patch (of 9):
KDIR currently defaults to the running kernel's modules directory when
building the page_frag module. The underlying assumption is that most
users build the kselftests in order to run them against the system they're
built on.
This assumption seems questionable, and there is no guarantee that the
module can actually be built against the running kernel.
Switch the default value of KDIR to the kernel's build directory, i.e.
$(O) if O= or KBUILD_OUTPUT= is used, and the source directory otherwise.
This seems like the least surprising option: the test module is built
against the kernel that has been previously built.
Note: we can't use $(top_srcdir) in mm/Makefile because it is only defined
once lib.mk is included.
Link: https://lkml.kernel.org/r/20260122170224.4056513-1-kevin.brodsky@arm.com Link: https://lkml.kernel.org/r/20260122170224.4056513-2-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Cc: David Hildenbrand <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: SeongJae Park <sj@kernel.org> Cc: Usama Anjum <Usama.Anjum@arm.com> Cc: wang lian <lianux.mm@gmail.com> Cc: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kevin Brodsky [Thu, 18 Dec 2025 10:05:40 +0000 (10:05 +0000)]
sparc/mm: export symbols for lazy_mmu_mode KUnit tests
The lazy_mmu_mode KUnit tests call lazy_mmu_mode_{enable,disable}. These
tests may be built as a module, and because of inlining this means that
arch_{enter,flush,leave}_lazy_mmu_mode need to be exported.
[akpm@linux-foundation.org: remove mm/tests/lazy_mmu_mode_kunit.c comment, per Kevin] Link: https://lkml.kernel.org/r/20251218100541.2667405-1-kevin.brodsky@arm.com Fixes: ee628d9cc8d5 ("mm: add basic tests for lazy_mmu") Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Acked-by: Andreas Larsson <andreas@gaisler.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Marco Crivellari [Tue, 13 Jan 2026 11:46:30 +0000 (12:46 +0100)]
mm: add WQ_PERCPU to alloc_workqueue users
This continues the effort to refactor workqueue APIs, which began with the
introduction of new workqueues and a new alloc_workqueue flag in:
commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag")
The refactoring is going to alter the default behavior of
alloc_workqueue() to be unbound by default.
With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn't explicitly specify WQ_UNBOUND
must now use WQ_PERCPU. For more details see the Link tag below.
In order to keep alloc_workqueue() behavior identical, explicitly request
WQ_PERCPU.
[akpm@linux-foundation.org: fix mm/slub.c]
[akpm@linux-foundation.org: fix kmem_cache_init_late() properly, per Sebastian] Link: https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/ Link: https://lkml.kernel.org/r/20260113114630.152942-4-marco.crivellari@suse.com Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Suggested-by: Tejun Heo <tj@kernel.org> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Cc: Alexander Potapenko <glider@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lai jiangshan <jiangshanlai@gmail.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Marco Elver <elver@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Marco Crivellari [Tue, 13 Jan 2026 11:46:29 +0000 (12:46 +0100)]
mm: replace use of system_wq with system_percpu_wq
This patch continues the effort to refactor workqueue APIs, which has
begun with the changes introducing new workqueues and a new
alloc_workqueue flag:
commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag")
The point of the refactoring is to eventually alter the default behavior
of workqueues to become unbound by default so that their workload
placement is optimized by the scheduler.
Before that to happen, workqueue users must be converted to the better
named new workqueues with no intended behaviour changes:
This way the old obsolete workqueues (system_wq, system_unbound_wq) can be
removed in the future.
2) [P 3] add WQ_PERCPU to remaining alloc_workqueue() users
With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.
WQ_UNBOUND will be removed in future.
For more information:
https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/
This patch (of 3):
This patch continues the effort to refactor workqueue APIs, which has
begun with the changes introducing new workqueues and a new
alloc_workqueue flag:
commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag")
The point of the refactoring is to eventually alter the default behavior
of workqueues to become unbound by default so that their workload
placement is optimized by the scheduler.
Before that to happen, workqueue users must be converted to the better
named new workqueues with no intended behaviour changes:
Joshua Hahn [Fri, 16 Jan 2026 19:27:16 +0000 (14:27 -0500)]
mm/hugetlb: enforce brace style
Documentation/process/coding-style.rst explicitly notes that if only one
branch of a conditional statement is a single statement, braces should be
used in both branches. Enforce this in mm/hugetlb.c.
While add it, fix the indentation for vma_end_reservation.
No functional change intended.
Link: https://lkml.kernel.org/r/20260116192717.1600049-2-joshua.hahnjy@gmail.com Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joshua Hahn [Fri, 16 Jan 2026 19:27:15 +0000 (14:27 -0500)]
mm/hugetlb: remove unnecessary if condition
if (map_chg) is always true, since it is nested in another if statement
which checks for map_chg == MAP_CHG_NEEDED, which is equal to 1.
if (unlikely(map_chg == MAP_CHG_NEEDED && retval == 0)) {
...
if (map_chg) {
...
}
}
Remove the check, un-indent, and collapse the function call for
readability.
No functional change intended.
Link: https://lkml.kernel.org/r/20260116192717.1600049-1-joshua.hahnjy@gmail.com Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
William Tambe [Thu, 11 Dec 2025 20:38:19 +0000 (12:38 -0800)]
mm/highmem: fix __kmap_to_page() build error
This changes fixes following build error which is a miss from ef6e06b2ef87
("highmem: fix kmap_to_page() for kmap_local_page() addresses").
mm/highmem.c:184:66: error: 'pteval' undeclared (first use in this
function); did you mean 'pte_val'?
184 | idx = arch_kmap_local_map_idx(i, pte_pfn(pteval));
In __kmap_to_page(), pteval is used but does not exist in the function.
(akpm: affects xtensa only)
Link: https://lkml.kernel.org/r/SJ0PR07MB86317E00EC0C59DA60935FDCD18DA@SJ0PR07MB8631.namprd07.prod.outlook.com Fixes: ef6e06b2ef87 ("highmem: fix kmap_to_page() for kmap_local_page() addresses") Signed-off-by: William Tambe <williamt@cadence.com> Reviewed-by: Max Filippov <jcmvbkbc@gmail.com> Cc: Chris Zankel <chris@zankel.net> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jiayuan Chen [Tue, 20 Jan 2026 02:43:49 +0000 (10:43 +0800)]
mm/vmscan: add tracepoint and reason for kswapd_failures reset
Currently, kswapd_failures is reset in multiple places (kswapd, direct
reclaim, PCP freeing, memory-tiers), but there's no way to trace when and
why it was reset, making it difficult to debug memory reclaim issues.
This patch:
1. Introduce kswapd_clear_hopeless() as a wrapper function to
centralize kswapd_failures reset logic.
2. Introduce kswapd_test_hopeless() to encapsulate hopeless node
checks, replacing all open-coded kswapd_failures comparisons.
3. Add kswapd_clear_hopeless_reason enum to distinguish reset sources:
- KSWAPD_CLEAR_HOPELESS_KSWAPD: reset from kswapd context
- KSWAPD_CLEAR_HOPELESS_DIRECT: reset from direct reclaim
- KSWAPD_CLEAR_HOPELESS_PCP: reset from PCP page freeing
- KSWAPD_CLEAR_HOPELESS_OTHER: reset from other paths
4. Add tracepoints for better observability:
- mm_vmscan_kswapd_clear_hopeless: traces each reset with reason
- mm_vmscan_kswapd_reclaim_fail: traces each kswapd reclaim failure
Jiayuan Chen [Tue, 20 Jan 2026 02:43:48 +0000 (10:43 +0800)]
mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim
Patch series "mm/vmscan: add tracepoint and reason for kswapd_failures
reset", v4.
Currently, kswapd_failures is reset in multiple places (kswapd,
direct reclaim, PCP freeing, memory-tiers), but there's no way to
trace when and why it was reset, making it difficult to debug
memory reclaim issues.
This patch:
1. Introduce kswapd_clear_hopeless() as a wrapper function to
centralize kswapd_failures reset logic.
2. Introduce kswapd_test_hopeless() to encapsulate hopeless node
checks, replacing all open-coded kswapd_failures comparisons.
3. Add kswapd_clear_hopeless_reason enum to distinguish reset sources:
- KSWAPD_CLEAR_HOPELESS_KSWAPD: reset from kswapd context
- KSWAPD_CLEAR_HOPELESS_DIRECT: reset from direct reclaim
- KSWAPD_CLEAR_HOPELESS_PCP: reset from PCP page freeing
- KSWAPD_CLEAR_HOPELESS_OTHER: reset from other paths
4. Add tracepoints for better observability:
- mm_vmscan_kswapd_clear_hopeless: traces each reset with reason
- mm_vmscan_kswapd_reclaim_fail: traces each kswapd reclaim failure
When kswapd fails to reclaim memory, kswapd_failures is incremented. Once
it reaches MAX_RECLAIM_RETRIES, kswapd stops running to avoid futile
reclaim attempts. However, any successful direct reclaim unconditionally
resets kswapd_failures to 0, which can cause problems.
We observed an issue in production on a multi-NUMA system where a process
allocated large amounts of anonymous pages on a single NUMA node, causing
its watermark to drop below high and evicting most file pages:
With swap disabled, only file pages can be reclaimed. When kswapd is
woken (e.g., via wake_all_kswapds()), it runs continuously but cannot
raise free memory above the high watermark since reclaimable file pages
are insufficient. Normally, kswapd would eventually stop after
kswapd_failures reaches MAX_RECLAIM_RETRIES.
However, containers on this machine have memory.high set in their cgroup.
Business processes continuously trigger the high limit, causing frequent
direct reclaim that keeps resetting kswapd_failures to 0. This prevents
kswapd from ever stopping.
The key insight is that direct reclaim triggered by cgroup memory.high
performs aggressive scanning to throttle the allocating process. With
sufficiently aggressive scanning, even hot pages will eventually be
reclaimed, making direct reclaim "successful" at freeing some memory.
However, this success does not mean the node has reached a balanced state
- the freed memory may still be insufficient to bring free pages above the
high watermark. Unconditionally resetting kswapd_failures in this case
keeps kswapd alive indefinitely.
The result is that kswapd runs endlessly. Unlike direct reclaim which
only reclaims from the allocating cgroup, kswapd scans the entire node's
memory. This causes hot file pages from all workloads on the node to be
evicted, not just those from the cgroup triggering memory.high. These
pages constantly refault, generating sustained heavy IO READ pressure
across the entire system.
Fix this by only resetting kswapd_failures when the node is actually
balanced. This allows both kswapd and direct reclaim to clear
kswapd_failures upon successful reclaim, but only when the reclaim
actually resolves the memory pressure (i.e., the node becomes balanced).
Link: https://lkml.kernel.org/r/20260120024402.387576-1-jiayuan.chen@linux.dev Link: https://lkml.kernel.org/r/20260120024402.387576-2-jiayuan.chen@linux.dev Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: fix OOM killer inaccuracy on large many-core systems
Use the precise, albeit slower, precise RSS counter sums for the OOM
killer task selection and console dumps. The approximated value is too
imprecise on large many-core systems.
The following rss tracking issues were noted by Sweet Tea Dorminy [1],
which lead to picking wrong tasks as OOM kill target:
Recently, several internal services had an RSS usage regression as part of a
kernel upgrade. Previously, they were on a pre-6.2 kernel and were able to
read RSS statistics in a backup watchdog process to monitor and decide if
they'd overrun their memory budget. Now, however, a representative service
with five threads, expected to use about a hundred MB of memory, on a 250-cpu
machine had memory usage tens of megabytes different from the expected amount
-- this constituted a significant percentage of inaccuracy, causing the
watchdog to act.
This was a result of commit f1a7941243c1 ("mm: convert mm's rss stats
into percpu_counter") [1]. Previously, the memory error was bounded by
64*nr_threads pages, a very livable megabyte. Now, however, as a result of
scheduler decisions moving the threads around the CPUs, the memory error could
be as large as a gigabyte.
This is a really tremendous inaccuracy for any few-threaded program on a
large machine and impedes monitoring significantly. These stat counters are
also used to make OOM killing decisions, so this additional inaccuracy could
make a big difference in OOM situations -- either resulting in the wrong
process being killed, or in less memory being returned from an OOM-kill than
expected.
Here is a (possibly incomplete) list of the prior approaches that were
used or proposed, along with their downside:
1) Per-thread rss tracking: large error on many-thread processes.
2) Per-CPU counters: up to 12% slower for short-lived processes and 9%
increased system time in make test workloads [1]. Moreover, the
inaccuracy increases with O(n^2) with the number of CPUs.
3) Per-NUMA-node counters: requires atomics on fast-path (overhead),
error is high with systems that have lots of NUMA nodes (32 times
the number of NUMA nodes).
commit 82241a83cd15 ("mm: fix the inaccurate memory statistics issue for
users") introduced get_mm_counter_sum() for precise proc memory status
queries for some proc files.
The simple fix proposed here is to do the precise per-cpu counters sum
every time a counter value needs to be read. This applies to the OOM
killer task selection, oom task console dumps (printk).
This change increases the latency introduced when the OOM killer executes
in favor of doing a more precise OOM target task selection. Effectively,
the OOM killer iterates on all tasks, for all relevant page types, for
which the precise sum iterates on all possible CPUs.
As a reference, here is the execution time of the OOM killer before/after
the change:
AMD EPYC 9654 96-Core (2 sockets)
Within a KVM, configured with 256 logical cpus.
| before | after |
----------------------------------|----------|----------|
nr_processes=40 | 0.3 ms | 0.5 ms |
nr_processes=10000 | 3.0 ms | 80.0 ms |
Link: https://lkml.kernel.org/r/20260114143642.47333-1-mathieu.desnoyers@efficios.com Fixes: f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter") Link: https://lore.kernel.org/lkml/20250331223516.7810-2-sweettea-kernel@dorminy.me/ Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Martin Liu <liumartin@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: SeongJae Park <sj@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Liam R . Howlett" <liam.howlett@oracle.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Yu Zhao <yuzhao@google.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Aboorva Devarajan <aboorvad@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ran Xiaokai [Thu, 15 Jan 2026 03:15:36 +0000 (03:15 +0000)]
alloc_tag: fix rw permission issue when handling boot parameter
Boot parameters prefixed with "sysctl." are processed during the final
stage of system initialization via kernel_init()-> do_sysctl_args(). When
CONFIG_MEM_ALLOC_PROFILING_DEBUG is enabled, the sysctl.vm.mem_profiling
entry is not writable and will cause a warning.
Before run_init_process(), system initialization executes in kernel thread
context. Use current->mm to distinguish sysctl writes during
do_sysctl_args() from user-space triggered ones.
And when the proc_handler is from do_sysctl_args(), always return success
because the same value was already set by setup_early_mem_profiling() and
this eliminates a permission denied warning.
Link: https://lkml.kernel.org/r/20260115031536.164254-1-ranxiaokai627@163.com Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn> Suggested-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Suren Baghdasaryan <surenb@google.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Manish Kumar [Thu, 15 Jan 2026 19:31:00 +0000 (01:01 +0530)]
mm: drop filename from page_alloc.c header comment
The file name in the header comment is redundant and not useful, as the
location is already known from the path. Remove it to align with kernel
coding style. No functional change.
init_lock has completely outgrown its initial purpose and is no longer
used only to "prevent concurrent execution of device init" as the stale
comment suggests. The scope of this lock is much bigger now.
These days this lock (rw_semaphore) controls how a task owns the
corresponding zram device: either in shared mode or in exclusive mode.
All zram device attribute writes should own the device in exclusive mode,
which synchronizes these tasks and prevents, for example, concurrent
execution of recompression and writeback.
All zram device attribute reads should own the device in shared mode.
Rename the lock to dev_lock to better reflect its current purpose.
Link: https://lkml.kernel.org/r/20260115080807.3957860-1-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Brian Geffon <bgeffon@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: rename CONFIG_BALLOON_COMPACTION to CONFIG_BALLOON_MIGRATION
While compaction depends on migration, the other direction is not the
case. So let's make it clearer that this is all about migration of
balloon pages.
Adjust all comments/docs in the core to talk about "migration" instead of
"compaction".
While at it add some "/* CONFIG_BALLOON_MIGRATION */".
mm/kconfig: make BALLOON_COMPACTION depend on MIGRATION
Migration support for balloon memory depends on MIGRATION not COMPACTION.
Compaction is simply another user of page migration.
The last dependency on compaction.c was effectively removed with commit 3d388584d599 ("mm: convert "movable" flag in page->mapping to a page
flag"). Ever since, everything for handling movable_ops page migration
resides in core migration code.
So let's change the dependency and adjust the description + help text.
drivers/virtio/virtio_balloon: stop using balloon_page_push/pop()
Let's stop using these functions so we can remove them. They look like
belonging to the balloon API for managing the device balloon list when
really they are just simple helpers only used by virtio-balloon.
Let's just inline them and switch to a proper list_for_each_entry_safe().
mm/balloon_compaction: drop fs.h include from balloon_compaction.h
Ever since commit 68f2736a8583 ("mm: Convert all PageMovable users to
movable_operations") we no longer store an inode in balloon_dev_info, so
we can stop including "fs.h".
mm/balloon_compaction: remove dependency on page lock
Let's stop using the page lock in balloon code and instead use only the
balloon_device_lock.
As soon as we set the PG_movable_ops flag, we might now get isolation
callbacks for that page as we are no longer holding the page lock. In
there, we'll simply synchronize using the balloon_device_lock.
So in balloon_page_isolate() lookup the balloon_dev_info through
page->private under balloon_device_lock.
It's crucial that we update page->private under the balloon_device_lock,
so the isolation callback can properly deal with concurrent deflation.
Consequently, make sure that balloon_page_finalize() is called under
balloon_device_lock as we remove a page from the list and clear
page->private. balloon_page_insert() is already called with the
balloon_device_lock held.
Note that the core will still lock the pages, for example in
isolate_movable_ops_page(). The lock is there still relevant for handling
the PageMovableOpsIsolated flag, but that can be later changed to use an
atomic test-and-set instead, or moved into the movable_ops backends.
mm/balloon_compaction: use a device-independent balloon (list) lock
In order to remove the dependency on the page lock for balloon pages, we
need a lock that is independent of the page.
It's crucial that we can handle the scenario where balloon deflation
(clearing page->private) can race with page isolation (using page->private
to obtain the balloon_dev_info where the lock currently resides).
The current lock in balloon_dev_info is therefore not suitable.
Fortunately, we never really have more than a single balloon device per
VM, so we can just keep it simple and use a static lock to protect all
balloon devices.
Based on this change we will remove the dependency on the page lock next.
Let's not piggy-back on the existing lock and use a separate lock for the
huge page list. Now that we use a separate lock, there is no need to
disable interrupts, so use the non-irqsave variants. We only required the
irqsave variants because of the balloon device lock.
This is a preparation for changing the locking used to protect
balloon_dev_info.
While at it, talk about "page migration" instead of "page compaction".
We'll change that in core code soon as well.
Let's centralize it, by allowing for the driver to enable this handling
through a new flag (bool for now) in the balloon device info.
Note that we now adjust the counter when adding/removing a page into the
balloon list: when removing a page to deflate it, it will now happen
before the driver communicated with hypervisor, not afterwards.
Let's update the balloon page references, the balloon page list, the
BALLOON_MIGRATE counter and the isolated-pages counter in
balloon_page_migrate(), after letting the balloon->migratepage() callback
deal with the actual inflation+deflation.
Note that we now perform the balloon list modifications outside of any
implementation-specific locks: which is fine, there is nothing special
about these page actions that the lock would be protecting.
The old page is already no longer in the list (isolated) and the new page
is not yet in the list.
Let's use -ENOENT to communicate the special "inflation of new page failed
after already deflating the old page" to balloon_page_migrate() so it can
handle it accordingly.
While at it, rename balloon->b_dev_info to make it match the other
functions. Also, drop the comment above balloon_page_migrate(), which
seems unnecessary.
Now that there is not a lot of logic left, let's just inline setting up
the migration function.
To avoid #ifdef in the caller we can instead use IS_ENABLED() and make the
compiler happy by only providing the function declaration.
Now that the function is gone, drop the "out_balloon_compaction" label.
Note that before commit 68f2736a8583 ("mm: Convert all PageMovable users
to movable_operations") we actually had to undo something, now not
anymore.
Now that there is not a lot of logic left, let's just inline setting up
the migration function and drop all these excessive comments that are not
really required (or true) anymore.
To avoid #ifdef in the caller we can instead use IS_ENABLED() and make the
compiler happy by only providing the function declaration.
vmw_balloon: adjust BALLOON_DEFLATE when deflating while migrating
Patch series "mm: balloon infrastructure cleanups", v3.
I started with wanting to remove the dependency of the balloon
infrastructure on the page lock, but ended up performing various other
cleanups, some of which I had on my todo list for years.
This series heavily cleans up and simplifies our balloon infrastructure,
including our balloon page migration functionality.
With this series, we no longer make use of the page lock for PageOffline
pages as part of the balloon infrastructure (preparing for memdescs where
PageOffline pages won't have any such lock), and simplifies migration
handling such that refcounting can more easily be adjusted later
(long-term focus is for PageOffline pages to not have a refcount either).
Plenty of related cleanups.
This patch (of 24):
When we're effectively deflating the balloon while migrating a page
because inflating the new page failed, we're not adjusting
BALLOON_DEFLATE.
Let's do that. This is a preparation for factoring out this handling to
the core code, making it work in a similar way first.
As this (deflating while migrating because of inflation error) is a corner
case that I don't really expect to happen in practice and the stats are
not that crucial, this likely doesn't classify as a fix.
Shivank Garg [Sun, 18 Jan 2026 19:23:01 +0000 (19:23 +0000)]
mm/khugepaged: make khugepaged_collapse_control static
The global variable 'khugepaged_collapse_control' is not used outside of
mm/khugepaged.c. Make it static to limit its scope.
Link: https://lkml.kernel.org/r/20260118192253.9263-14-shivankg@amd.com Signed-off-by: Shivank Garg <shivankg@amd.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shivank Garg [Sun, 18 Jan 2026 19:22:59 +0000 (19:22 +0000)]
mm/khugepaged: use enum scan_result for result variables and return types
Convert result variables and return types from int to enum scan_result
throughout khugepaged code. This improves type safety and code clarity by
making the intent explicit.
No functional change.
Link: https://lkml.kernel.org/r/20260118192253.9263-12-shivankg@amd.com Signed-off-by: Shivank Garg <shivankg@amd.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Tested-by: Nico Pache <npache@redhat.com> Reviewed-by: Nico Pache <npache@redhat.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shivank Garg [Sun, 18 Jan 2026 19:22:57 +0000 (19:22 +0000)]
mm/khugepaged: change collapse_pte_mapped_thp() to return void
The only external caller of collapse_pte_mapped_thp() is uprobe, which
ignores the return value. Change the external API to return void to
simplify the interface.
Introduce try_collapse_pte_mapped_thp() for internal use that preserves
the return value. This prepares for future patch that will convert the
return type to use enum scan_result.
Link: https://lkml.kernel.org/r/20260118192253.9263-10-shivankg@amd.com Signed-off-by: Shivank Garg <shivankg@amd.com> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org> Acked-by: Lance Yang <lance.yang@linux.dev> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Tested-by: Nico Pache <npache@redhat.com> Reviewed-by: Nico Pache <npache@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm/khugepaged: cleanups and scan limit fix", v3.
This series contains several cleanups for mm/khugepaged.c to improve code
readability and type safety, and one functional fix to ensure
khugepaged_scan_mm_slot() correctly accounts for small VMAs towards scan
limit.
This patch (of 4):
Replace goto skip with actual logic for better code readability.
No functional change.
Link: https://lkml.kernel.org/r/20260118192253.9263-4-shivankg@amd.com Link: https://lkml.kernel.org/r/20260118192253.9263-6-shivankg@amd.com Signed-off-by: Shivank Garg <shivankg@amd.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Tested-by: Nico Pache <npache@redhat.com> Reviewed-by: Nico Pache <npache@redhat.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrew Morton [Sat, 31 Jan 2026 22:20:03 +0000 (14:20 -0800)]
Merge branch 'mm-hotfixes-stable' into mm-stable to pick up "mm/shmem,
swap: fix race of truncate and swap entry split", needed for merging "mm,
swap: cleanup swap entry management workflow".
SeongJae Park [Thu, 15 Jan 2026 15:20:45 +0000 (07:20 -0800)]
mm/damon: hide kdamond and kdamond_lock of damon_ctx
There is no DAMON API caller that directly access 'kdamond' and
'kdamond_lock' fields of 'struct damon_ctx'. Keeping those exposed could
only encourage creative but error-prone usages. Hide them from DAMON API
callers by marking those as private fields.
SeongJae Park [Thu, 15 Jan 2026 15:20:44 +0000 (07:20 -0800)]
mm/damon/reclaim: use damon_kdamond_pid()
DAMON_RECLAIM directly uses damon_ctx->kdamond field with manual
synchronization using damon_ctx->kdamond_lock, to get the pid of the
kdamond. Use a new dedicated function for the purpose, namely
damon_kdamond_pid(), since that doesn't require manual and error-prone
synchronization.
SeongJae Park [Thu, 15 Jan 2026 15:20:43 +0000 (07:20 -0800)]
mm/damon/lru_sort: use damon_kdamond_pid()
DAMON_LRU_SORT directly uses damon_ctx->kdamond field with manual
synchronization using damon_ctx->kdamond_lock, to get the pid of the
kdamond. Use a new dedicated function for the purpose, namely
damon_kdamond_pid(), since that doesn't require manual and error-prone
synchronization.
SeongJae Park [Thu, 15 Jan 2026 15:20:42 +0000 (07:20 -0800)]
mm/damon/sysfs: use damon_kdamond_pid()
DAMON sysfs interface directly uses damon_ctx->kdamond field with manual
synchronization using damon_ctx->kdamond_lock, to get the pid of the
kdamond. Use a new dedicated function for the purpose, namely
damon_kdamond_pid(), since that doesn't require manual and error-prone
synchronization.
SeongJae Park [Thu, 15 Jan 2026 15:20:41 +0000 (07:20 -0800)]
mm/damon/core: implement damon_kdamond_pid()
Patch series "mm/damon: hide kdamond and kdamond_lock from API callers".
'kdamond' and 'kdamond_lock' fields initially exposed to DAMON API callers
for flexible synchronization and use cases. As DAMON API became somewhat
complicated compared to the early days, Keeping those exposed could only
encourage the API callers to invent more creative but complicated and
difficult-to-debug use cases.
Fortunately DAMON API callers didn't invent that many creative use cases.
There exist only two use cases of 'kdamond' and 'kdamond_lock'. Finding
whether the kdamond is actively running, and getting the pid of the
kdamond. For the first use case, a dedicated API function, namely
'damon_is_running()' is provided, and all DAMON API callers are using the
function for the use case. Hence only the second use case is where the
fields are directly being used by DAMON API callers.
To prevent future invention of complicated and erroneous use cases of the
fields, hide the fields from the API callers. For that, provide new
dedicated DAMON API functions for the remaining use case, namely
damon_kdamond_pid(), migrate DAMON API callers to use the new function,
and mark the fields as private fields.
This patch (of 5):
'kdamond' and 'kdamond_lock' are directly being used by DAMON API callers
for getting the pid of the corresponding kdamond. To discourage invention
of creative but complicated and erroneous new usages of the fields that
require careful synchronization, implement a new API function that can
simply be used without the manual synchronizations.
Yury Norov [Wed, 14 Jan 2026 17:22:15 +0000 (12:22 -0500)]
cgroup: use nodes_and() output where appropriate
Now that nodes_and() returns true if the result nodemask is not empty,
drop useless nodes_intersects() in guarantee_online_mems() and
nodes_empty() in update_nodemasks_hier(), which both are O(N).
Link: https://lkml.kernel.org/r/20260114172217.861204-4-ynorov@nvidia.com Signed-off-by: Yury Norov <ynorov@nvidia.com> Reviewed-by: Gregory Price <gourry@gourry.net> Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Byungchul Park <byungchul@sk.com> Cc: David Hildenbrand <david@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Koutný <mkoutny@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: Yury Norov (NVIDIA) <yury.norov@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yury Norov [Wed, 14 Jan 2026 17:22:14 +0000 (12:22 -0500)]
mm: use nodes_and() return value to simplify client code
establish_demotion_targets() and kernel_migrate_pages() call node_empty()
immediately after calling nodes_and(). Now that nodes_and() return false
if nodemask is empty, drop the latter.
Link: https://lkml.kernel.org/r/20260114172217.861204-3-ynorov@nvidia.com Signed-off-by: Yury Norov <ynorov@nvidia.com> Reviewed-by: Gregory Price <gourry@gourry.net> Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Byungchul Park <byungchul@sk.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Koutný <mkoutny@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Yury Norov (NVIDIA) <yury.norov@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yury Norov [Wed, 14 Jan 2026 17:22:13 +0000 (12:22 -0500)]
nodemask: propagate boolean for nodes_and{,not}
Patch series "nodemask: align nodes_and{,not} with underlying bitmap ops".
nodes_and{,not} are void despite that underlying bitmap_and(,not) return
boolean, true if the result bitmap is non-empty. Align nodemask API, and
simplify client code.
This patch (of 3):
Bitmap functions bitmap_and{,not} return boolean depending on emptiness of
the result bitmap. The corresponding nodemask helpers ignore the returned
value.
Propagate the underlying bitmaps result to nodemasks users, as it
simplifies user code.
Link: https://lkml.kernel.org/r/20260114172217.861204-1-ynorov@nvidia.com Link: https://lkml.kernel.org/r/20260114172217.861204-2-ynorov@nvidia.com Signed-off-by: Yury Norov <ynorov@nvidia.com> Reviewed-by: Gregory Price <gourry@gourry.net> Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com> Reviewed-by: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Byungchul Park <byungchul@sk.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Koutný <mkoutny@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: Yury Norov (NVIDIA) <yury.norov@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rohan McLure [Thu, 18 Dec 2025 17:09:44 +0000 (04:09 +1100)]
powerpc/mm: support page table check
On creation and clearing of a page table mapping, instrument such calls by
invoking page_table_check_pte_set and page_table_check_pte_clear
respectively. These calls serve as a sanity check against illegal
mappings.
Enable ARCH_SUPPORTS_PAGE_TABLE_CHECK on powerpc, except when HUGETLB_PAGE
is enabled (powerpc has some weirdness in how it implements
set_huge_pte_at(), which may require some further work).
See also:
riscv support in commit 3fee229a8eb9 ("riscv/mm: enable
ARCH_SUPPORTS_PAGE_TABLE_CHECK")
arm64 in commit 42b2547137f5 ("arm64/mm: enable
ARCH_SUPPORTS_PAGE_TABLE_CHECK")
x86_64 in commit d283d422c6c4 ("x86: mm: add x86_64 support for page table
check")
[ajd@linux.ibm.com: rebase, add additional instrumentation, misc fixes] Link: https://lkml.kernel.org/r/20251219-pgtable_check_v18rebase-v18-12-755bc151a50b@linux.ibm.com Signed-off-by: Rohan McLure <rmclure@linux.ibm.com> Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Acked-by: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: "Christophe Leroy (CS GROUP)" <chleroy@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Guo Weikang <guoweikang.kernel@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Miehlbradt <nicholas@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Cc: Thomas Huth <thuth@redhat.com> Cc: "Vishal Moola (Oracle)" <vishal.moola@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rohan McLure [Thu, 18 Dec 2025 17:09:43 +0000 (04:09 +1100)]
powerpc/mm: use set_pte_at_unchecked() for internal usages
In the new set_ptes() API, set_pte_at() (a special case of set_ptes()) is
intended to be instrumented by the page table check facility. There are
however several other routines that constitute the API for setting page
table entries, including set_pmd_at() among others. Such routines are
themselves implemented in terms of set_ptes_at().
A future patch providing support for page table checking on powerpc must
take care to avoid duplicate calls to page_table_check_p{te,md,ud}_set().
Allow for assignment of pte entries without instrumentation through the
set_pte_at_unchecked() routine introduced in this patch.
Cause API-facing routines that call set_pte_at() to instead call
set_pte_at_unchecked(), which will remain uninstrumented by page table
check. set_ptes() is itself implemented by calls to __set_pte_at(), so
this eliminates redundant code.
[ajd@linux.ibm.com: don't change to unchecked for early boot/kernel mappings] Link: https://lkml.kernel.org/r/20251219-pgtable_check_v18rebase-v18-11-755bc151a50b@linux.ibm.com Signed-off-by: Rohan McLure <rmclure@linux.ibm.com> Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com> Acked-by: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: "Christophe Leroy (CS GROUP)" <chleroy@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Guo Weikang <guoweikang.kernel@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Miehlbradt <nicholas@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Cc: Thomas Huth <thuth@redhat.com> Cc: "Vishal Moola (Oracle)" <vishal.moola@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rohan McLure [Thu, 18 Dec 2025 17:09:42 +0000 (04:09 +1100)]
powerpc/mm: implement *_user_accessible_page() for ptes
Page table checking depends on architectures providing an implementation
of p{te,md,ud}_user_accessible_page. With refactorisations made on
powerpc/mm, the pte_access_permitted() and similar methods verify whether
a userland page is accessible with the required permissions.
Since page table checking is the only user of
p{te,md,ud}_user_accessible_page(), implement these for all platforms,
using some of the same preliminary checks taken by pte_access_permitted()
on that platform.
Since commit 8e9bd41e4ce1 ("powerpc/nohash: Replace pte_user() by
pte_read()") pte_user() is no longer required to be present on all
platforms as it may be equivalent to or implied by pte_read(). Hence
implementations of pte_user_accessible_page() are specialised.
[ajd@linux.ibm.com: rebase and clean up] Link: https://lkml.kernel.org/r/20251219-pgtable_check_v18rebase-v18-10-755bc151a50b@linux.ibm.com Signed-off-by: Rohan McLure <rmclure@linux.ibm.com> Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Acked-by: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: "Christophe Leroy (CS GROUP)" <chleroy@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Guo Weikang <guoweikang.kernel@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Miehlbradt <nicholas@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Cc: Thomas Huth <thuth@redhat.com> Cc: "Vishal Moola (Oracle)" <vishal.moola@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rohan McLure [Thu, 18 Dec 2025 17:09:41 +0000 (04:09 +1100)]
mm: provide address parameter to p{te,md,ud}_user_accessible_page()
On several powerpc platforms, a page table entry may not imply whether the
relevant mapping is for userspace or kernelspace. Instead, such platforms
infer this by the address which is being accessed.
Add an additional address argument to each of these routines in order to
provide support for page table check on powerpc.
[ajd@linux.ibm.com: rebase on arm64 changes] Link: https://lkml.kernel.org/r/20251219-pgtable_check_v18rebase-v18-9-755bc151a50b@linux.ibm.com Signed-off-by: Rohan McLure <rmclure@linux.ibm.com> Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Acked-by: Ingo Molnar <mingo@kernel.org> # x86 Acked-by: Alexandre Ghiti <alexghiti@rivosinc.com> # riscv Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Alistair Popple <apopple@nvidia.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: "Christophe Leroy (CS GROUP)" <chleroy@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Guo Weikang <guoweikang.kernel@gmail.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Miehlbradt <nicholas@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Cc: Thomas Huth <thuth@redhat.com> Cc: "Vishal Moola (Oracle)" <vishal.moola@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rohan McLure [Thu, 18 Dec 2025 17:09:40 +0000 (04:09 +1100)]
mm/page_table_check: reinstate address parameter in [__]page_table_check_pte_clear()
This reverts commit aa232204c468 ("mm/page_table_check: remove unused
parameter in [__]page_table_check_pte_clear").
Reinstate previously unused parameters for the purpose of supporting
powerpc platforms, as many do not encode user/kernel ownership of the page
in the pte, but instead in the address of the access.
[ajd@linux.ibm.com: rebase, fix additional occurrence and loop handling] Link: https://lkml.kernel.org/r/20251219-pgtable_check_v18rebase-v18-8-755bc151a50b@linux.ibm.com Signed-off-by: Rohan McLure <rmclure@linux.ibm.com> Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Acked-by: Ingo Molnar <mingo@kernel.org> # x86 Acked-by: Alexandre Ghiti <alexghiti@rivosinc.com> # riscv Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Alistair Popple <apopple@nvidia.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: "Christophe Leroy (CS GROUP)" <chleroy@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Guo Weikang <guoweikang.kernel@gmail.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Miehlbradt <nicholas@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Cc: Thomas Huth <thuth@redhat.com> Cc: "Vishal Moola (Oracle)" <vishal.moola@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rohan McLure [Thu, 18 Dec 2025 17:09:39 +0000 (04:09 +1100)]
mm/page_table_check: reinstate address parameter in [__]page_table_check_pmd_clear()
This reverts commit 1831414cd729 ("mm/page_table_check: remove unused
parameter in [__]page_table_check_pmd_clear").
Reinstate previously unused parameters for the purpose of supporting
powerpc platforms, as many do not encode user/kernel ownership of the page
in the pte, but instead in the address of the access.
[ajd@linux.ibm.com: rebase on arm64 changes] Link: https://lkml.kernel.org/r/20251219-pgtable_check_v18rebase-v18-7-755bc151a50b@linux.ibm.com Signed-off-by: Rohan McLure <rmclure@linux.ibm.com> Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Acked-by: Ingo Molnar <mingo@kernel.org> # x86 Acked-by: Alexandre Ghiti <alexghiti@rivosinc.com> # riscv Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Alistair Popple <apopple@nvidia.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: "Christophe Leroy (CS GROUP)" <chleroy@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Guo Weikang <guoweikang.kernel@gmail.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Miehlbradt <nicholas@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Cc: Thomas Huth <thuth@redhat.com> Cc: "Vishal Moola (Oracle)" <vishal.moola@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rohan McLure [Thu, 18 Dec 2025 17:09:38 +0000 (04:09 +1100)]
mm/page_table_check: reinstate address parameter in [__]page_table_check_pud_clear()
This reverts commit 931c38e16499 ("mm/page_table_check: remove unused
parameter in [__]page_table_check_pud_clear").
Reinstate previously unused parameters for the purpose of supporting
powerpc platforms, as many do not encode user/kernel ownership of the page
in the pte, but instead in the address of the access.
[ajd@linux.ibm.com: rebase on arm64 changes] Link: https://lkml.kernel.org/r/20251219-pgtable_check_v18rebase-v18-6-755bc151a50b@linux.ibm.com Signed-off-by: Rohan McLure <rmclure@linux.ibm.com> Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Acked-by: Ingo Molnar <mingo@kernel.org> # x86 Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: "Christophe Leroy (CS GROUP)" <chleroy@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Guo Weikang <guoweikang.kernel@gmail.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Miehlbradt <nicholas@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Cc: Thomas Huth <thuth@redhat.com> Cc: "Vishal Moola (Oracle)" <vishal.moola@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rohan McLure [Thu, 18 Dec 2025 17:09:37 +0000 (04:09 +1100)]
mm/page_table_check: provide addr parameter to page_table_check_ptes_set()
To provide support for powerpc platforms, provide an addr parameter to the
__page_table_check_ptes_set() and page_table_check_ptes_set() routines.
This parameter is needed on some powerpc platforms which do not encode
whether a mapping is for user or kernel in the pte. On such platforms,
this can be inferred from the addr parameter.
[ajd@linux.ibm.com: rebase on arm64 + riscv changes, update commit message] Link: https://lkml.kernel.org/r/20251219-pgtable_check_v18rebase-v18-5-755bc151a50b@linux.ibm.com Signed-off-by: Rohan McLure <rmclure@linux.ibm.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Acked-by: Alexandre Ghiti <alexghiti@rivosinc.com> # riscv Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Alistair Popple <apopple@nvidia.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: "Christophe Leroy (CS GROUP)" <chleroy@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Guo Weikang <guoweikang.kernel@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Miehlbradt <nicholas@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Cc: Thomas Huth <thuth@redhat.com> Cc: "Vishal Moola (Oracle)" <vishal.moola@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rohan McLure [Thu, 18 Dec 2025 17:09:36 +0000 (04:09 +1100)]
mm/page_table_check: reinstate address parameter in [__]page_table_check_pmd[s]_set()
This reverts commit a3b837130b58 ("mm/page_table_check: remove unused
parameter in [__]page_table_check_pmd_set").
Reinstate previously unused parameters for the purpose of supporting
powerpc platforms, as many do not encode user/kernel ownership of the
page in the pte, but instead in the address of the access.
Apply this to __page_table_check_pmds_set(), page_table_check_pmd_set(), and
the page_table_check_pmd_set() wrapper macro.
[ajd@linux.ibm.com: rebase on arm64 + riscv changes, update commit message] Link: https://lkml.kernel.org/r/20251219-pgtable_check_v18rebase-v18-4-755bc151a50b@linux.ibm.com Signed-off-by: Rohan McLure <rmclure@linux.ibm.com> Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Acked-by: Ingo Molnar <mingo@kernel.org> # x86 Acked-by: Alexandre Ghiti <alexghiti@rivosinc.com> # riscv Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Alistair Popple <apopple@nvidia.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: "Christophe Leroy (CS GROUP)" <chleroy@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Guo Weikang <guoweikang.kernel@gmail.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Miehlbradt <nicholas@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Cc: Thomas Huth <thuth@redhat.com> Cc: "Vishal Moola (Oracle)" <vishal.moola@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rohan McLure [Thu, 18 Dec 2025 17:09:35 +0000 (04:09 +1100)]
mm/page_table_check: reinstate address parameter in [__]page_table_check_pud[s]_set()
This reverts commit 6d144436d954 ("mm/page_table_check: remove unused
parameter in [__]page_table_check_pud_set").
Reinstate previously unused parameters for the purpose of supporting
powerpc platforms, as many do not encode user/kernel ownership of the page
in the pte, but instead in the address of the access.
Apply this to __page_table_check_puds_set(), page_table_check_puds_set()
and the page_table_check_pud_set() wrapper macro.
[ajd@linux.ibm.com: rebase on riscv + arm64 changes, update commit message] Link: https://lkml.kernel.org/r/20251219-pgtable_check_v18rebase-v18-3-755bc151a50b@linux.ibm.com Signed-off-by: Rohan McLure <rmclure@linux.ibm.com> Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Acked-by: Ingo Molnar <mingo@kernel.org> # x86 Acked-by: Alexandre Ghiti <alexghiti@rivosinc.com> # riscv Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Alistair Popple <apopple@nvidia.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: "Christophe Leroy (CS GROUP)" <chleroy@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Guo Weikang <guoweikang.kernel@gmail.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Miehlbradt <nicholas@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Cc: Thomas Huth <thuth@redhat.com> Cc: "Vishal Moola (Oracle)" <vishal.moola@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrew Donnellan [Thu, 18 Dec 2025 17:09:34 +0000 (04:09 +1100)]
arm64/mm: add addr parameter to __ptep_get_and_clear_anysz()
To provide support for page table check on powerpc, we need to
reinstate the address parameter in several functions, including
page_table_check_{pte,pmd,pud}_clear().
In preparation for this, add the addr parameter to arm64's
__ptep_get_and_clear_anysz() and change its callsites accordingly.
Link: https://lkml.kernel.org/r/20251219-pgtable_check_v18rebase-v18-2-755bc151a50b@linux.ibm.com Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: "Christophe Leroy (CS GROUP)" <chleroy@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Guo Weikang <guoweikang.kernel@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Miehlbradt <nicholas@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Cc: Rohan McLure <rmclure@linux.ibm.com> Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Cc: Thomas Huth <thuth@redhat.com> Cc: "Vishal Moola (Oracle)" <vishal.moola@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrew Donnellan [Thu, 18 Dec 2025 17:09:33 +0000 (04:09 +1100)]
arm64/mm: add addr parameter to __set_ptes_anysz()
Patch series "Support page table check on PowerPC", v18.
Support page table check on PowerPC. Page table check tracks the usage of
of page table entries at each level to ensure that anonymous mappings have
at most one writable consumer, and likewise that file-backed mappings are
not simultaneously also anonymous mappings.
In order to support this infrastructure, a number of helpers or stubs must
be defined or updated for all powerpc platforms. Additionally, we
separate set_pte_at() and set_pte_at_unchecked(), to allow for internal,
uninstrumented mappings.
On some PowerPC platforms, implementing
{pte,pmd,pud}_user_accessible_page() requires the address. We revert
previous changes that removed the address parameter from various
interfaces, and add it to some other interfaces, in order to allow this.
For now, we don't allow page table check alongside HUGETLB_PAGE, due to
the arch-specific complexity of set_huge_page_at(). (I'm sure I could
figure this out, but I have to get this version on this list before I
leave my job.)
This series was initially written by Rohan McLure, who has left IBM and is
no longer working on powerpc.
This patch (of 18):
To provide support for page table check on powerpc, we need to reinstate
the address parameter in several functions, including
page_table_check_{ptes,pmds,puds}_set().
In preparation for this, add the addr parameter to arm64's
__set_ptes_anysz() and change its callsites accordingly.
Link: https://lkml.kernel.org/r/20251219-pgtable_check_v18rebase-v18-0-755bc151a50b@linux.ibm.com Link: https://lkml.kernel.org/r/20251219-pgtable_check_v18rebase-v18-1-755bc151a50b@linux.ibm.com Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Alistair Popple <apopple@nvidia.com> Cc: "Christophe Leroy (CS GROUP)" <chleroy@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Guo Weikang <guoweikang.kernel@gmail.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Miehlbradt <nicholas@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Cc: Thomas Huth <thuth@redhat.com> Cc: "Vishal Moola (Oracle)" <vishal.moola@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Ingo Molnar <mingo@kernel.org> Cc: Rohan McLure <rmclure@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jinjiang Tu [Tue, 23 Dec 2025 11:05:23 +0000 (19:05 +0800)]
mm/mempolicy: fix mpol_rebind_nodemask() for MPOL_F_NUMA_BALANCING
commit bda420b98505 ("numa balancing: migrate on fault among multiple
bound nodes") adds new flag MPOL_F_NUMA_BALANCING to enable NUMA balancing
for MPOL_BIND memory policy.
When the cpuset of tasks changes, the mempolicy of the task is rebound by
mpol_rebind_nodemask(). When MPOL_F_STATIC_NODES and
MPOL_F_RELATIVE_NODES are both not set, the behaviour of rebinding should
be same whenever MPOL_F_NUMA_BALANCING is set or not. So, when an
application calls set_mempolicy() with MPOL_F_NUMA_BALANCING set but both
MPOL_F_STATIC_NODES and MPOL_F_RELATIVE_NODES cleared,
mempolicy.w.cpuset_mems_allowed should be set to
cpuset_current_mems_allowed nodemask. However, in current implementation,
mpol_store_user_nodemask() wrongly returns true, causing
mempolicy->w.user_nodemask to be incorrectly set to the user-specified
nodemask. Later, when the cpuset of the application changes,
mpol_rebind_nodemask() ends up rebinding based on the user-specified
nodemask rather than the cpuset_mems_allowed nodemask as intended.
I can reproduce with the following steps in qemu with 4 NUMA nodes:
1. echo '+cpuset' > /sys/fs/cgroup/cgroup.subtree_control
2. mkdir /sys/fs/cgroup/test
3. ./reproducer &
4. cat /proc/$pid/numa_maps, the task is bound to NUMA 1
5. echo $pid > /sys/fs/cgroup/test/cgroup.procs
6. cat /proc/$pid/numa_maps, the task is bound to NUMA 0 now.
The reproducer code:
int main()
{
struct bitmask *bmp;
int ret;
bmp = numa_parse_nodestring("1");
ret = set_mempolicy(MPOL_BIND | MPOL_F_NUMA_BALANCING,
bmp->maskp, bmp->size + 1);
if (ret < 0) {
perror("Failed to call set_mempolicy");
exit(-1);
}
while (1);
return 0;
}
If I call set_mempolicy() without MPOL_F_NUMA_BALANCING in the reproducer
code. After step 5, the task is still bound to NUMA 1.
To fix this, only set mempolicy->w.user_nodemask to the user-specified
nodemask if MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES is present.
Link: https://lkml.kernel.org/r/20260120011018.1256654-1-tujinjiang@huawei.com Link: https://lkml.kernel.org/r/20251223110523.1161421-1-tujinjiang@huawei.com Fixes: bda420b98505 ("numa balancing: migrate on fault among multiple bound nodes") Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com> Reviewed-by: Gregory Price <gourry@gourry.net> Reviewed-by: Huang Ying <ying.huang@linux.alibaba.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Mel Gorman <mgorman <mgorman@suse.de> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently, zsmalloc performs address linearization on read (which
sometimes requires memcpy() to a local buffer). Not all zsmalloc users
need a linear address. For example, Crypto API supports SG-list,
performing linearization under the hood, if needed. In addition, some
compressors can have native SG-list support, completely avoiding the
linearization step.
Provide an SG-list based zsmalloc read API:
- zs_obj_read_sg_begin()
- zs_obj_read_sg_end()
This API allows callers to obtain an SG representation of the object (one
entry for objects that are contained in a single page and two entries for
spanning objects), avoiding the need for a bounce buffer and memcpy.
[senozhatsky@chromium.org: make zs_obj_read_sg_begin() return void, per Yosry] Link: https://lkml.kernel.org/r/20260117024900.792237-1-senozhatsky@chromium.org Link: https://lkml.kernel.org/r/20260113034645.2729998-1-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Tested-by: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Brian Geffon <bgeffon@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Test that pages allocated with alloc_page() are uninitialized by default.
Link: https://lkml.kernel.org/r/20260113091151.4035013-2-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>