mm/page_alloc: let page freeing clear any set page type
Currently, any user of page types must clear that type before freeing a
page back to the buddy, otherwise we'll run into mapcount related sanity
checks (because the page type currently overlays the page mapcount).
Let's allow for not clearing the page type by page type users by letting
the buddy handle it instead.
We'll focus on having a page type set on the first page of a larger
allocation only.
With this change, we can reliably identify typed folios even though they
might be in the process of getting freed, which will come in handy in
migration code (at least in the transition phase).
In the future we might want to warn on some page types. Instead of having
an "allow list", let's rather wait until we know about once that should go
on such a "disallow list".
Link: https://lkml.kernel.org/r/20250704102524.326966-5-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Eugenio Pé rez <eperezma@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jason Wang <jasowang@redhat.com> Cc: Jerrin Shaji George <jerrin.shaji-george@broadcom.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rik van Riel <riel@surriel.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: xu xin <xu.xin16@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/balloon_compaction: convert balloon_page_delete() to balloon_page_finalize()
Let's move the removal of the page from the balloon list into the single
caller, to remove the dependency on the PG_isolated flag and clarify
locking requirements.
Note that for now, balloon_page_delete() was used on two paths:
(1) Removing a page from the balloon for deflation through
balloon_page_list_dequeue()
(2) Removing an isolated page from the balloon for migration in the
per-driver migration handlers. Isolated pages were already removed from
the balloon list during isolation.
So instead of relying on the flag, we can just distinguish both cases
directly and handle it accordingly in the caller.
We'll shuffle the operations a bit such that they logically make more
sense (e.g., remove from the list before clearing flags).
In balloon migration functions we can now move the balloon_page_finalize()
out of the balloon lock and perform the finalization just before dropping
the balloon reference.
Document that the page lock is currently required when modifying the
movability aspects of a page; hopefully we can soon decouple this from the
page lock.
Link: https://lkml.kernel.org/r/20250704102524.326966-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Eugenio Pé rez <eperezma@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Gregory Price <gourry@gourry.net> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jason Wang <jasowang@redhat.com> Cc: Jerrin Shaji George <jerrin.shaji-george@broadcom.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rik van Riel <riel@surriel.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/balloon_compaction: we cannot have isolated pages in the balloon list
Patch series "mm/migration: rework movable_ops page migration (part 1)",
v2.
In the future, as we decouple "struct page" from "struct folio", pages
that support "non-lru page migration" -- movable_ops page migration such
as memory balloons and zsmalloc -- will no longer be folios. They will
not have ->mapping, ->lru, and likely no refcount and no page lock. But
they will have a type and flags 🙂
This is the first part (other parts not written yet) of decoupling
movable_ops page migration from folio migration.
In this series, we get rid of the ->mapping usage, and start cleaning up
the code + separating it from folio migration.
Migration core will have to be further reworked to not treat movable_ops
pages like folios. This is the first step into that direction.
This patch (of 29):
The core will set PG_isolated only after mops->isolate_page() was called.
In case of the balloon, that is where we will remove it from the balloon
list. So we cannot have isolated pages in the balloon list.
Let's drop this unnecessary check.
Link: https://lkml.kernel.org/r/20250704102524.326966-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Eugenio Pé rez <eperezma@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jason Wang <jasowang@redhat.com> Cc: Jerrin Shaji George <jerrin.shaji-george@broadcom.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rik van Riel <riel@surriel.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Wed, 2 Jul 2025 08:47:17 +0000 (09:47 +0100)]
tools/testing/selftests: add mremap() unfaulted/faulted test cases
Assert that mremap() behaviour is as expected when moving around unfaulted
VMAs immediately adjacent to faulted ones, as well as moving around
faulted VMAs and placing them back immediately adjacent to the VMA from
which they were moved.
This also introduces a shared helper for the syscall version of mremap()
so we don't encounter any issues with libc filtering parameters.
Dev Jain [Thu, 3 Jul 2025 06:33:38 +0000 (12:03 +0530)]
maple tree: add some comments
Add comments explaining the fields for maple_metadata, since "end" is
ambiguous and "gap" can be confused as the largest gap, whereas it is
actually the offset of the largest gap.
Add comment for mas_ascend() to explain, whose min and max we are trying
to find. Explain that, for example, if we are already on offset zero,
then the parent min is mas->min, otherwise we need to walk up to find the
implied pivot min.
Link: https://lkml.kernel.org/r/20250703063338.51509-1-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
__cma_declare_contiguous_nid() tries to allocate memory in several ways:
* on systems with 64 bit physical address and enough memory it first
attempts to allocate memory just above 4GiB
* if that fails, on systems with HIGHMEM the next attempt is from high
memory
* and at last, if none of the previous attempts succeeded, or was even
tried because of incompatible configuration, the memory is allocated
anywhere within specified limits.
Move all the allocation logic to a helper function to make these steps more
obvious.
Link: https://lkml.kernel.org/r/20250703184711.3485940-4-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Oscar Salvador <osalvador@suse.de> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Pratyush Yadav <ptyadav@amazon.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
cma: split reservation of fixed area into a helper function
Move the check that verifies that reservation of fixed area does not cross
HIGHMEM boundary and the actual memblock_resrve() call into a helper
function.
This makes code more readable and decouples logic related to
CONFIG_HIGHMEM from the core functionality of
__cma_declare_contiguous_nid().
Link: https://lkml.kernel.org/r/20250703184711.3485940-3-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Pratyush Yadav <ptyadav@amazon.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Thorsten Blum [Mon, 30 Jun 2025 13:23:18 +0000 (15:23 +0200)]
mm/cma: use str_plural() in cma_declare_contiguous_multi()
Use the string choice helper function str_plural() to simplify the code
and to fix the following Coccinelle/coccicheck warning reported by
string_choices.cocci:
opportunity for str_plural(nr)
Link: https://lkml.kernel.org/r/20250630132318.41339-2-thorsten.blum@linux.dev Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: Dev Jain <dev.jain@arm.com> Tested-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Oscar Salvador [Mon, 30 Jun 2025 14:42:11 +0000 (16:42 +0200)]
mm,hugetlb: drop obsolete comment about non-present pte and second faults
There is a comment in hugetlb_fault() that does not hold anymore. This
one:
/*
* vmf.orig_pte could be a migration/hwpoison vmf.orig_pte at this
* point, so this check prevents the kernel from going below assuming
* that we have an active hugepage in pagecache. This goto expects
* the 2nd page fault, and is_hugetlb_entry_(migration|hwpoisoned)
* check will properly handle it.
*/
This was written because back in the day we used to do:
hugetlb_fault () {
ptep = huge_pte_offset(...)
if (ptep) {
entry = huge_ptep_get(ptep)
if (unlikely(is_hugetlb_entry_migration(entry))
...
else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
...
}
...
...
/*
* entry could be a migration/hwpoison entry at this point, so this
* check prevents the kernel from going below assuming that we have
* a active hugepage in pagecache. This goto expects the 2nd page fault,
* and is_hugetlb_entry_(migration|hwpoisoned) check will properly
* handle it.
*/
if (!pte_present(entry))
goto out_mutex;
...
}
The code was designed to check for hwpoisoned/migration entries upfront,
and then bail out if further down the pte was not present anymore, relying
on the second fault to properly handle migration/hwpoison entries that
time around.
The way we handle this is different nowadays, so drop the misleading
comment.
Oscar Salvador [Mon, 30 Jun 2025 14:42:10 +0000 (16:42 +0200)]
mm,hugetlb: rename anon_rmap to new_anon_folio and make it boolean
anon_rmap is used to determine whether the new allocated folio is
anonymous. Rename it to something more meaningul like new_anon_folio and
make it boolean, as we use it like that.
While we are at it, drop 'new_pagecache_folio' as 'new_anon_folio' is
enough to check whether we need to restore the consumed reservation.
Oscar Salvador [Mon, 30 Jun 2025 14:42:09 +0000 (16:42 +0200)]
mm,hugetlb: sort out folio locking in the faulting path
Recent conversations showed that there was a misunderstanding about why we
were locking the folio prior to call in hugetlb_wp(). In fact, as soon as
we have the folio mapped into the pagetables, we no longer need to hold it
locked, because we know that no concurrent truncation could have happened.
There is only one case where the folio needs to be locked, and that is
when we are handling an anonymous folio, because hugetlb_wp() will check
whether it can re-use it exclusively for the process that is faulting it
in.
So, pass the folio locked to hugetlb_wp() when that is the case.
Oscar Salvador [Mon, 30 Jun 2025 14:42:08 +0000 (16:42 +0200)]
mm,hugetlb: change mechanism to detect a COW on private mapping
Patch series "Misc rework on hugetlb faulting path", v4.
This patchset aims to give some love to the hugetlb faulting path, doing
so by removing obsolete comments that are no longer true, sorting out the
folio lock, and changing the mechanism we use to determine whether we are
COWing a private mapping already.
The most important patch of the series is #1, as it fixes a deadlock that
was described in [1], where two processes were holding the same lock for
the folio in the pagecache, and then deadlocked in the mutex. Note that
this can also happen for anymous folios. This has been tested using this
reproducer, below
Looking up and locking the folio in the pagecache was done to check
whether that folio was the same folio we had mapped in our pagetables,
meaning that if it was different we knew that we already mapped that folio
privately, so any further CoW would be made on a private mapping, which
lead us to the question: __Was the reservation for that address
consumed?__ That is all we care about, because if it was indeed consumed
and we are the owner and we cannot allocate more folios, we need to unmap
the folio from the processes pagetables and make it exclusive for us.
We figured we do not need to look up the folio at all, and it is just
enough to check whether the folio we have mapped is anonymous, which means
we mapped it privately, so the reservation was indeed consumed.
Patch#2 sorts out folio locking in the faulting path, reducing the scope
of it ,only taking it when we are dealing with an anonymous folio and
document it. More details in the patch.
You will also have to add a delay in hugetlb_wp, after releasing the mutex
and before unmapping, so the window is large enough to reproduce it
reliably.
hugetlb_wp() checks whether the process is trying to COW on a private
mapping in order to know whether the reservation for that address was
already consumed. If it was consumed and we are the ownner of the
mapping, the folio will have to be unmapped from the other processes.
Currently, that check is done by looking up the folio in the pagecache and
compare it to the folio which is mapped in our pagetables. If it differs,
it means we already mapped it privately before, consuming a reservation on
the way. All we are interested in is whether the mapped folio is
anonymous, so we can simplify and check for that instead.
Gerald Schaefer [Mon, 30 Jun 2025 16:47:25 +0000 (18:47 +0200)]
mm/debug_vm_pgtable: use a swp_entry_t input value for swap tests
The various __pte/pmd_to_swp_entry and __swp_entry_to_pte/pmd helper
functions are expected to operate on swap PTE/PMD entries, not on present
and mapped entries.
Reflect this in the swap tests by using a swp_entry_t as input value, and
convert it to a swap PTE/PMD for testing, similar to how it is already
done in pte_swap_exclusive_tests(). Move the swap entry creation from
there to init_args() and store it in args, so it can also be used in other
functions.
The pte/pmd_swap_tests() are also changed to compare entries instead of
pfn values, again similar to pte_swap_exclusive_tests(). pte/pmd_pfn()
helpers are also not expected to operate on swap PTE/PMD entries at all.
Also update documentation, to reflect that the helpers operate on swap
PTE/PMD entries and not present and mapped entries, and use correct names,
i.e. __swp_to_pte/pmd_entry -> __swp_entry_to_pte/pmd.
For consistency, also change pte/pmd_swap_soft_dirty_tests() to use
args->swp_entry instead of a present and mapped PTE/PMD.
Thorsten Blum [Mon, 30 Jun 2025 17:18:26 +0000 (19:18 +0200)]
mm/hugetlb: use str_plural() in report_hugepages()
Use the string choice helper function str_plural() to simplify the code
and to fix the following Coccinelle/coccicheck warning reported by
string_choices.cocci:
opportunity for str_plural(nrinvalid)
Link: https://lkml.kernel.org/r/20250630171826.114008-2-thorsten.blum@linux.dev Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Sat, 28 Jun 2025 16:04:26 +0000 (09:04 -0700)]
selftests/damon/sysfs.py: test monitoring attribute parameters
Add DAMON sysfs interface functionality tests for DAMON monitoring
attribute parameters, including intervals, intervals tuning goals, and
min/max number of regions.
SeongJae Park [Sat, 28 Jun 2025 16:04:25 +0000 (09:04 -0700)]
selftests/damon: add python and drgn-based DAMON sysfs test
Add a python-written DAMON sysfs functionality selftest. It sets DAMON
parameters using Python module _damon_sysfs, reads updated kernel internal
DAMON status and parameters using a 'drgn' script, namely
drgn_dump_damon_status.py, and compare if the resulted DAMON internal
status is as expected. The test is very minimum at the moment.
SeongJae Park [Sat, 28 Jun 2025 16:04:24 +0000 (09:04 -0700)]
selftests/damon/_damon_sysfs: set Kdamond.pid in start()
_damon_sysfs.py is a Python module for reading and writing DAMON sysfs for
testing. It is not reading resulting kdamond pids. Read and update those
when starting kdamonds.
SeongJae Park [Sat, 28 Jun 2025 16:04:23 +0000 (09:04 -0700)]
selftests/damon: add drgn script for extracting damon status
Patch series "selftests/damon: add python and drgn based DAMON sysfs
functionality tests".
DAMON sysfs interface is the bridge between the user space and the kernel
space for DAMON parameters. There is no good and simple test to see if
the parameters are set as expected. Existing DAMON selftests therefore
test end-to-end features. For example, damos_quota_goal.py runs a DAMOS
scheme with quota goal set against a test program running an artificial
access pattern, and see if the result is as expected. Such tests cover
only a few part of DAMON. Adding more tests is also complicated.
Finally, the reliability of the test itself on different systems is bad.
'drgn' is a tool that can extract kernel internal data structures like
DAMON parameters. Add a test that passes specific DAMON parameters via
DAMON sysfs reusing _damon_sysfs.py, extract resulting DAMON parameters
via 'drgn', and compare those. Note that this test is not adding
exhaustive tests of all DAMON parameters and input combinations but very
basic things. Advancing the test infrastructure and adding more tests are
future works.
This patch (of 6):
'drgn' is a useful tool for extracting kernel internal data structures
such as DAMON's parameter and running status. Add a 'drgn' script that
extracts such DAMON internal data at runtime, for using it as a tool for
seeing if a test input has made expected results in the kernel.
The script saves or prints out the DAMON internal data as a json file or
string. This is for making use of it not very depends on 'drgn'. If
'drgn' is not available on a test setup and we find alternative tools for
doing that, the json-based tests can be updated to use an alternative tool
in future.
Note that the script is tested with 'drgn v0.0.22'.
Peter Xu [Fri, 27 Jun 2025 16:07:39 +0000 (12:07 -0400)]
mm: deduplicate mm_get_unmapped_area()
Essentially it sets vm_flags==0 for mm_get_unmapped_area_vmflags(). Use
the helper instead to dedup the lines.
Link: https://lkml.kernel.org/r/20250627160739.2124768-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Peter Xu [Fri, 27 Jun 2025 16:07:07 +0000 (12:07 -0400)]
mm/hugetlb: remove prepare_hugepage_range()
Only mips and loongarch implemented this API, however what it does was
checking against stack overflow for either len or addr. That's already
done in arch's arch_get_unmapped_area*() functions, even though it may not
be 100% identical checks.
For example, for both of the architectures, there will be a trivial
difference on how stack top was defined. The old code uses STACK_TOP
which may be slightly smaller than TASK_SIZE on either of them, but the
hope is that shouldn't be a problem.
It means the whole API is pretty much obsolete at least now, remove it
completely.
Link: https://lkml.kernel.org/r/20250627160707.2124580-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Muchun Song <muchun.song@linux.dev> Cc: Jann Horn <jannh@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yunjeong Mun [Fri, 27 Jun 2025 16:33:29 +0000 (09:33 -0700)]
samples/damon/mtier: add parameters for node0 memory usage
Change the hard-coded quota goal metric values into sysfs knobs:
`node0_mem_used_bp` and `node0_mem_free_bp`. These knobs represent the
used and free memory ratio of node0 in basis points (bp, where 1 bp =
0.01%). As mentioned in [1], this patch is developed under the assumption
that node0 is always the fast-tier in a two-tiers memory setup.
Zi Yan [Tue, 17 Jun 2025 02:11:14 +0000 (22:11 -0400)]
mm/page_isolation: remove migratetype parameter from more functions
migratetype is no longer overwritten during pageblock isolation,
start_isolate_page_range(), has_unmovable_pages(), and
set_migratetype_isolate() no longer need which migratetype to restore
during isolation failure.
For has_unmoable_pages(), it needs to know if the isolation is for CMA
allocation, so adding PB_ISOLATE_MODE_CMA_ALLOC provide the information.
At the same time change isolation flags to enum pb_isolate_mode
(PB_ISOLATE_MODE_MEM_OFFLINE, PB_ISOLATE_MODE_CMA_ALLOC,
PB_ISOLATE_MODE_OTHER). Remove REPORT_FAILURE and check
PB_ISOLATE_MODE_MEM_OFFLINE, since only PB_ISOLATE_MODE_MEM_OFFLINE
reports isolation failures.
alloc_contig_range() no longer needs migratetype. Replace it with a newly
defined acr_flags_t to tell if an allocation is for CMA. So does
__alloc_contig_migrate_range(). Add ACR_FLAGS_NONE (set to 0) to indicate
ordinary allocations.
Link: https://lkml.kernel.org/r/20250617021115.2331563-7-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Richard Chang <richardycc@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Zi Yan [Tue, 17 Jun 2025 02:11:13 +0000 (22:11 -0400)]
mm/page_isolation: remove migratetype from undo_isolate_page_range()
Since migratetype is no longer overwritten during pageblock isolation,
undoing pageblock isolation no longer needs which migratetype to restore.
Link: https://lkml.kernel.org/r/20250617021115.2331563-6-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Richard Chang <richardycc@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Zi Yan [Tue, 17 Jun 2025 02:11:12 +0000 (22:11 -0400)]
mm/page_isolation: remove migratetype from move_freepages_block_isolate()
Since migratetype is no longer overwritten during pageblock isolation,
moving a pageblock out of MIGRATE_ISOLATE no longer needs a new
migratetype.
Add pageblock_isolate_and_move_free_pages() and
pageblock_unisolate_and_move_free_pages() to be explicit about the page
isolation operations. Both share the common code in
__move_freepages_block_isolate(), which is renamed from
move_freepages_block_isolate().
Add toggle_pageblock_isolate() to flip pageblock isolation bit in
__move_freepages_block_isolate().
Make set_pageblock_migratetype() only accept non MIGRATE_ISOLATE types, so
that one should use set_pageblock_isolate() to isolate pageblocks. As a
result, move pageblock migratetype code out of __move_freepages_block().
Link: https://lkml.kernel.org/r/20250617021115.2331563-5-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Richard Chang <richardycc@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Zi Yan [Tue, 17 Jun 2025 02:11:11 +0000 (22:11 -0400)]
mm/page_alloc: add support for initializing pageblock as isolated
MIGRATE_ISOLATE is a standalone bit, so a pageblock cannot be initialized
to just MIGRATE_ISOLATE. Add init_pageblock_migratetype() to enable
initialize a pageblock with a migratetype and isolated.
Link: https://lkml.kernel.org/r/20250617021115.2331563-4-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Richard Chang <richardycc@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Zi Yan [Tue, 17 Jun 2025 02:11:10 +0000 (22:11 -0400)]
mm/page_isolation: make page isolation a standalone bit
During page isolation, the original migratetype is overwritten, since
MIGRATE_* are enums and stored in pageblock bitmaps. Change
MIGRATE_ISOLATE to be stored a standalone bit, PB_migrate_isolate, like
PB_compact_skip, so that migratetype is not lost during pageblock
isolation.
Link: https://lkml.kernel.org/r/20250617021115.2331563-3-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Richard Chang <richardycc@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Zi Yan [Tue, 17 Jun 2025 02:11:09 +0000 (22:11 -0400)]
mm/page_alloc: pageblock flags functions clean up
Patch series "Make MIGRATE_ISOLATE a standalone bit", v10.
This patchset moves MIGRATE_ISOLATE to a standalone bit to avoid being
overwritten during pageblock isolation process. Currently,
MIGRATE_ISOLATE is part of enum migratetype (in include/linux/mmzone.h),
thus, setting a pageblock to MIGRATE_ISOLATE overwrites its original
migratetype. This causes pageblock migratetype loss during
alloc_contig_range() and memory offline, especially when the process fails
due to a failed pageblock isolation and the code tries to undo the
finished pageblock isolations.
In terms of performance for changing pageblock types, no performance
change is observed:
1. I used perf to collect stats of offlining and onlining all memory
of a 40GB VM 10 times and see that get_pfnblock_flags_mask() and
set_pfnblock_flags_mask() take about 0.12% and 0.02% of the whole
process respectively with and without this patchset across 3 runs.
2. I used perf to collect stats of dd from /dev/random to a 40GB tmpfs
file and find get_pfnblock_flags_mask() takes about 0.05% of the
process with and without this patchset across 3 runs.
This patch (of 6):
No functional change is intended.
1. Add __NR_PAGEBLOCK_BITS for the number of pageblock flag bits and use
roundup_pow_of_two(__NR_PAGEBLOCK_BITS) as NR_PAGEBLOCK_BITS to take
right amount of bits for pageblock flags.
2. Rename PB_migrate_skip to PB_compact_skip.
3. Add {get,set,clear}_pfnblock_bit() to operate one a standalone bit,
like PB_compact_skip.
3. Make {get,set}_pfnblock_flags_mask() internal functions and use
{get,set}_pfnblock_migratetype() for pageblock migratetype operations.
4. Move pageblock flags common code to get_pfnblock_bitmap_bitidx().
3. Use MIGRATETYPE_MASK to get the migratetype of a pageblock from its
flags.
4. Use PB_migrate_end in the definition of MIGRATETYPE_MASK instead of
PB_migrate_bits.
5. Add a comment on is_migrate_cma_folio() to prevent one from changing it
to use get_pageblock_migratetype() and causing issues.
Link: https://lkml.kernel.org/r/20250617021115.2331563-1-ziy@nvidia.com Link: https://lkml.kernel.org/r/20250617021115.2331563-2-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Richard Chang <richardycc@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Oscar Salvador [Mon, 16 Jun 2025 13:51:53 +0000 (15:51 +0200)]
mm,page_ext: derive the node from the pfn
page_ext is the only user of 'status_change_nid', which is set in
online/offline operations, to know to which node we are adding/removing
memory.
Prior to call any notifiers, the memmap is initialized via, which among
other things, sets the node the pages belong to, to all corresponging
pages. This means that there is no need to keep using 'status_change_nid'
since we can derive the node from the pfn. This will allow us to finally
drop 'status_change_nid' from the memory_notify struct.
Link: https://lkml.kernel.org/r/20250616135158.450136-11-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Suggested-by: David Hildenbrand <david@redhat.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com> Cc: Rakie Kim <rakie.kim@sk.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Oscar Salvador [Mon, 16 Jun 2025 13:51:52 +0000 (15:51 +0200)]
mm,mempolicy: use node-notifier instead of memory-notifier
mempolicy is only concerned when a numa node changes its memory state,
because it needs to take this node into account for the auto-weighted
memory policy system. So stop using the memory notifier and use the new
numa node notifer instead.
Link: https://lkml.kernel.org/r/20250616135158.450136-10-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Rakie Kim <rakie.kim@sk.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Gregory Price <gourry@gourry.net> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Oscar Salvador [Mon, 16 Jun 2025 13:51:51 +0000 (15:51 +0200)]
kernel,cpuset: use node-notifier instead of memory-notifier
cpuset is only concerned when a numa node changes its memory state, as it
needs to know the current numa nodes with memory to keep an updated
mems_allowed mask. So stop using the memory notifier and use the new numa
node notifer instead.
Link: https://lkml.kernel.org/r/20250616135158.450136-9-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Rakie Kim <rakie.kim@sk.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Oscar Salvador [Mon, 16 Jun 2025 13:51:50 +0000 (15:51 +0200)]
drivers,hmat: use node-notifier instead of memory-notifier
hmat driver is only concerned when a numa node changes its memory state,
specifically when a numa node with memory comes into play for the first
time, because it will register the memory_targets belonging to that numa
node. So stop using the memory notifier and use the new numa node notifer
instead.
Link: https://lkml.kernel.org/r/20250616135158.450136-8-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Rakie Kim <rakie.kim@sk.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Oscar Salvador [Mon, 16 Jun 2025 13:51:49 +0000 (15:51 +0200)]
drivers,cxl: use node-notifier instead of memory-notifier
memory-tier is only concerned when a numa node changes its memory state,
specifically when a numa node with memory comes into play for the first
time, because it needs to get its performance attributes to build a proper
demotion chain. So stop using the memory notifier and use the new numa
node notifer instead.
Link: https://lkml.kernel.org/r/20250616135158.450136-7-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Rakie Kim <rakie.kim@sk.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Oscar Salvador [Mon, 16 Jun 2025 13:51:48 +0000 (15:51 +0200)]
mm,memory-tiers: use node-notifier instead of memory-notifier
memory-tier is only concerned when a numa node changes its memory state,
because it then needs to re-create the demotion list. So stop using the
memory notifier and use the new numa node notifer instead.
Link: https://lkml.kernel.org/r/20250616135158.450136-6-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Rakie Kim <rakie.kim@sk.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Oscar Salvador [Mon, 16 Jun 2025 13:51:46 +0000 (15:51 +0200)]
mm,memory_hotplug: implement numa node notifier
There are at least six consumers of hotplug_memory_notifier that what they
really are interested in is whether any numa node changed its state, e.g:
going from having memory to not having memory and vice versa.
Implement a specific notifier for numa nodes when their state gets
changed, which will later be used by those consumers that are only
interested in numa node state changes.
Add documentation as well.
[dan.carpenter@linaro.org: set failure reason in offline_pages()] Link: https://lkml.kernel.org/r/be4fd31b-7d09-46b0-8329-6d0464ffa7a5@sabinyo.mountain Link: https://lkml.kernel.org/r/20250616135158.450136-4-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Rakie Kim <rakie.kim@sk.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Oscar Salvador [Mon, 16 Jun 2025 13:51:44 +0000 (15:51 +0200)]
mm,slub: do not special case N_NORMAL nodes for slab_nodes
Patch series "Implement numa node notifier", v7.
Memory notifier is a tool that allow consumers to get notified whenever
memory gets onlined or offlined in the system. Currently, there are 10
consumers of that, but 5 out of those 10 consumers are only interested in
getting notifications when a numa node changes its memory state. That
means going from memoryless to memory-aware of vice versa.
Which means that for every {online,offline}_pages operation they get
notified even though the numa node might not have changed its state. This
is suboptimal, and we want to decouple numa node state changes from memory
state changes.
While we are doing this, remove status_change_nid_normal, as the only
current user (slub) does not really need it. This allows us to further
simplify and clean up the code.
The first patch gets rid of status_change_nid_normal in slub. The second
patch implements a numa node notifier that does just that, and have those
consumers register in there, so they get notified only when they are
interested.
The third patch replaces 'status_change_nid{_normal}' fields within
memory_notify with a 'nid', as that is only what we need for memory
notifer and update the only user of it (page_ext).
Consumers that are only interested in numa node states change are:
Currently, slab_mem_going_online_callback() checks whether the node has
N_NORMAL memory in order to be set in slab_nodes. While it is true that
getting rid of that enforcing would mean ending up with movables nodes in
slab_nodes, the memory waste that comes with that is negligible.
So stop checking for status_change_nid_normal and just use
status_change_nid instead which works for both types of memory.
Also, once we allocate the kmem_cache_node cache for the node in
slab_mem_online_callback(), we never deallocate it in
slab_mem_offline_callback() when the node goes memoryless, so we can just
get rid of it.
The side effects are that we will stop clearing the node from slab_nodes,
and also that newly created kmem caches after node hotremove will now
allocate their kmem_cache_node for the node(s) that was hotremoved, but
these should be negligible.
Link: https://lkml.kernel.org/r/20250616135158.450136-1-osalvador@suse.de Link: https://lkml.kernel.org/r/20250616135158.450136-2-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Suggested-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com> Cc: Rakie Kim <rakie.kim@sk.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Tue, 24 Jun 2025 13:03:48 +0000 (15:03 +0200)]
mm, madvise: use standard madvise locking in madvise_set_anon_name()
Use madvise_lock()/madvise_unlock() in madvise_set_anon_name() in the same
way as in do_madvise(). This narrows the lock scope a bit and reuses
existing functionality. get_lock_mode() already picks the correct
MADVISE_MMAP_WRITE_LOCK mode for __MADV_SET_ANON_VMA_NAME so we can just
remove the explicit assignment.
There is a user visible change in that the prctl(PR_SET_VMA,
PR_SET_VMA_ANON_NAME...) might now return -EINTR.
Link: https://lkml.kernel.org/r/20250624-anon_name_cleanup-v2-4-600075462a11@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Colin Cross <ccross@google.com> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Tue, 24 Jun 2025 13:03:46 +0000 (15:03 +0200)]
mm, madvise: extract mm code from prctl_set_vma() to mm/madvise.c
Setting anon_name is done via madvise_set_anon_name() and behaves a lot of
like other madvise operations. However, apparently because madvise() has
lacked the 4th argument and prctl() not, the userspace entry point has
been implemented via prctl(PR_SET_VMA, ...) and handled first by
prctl_set_vma().
Currently prctl_set_vma() lives in kernel/sys.c but setting the
vma->anon_name is mm-specific code so extract it to a new
set_anon_vma_name() function under mm. mm/madvise.c seems to be the most
straightforward place as that's where madvise_set_anon_name() lives. Stop
declaring the latter in mm.h and instead declare set_anon_vma_name().
Link: https://lkml.kernel.org/r/20250624-anon_name_cleanup-v2-2-600075462a11@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Colin Cross <ccross@google.com> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Tue, 24 Jun 2025 13:03:45 +0000 (15:03 +0200)]
mm, madvise: simplify anon_name handling
Patch series "madvise anon_name cleanups", v2.
While reviewing Lorenzo's madvise cleanups I've noticed that we can handle
anon_name in madvise code much better, so sending that as patch 1.
Initially I wanted to do first move the existing logic from
madvise_vma_behavior() to madvise_update_vma() as a separate patch before
the actual simplification but that would require adding
anon_vma_name_put() in error handling paths only to be removed again, so
it's a single patch to avoid churn.
It's also an opportunity to move some mm code from prctl under mm, hence
patch 2. After code moving preparation in patch 3, also unify madvise
lock handling for madvise_set_anon_name() in patch 4.
This patch (of 4):
Since the introduction in 9a10064f5625 ("mm: add a field to store names
for private anonymous memory") the code to set anon_name on a vma has been
using madvise_update_vma() to call replace_anon_vma_name(). Since the
former is called also by a number of other madvise behaviours that do not
set a new anon_name, they have been passing the existing anon_name of the
vma to make replace_anon_vma_name() a no-op.
This is rather wasteful as it needs anon_vma_name_eq() to determine the
no-op situations, and checks for when replace_anon_vma_name() is allowed
(the vma is anon/shmem) duplicate the checks already done earlier in
madvise_vma_behavior(). It has also lead to commit 942341dcc574 ("mm: fix
use-after-free when anon vma name is used after vma is freed") adding
anon_name refcount get/put operations exactly to the cases that actually
do not change anon_name - just so the replace_anon_vma_name() can keep
safely determining it has nothing to do.
The recent madvise cleanups made this suboptimal handling very obvious,
but happily also allow for an easy fix. madvise_update_vma() now has the
complete information whether it's been called to set a new anon_name, so
stop passing it the existing vma's name and doing the refcount get/put in
its only caller madvise_vma_behavior().
In madvise_update_vma() itself, limit calling of replace_anon_vma_name()
only to cases where we are setting a new name, otherwise we know it's a
no-op. We can rely solely on the __MADV_SET_ANON_VMA_NAME behaviour and
can remove the duplicate checks for vma being anon/shmem that were done
already in madvise_vma_behavior().
Additionally, by using vma_modify_flags() when not modifying the
anon_name, avoid explicitly passing the existing vma's anon_name and
storing a pointer to it in struct madv_behavior or a local variable. This
prevents the danger of accessing a freed anon_name after vma merging,
previously fixed by commit 942341dcc574.
Link: https://lkml.kernel.org/r/20250624-anon_name_cleanup-v2-0-600075462a11@suse.cz Link: https://lkml.kernel.org/r/20250624-anon_name_cleanup-v2-1-600075462a11@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Tested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Colin Cross <ccross@google.com> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Fri, 20 Jun 2025 15:33:05 +0000 (16:33 +0100)]
mm/madvise: eliminate very confusing manipulation of prev VMA
The madvise code has for the longest time had very confusing code around
the 'prev' VMA pointer passed around various functions which, in all cases
except madvise_update_vma(), is unused and instead simply updated as soon
as the function is invoked.
To compound the confusion, the prev pointer is also used to indicate to
the caller that the mmap lock has been dropped and that we can therefore
not safely access the end of the current VMA (which might have been
updated by madvise_update_vma()).
Clear up this confusion by not setting prev = vma anywhere except in
madvise_walk_vmas(), update all references to prev which will always be
equal to vma after madvise_vma_behavior() is invoked, and adding a flag to
indicate that the lock has been dropped to make this explicit.
Additionally, drop a redundant BUG_ON() from madvise_collapse(), which is
simply reiterating the BUG_ON(mmap_locked) above it (note that BUG_ON() is
not appropriate here, but we leave existing code as-is).
We finally adjust the madvise_walk_vmas() logic to be a little clearer -
delaying the assignment of the end of the range to the start of the new
range until the last moment and handling the lock being dropped scenario
immediately.
Additionally add some explanatory comments.
[lorenzo.stoakes@oracle.com: fix very subtle bug] Link: https://lkml.kernel.org/r/dca94cde-8afb-4eab-8e57-3f508624d670@lucifer.local Link: https://lkml.kernel.org/r/63d281c5df2e64225ab5b4bda398b45e22818701.1750433500.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Fri, 20 Jun 2025 15:33:04 +0000 (16:33 +0100)]
mm/madvise: thread all madvise state through madv_behavior
Doing so means we can get rid of all the weird struct vm_area_struct
**prev stuff, everything becomes consistent and in future if we want to
make change to behaviour there's a single place where all relevant state
is stored.
This also allows us to update try_vma_read_lock() to be a little more
succinct and set up state for us, as well as cleaning up
madvise_update_vma().
We also update the debug assertion prior to madvise_update_vma() to assert
that this is a write operation as correctly pointed out by Barry in the
relevant thread.
We can't reasonably update the madvise functions that live outside of
mm/madvise.c so we leave those as-is.
Link: https://lkml.kernel.org/r/7b345ab82ef51e551f8bc0c4f7be25712871629d.1750433500.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Fri, 20 Jun 2025 15:33:03 +0000 (16:33 +0100)]
mm/madvise: thread VMA range state through madvise_behavior
Rather than updating start and a confusing local parameter 'tmp' in
madvise_walk_vmas(), instead store the current range being operated upon
in the struct madvise_behavior helper object in a range pair and use this
consistently in all operations.
This makes it clearer what is going on and opens the door to further
cleanup now we store state regarding what is currently being operated upon
here.
Link: https://lkml.kernel.org/r/518480ceb48553d3c280bc2b0b5e77bbad840147.1750433500.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: SeongJae Park <sj@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Fri, 20 Jun 2025 15:33:02 +0000 (16:33 +0100)]
mm/madvise: thread mm_struct through madvise_behavior
There's no need to thread a pointer to the mm_struct nor have different
functions signatures for each behaviour, instead store state in the struct
madvise_behavior object consistently and use it for all madvise() actions.
Link: https://lkml.kernel.org/r/a47d850b0111735e026d438c3300c0e3b7f439f4.1750433500.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: SeongJae Park <sj@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Fri, 20 Jun 2025 15:33:01 +0000 (16:33 +0100)]
mm/madvise: remove the visitor pattern and thread anon_vma state
Patch series "madvise cleanup", v2.
This is a series of patches that helps address a number of historic
problems in the madvise() implementation:
* Eliminate the visitor pattern and having the code which is implemented
for both the anon_vma_name implementation and ordinary madvise()
operations use the same madvise_vma_behavior() implementation.
* Thread state through the madvise_behavior state object - this object,
very usefully introduced by SJ, is already used to transmit state
through operations. This series extends this by having all madvise()
operations use this, including anon_vma_name.
* Thread range, VMA state through madvise_behavior - This helps avoid a
lot of the confusing code around range and VMA state and again keeps
things consistent and with a single 'source of truth'.
* Addressing the very strange behaviour around the passed around struct
vm_area_struct **prev pointer - all read-only users do absolutely
nothing with the prev pointer. The only function that uses it is
madvise_update_vma(), and in all cases prev is always reset to VMA.
Fix this by no longer having aything but madvise_update_vma()
reference prev, and having madvise_walk_vmas() update prev in each
instance. Additionally make it clear that the meaningful change in vma
state is when madvise_update_vma() potentially merges a VMA, so
explicitly retrieve the VMA in this case.
* Update and clarify the madvise_walk_vmas() function - this is a source
of a great deal of confusion, so simplify, stop using prev = NULL to
signify that the mmap lock has been dropped (!) and make that explicit,
and add some comments to explain what's going on.
This patch (of 5):
Now we have the madvise_behavior helper struct we no longer need to mess
around with void* pointers in order to propagate anon_vma_name, and this
means we can get rid of the confusing and inconsistent visitor pattern
implementation in madvise_vma_anon_name().
This means we now have a single state object that threads through most of
madvise()'s logic and a single code path which executes the majority of
madvise() behaviour (we maintain separate logic for failure injection and
memory population for the time being).
We are able to remove the visitor pattern by handling the anon_vma_name
setting logic via an internal madvise flag - __MADV_SET_ANON_VMA_NAME.
This uses a negative value so it isn't reasonable that we will ever add
this as a UAPI flag.
Additionally, the madvise_behavior_valid() check ensures that
user-specified behaviours are strictly only those we permit which, of
course, this flag will be excluded from.
We are able to propagate the anon_vma_name object through use of the
madvise_behavior helper struct.
Doing this results in a can_modify_vma_madv() check for anonymous VMA name
changes, however this will cause no issues as this operation is not
prohibited.
We can also then reuse more code and drop the redundant
madvise_vma_anon_name() function altogether.
Additionally separate out behaviours that update VMAs from those that do
not.
Link: https://lkml.kernel.org/r/cover.1750433500.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/c5094bfccb41ecd19d4e9bcaa1c4a11e00158bba.1750433500.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: SeongJae Park <sj@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Li Wang [Tue, 24 Jun 2025 03:27:48 +0000 (11:27 +0800)]
ksm_tests: skip hugepage test when Transparent Hugepages are disabled
Some systems (e.g. minimal or real-time kernels) may not enable
Transparent Hugepages (THP), causing MADV_HUGEPAGE to return EINVAL. This
patch introduces a runtime check using the existing THP sysfs interface
and skips the hugepage merging test (`-H`) when THP is not available.
# --------------------
# running ./khugepaged
# --------------------
# Reading PMD pagesize failed# [FAIL]
not ok 1 khugepaged # exit=1
# --------------------
# running ./soft-dirty
# --------------------
# TAP version 13
# 1..15
# ok 1 Test test_simple
# ok 2 Test test_vma_reuse dirty bit of allocated page
# ok 3 Test test_vma_reuse dirty bit of reused address page
# Bail out! Reading PMD pagesize failed# Planned tests != run tests (15 != 3)
# # Totals: pass:3 fail:0 xfail:0 xpass:0 skip:0 error:0
# [FAIL]
not ok 1 soft-dirty # exit=1
# SUMMARY: PASS=0 SKIP=0 FAIL=1
# -------------------
# running ./migration
# -------------------
# TAP version 13
# 1..3
# # Starting 3 tests from 1 test cases.
# # RUN migration.private_anon ...
# # OK migration.private_anon
# ok 1 migration.private_anon
# # RUN migration.shared_anon ...
# # OK migration.shared_anon
# ok 2 migration.shared_anon
# # RUN migration.private_anon_thp ...
# # migration.c:196:private_anon_thp:Expected madvise(ptr, TWOMEG, MADV_HUGEPAGE) (-1) == 0 (0)
# # private_anon_thp: Test terminated by assertion
# # FAIL migration.private_anon_thp
# not ok 3 migration.private_anon_thp
# # FAILED: 2 / 3 tests passed.
# # Totals: pass:2 fail:1 xfail:0 xpass:0 skip:0 error:0
# [FAIL]
not ok 1 migration # exit=1
It's true that CONFIG_TRANSPARENT_HUGEPAGE=y is explicitly enabled in
tools/testing/selftests/mm/config, so ideally the runtime environment
should also support THP.
However, in practice, we've found that on some systems:
- THP is disabled at boot time (transparent_hugepage=never)
- Or manually disabled via sysfs
- Or unavailable in RT kernels, containers, or minimal CI environments
In these cases, the test will fail with EINVAL on madvise(MADV_HUGEPAGE),
even though the kernel config is correct.
To make the test suite more robust and avoid false negatives, this patch
adds a runtime check for /sys/kernel/mm/transparent_hugepage/enabled.
If THP is not available, the hugepage test (-H) is skipped with a clear
message.
Link: https://lkml.kernel.org/r/20250624032748.393836-1-liwang@redhat.com Signed-off-by: Li Wang <liwang@redhat.com> Cc: Aruna Ramakrishna <aruna.ramakrishna@oracle.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Joey Gouly <joey.gouly@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Keith Lucas <keith.lucas@oracle.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Li Wang [Tue, 24 Jun 2025 04:24:11 +0000 (12:24 +0800)]
selftests/mm: fix UFFDIO_API usage with proper two-step feature negotiation
The current implementation of test_unmerge_uffd_wp() explicitly sets
`uffdio_api.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP` before calling
UFFDIO_API. This can cause the ioctl() call to fail with EINVAL on
kernels that do not support UFFD-WP, leading the test to fail
unnecessarily:
# ------------------------------
# running ./ksm_functional_tests
# ------------------------------
# TAP version 13
# 1..9
# # [RUN] test_unmerge
# ok 1 Pages were unmerged
# # [RUN] test_unmerge_zero_pages
# ok 2 KSM zero pages were unmerged
# # [RUN] test_unmerge_discarded
# ok 3 Pages were unmerged
# # [RUN] test_unmerge_uffd_wp
# not ok 4 UFFDIO_API failed <-----
# # [RUN] test_prot_none
# ok 5 Pages were unmerged
# # [RUN] test_prctl
# ok 6 Setting/clearing PR_SET_MEMORY_MERGE works
# # [RUN] test_prctl_fork
# # No pages got merged
# # [RUN] test_prctl_fork_exec
# ok 7 PR_SET_MEMORY_MERGE value is inherited
# # [RUN] test_prctl_unmerge
# ok 8 Pages were unmerged
# Bail out! 1 out of 8 tests failed
# # Planned tests != run tests (9 != 8)
# # Totals: pass:7 fail:1 xfail:0 xpass:0 skip:0 error:0
# [FAIL]
This patch improves compatibility and robustness of the UFFD-WP test
(test_unmerge_uffd_wp) by correctly implementing the UFFDIO_API two-step
handshake as recommended by the userfaultfd(2) man page.
Key changes:
1. Use features=0 in the initial UFFDIO_API call to query supported
feature bits, rather than immediately requesting WP support.
2. Skip the test gracefully if:
- UFFDIO_API fails with EINVAL (e.g. unsupported API version), or
- UFFD_FEATURE_PAGEFAULT_FLAG_WP is not advertised by the kernel.
3. Close the initial userfaultfd and create a new one before enabling
the required feature, since UFFDIO_API can only be called once per fd.
4. Improve diagnostics by distinguishing between expected and unexpected
failures, using strerror() to report errors.
This ensures the test behaves correctly across a wider range of kernel
versions and configurations, while preserving the intended behavior on
kernels that support UFFD-WP.
[liwang@redhat.com: fail the test if sys_userfaultfd() fails, per David] Link: https://lkml.kernel.org/r/20250625004645.400520-1-liwang@redhat.com Link: https://lkml.kernel.org/r/20250624042411.395285-1-liwang@redhat.com Signed-off-by: Li Wang <liwang@redhat.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Aruna Ramakrishna <aruna.ramakrishna@oracle.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Joey Gouly <joey.gouly@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Keith Lucas <keith.lucas@oracle.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Li Wang <liwang@redhat.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Tue, 24 Jun 2025 15:48:23 +0000 (11:48 -0400)]
maple_tree: add testing for restoring maple state to active
Restoring maple status to ma_active on overflow/underflow when mas->node
was NULL could have happened in the past, but was masked by a bug in
mas_walk(). Add test cases that triggered the bug when the node was
mas->node prior to fixing the maple state setup.
Add a few extra tests around restoring the active maple status.
Liam R. Howlett [Tue, 24 Jun 2025 15:48:22 +0000 (11:48 -0400)]
maple_tree: fix status setup on restore to active
During the initial call with a maple state, an error status may be set
before a valid node is populated into the maple state node. Subsequent
calls with the maple state may restore the state into an active state with
no node set. This was masked by the mas_walk() always resetting the
status to ma_start and result in an extra walk in this rare scenario.
Don't restore the state to active unless there is a value in the structs
node. This also allows mas_walk() to be fixed to use the active state
without exposing an issue.
User visible results are marginal performance improvements when an active
state can be restored and used instead of rewalking the tree.
Stable is not Cc'ed because the existing code is stable and the
performance gains are not worth the risk.
copy_to_kernel_nofault() is an internal helper which should not be visible
to loadable modules – exporting it would give exploit code a cheap
oracle to probe kernel addresses. Instead, keep the helper un-exported
and compile the kunit case that exercises it only when
mm/kasan/kasan_test.o is linked into vmlinux.
[snovitoll@gmail.com: add a brief comment to `#ifndef MODULE`] Link: https://lkml.kernel.org/r/20250622141142.79332-1-snovitoll@gmail.com Link: https://lkml.kernel.org/r/20250622051906.67374-1-snovitoll@gmail.com Fixes: ca79a00bb9a8 ("kasan: migrate copy_user_test to kunit") Signed-off-by: Sabyrzhan Tasbolatov <snovitoll@gmail.com> Suggested-by: Christoph Hellwig <hch@infradead.org> Suggested-by: Marco Elver <elver@google.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
lib/test_vmalloc.c: restrict default test mask to avoid test warnings
When the vmalloc test is built into the kernel, it runs automatically
during the boot. The current-default "run_test_mask" includes all test
cases, including those which are designed to fail and which trigger kernel
warnings.
These kernel splats can be misinterpreted as actual kernel bugs, leading
to false alarms and unnecessary reports.
To address this, limit the default test mask to only the first few tests
which are expected to pass cleanly. These tests are safe and should not
generate any warnings unless there is a real bug.
Users who wish to explicitly run specific test cases have to pass the
run_test_mask as a boot parameter or at module load time.
Link: https://lkml.kernel.org/r/20250623184035.581229-2-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: David Wang <00107082@163.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
lib/test_vmalloc.c: use late_initcall() if built-in for init ordering
When the vmalloc test code is compiled as a built-in, use late_initcall()
instead of module_init() to defer a vmalloc test execution until most
subsystems are up and running.
It avoids interfering with components that may not yet be initialized at
module_init() time. For example, there was a recent report of memory
profiling infrastructure not being ready early enough leading to kernel
crash.
By using late_initcall() in the built-in case, we ensure the tests are run
at a safer point during a boot sequence.
Link: https://lkml.kernel.org/r/20250623184035.581229-1-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: David Wang <00107082@163.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Sun, 22 Jun 2025 21:37:59 +0000 (14:37 -0700)]
mm/damon/sysfs: decouple from damon_ops_id
Decouple DAMON sysfs interface from damon_ops_id. For this, define and
use new mm/damon/sysfs.c internal data structure that maps the user-space
keywords and damon_ops_id, instead of having the implicit and unflexible
array index rule.
SeongJae Park [Sun, 22 Jun 2025 21:37:58 +0000 (14:37 -0700)]
mm/damon/sysfs-schemes: decouple from damos_filter_type
Decouple DAMOS sysfs interface from damos_filter_type. For this, define
and use new sysfs-schemes internal data structure that maps the user-space
keywords and damos_filter_type, instead of having the implicit and
unflexible array index rule.
SeongJae Park [Sun, 22 Jun 2025 21:37:57 +0000 (14:37 -0700)]
mm/damon/sysfs-schemes: decouple from damos_wmark_metric
Decouple DAMOS sysfs interface from damos_wmark_metric. For this, define
and use new sysfs-schemes internal data structure that maps the user-space
keywords and damos_wmark_metric, instead of having the implicit and
unflexible array index rule.
SeongJae Park [Sun, 22 Jun 2025 21:37:56 +0000 (14:37 -0700)]
mm/damon/sysfs-schemes: decouple from damos_action
Decouple DAMOS sysfs interface from damos_action. For this, define and
use new sysfs-schemes internal data structure that maps the user-space
keywords and damos_action, instead of having the implicit and unflexible
array index rule.
[akpm@linux-foundation.org: make damos_sysfs_action_names static] Closes: https://lore.kernel.org/oe-kbuild-all/202506271655.b8yfEZIT-lkp@intel.com/ Link: https://lkml.kernel.org/r/20250622213759.50930-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Sun, 22 Jun 2025 21:37:55 +0000 (14:37 -0700)]
mm/damon/sysfs-schemes: decouple from damos_quota_goal_metric
Patch series "mm/damon: decouple sysfs from core".
DAMON sysfs interface is coupled with core layer. It maintains some of
its keywords arrays be synchronized with matching DAMON core API enums.
It is unnecessary coupling that makes separated changes for different
layers difficult. Decouple the layers by introducing new data structure
for the mappings on DAMON sysfs interface.
This patch (of 5):
Decouple DAMOS sysfs interface from damos_quota_goal_metric. For this,
define and use new sysfs-schemes internal data structure that maps the
user-space keywords and damos_quota_goal_metric, instead of having the
implicit and unflexible array index rule.
mm/ptdump: take the memory hotplug lock inside ptdump_walk_pgd()
Memory hot remove unmaps and tears down various kernel page table regions
as required. The ptdump code can race with concurrent modifications of
the kernel page tables. When leaf entries are modified concurrently, the
dump code may log stale or inconsistent information for a VA range, but
this is otherwise not harmful.
But when intermediate levels of kernel page table are freed, the dump code
will continue to use memory that has been freed and potentially
reallocated for another purpose. In such cases, the ptdump code may
dereference bogus addresses, leading to a number of potential problems.
To avoid the above mentioned race condition, platforms such as arm64,
riscv and s390 take memory hotplug lock, while dumping kernel page table
via the sysfs interface /sys/kernel/debug/kernel_page_tables.
Similar race condition exists while checking for pages that might have
been marked W+X via /sys/kernel/debug/kernel_page_tables/check_wx_pages
which in turn calls ptdump_check_wx(). Instead of solving this race
condition again, let's just move the memory hotplug lock inside generic
ptdump_check_wx() which will benefit both the scenarios.
Drop get_online_mems() and put_online_mems() combination from all existing
platform ptdump code paths.
Link: https://lkml.kernel.org/r/20250620052427.2092093-1-anshuman.khandual@arm.com Fixes: bbd6ec605c0f ("arm64/mm: Enable memory hot remove") Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> [s390] Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Peter Xu [Fri, 20 Jun 2025 15:00:58 +0000 (11:00 -0400)]
selftests/mm: reduce uffd-unit-test poison test to minimum
The test will still generate quite some unwanted MCE error messages to
syslog. There was old proposal ratelimiting the MCE messages from kernel,
but that has risk of hiding real useful information on production systems.
We can at least reduce the test to minimum to not over-pollute dmesg,
however trying to not lose its coverage too much.
Dev Jain [Tue, 24 Jun 2025 08:07:48 +0000 (13:37 +0530)]
maple tree: use goto label to simplify code
Use the underflow goto label to set the status to ma_underflow and return
NULL, as is being done elsewhere.
[akpm@linux-foundation.org: add newline, per Liam (and remove one, per akpm)] Link: https://lkml.kernel.org/r/20250624080748.4855-1-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Alistair Popple [Thu, 19 Jun 2025 08:58:04 +0000 (18:58 +1000)]
mm: remove PFN_DEV, PFN_MAP, PFN_SPECIAL, PFN_SG_CHAIN and PFN_SG_LAST
The PFN_MAP flag is no longer used for anything, so remove it. The
PFN_SG_CHAIN and PFN_SG_LAST flags never appear to have been used so also
remove them. The last user of PFN_SPECIAL was removed by 653d7825c149
("dcssblk: mark DAX broken, remove FS_DAX_LIMITED support").
Users of PFN_DEV were removed earlier in this series by "mm: Remove
remaining uses of PFN_DEV".
Link: https://lkml.kernel.org/r/670b3950d70b4d97b905bb597dadfd3633de4314.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Björn Töpel <bjorn@kernel.org> Cc: Björn Töpel <bjorn@rivosinc.com> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: John Groves <john@groves.net> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Alistair Popple [Thu, 19 Jun 2025 08:58:03 +0000 (18:58 +1000)]
mm: remove devmap related functions and page table bits
Now that DAX and all other reference counts to ZONE_DEVICE pages are
managed normally there is no need for the special devmap PTE/PMD/PUD page
table bits. So drop all references to these, freeing up a software
defined page table bit on architectures supporting it.
Link: https://lkml.kernel.org/r/6389398c32cc9daa3dfcaa9f79c7972525d310ce.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Acked-by: Will Deacon <will@kernel.org> # arm64 Acked-by: David Hildenbrand <david@redhat.com> Suggested-by: Chunyan Zhang <zhang.lyra@gmail.com> Reviewed-by: Björn Töpel <bjorn@rivosinc.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Björn Töpel <bjorn@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: John Groves <john@groves.net> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Alistair Popple [Thu, 19 Jun 2025 08:58:02 +0000 (18:58 +1000)]
fs/dax: remove FS_DAX_LIMITED config option
The dcssblk driver was the last user of FS_DAX_LIMITED. That was marked
broken by 653d7825c149 ("dcssblk: mark DAX broken, remove FS_DAX_LIMITED
support") to allow removal of PFN_SPECIAL. However the FS_DAX_LIMITED
config option itself was not removed, so do that now.
Link: https://lkml.kernel.org/r/b47bf164b4a1013d736fa1a3d501bc8b8e71311f.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Björn Töpel <bjorn@kernel.org> Cc: Björn Töpel <bjorn@rivosinc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Groves <john@groves.net> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Alistair Popple [Thu, 19 Jun 2025 08:58:01 +0000 (18:58 +1000)]
powerpc: remove checks for devmap pages and PMDs/PUDs
PFN_DEV no longer exists. This means no devmap PMDs or PUDs will be
created, so checking for them is redundant. Instead mappings of pages
that would have previously returned true for pXd_devmap() will return true
for pXd_trans_huge()
Link: https://lkml.kernel.org/r/31f63cc8dd518f9e2ec300681fe302eb4adf49b4.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Björn Töpel <bjorn@kernel.org> Cc: Björn Töpel <bjorn@rivosinc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: John Groves <john@groves.net> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The pmd_devmap() check in check_pmd_state() is redundant because the only
user of pmd_devmap were device dax and fs dax. However all callers of
check_pmd_state() first call thp_vma_allowable_order() to check if the vma
should be scanned. Except when called from a page fault this always
returns 0 for dax vma's, hence we would never expect to see a pmd_devmap
entry.
Link: https://lkml.kernel.org/r/a68175fd3a37e9b72cc82c1d63fd8b69691a85b5.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Björn Töpel <bjorn@kernel.org> Cc: Björn Töpel <bjorn@rivosinc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: John Groves <john@groves.net> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Alistair Popple [Thu, 19 Jun 2025 08:57:59 +0000 (18:57 +1000)]
mm: remove redundant pXd_devmap calls
DAX was the only thing that created pmd_devmap and pud_devmap entries
however it no longer does as DAX pages are now refcounted normally and
pXd_trans_huge() returns true for those. Therefore checking both
pXd_devmap and pXd_trans_huge() is redundant and the former can be removed
without changing behaviour as it will always be false.
Link: https://lkml.kernel.org/r/d58f089dc16b7feb7c6728164f37dea65d64a0d3.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Björn Töpel <bjorn@kernel.org> Cc: Björn Töpel <bjorn@rivosinc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Groves <john@groves.net> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Alistair Popple [Thu, 19 Jun 2025 08:57:58 +0000 (18:57 +1000)]
mm/huge_memory: remove pXd_devmap usage from insert_pXd_pfn()
Nothing uses PFN_DEV anymore so no need to create devmap pXd's when
mapping a PFN. Instead special mappings will be created which ensures
vm_normal_page_pXd() will not return pages which don't have an associated
page. This could change behaviour slightly on architectures where
pXd_devmap() does not imply pXd_special() as the normal page checks would
have fallen through to checking VM_PFNMAP/MIXEDMAP instead, which in
theory at least could have returned a page.
However vm_normal_page_pXd() should never have been returning pages for
pXd_devmap() entries anyway, so anything relying on that would have been a
bug.
Link: https://lkml.kernel.org/r/cd8658f9ff10afcfffd8b145a39d98bf1c595ffa.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Björn Töpel <bjorn@kernel.org> Cc: Björn Töpel <bjorn@rivosinc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Groves <john@groves.net> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Alistair Popple [Thu, 19 Jun 2025 08:57:57 +0000 (18:57 +1000)]
mm/gup: remove pXX_devmap usage from get_user_pages()
GUP uses pXX_devmap() calls to see if it needs to a get a reference on the
associated pgmap data structure to ensure the pages won't go away.
However it's a driver responsibility to ensure that if pages are mapped
(ie. discoverable by GUP) that they are not offlined or removed from the
memmap so there is no need to hold a reference on the pgmap data structure
to ensure this.
Furthermore mappings with PFN_DEV are no longer created, hence this
effectively dead code anyway so can be removed.
Link: https://lkml.kernel.org/r/708b2be76876659ec5261fe5d059b07268b98b36.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Björn Töpel <bjorn@kernel.org> Cc: Björn Töpel <bjorn@rivosinc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: John Groves <john@groves.net> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Alistair Popple [Thu, 19 Jun 2025 08:57:56 +0000 (18:57 +1000)]
mm: convert vmf_insert_mixed() from using pte_devmap to pte_special
DAX no longer requires device PTEs as it always has a ZONE_DEVICE page
associated with the PTE that can be reference counted normally. Other
users of pte_devmap are drivers that set PFN_DEV when calling
vmf_insert_mixed() which ensures vm_normal_page() returns NULL for these
entries.
There is no reason to distinguish these pte_devmap users so in order to
free up a PTE bit use pte_special instead for entries created with
vmf_insert_mixed(). This will ensure vm_normal_page() will continue to
return NULL for these pages.
Architectures that don't support pte_special also don't support pte_devmap
so those will continue to rely on pfn_valid() to determine if the page can
be mapped.
Link: https://lkml.kernel.org/r/93086bd446e7bf8e4c85345613ac18f706b01f60.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Björn Töpel <bjorn@kernel.org> Cc: Björn Töpel <bjorn@rivosinc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: John Groves <john@groves.net> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Alistair Popple [Thu, 19 Jun 2025 08:57:55 +0000 (18:57 +1000)]
mm: remove remaining uses of PFN_DEV
PFN_DEV was used by callers of dax_direct_access() to figure out if the
returned PFN is associated with a page using pfn_t_has_page() or not.
However all DAX PFNs now require an assoicated ZONE_DEVICE page so can
assume a page exists.
Other users of PFN_DEV were setting it before calling vmf_insert_mixed().
This is unnecessary as it is no longer checked, instead relying on
pfn_valid() to determine if there is an associated page or not.
Link: https://lkml.kernel.org/r/74b293aebc21b941090bc3e7aeafa91b57c821a5.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Björn Töpel <bjorn@kernel.org> Cc: Björn Töpel <bjorn@rivosinc.com> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: John Groves <john@groves.net> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Alistair Popple [Thu, 19 Jun 2025 08:57:54 +0000 (18:57 +1000)]
mm: filter zone device pages returned from folio_walk_start()
Previously dax pages were skipped by the pagewalk code as pud_special() or
vm_normal_page{_pmd}() would be false for DAX pages. Now that dax pages
are refcounted normally that is no longer the case, so the pagewalk code
will start returning them.
Most callers already explicitly filter for DAX or zone device pages so
don't need updating. However some don't, so add checks to those callers.
Link: https://lkml.kernel.org/r/4ecb7b357fc5b435588024770b88bbb695c30090.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Björn Töpel <bjorn@kernel.org> Cc: Björn Töpel <bjorn@rivosinc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Groves <john@groves.net> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Alistair Popple [Thu, 19 Jun 2025 08:57:53 +0000 (18:57 +1000)]
mm: convert pXd_devmap checks to vma_is_dax
Patch series "mm: Remove pXX_devmap page table bit and pfn_t type", v3.
All users of dax now require a ZONE_DEVICE page which is properly
refcounted. This means there is no longer any need for the PFN_DEV,
PFN_MAP and PFN_SPECIAL flags. Furthermore the PFN_SG_CHAIN and
PFN_SG_LAST flags never appear to have been used. It is therefore
possible to remove the pfn_t type and replace any usage with raw pfns.
The remaining users of PFN_DEV have simply passed this to
vmf_insert_mixed() to create pte_devmap() mappings. It is unclear why
this was the case but presumably to ensure vm_normal_page() does not
return these pages. These users can be trivially converted to raw pfns
and creating a pXX_special() mapping to ensure vm_normal_page() still
doesn't return these pages.
Now that there are no users of PFN_DEV we can remove the devmap page table
bit and all associated functions and macros, freeing up a software page
table bit.
This patch (of 14):
Currently dax is the only user of pmd and pud mapped ZONE_DEVICE pages.
Therefore page walkers that want to exclude DAX pages can check pmd_devmap
or pud_devmap. However soon dax will no longer set PFN_DEV, meaning dax
pages are mapped as normal pages.
Ensure page walkers that currently use pXd_devmap to skip DAX pages
continue to do so by adding explicit checks of the VMA instead.
Remove vma_is_dax() check from mm/userfaultfd.c as validate_move_areas()
will already skip DAX VMA's on account of them not being anonymous.
The check in userfaultfd_must_wait() is also redundant as
vma_can_userfault() should have already filtered out dax vma's.
For HMM the pud_devmap check seems unnecessary as there is no reason we
shouldn't be able to handle any leaf PUD here so remove it also.
Link: https://lkml.kernel.org/r/cover.176965585864cb8d2cf41464b44dcc0471e643a0.1750323463.git-series.apopple@nvidia.com Link: https://lkml.kernel.org/r/f0611f6f475f48fcdf34c65084a359aefef4cccc.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Björn Töpel <bjorn@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: John Groves <john@groves.net> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Björn Töpel <bjorn@rivosinc.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hao Ge [Wed, 18 Jun 2025 01:58:09 +0000 (09:58 +0800)]
mm/percpu: conditionally define _shared_alloc_tag via CONFIG_ARCH_MODULE_NEEDS_WEAK_PER_CPU
Recently discovered this entry while checking kallsyms on ARM64: ffff800083e509c0 D _shared_alloc_tag
If ARCH_NEEDS_WEAK_PER_CPU is not defined(it is only defined for s390 and
alpha architectures), there's no need to statically define the percpu
variable _shared_alloc_tag.
Therefore, we need to implement isolation for this purpose.
When building the core kernel code for s390 or alpha architectures,
ARCH_NEEDS_WEAK_PER_CPU remains undefined (as it is gated by #if
defined(MODULE)). However, when building modules for these architectures,
the macro is explicitly defined.
Therefore, we remove all instances of ARCH_NEEDS_WEAK_PER_CPU from the
code and introduced CONFIG_ARCH_MODULE_NEEDS_WEAK_PER_CPU to replace the
relevant logic. We can now conditionally define the perpcu variable
_shared_alloc_tag based on CONFIG_ARCH_MODULE_NEEDS_WEAK_PER_CPU. This
allows architectures (such as s390/alpha) that require weak definitions
for percpu variables in modules to include the definition, while others
can omit it via compile-time exclusion.
Link: https://lkml.kernel.org/r/20250618015809.1235761-1-hao.ge@linux.dev Signed-off-by: Hao Ge <gehao@kylinos.cn> Suggested-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> [s390] Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Chistoph Lameter <cl@linux.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vivek Kasireddy [Wed, 18 Jun 2025 05:30:55 +0000 (22:30 -0700)]
selftests/udmabuf: add a test to pin first before writing to memfd
Unlike the existing tests, this new test will create a memfd (backed by
hugetlb) and pin the folios in it (a small subset) before writing/
populating it with data. This is a valid use-case that invokes the
memfd_alloc_folio() kernel API and is expected to work unless there aren't
enough hugetlb folios to satisfy the allocation needs.
Link: https://lkml.kernel.org/r/20250618053415.1036185-4-vivek.kasireddy@intel.com Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> Cc: Gerd Hoffmann <kraxel@redhat.com> Cc: Steve Sistare <steven.sistare@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vivek Kasireddy [Wed, 18 Jun 2025 05:30:54 +0000 (22:30 -0700)]
mm/memfd: reserve hugetlb folios before allocation
When we try to allocate a folio via alloc_hugetlb_folio_reserve(), we need
to ensure that there is an active reservation associated with the
allocation. Otherwise, our allocation request would fail if there are no
active reservations made at that moment against any other allocations.
This is because alloc_hugetlb_folio_reserve() checks h->resv_huge_pages
before proceeding with the allocation.
Therefore, to address this issue, we just need to make a reservation (by
calling hugetlb_reserve_pages()) before we try to allocate the folio.
This will also ensure that proper region/subpool accounting is done
associated with our allocation.
Link: https://lkml.kernel.org/r/20250618053415.1036185-3-vivek.kasireddy@intel.com Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> Cc: Steve Sistare <steven.sistare@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Gerd Hoffmann <kraxel@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vivek Kasireddy [Wed, 18 Jun 2025 05:30:53 +0000 (22:30 -0700)]
mm/hugetlb: make hugetlb_reserve_pages() return nr of entries updated
Patch series "mm/memfd: Reserve hugetlb folios before allocation", v4.
There are cases when we try to pin a folio but discover that it has not
been faulted-in. So, we try to allocate it in memfd_alloc_folio() but the
allocation request may not succeed if there are no active reservations in
the system at that instant.
Therefore, making a reservation (by calling hugetlb_reserve_pages())
associated with the allocation will ensure that our request would not fail
due to lack of reservations. This will also ensure that proper
region/subpool accounting is done with our allocation.
This patch (of 3):
Currently, hugetlb_reserve_pages() returns a bool to indicate whether the
reservation map update for the range [from, to] was successful or not.
This is not sufficient for the case where the caller needs to determine
how many entries were updated for the range.
Therefore, have hugetlb_reserve_pages() return the number of entries
updated in the reservation map associated with the range [from, to].
Also, update the callers of hugetlb_reserve_pages() to handle the new
return value.
Petr Pavlu [Wed, 18 Jun 2025 12:50:35 +0000 (14:50 +0200)]
codetag: avoid unused alloc_tags sections/symbols
With CONFIG_MEM_ALLOC_PROFILING=n, vmlinux and all modules unnecessarily
contain the symbols __start_alloc_tags and __stop_alloc_tags, which define
an empty range. In the case of modules, the presence of these symbols
also forces the linker to create an empty .codetag.alloc_tags section.
Update codetag.lds.h to make the data conditional on
CONFIG_MEM_ALLOC_PROFILING.
Link: https://lkml.kernel.org/r/20250618125037.53182-1-petr.pavlu@suse.com Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Casey Chen <cachen@purestorage.com> Cc: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Wed, 18 Jun 2025 19:42:54 +0000 (20:42 +0100)]
mm: update architecture and driver code to use vm_flags_t
In future we intend to change the vm_flags_t type, so it isn't correct for
architecture and driver code to assume it is unsigned long. Correct this
assumption across the board.
Overall, this patch does not introduce any functional change.
Link: https://lkml.kernel.org/r/b6eb1894abc5555ece80bb08af5c022ef780c8bc.1750274467.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Kees Cook <kees@kernel.org> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Wed, 18 Jun 2025 19:42:53 +0000 (20:42 +0100)]
mm: update core kernel code to use vm_flags_t consistently
The core kernel code is currently very inconsistent in its use of
vm_flags_t vs. unsigned long. This prevents us from changing the type of
vm_flags_t in the future and is simply not correct, so correct this.
While this results in rather a lot of churn, it is a critical
pre-requisite for a future planned change to VMA flag type.
Additionally, update VMA userland tests to account for the changes.
To make review easier and to break things into smaller parts, driver and
architecture-specific changes is left for a subsequent commit.
The code has been adjusted to cascade the changes across all calling code
as far as is needed.
We will adjust architecture-specific and driver code in a subsequent patch.
Overall, this patch does not introduce any functional change.
Link: https://lkml.kernel.org/r/d1588e7bb96d1ea3fe7b9df2c699d5b4592d901d.1750274467.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Kees Cook <kees@kernel.org> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Jan Kara <jack@suse.cz> Acked-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Wed, 18 Jun 2025 19:42:52 +0000 (20:42 +0100)]
mm: change vm_get_page_prot() to accept vm_flags_t argument
Patch series "use vm_flags_t consistently".
The VMA flags field vma->vm_flags is of type vm_flags_t. Right now this
is exactly equivalent to unsigned long, but it should not be assumed to
be.
Much code that references vma->vm_flags already correctly uses vm_flags_t,
but a fairly large chunk of code simply uses unsigned long and assumes
that the two are equivalent.
This series corrects that and has us use vm_flags_t consistently.
This series is motivated by the desire to, in a future series, adjust
vm_flags_t to be a u64 regardless of whether the kernel is 32-bit or
64-bit in order to deal with the VMA flag exhaustion issue and avoid all
the various problems that arise from it (being unable to use certain
features in 32-bit, being unable to add new flags except for 64-bit, etc.)
This is therefore a critical first step towards that goal. At any rate,
using the correct type is of value regardless.
We additionally take the opportunity to refer to VMA flags as vm_flags
where possible to make clear what we're referring to.
Overall, this series does not introduce any functional change.
This patch (of 3):
We abstract the type of the VMA flags to vm_flags_t, however in may places
it is simply assumed this is unsigned long, which is simply incorrect.
At the moment this is simply an incongruity, however in future we plan to
change this type and therefore this change is a critical requirement for
doing so.
Overall, this patch does not introduce any functional change.
[lorenzo.stoakes@oracle.com: add missing vm_get_page_prot() instance, remove include] Link: https://lkml.kernel.org/r/552f88e1-2df8-4e95-92b8-812f7c8db829@lucifer.local Link: https://lkml.kernel.org/r/cover.1750274467.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/a12769720a2743f235643b158c4f4f0a9911daf0.1750274467.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Jann Horn <jannh@google.com> Cc: Kees Cook <kees@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Commit a00ce85af2a1 ("mm: make alloc_demote_folio externally invokable for
migration") was made to let DAMOS_MIGRATE_{HOT,COLD} call the function.
But a previous commit made DAMOS_MIGRATE_{HOT,COLD} call
alloc_migration_target() instead. Hence there are no more callers of the
function outside of vmscan.c. Revert the commit to make the function
static again.
Link: https://lkml.kernel.org/r/20250616172346.67659-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Honggyu Kim <honggyu.kim@sk.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Commit 8f75267d22bd ("mm: rename alloc_demote_folio to
alloc_migrate_folio") was to reflect the fact the function is called for
not only demotion, but also general migrations from
DAMOS_MIGRATE_{HOT,COLD}. The previous commit made the DAMOS actions to
not use alloc_migrate_folio(), though. So, demote_folio_list() is the
only caller of alloc_migrate_folio(), and the name could now be rather
confusing. Revert the renaming commit.
Link: https://lkml.kernel.org/r/20250616172346.67659-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Honggyu Kim <honggyu.kim@sk.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Mon, 16 Jun 2025 17:23:44 +0000 (10:23 -0700)]
mm/damon/paddr: use alloc_migartion_target() with no migration fallback nodemask
Patch series "mm/damon: use alloc_migrate_target() for
DAMOS_MIGRATE_{HOT,COLD}".
DAMOS_MIGRATE_{HOT,COLD} implementation resembles that for demotion, and
hence the behavior is also similar to that. But, since those are not only
for demotion but general migrations, it would be better to match with that
for move_pages() system call. Make the implementation and the behavior
more similar to move_pages() by not setting migration fallback nodes, and
using alloc_migration_target() instead of alloc_migrate_folio().
alloc_migrate_folio() was renamed from alloc_demote_folio() and been
non-static function, to let DAMOS_MIGRATE_{HOT,COLD} call it. As
alloc_migration_target() is called instead, the renaming and de-static
changes are no more required but could only make future code readers be
confused. Revert the changes, too.
This patch (of 3):
DAMOS_MIGRATE_{HOT,COLD} implementation resembles that for
demote_folio_list(). Because those are not only for demotion but general
folio migrations, it makes more sense to behave similarly to move_pages()
system call. Make the behavior more similar to move_pages(), by using
alloc_migration_target() instead of alloc_migrate_folio(), without
fallback nodemask.
Link: https://lkml.kernel.org/r/20250616172346.67659-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Honggyu Kim <honggyu.kim@sk.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Mon, 16 Jun 2025 18:45:21 +0000 (14:45 -0400)]
tools/testing/radix-tree: test maple tree chaining mas_preallocate() calls
Testing calling multiple mas_preallocate() calls in a row after adjusting
the maple state. Ensures new calls to mas_preallocate() will change the
number of allocated nodes.
Link: https://lkml.kernel.org/r/20250616184521.3382795-4-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Hailong Liu <hailong.liu@oppo.com> Cc: "Liam R. Howlett" <howlett@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>