mm: convert "movable" flag in page->mapping to a page flag
Instead, let's use a page flag. As the page flag can result in
false-positives, glue it to the page types for which we support/implement
movable_ops page migration.
We are reusing PG_uptodate, that is for example used to track file system
state and does not apply to movable_ops pages. So warning in case it is
set in page_has_movable_ops() on other page types could result in
false-positive warnings.
Likely we could set the bit using a non-atomic update: in contrast to
page->mapping, we could have others trying to update the flags
concurrently when trying to lock the folio. In
isolate_movable_ops_page(), we already take care of that by checking if
the page has movable_ops before locking it. Let's start with the atomic
variant, we could later switch to the non-atomic variant once we are sure
other cases are similarly fine. Once we perform the switch, we'll have to
introduce __SETPAGEFLAG_NOOP().
[david@redhat.com: add missing `:' in kerneldoc] Link: https://lkml.kernel.org/r/d96e2916-2c43-462c-b6a1-2375ef397d8b@redhat.com Link: https://lkml.kernel.org/r/20250704102524.326966-21-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Eugenio Pé rez <eperezma@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Gregory Price <gourry@gourry.net> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jason Wang <jasowang@redhat.com> Cc: Jerrin Shaji George <jerrin.shaji-george@broadcom.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rik van Riel <riel@surriel.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: xu xin <xu.xin16@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
... instead, look them up statically based on the page type. Maybe in
the future we want a registration interface? At least for now, it can be
easily handled using the two page types that actually support page
migration.
The remaining usage of page->mapping is to flag such pages as actually
being movable (having movable_ops), which we will change next.
Link: https://lkml.kernel.org/r/20250704102524.326966-20-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Eugenio Pé rez <eperezma@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jason Wang <jasowang@redhat.com> Cc: Jerrin Shaji George <jerrin.shaji-george@broadcom.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rik van Riel <riel@surriel.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: xu xin <xu.xin16@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Convert to page_has_movable_ops(). While at it, cleanup relevant code a
bit.
The data_race() in migrate_folio_unmap() is questionable: we already hold
a page reference, and concurrent modifications can no longer happen (iow:
__ClearPageMovable() no longer exists). Drop it for now, we'll rework
page_has_movable_ops() soon either way to no longer rely on page->mapping.
Wherever we cast from folio to page now is a clear sign that this code has
to be decoupled.
Link: https://lkml.kernel.org/r/20250704102524.326966-19-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Eugenio Pé rez <eperezma@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jason Wang <jasowang@redhat.com> Cc: Jerrin Shaji George <jerrin.shaji-george@broadcom.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rik van Riel <riel@surriel.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: xu xin <xu.xin16@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Previously, if __ClearPageMovable() were invoked on a page, this would
cause __PageMovable() to return false, but due to the continued existence
of page movable ops, PageMovable() would have returned true.
With __ClearPageMovable() gone, the two are exactly equivalent.
So we can replace PageMovable() checks by __PageMovable(). In fact,
__PageMovable() cannot change until a page is freed, so we can turn some
PageMovable() into sanity checks for __PageMovable().
Link: https://lkml.kernel.org/r/20250704102524.326966-16-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Eugenio Pé rez <eperezma@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jason Wang <jasowang@redhat.com> Cc: Jerrin Shaji George <jerrin.shaji-george@broadcom.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rik van Riel <riel@surriel.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: xu xin <xu.xin16@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The Chinese docs in Documentation/translations/zh_CN/mm/page_migration.rst
still mention it, but that whole docs is destined to get outdated and
updated by somebody that actually speaks that language.
Link: https://lkml.kernel.org/r/20250704102524.326966-15-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Eugenio Pé rez <eperezma@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jason Wang <jasowang@redhat.com> Cc: Jerrin Shaji George <jerrin.shaji-george@broadcom.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rik van Riel <riel@surriel.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: xu xin <xu.xin16@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/balloon_compaction: stop using __ClearPageMovable()
We can just look at the balloon device (stored in page->private), to see
if the page is still part of the balloon.
As isolated balloon pages cannot get released (they are taken off the
balloon list while isolated), we don't have to worry about this case in
the putback and migration callback. Add a WARN_ON_ONCE for now.
Link: https://lkml.kernel.org/r/20250704102524.326966-14-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Eugenio Pé rez <eperezma@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Gregory Price <gourry@gourry.net> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jason Wang <jasowang@redhat.com> Cc: Jerrin Shaji George <jerrin.shaji-george@broadcom.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rik van Riel <riel@surriel.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Instead, let's check in the callbacks if the page was already destroyed,
which can be checked by looking at zpdesc->zspage (see reset_zpdesc()).
If we detect that the page was destroyed:
(1) Fail isolation, just like the migration core would
(2) Fake migration success just like the migration core would
In the putback case there is nothing to do, as we don't do anything just
like the migration core would do.
In the future, we should look into not letting these pages get destroyed
while they are isolated -- and instead delaying that to the
putback/migration call. Add a TODO for that.
Link: https://lkml.kernel.org/r/20250704102524.326966-13-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Eugenio Pé rez <eperezma@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jason Wang <jasowang@redhat.com> Cc: Jerrin Shaji George <jerrin.shaji-george@broadcom.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/migrate: move movable_ops page handling out of move_to_new_folio()
Let's move that handling directly into migrate_folio_move(), so we can
simplify move_to_new_folio(). While at it, fixup the documentation a bit.
Note that unmap_and_move_huge_page() does not care, because it only deals
with actual folios. (we only support migration of individual movable_ops
pages)
Link: https://lkml.kernel.org/r/20250704102524.326966-12-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Eugenio Pé rez <eperezma@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jason Wang <jasowang@redhat.com> Cc: Jerrin Shaji George <jerrin.shaji-george@broadcom.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rik van Riel <riel@surriel.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: xu xin <xu.xin16@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/migrate: rename isolate_movable_page() to isolate_movable_ops_page()
... and start moving back to per-page things that will absolutely not be
folio things in the future. Add documentation and a comment that the
remaining folio stuff (lock, refcount) will have to be reworked as well.
While at it, convert the VM_BUG_ON() into a WARN_ON_ONCE() and handle it
gracefully (relevant with further changes), and convert a WARN_ON_ONCE()
into a VM_WARN_ON_ONCE_PAGE().
Note that we will leave anything that needs a rework (lock, refcount,
->lru) to be using folios for now: that perfectly highlights the
problematic bits.
Link: https://lkml.kernel.org/r/20250704102524.326966-8-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Eugenio Pé rez <eperezma@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jason Wang <jasowang@redhat.com> Cc: Jerrin Shaji George <jerrin.shaji-george@broadcom.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rik van Riel <riel@surriel.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: xu xin <xu.xin16@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/zsmalloc: make PageZsmalloc() sticky until the page is freed
Let the page freeing code handle clearing the page type. Being able to
identify balloon pages until actually freed is a requirement for upcoming
movable_ops migration changes.
Link: https://lkml.kernel.org/r/20250704102524.326966-7-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Eugenio Pé rez <eperezma@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jason Wang <jasowang@redhat.com> Cc: Jerrin Shaji George <jerrin.shaji-george@broadcom.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: xu xin <xu.xin16@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/balloon_compaction: make PageOffline sticky until the page is freed
Let the page freeing code handle clearing the page type. Being able to
identify balloon pages until actually freed is a requirement for upcoming
movable_ops migration changes.
Link: https://lkml.kernel.org/r/20250704102524.326966-6-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Eugenio Pé rez <eperezma@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jason Wang <jasowang@redhat.com> Cc: Jerrin Shaji George <jerrin.shaji-george@broadcom.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rik van Riel <riel@surriel.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: xu xin <xu.xin16@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/page_alloc: let page freeing clear any set page type
Currently, any user of page types must clear that type before freeing a
page back to the buddy, otherwise we'll run into mapcount related sanity
checks (because the page type currently overlays the page mapcount).
Let's allow for not clearing the page type by page type users by letting
the buddy handle it instead.
We'll focus on having a page type set on the first page of a larger
allocation only.
With this change, we can reliably identify typed folios even though they
might be in the process of getting freed, which will come in handy in
migration code (at least in the transition phase).
In the future we might want to warn on some page types. Instead of having
an "allow list", let's rather wait until we know about once that should go
on such a "disallow list".
Link: https://lkml.kernel.org/r/20250704102524.326966-5-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Eugenio Pé rez <eperezma@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jason Wang <jasowang@redhat.com> Cc: Jerrin Shaji George <jerrin.shaji-george@broadcom.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rik van Riel <riel@surriel.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: xu xin <xu.xin16@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/balloon_compaction: convert balloon_page_delete() to balloon_page_finalize()
Let's move the removal of the page from the balloon list into the single
caller, to remove the dependency on the PG_isolated flag and clarify
locking requirements.
Note that for now, balloon_page_delete() was used on two paths:
(1) Removing a page from the balloon for deflation through
balloon_page_list_dequeue()
(2) Removing an isolated page from the balloon for migration in the
per-driver migration handlers. Isolated pages were already removed from
the balloon list during isolation.
So instead of relying on the flag, we can just distinguish both cases
directly and handle it accordingly in the caller.
We'll shuffle the operations a bit such that they logically make more
sense (e.g., remove from the list before clearing flags).
In balloon migration functions we can now move the balloon_page_finalize()
out of the balloon lock and perform the finalization just before dropping
the balloon reference.
Document that the page lock is currently required when modifying the
movability aspects of a page; hopefully we can soon decouple this from the
page lock.
Link: https://lkml.kernel.org/r/20250704102524.326966-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Eugenio Pé rez <eperezma@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Gregory Price <gourry@gourry.net> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jason Wang <jasowang@redhat.com> Cc: Jerrin Shaji George <jerrin.shaji-george@broadcom.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rik van Riel <riel@surriel.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/balloon_compaction: we cannot have isolated pages in the balloon list
Patch series "mm/migration: rework movable_ops page migration (part 1)",
v2.
In the future, as we decouple "struct page" from "struct folio", pages
that support "non-lru page migration" -- movable_ops page migration such
as memory balloons and zsmalloc -- will no longer be folios. They will
not have ->mapping, ->lru, and likely no refcount and no page lock. But
they will have a type and flags 🙂
This is the first part (other parts not written yet) of decoupling
movable_ops page migration from folio migration.
In this series, we get rid of the ->mapping usage, and start cleaning up
the code + separating it from folio migration.
Migration core will have to be further reworked to not treat movable_ops
pages like folios. This is the first step into that direction.
This patch (of 29):
The core will set PG_isolated only after mops->isolate_page() was called.
In case of the balloon, that is where we will remove it from the balloon
list. So we cannot have isolated pages in the balloon list.
Let's drop this unnecessary check.
Link: https://lkml.kernel.org/r/20250704102524.326966-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Eugenio Pé rez <eperezma@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jason Wang <jasowang@redhat.com> Cc: Jerrin Shaji George <jerrin.shaji-george@broadcom.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rik van Riel <riel@surriel.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Wed, 2 Jul 2025 08:47:17 +0000 (09:47 +0100)]
tools/testing/selftests: add mremap() unfaulted/faulted test cases
Assert that mremap() behaviour is as expected when moving around unfaulted
VMAs immediately adjacent to faulted ones, as well as moving around
faulted VMAs and placing them back immediately adjacent to the VMA from
which they were moved.
This also introduces a shared helper for the syscall version of mremap()
so we don't encounter any issues with libc filtering parameters.
Dev Jain [Thu, 3 Jul 2025 06:33:38 +0000 (12:03 +0530)]
maple tree: add some comments
Add comments explaining the fields for maple_metadata, since "end" is
ambiguous and "gap" can be confused as the largest gap, whereas it is
actually the offset of the largest gap.
Add comment for mas_ascend() to explain, whose min and max we are trying
to find. Explain that, for example, if we are already on offset zero,
then the parent min is mas->min, otherwise we need to walk up to find the
implied pivot min.
Link: https://lkml.kernel.org/r/20250703063338.51509-1-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
__cma_declare_contiguous_nid() tries to allocate memory in several ways:
* on systems with 64 bit physical address and enough memory it first
attempts to allocate memory just above 4GiB
* if that fails, on systems with HIGHMEM the next attempt is from high
memory
* and at last, if none of the previous attempts succeeded, or was even
tried because of incompatible configuration, the memory is allocated
anywhere within specified limits.
Move all the allocation logic to a helper function to make these steps more
obvious.
Link: https://lkml.kernel.org/r/20250703184711.3485940-4-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Oscar Salvador <osalvador@suse.de> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Pratyush Yadav <ptyadav@amazon.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
cma: split reservation of fixed area into a helper function
Move the check that verifies that reservation of fixed area does not cross
HIGHMEM boundary and the actual memblock_resrve() call into a helper
function.
This makes code more readable and decouples logic related to
CONFIG_HIGHMEM from the core functionality of
__cma_declare_contiguous_nid().
Link: https://lkml.kernel.org/r/20250703184711.3485940-3-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Pratyush Yadav <ptyadav@amazon.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Thorsten Blum [Mon, 30 Jun 2025 13:23:18 +0000 (15:23 +0200)]
mm/cma: use str_plural() in cma_declare_contiguous_multi()
Use the string choice helper function str_plural() to simplify the code
and to fix the following Coccinelle/coccicheck warning reported by
string_choices.cocci:
opportunity for str_plural(nr)
Link: https://lkml.kernel.org/r/20250630132318.41339-2-thorsten.blum@linux.dev Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: Dev Jain <dev.jain@arm.com> Tested-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Oscar Salvador [Mon, 30 Jun 2025 14:42:11 +0000 (16:42 +0200)]
mm,hugetlb: drop obsolete comment about non-present pte and second faults
There is a comment in hugetlb_fault() that does not hold anymore. This
one:
/*
* vmf.orig_pte could be a migration/hwpoison vmf.orig_pte at this
* point, so this check prevents the kernel from going below assuming
* that we have an active hugepage in pagecache. This goto expects
* the 2nd page fault, and is_hugetlb_entry_(migration|hwpoisoned)
* check will properly handle it.
*/
This was written because back in the day we used to do:
hugetlb_fault () {
ptep = huge_pte_offset(...)
if (ptep) {
entry = huge_ptep_get(ptep)
if (unlikely(is_hugetlb_entry_migration(entry))
...
else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
...
}
...
...
/*
* entry could be a migration/hwpoison entry at this point, so this
* check prevents the kernel from going below assuming that we have
* a active hugepage in pagecache. This goto expects the 2nd page fault,
* and is_hugetlb_entry_(migration|hwpoisoned) check will properly
* handle it.
*/
if (!pte_present(entry))
goto out_mutex;
...
}
The code was designed to check for hwpoisoned/migration entries upfront,
and then bail out if further down the pte was not present anymore, relying
on the second fault to properly handle migration/hwpoison entries that
time around.
The way we handle this is different nowadays, so drop the misleading
comment.
Oscar Salvador [Mon, 30 Jun 2025 14:42:10 +0000 (16:42 +0200)]
mm,hugetlb: rename anon_rmap to new_anon_folio and make it boolean
anon_rmap is used to determine whether the new allocated folio is
anonymous. Rename it to something more meaningul like new_anon_folio and
make it boolean, as we use it like that.
While we are at it, drop 'new_pagecache_folio' as 'new_anon_folio' is
enough to check whether we need to restore the consumed reservation.
Oscar Salvador [Mon, 30 Jun 2025 14:42:09 +0000 (16:42 +0200)]
mm,hugetlb: sort out folio locking in the faulting path
Recent conversations showed that there was a misunderstanding about why we
were locking the folio prior to call in hugetlb_wp(). In fact, as soon as
we have the folio mapped into the pagetables, we no longer need to hold it
locked, because we know that no concurrent truncation could have happened.
There is only one case where the folio needs to be locked, and that is
when we are handling an anonymous folio, because hugetlb_wp() will check
whether it can re-use it exclusively for the process that is faulting it
in.
So, pass the folio locked to hugetlb_wp() when that is the case.
Oscar Salvador [Mon, 30 Jun 2025 14:42:08 +0000 (16:42 +0200)]
mm,hugetlb: change mechanism to detect a COW on private mapping
Patch series "Misc rework on hugetlb faulting path", v4.
This patchset aims to give some love to the hugetlb faulting path, doing
so by removing obsolete comments that are no longer true, sorting out the
folio lock, and changing the mechanism we use to determine whether we are
COWing a private mapping already.
The most important patch of the series is #1, as it fixes a deadlock that
was described in [1], where two processes were holding the same lock for
the folio in the pagecache, and then deadlocked in the mutex. Note that
this can also happen for anymous folios. This has been tested using this
reproducer, below
Looking up and locking the folio in the pagecache was done to check
whether that folio was the same folio we had mapped in our pagetables,
meaning that if it was different we knew that we already mapped that folio
privately, so any further CoW would be made on a private mapping, which
lead us to the question: __Was the reservation for that address
consumed?__ That is all we care about, because if it was indeed consumed
and we are the owner and we cannot allocate more folios, we need to unmap
the folio from the processes pagetables and make it exclusive for us.
We figured we do not need to look up the folio at all, and it is just
enough to check whether the folio we have mapped is anonymous, which means
we mapped it privately, so the reservation was indeed consumed.
Patch#2 sorts out folio locking in the faulting path, reducing the scope
of it ,only taking it when we are dealing with an anonymous folio and
document it. More details in the patch.
You will also have to add a delay in hugetlb_wp, after releasing the mutex
and before unmapping, so the window is large enough to reproduce it
reliably.
hugetlb_wp() checks whether the process is trying to COW on a private
mapping in order to know whether the reservation for that address was
already consumed. If it was consumed and we are the ownner of the
mapping, the folio will have to be unmapped from the other processes.
Currently, that check is done by looking up the folio in the pagecache and
compare it to the folio which is mapped in our pagetables. If it differs,
it means we already mapped it privately before, consuming a reservation on
the way. All we are interested in is whether the mapped folio is
anonymous, so we can simplify and check for that instead.
Gerald Schaefer [Mon, 30 Jun 2025 16:47:25 +0000 (18:47 +0200)]
mm/debug_vm_pgtable: use a swp_entry_t input value for swap tests
The various __pte/pmd_to_swp_entry and __swp_entry_to_pte/pmd helper
functions are expected to operate on swap PTE/PMD entries, not on present
and mapped entries.
Reflect this in the swap tests by using a swp_entry_t as input value, and
convert it to a swap PTE/PMD for testing, similar to how it is already
done in pte_swap_exclusive_tests(). Move the swap entry creation from
there to init_args() and store it in args, so it can also be used in other
functions.
The pte/pmd_swap_tests() are also changed to compare entries instead of
pfn values, again similar to pte_swap_exclusive_tests(). pte/pmd_pfn()
helpers are also not expected to operate on swap PTE/PMD entries at all.
Also update documentation, to reflect that the helpers operate on swap
PTE/PMD entries and not present and mapped entries, and use correct names,
i.e. __swp_to_pte/pmd_entry -> __swp_entry_to_pte/pmd.
For consistency, also change pte/pmd_swap_soft_dirty_tests() to use
args->swp_entry instead of a present and mapped PTE/PMD.
Thorsten Blum [Mon, 30 Jun 2025 17:18:26 +0000 (19:18 +0200)]
mm/hugetlb: use str_plural() in report_hugepages()
Use the string choice helper function str_plural() to simplify the code
and to fix the following Coccinelle/coccicheck warning reported by
string_choices.cocci:
opportunity for str_plural(nrinvalid)
Link: https://lkml.kernel.org/r/20250630171826.114008-2-thorsten.blum@linux.dev Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Sat, 28 Jun 2025 16:04:26 +0000 (09:04 -0700)]
selftests/damon/sysfs.py: test monitoring attribute parameters
Add DAMON sysfs interface functionality tests for DAMON monitoring
attribute parameters, including intervals, intervals tuning goals, and
min/max number of regions.
SeongJae Park [Sat, 28 Jun 2025 16:04:25 +0000 (09:04 -0700)]
selftests/damon: add python and drgn-based DAMON sysfs test
Add a python-written DAMON sysfs functionality selftest. It sets DAMON
parameters using Python module _damon_sysfs, reads updated kernel internal
DAMON status and parameters using a 'drgn' script, namely
drgn_dump_damon_status.py, and compare if the resulted DAMON internal
status is as expected. The test is very minimum at the moment.
SeongJae Park [Sat, 28 Jun 2025 16:04:24 +0000 (09:04 -0700)]
selftests/damon/_damon_sysfs: set Kdamond.pid in start()
_damon_sysfs.py is a Python module for reading and writing DAMON sysfs for
testing. It is not reading resulting kdamond pids. Read and update those
when starting kdamonds.
SeongJae Park [Sat, 28 Jun 2025 16:04:23 +0000 (09:04 -0700)]
selftests/damon: add drgn script for extracting damon status
Patch series "selftests/damon: add python and drgn based DAMON sysfs
functionality tests".
DAMON sysfs interface is the bridge between the user space and the kernel
space for DAMON parameters. There is no good and simple test to see if
the parameters are set as expected. Existing DAMON selftests therefore
test end-to-end features. For example, damos_quota_goal.py runs a DAMOS
scheme with quota goal set against a test program running an artificial
access pattern, and see if the result is as expected. Such tests cover
only a few part of DAMON. Adding more tests is also complicated.
Finally, the reliability of the test itself on different systems is bad.
'drgn' is a tool that can extract kernel internal data structures like
DAMON parameters. Add a test that passes specific DAMON parameters via
DAMON sysfs reusing _damon_sysfs.py, extract resulting DAMON parameters
via 'drgn', and compare those. Note that this test is not adding
exhaustive tests of all DAMON parameters and input combinations but very
basic things. Advancing the test infrastructure and adding more tests are
future works.
This patch (of 6):
'drgn' is a useful tool for extracting kernel internal data structures
such as DAMON's parameter and running status. Add a 'drgn' script that
extracts such DAMON internal data at runtime, for using it as a tool for
seeing if a test input has made expected results in the kernel.
The script saves or prints out the DAMON internal data as a json file or
string. This is for making use of it not very depends on 'drgn'. If
'drgn' is not available on a test setup and we find alternative tools for
doing that, the json-based tests can be updated to use an alternative tool
in future.
Note that the script is tested with 'drgn v0.0.22'.
Peter Xu [Fri, 27 Jun 2025 16:07:39 +0000 (12:07 -0400)]
mm: deduplicate mm_get_unmapped_area()
Essentially it sets vm_flags==0 for mm_get_unmapped_area_vmflags(). Use
the helper instead to dedup the lines.
Link: https://lkml.kernel.org/r/20250627160739.2124768-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Peter Xu [Fri, 27 Jun 2025 16:07:07 +0000 (12:07 -0400)]
mm/hugetlb: remove prepare_hugepage_range()
Only mips and loongarch implemented this API, however what it does was
checking against stack overflow for either len or addr. That's already
done in arch's arch_get_unmapped_area*() functions, even though it may not
be 100% identical checks.
For example, for both of the architectures, there will be a trivial
difference on how stack top was defined. The old code uses STACK_TOP
which may be slightly smaller than TASK_SIZE on either of them, but the
hope is that shouldn't be a problem.
It means the whole API is pretty much obsolete at least now, remove it
completely.
Link: https://lkml.kernel.org/r/20250627160707.2124580-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Muchun Song <muchun.song@linux.dev> Cc: Jann Horn <jannh@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yunjeong Mun [Fri, 27 Jun 2025 16:33:29 +0000 (09:33 -0700)]
samples/damon/mtier: add parameters for node0 memory usage
Change the hard-coded quota goal metric values into sysfs knobs:
`node0_mem_used_bp` and `node0_mem_free_bp`. These knobs represent the
used and free memory ratio of node0 in basis points (bp, where 1 bp =
0.01%). As mentioned in [1], this patch is developed under the assumption
that node0 is always the fast-tier in a two-tiers memory setup.
Zi Yan [Tue, 17 Jun 2025 02:11:14 +0000 (22:11 -0400)]
mm/page_isolation: remove migratetype parameter from more functions
migratetype is no longer overwritten during pageblock isolation,
start_isolate_page_range(), has_unmovable_pages(), and
set_migratetype_isolate() no longer need which migratetype to restore
during isolation failure.
For has_unmoable_pages(), it needs to know if the isolation is for CMA
allocation, so adding PB_ISOLATE_MODE_CMA_ALLOC provide the information.
At the same time change isolation flags to enum pb_isolate_mode
(PB_ISOLATE_MODE_MEM_OFFLINE, PB_ISOLATE_MODE_CMA_ALLOC,
PB_ISOLATE_MODE_OTHER). Remove REPORT_FAILURE and check
PB_ISOLATE_MODE_MEM_OFFLINE, since only PB_ISOLATE_MODE_MEM_OFFLINE
reports isolation failures.
alloc_contig_range() no longer needs migratetype. Replace it with a newly
defined acr_flags_t to tell if an allocation is for CMA. So does
__alloc_contig_migrate_range(). Add ACR_FLAGS_NONE (set to 0) to indicate
ordinary allocations.
Link: https://lkml.kernel.org/r/20250617021115.2331563-7-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Richard Chang <richardycc@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Zi Yan [Tue, 17 Jun 2025 02:11:13 +0000 (22:11 -0400)]
mm/page_isolation: remove migratetype from undo_isolate_page_range()
Since migratetype is no longer overwritten during pageblock isolation,
undoing pageblock isolation no longer needs which migratetype to restore.
Link: https://lkml.kernel.org/r/20250617021115.2331563-6-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Richard Chang <richardycc@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Zi Yan [Tue, 17 Jun 2025 02:11:12 +0000 (22:11 -0400)]
mm/page_isolation: remove migratetype from move_freepages_block_isolate()
Since migratetype is no longer overwritten during pageblock isolation,
moving a pageblock out of MIGRATE_ISOLATE no longer needs a new
migratetype.
Add pageblock_isolate_and_move_free_pages() and
pageblock_unisolate_and_move_free_pages() to be explicit about the page
isolation operations. Both share the common code in
__move_freepages_block_isolate(), which is renamed from
move_freepages_block_isolate().
Add toggle_pageblock_isolate() to flip pageblock isolation bit in
__move_freepages_block_isolate().
Make set_pageblock_migratetype() only accept non MIGRATE_ISOLATE types, so
that one should use set_pageblock_isolate() to isolate pageblocks. As a
result, move pageblock migratetype code out of __move_freepages_block().
Link: https://lkml.kernel.org/r/20250617021115.2331563-5-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Richard Chang <richardycc@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Zi Yan [Tue, 17 Jun 2025 02:11:11 +0000 (22:11 -0400)]
mm/page_alloc: add support for initializing pageblock as isolated
MIGRATE_ISOLATE is a standalone bit, so a pageblock cannot be initialized
to just MIGRATE_ISOLATE. Add init_pageblock_migratetype() to enable
initialize a pageblock with a migratetype and isolated.
Link: https://lkml.kernel.org/r/20250617021115.2331563-4-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Richard Chang <richardycc@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Zi Yan [Tue, 17 Jun 2025 02:11:10 +0000 (22:11 -0400)]
mm/page_isolation: make page isolation a standalone bit
During page isolation, the original migratetype is overwritten, since
MIGRATE_* are enums and stored in pageblock bitmaps. Change
MIGRATE_ISOLATE to be stored a standalone bit, PB_migrate_isolate, like
PB_compact_skip, so that migratetype is not lost during pageblock
isolation.
Link: https://lkml.kernel.org/r/20250617021115.2331563-3-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Richard Chang <richardycc@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Zi Yan [Tue, 17 Jun 2025 02:11:09 +0000 (22:11 -0400)]
mm/page_alloc: pageblock flags functions clean up
Patch series "Make MIGRATE_ISOLATE a standalone bit", v10.
This patchset moves MIGRATE_ISOLATE to a standalone bit to avoid being
overwritten during pageblock isolation process. Currently,
MIGRATE_ISOLATE is part of enum migratetype (in include/linux/mmzone.h),
thus, setting a pageblock to MIGRATE_ISOLATE overwrites its original
migratetype. This causes pageblock migratetype loss during
alloc_contig_range() and memory offline, especially when the process fails
due to a failed pageblock isolation and the code tries to undo the
finished pageblock isolations.
In terms of performance for changing pageblock types, no performance
change is observed:
1. I used perf to collect stats of offlining and onlining all memory
of a 40GB VM 10 times and see that get_pfnblock_flags_mask() and
set_pfnblock_flags_mask() take about 0.12% and 0.02% of the whole
process respectively with and without this patchset across 3 runs.
2. I used perf to collect stats of dd from /dev/random to a 40GB tmpfs
file and find get_pfnblock_flags_mask() takes about 0.05% of the
process with and without this patchset across 3 runs.
This patch (of 6):
No functional change is intended.
1. Add __NR_PAGEBLOCK_BITS for the number of pageblock flag bits and use
roundup_pow_of_two(__NR_PAGEBLOCK_BITS) as NR_PAGEBLOCK_BITS to take
right amount of bits for pageblock flags.
2. Rename PB_migrate_skip to PB_compact_skip.
3. Add {get,set,clear}_pfnblock_bit() to operate one a standalone bit,
like PB_compact_skip.
3. Make {get,set}_pfnblock_flags_mask() internal functions and use
{get,set}_pfnblock_migratetype() for pageblock migratetype operations.
4. Move pageblock flags common code to get_pfnblock_bitmap_bitidx().
3. Use MIGRATETYPE_MASK to get the migratetype of a pageblock from its
flags.
4. Use PB_migrate_end in the definition of MIGRATETYPE_MASK instead of
PB_migrate_bits.
5. Add a comment on is_migrate_cma_folio() to prevent one from changing it
to use get_pageblock_migratetype() and causing issues.
Link: https://lkml.kernel.org/r/20250617021115.2331563-1-ziy@nvidia.com Link: https://lkml.kernel.org/r/20250617021115.2331563-2-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Richard Chang <richardycc@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Oscar Salvador [Mon, 16 Jun 2025 13:51:53 +0000 (15:51 +0200)]
mm,page_ext: derive the node from the pfn
page_ext is the only user of 'status_change_nid', which is set in
online/offline operations, to know to which node we are adding/removing
memory.
Prior to call any notifiers, the memmap is initialized via, which among
other things, sets the node the pages belong to, to all corresponging
pages. This means that there is no need to keep using 'status_change_nid'
since we can derive the node from the pfn. This will allow us to finally
drop 'status_change_nid' from the memory_notify struct.
Link: https://lkml.kernel.org/r/20250616135158.450136-11-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Suggested-by: David Hildenbrand <david@redhat.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com> Cc: Rakie Kim <rakie.kim@sk.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Oscar Salvador [Mon, 16 Jun 2025 13:51:52 +0000 (15:51 +0200)]
mm,mempolicy: use node-notifier instead of memory-notifier
mempolicy is only concerned when a numa node changes its memory state,
because it needs to take this node into account for the auto-weighted
memory policy system. So stop using the memory notifier and use the new
numa node notifer instead.
Link: https://lkml.kernel.org/r/20250616135158.450136-10-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Rakie Kim <rakie.kim@sk.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Gregory Price <gourry@gourry.net> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Oscar Salvador [Mon, 16 Jun 2025 13:51:51 +0000 (15:51 +0200)]
kernel,cpuset: use node-notifier instead of memory-notifier
cpuset is only concerned when a numa node changes its memory state, as it
needs to know the current numa nodes with memory to keep an updated
mems_allowed mask. So stop using the memory notifier and use the new numa
node notifer instead.
Link: https://lkml.kernel.org/r/20250616135158.450136-9-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Rakie Kim <rakie.kim@sk.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Oscar Salvador [Mon, 16 Jun 2025 13:51:50 +0000 (15:51 +0200)]
drivers,hmat: use node-notifier instead of memory-notifier
hmat driver is only concerned when a numa node changes its memory state,
specifically when a numa node with memory comes into play for the first
time, because it will register the memory_targets belonging to that numa
node. So stop using the memory notifier and use the new numa node notifer
instead.
Link: https://lkml.kernel.org/r/20250616135158.450136-8-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Rakie Kim <rakie.kim@sk.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Oscar Salvador [Mon, 16 Jun 2025 13:51:49 +0000 (15:51 +0200)]
drivers,cxl: use node-notifier instead of memory-notifier
memory-tier is only concerned when a numa node changes its memory state,
specifically when a numa node with memory comes into play for the first
time, because it needs to get its performance attributes to build a proper
demotion chain. So stop using the memory notifier and use the new numa
node notifer instead.
Link: https://lkml.kernel.org/r/20250616135158.450136-7-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Rakie Kim <rakie.kim@sk.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Oscar Salvador [Mon, 16 Jun 2025 13:51:48 +0000 (15:51 +0200)]
mm,memory-tiers: use node-notifier instead of memory-notifier
memory-tier is only concerned when a numa node changes its memory state,
because it then needs to re-create the demotion list. So stop using the
memory notifier and use the new numa node notifer instead.
Link: https://lkml.kernel.org/r/20250616135158.450136-6-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Rakie Kim <rakie.kim@sk.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Oscar Salvador [Mon, 16 Jun 2025 13:51:46 +0000 (15:51 +0200)]
mm,memory_hotplug: implement numa node notifier
There are at least six consumers of hotplug_memory_notifier that what they
really are interested in is whether any numa node changed its state, e.g:
going from having memory to not having memory and vice versa.
Implement a specific notifier for numa nodes when their state gets
changed, which will later be used by those consumers that are only
interested in numa node state changes.
Add documentation as well.
[dan.carpenter@linaro.org: set failure reason in offline_pages()] Link: https://lkml.kernel.org/r/be4fd31b-7d09-46b0-8329-6d0464ffa7a5@sabinyo.mountain Link: https://lkml.kernel.org/r/20250616135158.450136-4-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Rakie Kim <rakie.kim@sk.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Oscar Salvador [Mon, 16 Jun 2025 13:51:44 +0000 (15:51 +0200)]
mm,slub: do not special case N_NORMAL nodes for slab_nodes
Patch series "Implement numa node notifier", v7.
Memory notifier is a tool that allow consumers to get notified whenever
memory gets onlined or offlined in the system. Currently, there are 10
consumers of that, but 5 out of those 10 consumers are only interested in
getting notifications when a numa node changes its memory state. That
means going from memoryless to memory-aware of vice versa.
Which means that for every {online,offline}_pages operation they get
notified even though the numa node might not have changed its state. This
is suboptimal, and we want to decouple numa node state changes from memory
state changes.
While we are doing this, remove status_change_nid_normal, as the only
current user (slub) does not really need it. This allows us to further
simplify and clean up the code.
The first patch gets rid of status_change_nid_normal in slub. The second
patch implements a numa node notifier that does just that, and have those
consumers register in there, so they get notified only when they are
interested.
The third patch replaces 'status_change_nid{_normal}' fields within
memory_notify with a 'nid', as that is only what we need for memory
notifer and update the only user of it (page_ext).
Consumers that are only interested in numa node states change are:
Currently, slab_mem_going_online_callback() checks whether the node has
N_NORMAL memory in order to be set in slab_nodes. While it is true that
getting rid of that enforcing would mean ending up with movables nodes in
slab_nodes, the memory waste that comes with that is negligible.
So stop checking for status_change_nid_normal and just use
status_change_nid instead which works for both types of memory.
Also, once we allocate the kmem_cache_node cache for the node in
slab_mem_online_callback(), we never deallocate it in
slab_mem_offline_callback() when the node goes memoryless, so we can just
get rid of it.
The side effects are that we will stop clearing the node from slab_nodes,
and also that newly created kmem caches after node hotremove will now
allocate their kmem_cache_node for the node(s) that was hotremoved, but
these should be negligible.
Link: https://lkml.kernel.org/r/20250616135158.450136-1-osalvador@suse.de Link: https://lkml.kernel.org/r/20250616135158.450136-2-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Suggested-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com> Cc: Rakie Kim <rakie.kim@sk.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Tue, 24 Jun 2025 13:03:48 +0000 (15:03 +0200)]
mm, madvise: use standard madvise locking in madvise_set_anon_name()
Use madvise_lock()/madvise_unlock() in madvise_set_anon_name() in the same
way as in do_madvise(). This narrows the lock scope a bit and reuses
existing functionality. get_lock_mode() already picks the correct
MADVISE_MMAP_WRITE_LOCK mode for __MADV_SET_ANON_VMA_NAME so we can just
remove the explicit assignment.
There is a user visible change in that the prctl(PR_SET_VMA,
PR_SET_VMA_ANON_NAME...) might now return -EINTR.
Link: https://lkml.kernel.org/r/20250624-anon_name_cleanup-v2-4-600075462a11@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Colin Cross <ccross@google.com> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Tue, 24 Jun 2025 13:03:46 +0000 (15:03 +0200)]
mm, madvise: extract mm code from prctl_set_vma() to mm/madvise.c
Setting anon_name is done via madvise_set_anon_name() and behaves a lot of
like other madvise operations. However, apparently because madvise() has
lacked the 4th argument and prctl() not, the userspace entry point has
been implemented via prctl(PR_SET_VMA, ...) and handled first by
prctl_set_vma().
Currently prctl_set_vma() lives in kernel/sys.c but setting the
vma->anon_name is mm-specific code so extract it to a new
set_anon_vma_name() function under mm. mm/madvise.c seems to be the most
straightforward place as that's where madvise_set_anon_name() lives. Stop
declaring the latter in mm.h and instead declare set_anon_vma_name().
Link: https://lkml.kernel.org/r/20250624-anon_name_cleanup-v2-2-600075462a11@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Colin Cross <ccross@google.com> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Tue, 24 Jun 2025 13:03:45 +0000 (15:03 +0200)]
mm, madvise: simplify anon_name handling
Patch series "madvise anon_name cleanups", v2.
While reviewing Lorenzo's madvise cleanups I've noticed that we can handle
anon_name in madvise code much better, so sending that as patch 1.
Initially I wanted to do first move the existing logic from
madvise_vma_behavior() to madvise_update_vma() as a separate patch before
the actual simplification but that would require adding
anon_vma_name_put() in error handling paths only to be removed again, so
it's a single patch to avoid churn.
It's also an opportunity to move some mm code from prctl under mm, hence
patch 2. After code moving preparation in patch 3, also unify madvise
lock handling for madvise_set_anon_name() in patch 4.
This patch (of 4):
Since the introduction in 9a10064f5625 ("mm: add a field to store names
for private anonymous memory") the code to set anon_name on a vma has been
using madvise_update_vma() to call replace_anon_vma_name(). Since the
former is called also by a number of other madvise behaviours that do not
set a new anon_name, they have been passing the existing anon_name of the
vma to make replace_anon_vma_name() a no-op.
This is rather wasteful as it needs anon_vma_name_eq() to determine the
no-op situations, and checks for when replace_anon_vma_name() is allowed
(the vma is anon/shmem) duplicate the checks already done earlier in
madvise_vma_behavior(). It has also lead to commit 942341dcc574 ("mm: fix
use-after-free when anon vma name is used after vma is freed") adding
anon_name refcount get/put operations exactly to the cases that actually
do not change anon_name - just so the replace_anon_vma_name() can keep
safely determining it has nothing to do.
The recent madvise cleanups made this suboptimal handling very obvious,
but happily also allow for an easy fix. madvise_update_vma() now has the
complete information whether it's been called to set a new anon_name, so
stop passing it the existing vma's name and doing the refcount get/put in
its only caller madvise_vma_behavior().
In madvise_update_vma() itself, limit calling of replace_anon_vma_name()
only to cases where we are setting a new name, otherwise we know it's a
no-op. We can rely solely on the __MADV_SET_ANON_VMA_NAME behaviour and
can remove the duplicate checks for vma being anon/shmem that were done
already in madvise_vma_behavior().
Additionally, by using vma_modify_flags() when not modifying the
anon_name, avoid explicitly passing the existing vma's anon_name and
storing a pointer to it in struct madv_behavior or a local variable. This
prevents the danger of accessing a freed anon_name after vma merging,
previously fixed by commit 942341dcc574.
Link: https://lkml.kernel.org/r/20250624-anon_name_cleanup-v2-0-600075462a11@suse.cz Link: https://lkml.kernel.org/r/20250624-anon_name_cleanup-v2-1-600075462a11@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Tested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Colin Cross <ccross@google.com> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Fri, 20 Jun 2025 15:33:05 +0000 (16:33 +0100)]
mm/madvise: eliminate very confusing manipulation of prev VMA
The madvise code has for the longest time had very confusing code around
the 'prev' VMA pointer passed around various functions which, in all cases
except madvise_update_vma(), is unused and instead simply updated as soon
as the function is invoked.
To compound the confusion, the prev pointer is also used to indicate to
the caller that the mmap lock has been dropped and that we can therefore
not safely access the end of the current VMA (which might have been
updated by madvise_update_vma()).
Clear up this confusion by not setting prev = vma anywhere except in
madvise_walk_vmas(), update all references to prev which will always be
equal to vma after madvise_vma_behavior() is invoked, and adding a flag to
indicate that the lock has been dropped to make this explicit.
Additionally, drop a redundant BUG_ON() from madvise_collapse(), which is
simply reiterating the BUG_ON(mmap_locked) above it (note that BUG_ON() is
not appropriate here, but we leave existing code as-is).
We finally adjust the madvise_walk_vmas() logic to be a little clearer -
delaying the assignment of the end of the range to the start of the new
range until the last moment and handling the lock being dropped scenario
immediately.
Additionally add some explanatory comments.
[lorenzo.stoakes@oracle.com: fix very subtle bug] Link: https://lkml.kernel.org/r/dca94cde-8afb-4eab-8e57-3f508624d670@lucifer.local Link: https://lkml.kernel.org/r/63d281c5df2e64225ab5b4bda398b45e22818701.1750433500.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Fri, 20 Jun 2025 15:33:04 +0000 (16:33 +0100)]
mm/madvise: thread all madvise state through madv_behavior
Doing so means we can get rid of all the weird struct vm_area_struct
**prev stuff, everything becomes consistent and in future if we want to
make change to behaviour there's a single place where all relevant state
is stored.
This also allows us to update try_vma_read_lock() to be a little more
succinct and set up state for us, as well as cleaning up
madvise_update_vma().
We also update the debug assertion prior to madvise_update_vma() to assert
that this is a write operation as correctly pointed out by Barry in the
relevant thread.
We can't reasonably update the madvise functions that live outside of
mm/madvise.c so we leave those as-is.
Link: https://lkml.kernel.org/r/7b345ab82ef51e551f8bc0c4f7be25712871629d.1750433500.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Fri, 20 Jun 2025 15:33:03 +0000 (16:33 +0100)]
mm/madvise: thread VMA range state through madvise_behavior
Rather than updating start and a confusing local parameter 'tmp' in
madvise_walk_vmas(), instead store the current range being operated upon
in the struct madvise_behavior helper object in a range pair and use this
consistently in all operations.
This makes it clearer what is going on and opens the door to further
cleanup now we store state regarding what is currently being operated upon
here.
Link: https://lkml.kernel.org/r/518480ceb48553d3c280bc2b0b5e77bbad840147.1750433500.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: SeongJae Park <sj@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Fri, 20 Jun 2025 15:33:02 +0000 (16:33 +0100)]
mm/madvise: thread mm_struct through madvise_behavior
There's no need to thread a pointer to the mm_struct nor have different
functions signatures for each behaviour, instead store state in the struct
madvise_behavior object consistently and use it for all madvise() actions.
Link: https://lkml.kernel.org/r/a47d850b0111735e026d438c3300c0e3b7f439f4.1750433500.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: SeongJae Park <sj@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Fri, 20 Jun 2025 15:33:01 +0000 (16:33 +0100)]
mm/madvise: remove the visitor pattern and thread anon_vma state
Patch series "madvise cleanup", v2.
This is a series of patches that helps address a number of historic
problems in the madvise() implementation:
* Eliminate the visitor pattern and having the code which is implemented
for both the anon_vma_name implementation and ordinary madvise()
operations use the same madvise_vma_behavior() implementation.
* Thread state through the madvise_behavior state object - this object,
very usefully introduced by SJ, is already used to transmit state
through operations. This series extends this by having all madvise()
operations use this, including anon_vma_name.
* Thread range, VMA state through madvise_behavior - This helps avoid a
lot of the confusing code around range and VMA state and again keeps
things consistent and with a single 'source of truth'.
* Addressing the very strange behaviour around the passed around struct
vm_area_struct **prev pointer - all read-only users do absolutely
nothing with the prev pointer. The only function that uses it is
madvise_update_vma(), and in all cases prev is always reset to VMA.
Fix this by no longer having aything but madvise_update_vma()
reference prev, and having madvise_walk_vmas() update prev in each
instance. Additionally make it clear that the meaningful change in vma
state is when madvise_update_vma() potentially merges a VMA, so
explicitly retrieve the VMA in this case.
* Update and clarify the madvise_walk_vmas() function - this is a source
of a great deal of confusion, so simplify, stop using prev = NULL to
signify that the mmap lock has been dropped (!) and make that explicit,
and add some comments to explain what's going on.
This patch (of 5):
Now we have the madvise_behavior helper struct we no longer need to mess
around with void* pointers in order to propagate anon_vma_name, and this
means we can get rid of the confusing and inconsistent visitor pattern
implementation in madvise_vma_anon_name().
This means we now have a single state object that threads through most of
madvise()'s logic and a single code path which executes the majority of
madvise() behaviour (we maintain separate logic for failure injection and
memory population for the time being).
We are able to remove the visitor pattern by handling the anon_vma_name
setting logic via an internal madvise flag - __MADV_SET_ANON_VMA_NAME.
This uses a negative value so it isn't reasonable that we will ever add
this as a UAPI flag.
Additionally, the madvise_behavior_valid() check ensures that
user-specified behaviours are strictly only those we permit which, of
course, this flag will be excluded from.
We are able to propagate the anon_vma_name object through use of the
madvise_behavior helper struct.
Doing this results in a can_modify_vma_madv() check for anonymous VMA name
changes, however this will cause no issues as this operation is not
prohibited.
We can also then reuse more code and drop the redundant
madvise_vma_anon_name() function altogether.
Additionally separate out behaviours that update VMAs from those that do
not.
Link: https://lkml.kernel.org/r/cover.1750433500.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/c5094bfccb41ecd19d4e9bcaa1c4a11e00158bba.1750433500.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: SeongJae Park <sj@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Li Wang [Tue, 24 Jun 2025 03:27:48 +0000 (11:27 +0800)]
ksm_tests: skip hugepage test when Transparent Hugepages are disabled
Some systems (e.g. minimal or real-time kernels) may not enable
Transparent Hugepages (THP), causing MADV_HUGEPAGE to return EINVAL. This
patch introduces a runtime check using the existing THP sysfs interface
and skips the hugepage merging test (`-H`) when THP is not available.
# --------------------
# running ./khugepaged
# --------------------
# Reading PMD pagesize failed# [FAIL]
not ok 1 khugepaged # exit=1
# --------------------
# running ./soft-dirty
# --------------------
# TAP version 13
# 1..15
# ok 1 Test test_simple
# ok 2 Test test_vma_reuse dirty bit of allocated page
# ok 3 Test test_vma_reuse dirty bit of reused address page
# Bail out! Reading PMD pagesize failed# Planned tests != run tests (15 != 3)
# # Totals: pass:3 fail:0 xfail:0 xpass:0 skip:0 error:0
# [FAIL]
not ok 1 soft-dirty # exit=1
# SUMMARY: PASS=0 SKIP=0 FAIL=1
# -------------------
# running ./migration
# -------------------
# TAP version 13
# 1..3
# # Starting 3 tests from 1 test cases.
# # RUN migration.private_anon ...
# # OK migration.private_anon
# ok 1 migration.private_anon
# # RUN migration.shared_anon ...
# # OK migration.shared_anon
# ok 2 migration.shared_anon
# # RUN migration.private_anon_thp ...
# # migration.c:196:private_anon_thp:Expected madvise(ptr, TWOMEG, MADV_HUGEPAGE) (-1) == 0 (0)
# # private_anon_thp: Test terminated by assertion
# # FAIL migration.private_anon_thp
# not ok 3 migration.private_anon_thp
# # FAILED: 2 / 3 tests passed.
# # Totals: pass:2 fail:1 xfail:0 xpass:0 skip:0 error:0
# [FAIL]
not ok 1 migration # exit=1
It's true that CONFIG_TRANSPARENT_HUGEPAGE=y is explicitly enabled in
tools/testing/selftests/mm/config, so ideally the runtime environment
should also support THP.
However, in practice, we've found that on some systems:
- THP is disabled at boot time (transparent_hugepage=never)
- Or manually disabled via sysfs
- Or unavailable in RT kernels, containers, or minimal CI environments
In these cases, the test will fail with EINVAL on madvise(MADV_HUGEPAGE),
even though the kernel config is correct.
To make the test suite more robust and avoid false negatives, this patch
adds a runtime check for /sys/kernel/mm/transparent_hugepage/enabled.
If THP is not available, the hugepage test (-H) is skipped with a clear
message.
Link: https://lkml.kernel.org/r/20250624032748.393836-1-liwang@redhat.com Signed-off-by: Li Wang <liwang@redhat.com> Cc: Aruna Ramakrishna <aruna.ramakrishna@oracle.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Joey Gouly <joey.gouly@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Keith Lucas <keith.lucas@oracle.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Li Wang [Tue, 24 Jun 2025 04:24:11 +0000 (12:24 +0800)]
selftests/mm: fix UFFDIO_API usage with proper two-step feature negotiation
The current implementation of test_unmerge_uffd_wp() explicitly sets
`uffdio_api.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP` before calling
UFFDIO_API. This can cause the ioctl() call to fail with EINVAL on
kernels that do not support UFFD-WP, leading the test to fail
unnecessarily:
# ------------------------------
# running ./ksm_functional_tests
# ------------------------------
# TAP version 13
# 1..9
# # [RUN] test_unmerge
# ok 1 Pages were unmerged
# # [RUN] test_unmerge_zero_pages
# ok 2 KSM zero pages were unmerged
# # [RUN] test_unmerge_discarded
# ok 3 Pages were unmerged
# # [RUN] test_unmerge_uffd_wp
# not ok 4 UFFDIO_API failed <-----
# # [RUN] test_prot_none
# ok 5 Pages were unmerged
# # [RUN] test_prctl
# ok 6 Setting/clearing PR_SET_MEMORY_MERGE works
# # [RUN] test_prctl_fork
# # No pages got merged
# # [RUN] test_prctl_fork_exec
# ok 7 PR_SET_MEMORY_MERGE value is inherited
# # [RUN] test_prctl_unmerge
# ok 8 Pages were unmerged
# Bail out! 1 out of 8 tests failed
# # Planned tests != run tests (9 != 8)
# # Totals: pass:7 fail:1 xfail:0 xpass:0 skip:0 error:0
# [FAIL]
This patch improves compatibility and robustness of the UFFD-WP test
(test_unmerge_uffd_wp) by correctly implementing the UFFDIO_API two-step
handshake as recommended by the userfaultfd(2) man page.
Key changes:
1. Use features=0 in the initial UFFDIO_API call to query supported
feature bits, rather than immediately requesting WP support.
2. Skip the test gracefully if:
- UFFDIO_API fails with EINVAL (e.g. unsupported API version), or
- UFFD_FEATURE_PAGEFAULT_FLAG_WP is not advertised by the kernel.
3. Close the initial userfaultfd and create a new one before enabling
the required feature, since UFFDIO_API can only be called once per fd.
4. Improve diagnostics by distinguishing between expected and unexpected
failures, using strerror() to report errors.
This ensures the test behaves correctly across a wider range of kernel
versions and configurations, while preserving the intended behavior on
kernels that support UFFD-WP.
[liwang@redhat.com: fail the test if sys_userfaultfd() fails, per David] Link: https://lkml.kernel.org/r/20250625004645.400520-1-liwang@redhat.com Link: https://lkml.kernel.org/r/20250624042411.395285-1-liwang@redhat.com Signed-off-by: Li Wang <liwang@redhat.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Aruna Ramakrishna <aruna.ramakrishna@oracle.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Joey Gouly <joey.gouly@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Keith Lucas <keith.lucas@oracle.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Li Wang <liwang@redhat.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Tue, 24 Jun 2025 15:48:23 +0000 (11:48 -0400)]
maple_tree: add testing for restoring maple state to active
Restoring maple status to ma_active on overflow/underflow when mas->node
was NULL could have happened in the past, but was masked by a bug in
mas_walk(). Add test cases that triggered the bug when the node was
mas->node prior to fixing the maple state setup.
Add a few extra tests around restoring the active maple status.
Liam R. Howlett [Tue, 24 Jun 2025 15:48:22 +0000 (11:48 -0400)]
maple_tree: fix status setup on restore to active
During the initial call with a maple state, an error status may be set
before a valid node is populated into the maple state node. Subsequent
calls with the maple state may restore the state into an active state with
no node set. This was masked by the mas_walk() always resetting the
status to ma_start and result in an extra walk in this rare scenario.
Don't restore the state to active unless there is a value in the structs
node. This also allows mas_walk() to be fixed to use the active state
without exposing an issue.
User visible results are marginal performance improvements when an active
state can be restored and used instead of rewalking the tree.
Stable is not Cc'ed because the existing code is stable and the
performance gains are not worth the risk.
copy_to_kernel_nofault() is an internal helper which should not be visible
to loadable modules – exporting it would give exploit code a cheap
oracle to probe kernel addresses. Instead, keep the helper un-exported
and compile the kunit case that exercises it only when
mm/kasan/kasan_test.o is linked into vmlinux.
[snovitoll@gmail.com: add a brief comment to `#ifndef MODULE`] Link: https://lkml.kernel.org/r/20250622141142.79332-1-snovitoll@gmail.com Link: https://lkml.kernel.org/r/20250622051906.67374-1-snovitoll@gmail.com Fixes: ca79a00bb9a8 ("kasan: migrate copy_user_test to kunit") Signed-off-by: Sabyrzhan Tasbolatov <snovitoll@gmail.com> Suggested-by: Christoph Hellwig <hch@infradead.org> Suggested-by: Marco Elver <elver@google.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
lib/test_vmalloc.c: restrict default test mask to avoid test warnings
When the vmalloc test is built into the kernel, it runs automatically
during the boot. The current-default "run_test_mask" includes all test
cases, including those which are designed to fail and which trigger kernel
warnings.
These kernel splats can be misinterpreted as actual kernel bugs, leading
to false alarms and unnecessary reports.
To address this, limit the default test mask to only the first few tests
which are expected to pass cleanly. These tests are safe and should not
generate any warnings unless there is a real bug.
Users who wish to explicitly run specific test cases have to pass the
run_test_mask as a boot parameter or at module load time.
Link: https://lkml.kernel.org/r/20250623184035.581229-2-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: David Wang <00107082@163.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
lib/test_vmalloc.c: use late_initcall() if built-in for init ordering
When the vmalloc test code is compiled as a built-in, use late_initcall()
instead of module_init() to defer a vmalloc test execution until most
subsystems are up and running.
It avoids interfering with components that may not yet be initialized at
module_init() time. For example, there was a recent report of memory
profiling infrastructure not being ready early enough leading to kernel
crash.
By using late_initcall() in the built-in case, we ensure the tests are run
at a safer point during a boot sequence.
Link: https://lkml.kernel.org/r/20250623184035.581229-1-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: David Wang <00107082@163.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Sun, 22 Jun 2025 21:37:59 +0000 (14:37 -0700)]
mm/damon/sysfs: decouple from damon_ops_id
Decouple DAMON sysfs interface from damon_ops_id. For this, define and
use new mm/damon/sysfs.c internal data structure that maps the user-space
keywords and damon_ops_id, instead of having the implicit and unflexible
array index rule.
SeongJae Park [Sun, 22 Jun 2025 21:37:58 +0000 (14:37 -0700)]
mm/damon/sysfs-schemes: decouple from damos_filter_type
Decouple DAMOS sysfs interface from damos_filter_type. For this, define
and use new sysfs-schemes internal data structure that maps the user-space
keywords and damos_filter_type, instead of having the implicit and
unflexible array index rule.
SeongJae Park [Sun, 22 Jun 2025 21:37:57 +0000 (14:37 -0700)]
mm/damon/sysfs-schemes: decouple from damos_wmark_metric
Decouple DAMOS sysfs interface from damos_wmark_metric. For this, define
and use new sysfs-schemes internal data structure that maps the user-space
keywords and damos_wmark_metric, instead of having the implicit and
unflexible array index rule.
SeongJae Park [Sun, 22 Jun 2025 21:37:56 +0000 (14:37 -0700)]
mm/damon/sysfs-schemes: decouple from damos_action
Decouple DAMOS sysfs interface from damos_action. For this, define and
use new sysfs-schemes internal data structure that maps the user-space
keywords and damos_action, instead of having the implicit and unflexible
array index rule.
[akpm@linux-foundation.org: make damos_sysfs_action_names static] Closes: https://lore.kernel.org/oe-kbuild-all/202506271655.b8yfEZIT-lkp@intel.com/ Link: https://lkml.kernel.org/r/20250622213759.50930-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Sun, 22 Jun 2025 21:37:55 +0000 (14:37 -0700)]
mm/damon/sysfs-schemes: decouple from damos_quota_goal_metric
Patch series "mm/damon: decouple sysfs from core".
DAMON sysfs interface is coupled with core layer. It maintains some of
its keywords arrays be synchronized with matching DAMON core API enums.
It is unnecessary coupling that makes separated changes for different
layers difficult. Decouple the layers by introducing new data structure
for the mappings on DAMON sysfs interface.
This patch (of 5):
Decouple DAMOS sysfs interface from damos_quota_goal_metric. For this,
define and use new sysfs-schemes internal data structure that maps the
user-space keywords and damos_quota_goal_metric, instead of having the
implicit and unflexible array index rule.
mm/ptdump: take the memory hotplug lock inside ptdump_walk_pgd()
Memory hot remove unmaps and tears down various kernel page table regions
as required. The ptdump code can race with concurrent modifications of
the kernel page tables. When leaf entries are modified concurrently, the
dump code may log stale or inconsistent information for a VA range, but
this is otherwise not harmful.
But when intermediate levels of kernel page table are freed, the dump code
will continue to use memory that has been freed and potentially
reallocated for another purpose. In such cases, the ptdump code may
dereference bogus addresses, leading to a number of potential problems.
To avoid the above mentioned race condition, platforms such as arm64,
riscv and s390 take memory hotplug lock, while dumping kernel page table
via the sysfs interface /sys/kernel/debug/kernel_page_tables.
Similar race condition exists while checking for pages that might have
been marked W+X via /sys/kernel/debug/kernel_page_tables/check_wx_pages
which in turn calls ptdump_check_wx(). Instead of solving this race
condition again, let's just move the memory hotplug lock inside generic
ptdump_check_wx() which will benefit both the scenarios.
Drop get_online_mems() and put_online_mems() combination from all existing
platform ptdump code paths.
Link: https://lkml.kernel.org/r/20250620052427.2092093-1-anshuman.khandual@arm.com Fixes: bbd6ec605c0f ("arm64/mm: Enable memory hot remove") Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> [s390] Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Peter Xu [Fri, 20 Jun 2025 15:00:58 +0000 (11:00 -0400)]
selftests/mm: reduce uffd-unit-test poison test to minimum
The test will still generate quite some unwanted MCE error messages to
syslog. There was old proposal ratelimiting the MCE messages from kernel,
but that has risk of hiding real useful information on production systems.
We can at least reduce the test to minimum to not over-pollute dmesg,
however trying to not lose its coverage too much.
Dev Jain [Tue, 24 Jun 2025 08:07:48 +0000 (13:37 +0530)]
maple tree: use goto label to simplify code
Use the underflow goto label to set the status to ma_underflow and return
NULL, as is being done elsewhere.
[akpm@linux-foundation.org: add newline, per Liam (and remove one, per akpm)] Link: https://lkml.kernel.org/r/20250624080748.4855-1-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Alistair Popple [Thu, 19 Jun 2025 08:58:04 +0000 (18:58 +1000)]
mm: remove PFN_DEV, PFN_MAP, PFN_SPECIAL, PFN_SG_CHAIN and PFN_SG_LAST
The PFN_MAP flag is no longer used for anything, so remove it. The
PFN_SG_CHAIN and PFN_SG_LAST flags never appear to have been used so also
remove them. The last user of PFN_SPECIAL was removed by 653d7825c149
("dcssblk: mark DAX broken, remove FS_DAX_LIMITED support").
Users of PFN_DEV were removed earlier in this series by "mm: Remove
remaining uses of PFN_DEV".
Link: https://lkml.kernel.org/r/670b3950d70b4d97b905bb597dadfd3633de4314.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Björn Töpel <bjorn@kernel.org> Cc: Björn Töpel <bjorn@rivosinc.com> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: John Groves <john@groves.net> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Alistair Popple [Thu, 19 Jun 2025 08:58:03 +0000 (18:58 +1000)]
mm: remove devmap related functions and page table bits
Now that DAX and all other reference counts to ZONE_DEVICE pages are
managed normally there is no need for the special devmap PTE/PMD/PUD page
table bits. So drop all references to these, freeing up a software
defined page table bit on architectures supporting it.
Link: https://lkml.kernel.org/r/6389398c32cc9daa3dfcaa9f79c7972525d310ce.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Acked-by: Will Deacon <will@kernel.org> # arm64 Acked-by: David Hildenbrand <david@redhat.com> Suggested-by: Chunyan Zhang <zhang.lyra@gmail.com> Reviewed-by: Björn Töpel <bjorn@rivosinc.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Björn Töpel <bjorn@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: John Groves <john@groves.net> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Alistair Popple [Thu, 19 Jun 2025 08:58:02 +0000 (18:58 +1000)]
fs/dax: remove FS_DAX_LIMITED config option
The dcssblk driver was the last user of FS_DAX_LIMITED. That was marked
broken by 653d7825c149 ("dcssblk: mark DAX broken, remove FS_DAX_LIMITED
support") to allow removal of PFN_SPECIAL. However the FS_DAX_LIMITED
config option itself was not removed, so do that now.
Link: https://lkml.kernel.org/r/b47bf164b4a1013d736fa1a3d501bc8b8e71311f.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Björn Töpel <bjorn@kernel.org> Cc: Björn Töpel <bjorn@rivosinc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Groves <john@groves.net> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Alistair Popple [Thu, 19 Jun 2025 08:58:01 +0000 (18:58 +1000)]
powerpc: remove checks for devmap pages and PMDs/PUDs
PFN_DEV no longer exists. This means no devmap PMDs or PUDs will be
created, so checking for them is redundant. Instead mappings of pages
that would have previously returned true for pXd_devmap() will return true
for pXd_trans_huge()
Link: https://lkml.kernel.org/r/31f63cc8dd518f9e2ec300681fe302eb4adf49b4.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Björn Töpel <bjorn@kernel.org> Cc: Björn Töpel <bjorn@rivosinc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: John Groves <john@groves.net> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The pmd_devmap() check in check_pmd_state() is redundant because the only
user of pmd_devmap were device dax and fs dax. However all callers of
check_pmd_state() first call thp_vma_allowable_order() to check if the vma
should be scanned. Except when called from a page fault this always
returns 0 for dax vma's, hence we would never expect to see a pmd_devmap
entry.
Link: https://lkml.kernel.org/r/a68175fd3a37e9b72cc82c1d63fd8b69691a85b5.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Björn Töpel <bjorn@kernel.org> Cc: Björn Töpel <bjorn@rivosinc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: John Groves <john@groves.net> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>