powerpc/ptdump: rename "struct pgtable_level" to "struct ptdump_pg_level"
We want to make use of "pgtable_level" for an enum in core-mm. Other
architectures seem to call "struct pgtable_level" either:
* "struct pg_level" when not exposed in a header (riscv, arm)
* "struct ptdump_pg_level" when expose in a header (arm64)
So let's follow what arm64 does.
Link: https://lkml.kernel.org/r/20250811112631.759341-7-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Juegren Gross <jgross@suse.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mariano Pache <npache@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/huge_memory: mark PMD mappings of the huge zero folio special
The huge zero folio is refcounted (+mapcounted -- is that a word?)
differently than "normal" folios, similarly (but different) to the
ordinary shared zeropage.
For this reason, we special-case these pages in
vm_normal_page*/vm_normal_folio*, and only allow selected callers to still
use them (e.g., GUP can still take a reference on them).
vm_normal_page_pmd() already filters out the huge zero folio, to indicate
it a special (return NULL). However, so far we are not making use of
pmd_special() on architectures that support it
(CONFIG_ARCH_HAS_PTE_SPECIAL), like we would with the ordinary shared
zeropage.
Let's mark PMD mappings of the huge zero folio similarly as special, so we
can avoid the manual check for the huge zero folio with
CONFIG_ARCH_HAS_PTE_SPECIAL next, and only perform the check on
!CONFIG_ARCH_HAS_PTE_SPECIAL.
In copy_huge_pmd(), where we have a manual pmd_special() check to handle
PFNMAP, we have to manually rule out the huge zero folio. That code needs
a serious cleanup, but that's something for another day.
While at it, update the doc regarding the shared zero folios.
No functional change intended: vm_normal_page_pmd() still returns NULL
when it encounters the huge zero folio.
Link: https://lkml.kernel.org/r/20250811112631.759341-6-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Juegren Gross <jgross@suse.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mariano Pache <npache@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
fs/dax: use vmf_insert_folio_pmd() to insert the huge zero folio
Let's convert to vmf_insert_folio_pmd().
There is a theoretical change in behavior: in the unlikely case there is
already something mapped, we'll now still call trace_dax_pmd_load_hole()
and return VM_FAULT_NOPAGE.
Previously, we would have returned VM_FAULT_FALLBACK, and the caller would
have zapped the PMD to try a PTE fault.
However, that behavior was different to other PTE+PMD faults, when there
would already be something mapped, and it's not even clear if it could be
triggered.
Assuming the huge zero folio is already mapped, all good, no need to
fallback to PTEs.
Assuming there is already a leaf page table ... the behavior would be
just like when trying to insert a PMD mapping a folio through
dax_fault_iter()->vmf_insert_folio_pmd().
Assuming there is already something else mapped as PMD? It sounds like a
BUG, and the behavior would be just like when trying to insert a PMD
mapping a folio through dax_fault_iter()->vmf_insert_folio_pmd().
So, it sounds reasonable to not handle huge zero folios differently to
inserting PMDs mapping folios when there already is something mapped.
Link: https://lkml.kernel.org/r/20250811112631.759341-5-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Juegren Gross <jgross@suse.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mariano Pache <npache@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/huge_memory: support huge zero folio in vmf_insert_folio_pmd()
Just like we do for vmf_insert_page_mkwrite() -> ... ->
insert_page_into_pte_locked() with the shared zeropage, support the huge
zero folio in vmf_insert_folio_pmd().
When (un)mapping the huge zero folio in page tables, we neither adjust the
refcount nor the mapcount, just like for the shared zeropage.
For now, the huge zero folio is not marked as special yet, although
vm_normal_page_pmd() really wants to treat it as special. We'll change
that next.
Link: https://lkml.kernel.org/r/20250811112631.759341-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Juegren Gross <jgross@suse.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mariano Pache <npache@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/huge_memory: move more common code into insert_pud()
Let's clean it all further up.
No functional change intended.
Link: https://lkml.kernel.org/r/20250811112631.759341-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Juegren Gross <jgross@suse.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mariano Pache <npache@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/huge_memory: move more common code into insert_pmd()
Patch series "mm: vm_normal_page*() improvements", v3.
Cleanup and unify vm_normal_page_*() handling, also marking the huge
zerofolio as special in the PMD. Add+use vm_normal_page_pud() and cleanup
that XEN vm_ops->find_special_page thingy.
There are plans of using vm_normal_page_*() more widely soon.
This patch (of 11):
Let's clean it all further up.
No functional change intended.
Link: https://lkml.kernel.org/r/20250811112631.759341-1-david@redhat.com Link: https://lkml.kernel.org/r/20250811112631.759341-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Juegren Gross <jgross@suse.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mariano Pache <npache@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Cc: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
At this point MIGRATEPAGE_SUCCESS is misnamed for all folio users,
and now that we remove MIGRATEPAGE_UNMAP, it's really the only "success"
return value that the code uses and expects.
Let's just get rid of MIGRATEPAGE_SUCCESS completely and just use "0"
for success.
Link: https://lkml.kernel.org/r/20250811143949.1117439-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> [mm] Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com> [jfs] Acked-by: David Sterba <dsterba@suse.com> [btrfs] Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Byungchul Park <byungchul@sk.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Kleikamp <shaggy@kernel.org> Cc: Eugenio Pé rez <eperezma@redhat.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Wang <jasowang@redhat.com> Cc: Jerrin Shaji George <jerrin.shaji-george@broadcom.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: Lance Yang <lance.yang@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
migrate_folio_unmap() is the only user of MIGRATEPAGE_UNMAP. We want to
remove MIGRATEPAGE_* completely.
It's rather weird to have a generic MIGRATEPAGE_UNMAP, documented to be
returned from address-space callbacks, when it's only used for an internal
helper.
Let's start by having only a single "success" return value for
migrate_folio_unmap() -- 0 -- by moving the "folio was already freed"
check into the single caller.
There is a remaining comment for PG_isolated, which we renamed to
PG_movable_ops_isolated recently and forgot to update.
While we might still run into that case with zsmalloc, it's something we
want to get rid of soon. So let's just focus that optimization on real
folios only for now by excluding movable_ops pages. Note that concurrent
freeing can happen at any time and this "already freed" check is not
relevant for correctness.
[david@redhat.com: no need to pass "reason" to migrate_folio_unmap(), per Lance] Link: https://lkml.kernel.org/r/3bb725f8-28d7-4aa2-b75f-af40d5cab280@redhat.com Link: https://lkml.kernel.org/r/20250811143949.1117439-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Byungchul Park <byungchul@sk.com> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Kleikamp <shaggy@kernel.org> Cc: David Sterba <dsterba@suse.com> Cc: Eugenio Pé rez <eperezma@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Wang <jasowang@redhat.com> Cc: Jerrin Shaji George <jerrin.shaji-george@broadcom.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Mon, 11 Aug 2025 17:20:18 +0000 (01:20 +0800)]
mm/mincore: use a helper for checking the swap cache
Introduce a mincore_swap helper for checking swap entries. Move all swap
related logic and sanity debug check into it, and separate them from page
cache checking.
The performance is better after this commit. mincore_page is never called
on a swap cache space now, so the logic can be simpler. The sanity check
also covers more potential cases now, previously the WARN_ON only catches
potentially corrupted page table, now if shmem contains a swap entry with
!CONFIG_SWAP, a WARN will be triggered. This changes the mincore value
when the WARN is triggered, but this shouldn't matter. The WARN_ON means
the data is already corrupted or something is very wrong, so it really
should not happen.
Before this series:
mincore on a swaped out 16G anon mmap range:
Took 488220 us
mincore on 16G shmem mmap range:
Took 530272 us.
After this commit:
mincore on a swaped out 16G anon mmap range:
Took 446763 us
mincore on 16G shmem mmap range:
Took 460496 us.
About ~10% faster.
Link: https://lkml.kernel.org/r/20250811172018.48901-3-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Mon, 11 Aug 2025 17:20:17 +0000 (01:20 +0800)]
mm/mincore, swap: consolidate swap cache checking for mincore
Patch series "mm/mincore: minor clean up for swap cache checking".
This series cleans up a swap cache helper only used by mincore, move it
back into mincore code. Also separate the swap cache related logics out
of shmem / page cache logics in mincore.
With this series we have less lines of code and better performance.
Before this series:
mincore on a swaped out 16G anon mmap range:
Took 488220 us
mincore on 16G shmem mmap range:
Took 530272 us.
After this series:
mincore on a swaped out 16G anon mmap range:
Took 446763 us
mincore on 16G shmem mmap range:
Took 460496 us.
About ~10% faster.
This patch (of 2):
The filemap_get_incore_folio (previously find_get_incore_page) helper was
introduced by commit 61ef18655704 ("mm: factor find_get_incore_page out of
mincore_page") to be used by later commit f5df8635c5a3 ("mm: use
find_get_incore_page in memcontrol"), so memory cgroup charge move code
can be simplified.
But commit 6b611388b626 ("memcg-v1: remove charge move code") removed that
user completely, it's only used by mincore now.
So this commit basically reverts commit 61ef18655704 ("mm: factor
find_get_incore_page out of mincore_page"). Move it back to mincore side
to simplify the code.
Link: https://lkml.kernel.org/r/20250811172018.48901-1-ryncsn@gmail.com Link: https://lkml.kernel.org/r/20250811172018.48901-2-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sang-Heon Jeon [Tue, 5 Aug 2025 12:39:40 +0000 (21:39 +0900)]
mm/damon: update expired description of damos_action
Nowadays, damos operation actions support a greater operation set. But
comments (also, generated documentation) weren't updated. So fix the
comments with current support status.
fs/proc/task_mmu: execute PROCMAP_QUERY ioctl under per-vma locks
Utilize per-vma locks to stabilize vma after lookup without taking
mmap_lock during PROCMAP_QUERY ioctl execution. If vma lock is
contended, we fall back to mmap_lock but take it only momentarily
to lock the vma and release the mmap_lock. In a very unlikely case
of vm_refcnt overflow, this fall back path will fail and ioctl is
done under mmap_lock protection.
This change is designed to reduce mmap_lock contention and prevent
PROCMAP_QUERY ioctl calls from blocking address space updates.
Link: https://lkml.kernel.org/r/20250808152850.2580887-4-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: SeongJae Park <sj@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Thomas Weißschuh <linux@weissschuh.net> Cc: T.J. Mercier <tjmercier@google.com> Cc: Ye Bin <yebin10@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
fs/proc/task_mmu: factor out proc_maps_private fields used by PROCMAP_QUERY
Refactor struct proc_maps_private so that the fields used by PROCMAP_QUERY
ioctl are moved into a separate structure. In the next patch this allows
ioctl to reuse some of the functions used for reading /proc/pid/maps
without using file->private_data. This prevents concurrent modification
of file->private_data members by ioctl and /proc/pid/maps readers.
The change is pure code refactoring and has no functional changes.
Link: https://lkml.kernel.org/r/20250808152850.2580887-3-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: SeongJae Park <sj@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Thomas Weißschuh <linux@weissschuh.net> Cc: T.J. Mercier <tjmercier@google.com> Cc: Ye Bin <yebin10@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently modified
Patch series " execute PROCMAP_QUERY ioctl under per-vma lock", v4.
With /proc/pid/maps now being read under per-vma lock protection we can
reuse parts of that code to execute PROCMAP_QUERY ioctl also without
taking mmap_lock. The change is designed to reduce mmap_lock contention
and prevent PROCMAP_QUERY ioctl calls from blocking address space updates.
This patchset was split out of the original patchset [1] that introduced
per-vma lock usage for /proc/pid/maps reading. It contains PROCMAP_QUERY
tests, code refactoring patch to simplify the main change and the actual
transition to per-vma lock.
This patch (of 3):
Extend /proc/pid/maps tearing tests to verify PROCMAP_QUERY ioctl operation
correctness while the vma is being concurrently modified.
Link: https://lkml.kernel.org/r/20250808152850.2580887-1-surenb@google.com Link: https://lkml.kernel.org/r/20250808152850.2580887-2-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Tested-by: SeongJae Park <sj@kernel.org> Acked-by: SeongJae Park <sj@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Thomas Weißschuh <linux@weissschuh.net> Cc: T.J. Mercier <tjmercier@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Ye Bin <yebin10@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yueyang Pan [Sat, 2 Aug 2025 11:52:46 +0000 (11:52 +0000)]
mm/damon/vaddr: support stat-purpose DAMOS filters
This patch extends DAMOS_STAT handling of the DAMON operations sets for
virtual address spaces for ops-level DAMOS filters. It leverages the
walk_page_range to walk the page table and gets the folio from page table.
The last folio scanned is stored in damos->last_applied to prevent double
counting.
Yueyang Pan [Sat, 2 Aug 2025 11:52:45 +0000 (11:52 +0000)]
mm/damon/paddr: move filters existence check function to ops-common
Patch series "mm/damon/vaddr: support stat-purpose DAMOS filters", v4.
Extend DAMOS_STAT handling of the DAMON operations sets for virtual
address spaces for ops-level DAMOS filters.
Functionality Test
==================
I wrote a small test program which allocates 10GB of DRAM, use
madvise(MADV_HUGEPAGE) to convert the base pages to 2MB huge pages Then my
program does the following things in order:
1. Write sequentially to the whole 10GB region
2. Read the first 5GB region sequentially for 10 times
3. Sleep 5s
4. Read the second 5GB region sequentially for 10 times
With a proper damon setting, we are expected to see df-passed to be 10GB
and hot region move around with the read
As you can see the total df-passed region is 10GiB and the hot region
moves as the seq read keeps going
This patch (of 2):
This patch moves damon_pa_scheme_has_filter to ops-common. renaming to
damos_ops_has_filter. Doing so allows us to reuse its logic in the vaddr
version of DAMOS_STAT.
Bijan Tabatabai [Wed, 6 Aug 2025 23:42:54 +0000 (18:42 -0500)]
mm/damon/core: skip needless update of damon_attrs in damon_commit_ctx()
Currently, damon_commit_ctx() always calls damon_set_attrs() even if the
attributes have not been changed. This can be problematic when the DAMON
state is committed relatively frequently because damon_set_attrs() resets
ctx->next_{aggregation,ops_update}_sis, causing aggregation and ops update
operations to be needlessly delayed.
This patch avoids this by only calling damon_set_attrs() in
damon_commit_ctx when the attributes have been changed.
Wei Yang [Mon, 4 Aug 2025 06:41:06 +0000 (06:41 +0000)]
mm/rmap: do __folio_mod_stat() in __folio_add_rmap()
It is required to modify folio statistic after rmap changes, so it looks
reasonable to do it in __folio_add_rmap(), which is the current behavior
of __folio_remove_rmap() and folio_add_new_anon_rmap().
Call __folio_mod_stat() in __folio_add_rmap(), so that rmap adjustment
family shares the same pattern.
Link: https://lkml.kernel.org/r/20250804064106.21269-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Rik van Riel <riel@surriel.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Qianfeng Rong [Mon, 4 Aug 2025 13:00:17 +0000 (21:00 +0800)]
xarray: remove redundant __GFP_NOWARN
Commit 16f5dfbc851b ("gfp: include __GFP_NOWARN in GFP_NOWAIT") made
GFP_NOWAIT implicitly include __GFP_NOWARN.
Therefore, explicit __GFP_NOWARN combined with GFP_NOWAIT (e.g.,
`GFP_NOWAIT | __GFP_NOWARN`) is now redundant. Let's clean up these
redundant flags across subsystems.
Vitaly Wool [Wed, 6 Aug 2025 12:55:52 +0000 (14:55 +0200)]
rust: support large alignments in allocations
Add support for large (> PAGE_SIZE) alignments in Rust allocators. All
the preparations on the C side are already done, we just need to add
bindings for <alloc>_node_align() functions and start using those.
Link: https://lkml.kernel.org/r/20250806125552.1727073-1-vitaly.wool@konsulko.se Signed-off-by: Vitaly Wool <vitaly.wool@konsulko.se> Acked-by: Danilo Krummrich <dakr@kernel.org> Acked-by: Alice Ryhl <aliceryhl@google.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Jann Horn <jannh@google.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vitaly Wool [Wed, 6 Aug 2025 12:55:22 +0000 (14:55 +0200)]
rust: add support for NUMA ids in allocations
Add a new type to support specifying NUMA identifiers in Rust allocators
and extend the allocators to have NUMA id as a parameter. Thus, modify
ReallocFunc to use the new extended realloc primitives from the C side of
the kernel (i.e. k[v]realloc_node_align/vrealloc_node_align) and add the
new function alloc_node to the Allocator trait while keeping the existing
one (alloc) for backward compatibility.
This will allow to specify node to use for allocation of e. g. {KV}Box,
as well as for future NUMA aware users of the API.
[ojeda@kernel.org: fix missing import needed for `rusttest`] Link: https://lkml.kernel.org/r/20250816210214.2729269-1-ojeda@kernel.org Link: https://lkml.kernel.org/r/20250806125522.1726992-1-vitaly.wool@konsulko.se Signed-off-by: Vitaly Wool <vitaly.wool@konsulko.se> Signed-off-by: Miguel Ojeda <ojeda@kernel.org> Acked-by: Danilo Krummrich <dakr@kernel.org> Acked-by: Alice Ryhl <aliceryhl@google.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Jann Horn <jannh@google.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Miguel Ojeda <ojeda@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vitaly Wool [Wed, 6 Aug 2025 12:41:47 +0000 (14:41 +0200)]
mm/slub: allow to set node and align in k[v]realloc
Reimplement k[v]realloc_node() to be able to set node and alignment should
a user need to do so. In order to do that while retaining the maximal
backward compatibility, add k[v]realloc_node_align() functions and
redefine the rest of API using these new ones.
While doing that, we also keep the number of _noprof variants to a
minimum, which implies some changes to the existing users of older _noprof
functions, that basically being bcachefs.
With that change we also provide the ability for the Rust part of the
kernel to set node and alignment in its K[v]xxx [re]allocations.
Link: https://lkml.kernel.org/r/20250806124147.1724658-1-vitaly.wool@konsulko.se Signed-off-by: Vitaly Wool <vitaly.wool@konsulko.se> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Jann Horn <jannh@google.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vitaly Wool [Wed, 6 Aug 2025 12:41:08 +0000 (14:41 +0200)]
mm/vmalloc: allow to set node and align in vrealloc
Patch series "support large align and nid in Rust allocators", v15.
The series provides the ability for Rust allocators to set NUMA node and
large alignment.
This patch (of 4):
Reimplement vrealloc() to be able to set node and alignment should a user
need to do so. Rename the function to vrealloc_node_align() to better
match what it actually does now and introduce macros for vrealloc() and
friends for backward compatibility.
With that change we also provide the ability for the Rust part of the
kernel to set node and alignment in its allocations.
Link: https://lkml.kernel.org/r/20250806124034.1724515-1-vitaly.wool@konsulko.se Link: https://lkml.kernel.org/r/20250806124108.1724561-1-vitaly.wool@konsulko.se Signed-off-by: Vitaly Wool <vitaly.wool@konsulko.se> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Jann Horn <jannh@google.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: correct misleading comment on mmap_lock field in mm_struct
The comment previously described the offset of mmap_lock as 0x120 (hex),
which is misleading. The correct offset is 56 bytes (decimal) from the
last cache line boundary. Using '0x120' could confuse readers trying to
understand why the count and owner fields reside in separate cachelines.
This change also removes an unnecessary space for improved formatting.
Link: https://lkml.kernel.org/r/20250806145906.24647-1-adrianhuang0701@gmail.com Signed-off-by: Adrian Huang (Lenovo) <adrianhuang0701@gmail.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Replace typeof() with __auto_type in the swap() macro in uffd-stress.c.
__auto_type was introduced in GCC 4.9 and reduces the compile time for all
compilers. No functional changes intended.
Kairui Song [Wed, 6 Aug 2025 16:17:48 +0000 (00:17 +0800)]
mm, swap: prefer nonfull over free clusters
We prefer a free cluster over a nonfull cluster whenever a CPU local
cluster is drained to respect the SSD discard behavior [1]. It's not a
best practice for non-discarding devices. And this is causing a higher
fragmentation rate.
So for a non-discarding device, prefer nonfull over free clusters. This
reduces the fragmentation issue by a lot.
Testing with make -j96, defconfig, using 64k mTHP, 8G ZRAM:
Kairui Song [Wed, 6 Aug 2025 16:17:47 +0000 (00:17 +0800)]
mm, swap: remove fragment clusters counter
It was used for calculating the iteration number when the swap allocator
wants to scan the whole fragment list. Now the allocator only scans one
fragment cluster at a time, so no one uses this counter anymore.
Remove it as a cleanup; the performance change is marginal:
Build linux kernel using 10G ZRAM, make -j96, defconfig with 2G cgroup
memory limit, on top of tmpfs, 64kB mTHP enabled:
Link: https://lkml.kernel.org/r/20250806161748.76651-3-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Wed, 6 Aug 2025 16:17:46 +0000 (00:17 +0800)]
mm, swap: only scan one cluster in fragment list
Patch series "mm, swap: improve cluster scan strategy", v2.
This series improves the large allocation performance and reduces the
failure rate. Some design of the cluster alloactor was later found to be
improvable after thorough testing.
The allocator spent too much effort scanning the fragment list, which is
not helpful in most setups, but causes serious contention of the list lock
(si->lock). Besides, the allocator prefers free clusters when searching
for a new cluster due to historical reasons, which causes fragmentation
issues.
So make the allocator only scan one cluster for high order allocation, and
prefer nonfull cluster. This both improves the performance and reduces
fragmentation.
For example, build kernel test with make -j96 and 10G ZRAM with 64kB mTHP
enabled shows better performance and a lower failure rate:
Fragment clusters were mostly failing high order allocation already. The
reason we scan it through now is that a swap slot may get freed without
releasing the swap cache, so a swap map entry will end up in HAS_CACHE
only status, and the cluster won't be moved back to non-full or free
cluster list. This may cause a higher allocation failure rate.
Usually only !SWP_SYNCHRONOUS_IO devices may have a large number of slots
stuck in HAS_CACHE only status. Because when a !SWP_SYNCHRONOUS_IO
device's usage is low (!vm_swap_full()), it will try to lazy free the swap
cache.
But this fragment list scan out is a bit overkill. Fragmentation is
only an issue for the allocator when the device is getting full, and by
that time, swap will be releasing the swap cache aggressively already.
Only scanning one fragment cluster at a time is good enough to reclaim
already pinned slots, and move the cluster back to nonfull.
And besides, only high order allocation requires iterating over the list,
order 0 allocation will succeed on the first attempt. And high order
allocation failure isn't a serious problem.
So the iteration of fragment clusters is trivial, but it will slow down
large allocation by a lot when the fragment cluster list is long. So it's
better to drop this fragment cluster iteration design.
Test on a 48c96t system, build linux kernel using 10G ZRAM, make -j48,
defconfig with 768M cgroup memory limit, on top of tmpfs, 4K folio only:
The performance is a lot better for large folios, and the large order
allocation failure rate is only very slightly higher or unchanged even
for !SWP_SYNCHRONOUS_IO devices high pressure.
Link: https://lkml.kernel.org/r/20250806161748.76651-1-ryncsn@gmail.com Link: https://lkml.kernel.org/r/20250806161748.76651-2-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: change vma_start_read() to drop RCU lock on failure
vma_start_read() can drop and reacquire RCU lock in certain failure cases.
It's not apparent that the RCU session started by the caller of this
function might be interrupted when vma_start_read() fails to lock the vma.
This might become a source of subtle bugs and to prevent that we change
the locking rules for vma_start_read() to drop RCU read lock upon failure.
This way it's more obvious that RCU-protected objects are unsafe after
vma locking fails.
Limit the scope of vma_start_read() as it is used only as a helper for
higher-level locking functions implemented inside mmap_lock.c and we are
about to introduce more complex RCU rules for this function. The change
is pure code refactoring and has no functional changes.
selftests/mm: pass filename as input param to VM_PFNMAP tests
Enable these tests to be run on other pfnmap'ed memory like NVIDIA's EGM.
Add '--' as a separator to pass in file path. This allows passing of cmd
line arguments to kselftest_harness. Use '/dev/mem' as default filename.
Existing test passes:
pfnmap
TAP version 13
1..6
# Starting 6 tests from 1 test cases.
# PASSED: 6 / 6 tests passed.
# Totals: pass:6 fail:0 xfail:0 xpass:0 skip:0 error:0
Pass params to kselftest_harness:
pfnmap -r pfnmap:mremap_fixed
TAP version 13
1..1
# Starting 1 tests from 1 test cases.
# RUN pfnmap.mremap_fixed ...
# OK pfnmap.mremap_fixed
ok 1 pfnmap.mremap_fixed
# PASSED: 1 / 1 tests passed.
# Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
Pass non-existent file name as input:
pfnmap -- /dev/blah
TAP version 13
1..6
# Starting 6 tests from 1 test cases.
# RUN pfnmap.madvise_disallowed ...
# SKIP Cannot open '/dev/blah'
Pass non pfnmap'ed file as input:
pfnmap -r pfnmap.madvise_disallowed -- randfile.txt
TAP version 13
1..1
# Starting 1 tests from 1 test cases.
# RUN pfnmap.madvise_disallowed ...
# SKIP Invalid file: 'randfile.txt'. Not pfnmap'ed
Link: https://lkml.kernel.org/r/20250805013629.47629-1-sudarsanm@google.com Signed-off-by: Sudarsan Mahendran <sudarsanm@google.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
zram: protect recomp_algorithm_show() with ->init_lock
sysfs handlers should be called under ->init_lock and are not supposed to
unlock it until return, otherwise e.g. a concurrent reset() can occur.
There is one handler that breaks that rule: recomp_algorithm_show().
Move ->init_lock handling outside of __comp_algorithm_show() (also drop it
and call zcomp_available_show() directly) so that the entire
recomp_algorithm_show() loop is protected by the lock, as opposed to
protecting individual iterations.
The patch does not need to go to -stable, as it does not fix any
runtime errors (at least I can't think of any). It makes
recomp_algorithm_show() "atomic" w.r.t. zram reset() (just like the
rest of zram sysfs show() handlers), that's a pretty minor change.
/dev/zero: try to align PMD_SIZE for private mapping
Attempt to map aligned to huge page size for private mapping which could
achieve performance gains, the mprot_tw4m in libMicro average execution
time on arm64:
- Test case: mprot_tw4m
- Before the patch: 22 us
- After the patch: 17 us
If THP config is not set, we fall back to system page size mappings.
Link: https://lkml.kernel.org/r/20250731122305.2669090-1-zhangqilong3@huawei.com Signed-off-by: Zhang Qilong <zhangqilong3@huawei.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Tested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Nanyong Sun <sunnanyong@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Goto-san reported confusing pgpromote statistics where the
pgpromote_success count significantly exceeded pgpromote_candidate.
On a system with three nodes (nodes 0-1: DRAM 4GB, node 2: NVDIMM 4GB):
# Enable demotion only
echo 1 > /sys/kernel/mm/numa/demotion_enabled
numactl -m 0-1 memhog -r200 3500M >/dev/null &
pid=$!
sleep 2
numactl memhog -r100 2500M >/dev/null &
sleep 10
kill -9 $pid # terminate the 1st memhog
# Enable promotion
echo 2 > /proc/sys/kernel/numa_balancing
After a few seconds, we observeed `pgpromote_candidate < pgpromote_success`
$ grep -e pgpromote /proc/vmstat
pgpromote_success 2579
pgpromote_candidate 0
In this scenario, after terminating the first memhog, the conditions for
pgdat_free_space_enough() are quickly met, and triggers promotion.
However, these migrated pages are only counted for in PGPROMOTE_SUCCESS,
not in PGPROMOTE_CANDIDATE.
To solve these confusing statistics, introduce PGPROMOTE_CANDIDATE_NRL to
count the missed promotion pages. And also, not counting these pages into
PGPROMOTE_CANDIDATE is to avoid changing the existing algorithm or
performance of the promotion rate limit.
Link: https://lkml.kernel.org/r/20250901090122.124262-1-ruansy.fnst@fujitsu.com Link: https://lkml.kernel.org/r/20250729035101.1601407-1-ruansy.fnst@fujitsu.com Fixes: c6833e10008f ("memory tiering: rate limit NUMA migration throughput") Co-developed-by: Li Zhijian <lizhijian@fujitsu.com> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> Signed-off-by: Ruan Shiyang <ruansy.fnst@fujitsu.com> Reported-by: Yasunori Gotou (Fujitsu) <y-goto@fujitsu.com> Suggested-by: Huang Ying <ying.huang@linux.alibaba.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Huang Ying <ying.huang@linux.alibaba.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/mglru: update MG-LRU proactive reclaim statistics only to memcg
Users can use /sys/kernel/debug/lru_gen to trigger proactive memory
reclaim of a specified memcg. Currently, statistics such as pgrefill,
pgscan and pgsteal will be updated to the /proc/vmstat system memory
statistics.
This will confuse some system memory pressure monitoring tools, making it
difficult to determine whether pgscan and pgsteal are caused by
system-level pressure or by proactive memory reclaim of some specific
memory cgroup.
Therefore, make this interface behave similarly to memory.reclaim. Update
proactive memory reclaim statistics only to its memory cgroup.
Link: https://lkml.kernel.org/r/20250717082845.34673-1-jiahao.kernel@gmail.com Signed-off-by: Hao Jia <jiahao1@lixiang.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kinsey Ho <kinseyho@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joshua Hahn [Tue, 5 Aug 2025 20:50:47 +0000 (13:50 -0700)]
mempolicy: clarify what zone reclaim means
The zone_reclaim_mode API controls the reclaim behavior when a node runs
out of memory. Contrary to its user-facing name, it is internally
referred to as "node_reclaim_mode".
This can be confusing. But because we cannot change the name of the API
since it has been in place since at least 2.6, let's try to be more
explicit about what the behavior of this API is.
Change the description to clarify what zone reclaim entails, and be
explicit about the RECLAIM_ZONE bit, whose purpose has led to some
confusion in the past already [1] [2].
While at it, also soften the warning about changing these bits.
Merge tag 'sound-6.17-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"A collection of small changes including a few regression fixes:
- Regression fix for Intel SKL/KBL HD-audio bindings
- Regression fix for missing Nvidia HDMI codec entries after the
recent code reorganization
- A few TAS2781 codec regression fixes
- Fix for ASoC component lookup breakage
- Usual HD-audio, USB-audio and SOF quirk entries"
* tag 'sound-6.17-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: hda/hdmi: Add pin fix for another HP EliteDesk 800 G4 model
ALSA: usb-audio: Allow Focusrite devices to use low samplerates
ALSA: hda: tas2781: reorder tas2563 calibration variables
ALSA: hda: tas2781: fix tas2563 EFI data endianness
ALSA: firewire-motu: drop EPOLLOUT from poll return values as write is not supported
ALSA: docs: Add documents for recently changes in snd-usb-audio
ALSA: usb-audio: Add mute TLV for playback volumes on more devices
ASoC: SOF: Intel: WCL: Add the sdw_process_wakeen op
ALSA: hda: Avoid binding with SOF for SKL/KBL platforms
ASoC: rsnd: tidyup direction name on rsnd_dai_connect()
ALSA: hda/tas2781: Fix EFI name for calibration beginning with 1 instead of 0
ALSA: usb-audio: move mixer_quirks' min_mute into common quirk
ALSA: hda/realtek: Fix headset mic for TongFang X6[AF]R5xxY
ALSA: hda/hdmi: Restore missing HDMI codec entries
ASoC: codecs: idt821034: fix wrong log in idt821034_chip_direction_output()
ASoC: soc-core: tidyup snd_soc_lookup_component_nolocked()
ASoC: soc-core: care NULL dirver name on snd_soc_lookup_component_nolocked()
ALSA: hda: intel-dsp-config: Select SOF driver on MTL Chromebooks
ALSA: usb-audio: Add mute TLV for playback volumes on some devices
Merge tag 'mm-hotfixes-stable-2025-09-01-17-20' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"17 hotfixes. 13 are cc:stable and the remainder address post-6.16
issues or aren't considered necessary for -stable kernels. 11 of these
fixes are for MM.
This includes a three-patch series from Harry Yoo which fixes an
intermittent boot failure which can occur on x86 systems. And a
two-patch series from Alexander Gordeev which fixes a KASAN crash on
S390 systems"
* tag 'mm-hotfixes-stable-2025-09-01-17-20' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm: fix possible deadlock in kmemleak
x86/mm/64: define ARCH_PAGE_TABLE_SYNC_MASK and arch_sync_kernel_mappings()
mm: introduce and use {pgd,p4d}_populate_kernel()
mm: move page table sync declarations to linux/pgtable.h
proc: fix missing pde_set_flags() for net proc files
mm: fix accounting of memmap pages
mm/damon/core: prevent unnecessary overflow in damos_set_effective_quota()
kexec: add KEXEC_FILE_NO_CMA as a legal flag
kasan: fix GCC mem-intrinsic prefix with sw tags
mm/kasan: avoid lazy MMU mode hazards
mm/kasan: fix vmalloc shadow memory (de-)population races
kunit: kasan_test: disable fortify string checker on kasan_strings() test
selftests/mm: fix FORCE_READ to read input value correctly
mm/userfaultfd: fix kmap_local LIFO ordering for CONFIG_HIGHPTE
ocfs2: prevent release journal inode after journal shutdown
rust: mm: mark VmaNew as transparent
of_numa: fix uninitialized memory nodes causing kernel panic
Merge tag 'for-6.17-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix a few races related to inode link count
- fix inode leak on failure to add link to inode
- move transaction aborts closer to where they happen
* tag 'for-6.17-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: avoid load/store tearing races when checking if an inode was logged
btrfs: fix race between setting last_dir_index_offset and inode logging
btrfs: fix race between logging inode and checking if it was logged before
btrfs: simplify error handling logic for btrfs_link()
btrfs: fix inode leak on failure to add link to inode
btrfs: abort transaction on failure to add link to inode
To solve this problem, switch to printk_safe mode before printing warning
message, this will redirect all printk()-s to a special per-CPU buffer,
which will be flushed later from a safe context (irq work), and this
deadlock problem can be avoided. The proper API to use should be
printk_deferred_enter()/printk_deferred_exit() [2]. Another way is to
place the warn print after kmemleak is released.
ALSA: hda/hdmi: Add pin fix for another HP EliteDesk 800 G4 model
It was reported that HP EliteDesk 800 G4 DM 65W (SSID 103c:845a) needs
the similar quirk for enabling HDMI outputs, too. This patch adds the
corresponding quirk entry.
Tina Wuest [Mon, 1 Sep 2025 09:20:24 +0000 (12:20 +0300)]
ALSA: usb-audio: Allow Focusrite devices to use low samplerates
Commit 05f254a6369ac020fc0382a7cbd3ef64ad997c92 ("ALSA: usb-audio:
Improve filtering of sample rates on Focusrite devices") changed the
check for max_rate in a way which was overly restrictive, forcing
devices to use very high samplerates if they support them, despite
support existing for lower rates as well.
This maintains the intended outcome (ensuring samplerates selected are
supported) while allowing devices with higher maximum samplerates to be
opened at all supported samplerates.
This patch was tested with a Clarett+ 8Pre USB
Fixes: 05f254a6369a ("ALSA: usb-audio: Improve filtering of sample rates on Focusrite devices") Signed-off-by: Tina Wuest <tina@wuest.me> Link: https://patch.msgid.link/20250901092024.140993-1-tina@wuest.me Signed-off-by: Takashi Iwai <tiwai@suse.de>
Linus Torvalds [Sun, 31 Aug 2025 16:20:17 +0000 (09:20 -0700)]
Merge tag 'x86_urgent_for_v6.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Convert the SSB mitigation to the attack vector controls which got
forgotten at the time
- Prevent the CPUID topology hierarchy detection on AMD from
overwriting the correct initial APIC ID
- Fix the case of a machine shipping without microcode in the BIOS, in
the AMD microcode loader
- Correct the Pentium 4 model range which has a constant TSC
* tag 'x86_urgent_for_v6.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/bugs: Add attack vector controls for SSB
x86/cpu/topology: Use initial APIC ID from XTOPOLOGY leaf on AMD/HYGON
x86/microcode/AMD: Handle the case of no BIOS microcode
x86/cpu/intel: Fix the constant_tsc model check for Pentium 4
Linus Torvalds [Sun, 31 Aug 2025 16:13:00 +0000 (09:13 -0700)]
Merge tag 'sched_urgent_for_v6.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Borislav Petkov:
- Fix a stall on the CPU offline path due to mis-counting a deadline
server task twice as part of the runqueue's running tasks count
- Fix a realtime tasks starvation case where failure to enqueue a timer
whose expiration time is already in the past would cause repeated
attempts to re-enqueue a deadline server task which leads to starving
the former, realtime one
- Prevent a delayed deadline server task stop from breaking the
per-runqueue bandwidth tracking
- Have a function checking whether the deadline server task has
stopped, return the correct value
* tag 'sched_urgent_for_v6.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/deadline: Don't count nr_running for dl_server proxy tasks
sched/deadline: Fix RT task potential starvation when expiry time passed
sched/deadline: Always stop dl-server before changing parameters
sched/deadline: Fix dl_server_stopped()
Linus Torvalds [Sun, 31 Aug 2025 16:07:37 +0000 (09:07 -0700)]
Merge tag 'irq_urgent_for_v6.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Borislav Petkov:
- Remove unnecessary and noisy WARN_ONs in gic-v5's init path
- Avoid a kmemleak false positive for the gic-v5's L2 IST table entries
- Fix a retval check in mvebu-gicp's probe function
- Fix a wrong conversion to guards in atmel-aic[5] irqchip
* tag 'irq_urgent_for_v6.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/gic-v5: Remove undue WARN_ON()s in the IRS affinity parsing
irqchip/gic-v5: Fix kmemleak L2 IST table entries false positives
irqchip/mvebu-gicp: Fix an IS_ERR() vs NULL check in probe()
irqchip/atmel-aic[5]: Fix incorrect lock guard conversion
Linus Torvalds [Sun, 31 Aug 2025 15:56:45 +0000 (08:56 -0700)]
Merge tag 'hardening-v6.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardening fixes from Kees Cook:
- ARM: stacktrace: include asm/sections.h in asm/stacktrace.h (Arnd
Bergmann)
- ubsan: Fix incorrect hand-side used in handle (Junhui Pei)
- hardening: Require clang 20.1.0 for __counted_by (Nathan Chancellor)
* tag 'hardening-v6.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
hardening: Require clang 20.1.0 for __counted_by
ARM: stacktrace: include asm/sections.h in asm/stacktrace.h
ubsan: Fix incorrect hand-side used in handle
Linus Torvalds [Sun, 31 Aug 2025 15:49:55 +0000 (08:49 -0700)]
Merge tag 'gpio-fixes-for-v6.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux
Pull gpio fixes from Bartosz Golaszewski:
- fix an off-by-one bug in interrupt handling in gpio-timberdale
- update MAINTAINERS
* tag 'gpio-fixes-for-v6.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
MAINTAINERS: Change Altera-PIO driver maintainer
gpio: timberdale: fix off-by-one in IRQ type boundary check
Linus Torvalds [Sat, 30 Aug 2025 17:43:53 +0000 (10:43 -0700)]
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Catalin Marinas:
- CFI failure due to kpti_ng_pgd_alloc() signature mismatch
- Underallocation bug in the SVE ptrace kselftest
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
kselftest/arm64: Don't open code SVE_PT_SIZE() in fp-ptrace
arm64: mm: Fix CFI failure due to kpti_ng_pgd_alloc function signature
Mark Brown [Tue, 12 Aug 2025 14:49:27 +0000 (15:49 +0100)]
kselftest/arm64: Don't open code SVE_PT_SIZE() in fp-ptrace
In fp-trace when allocating a buffer to write SVE register data we open
code the addition of the header size to the VL depeendent register data
size, which lead to an underallocation bug when we cut'n'pasted the code
for FPSIMD format writes. Use the SVE_PT_SIZE() macro that the kernel
UAPI provides for this.
The tasdev_load_calibrated_data() function expects the calibration data
values in the cali_data buffer as R0, R0Low, InvR0, Power, TLim which
is not the same as what tas2563_save_calibration() writes to the buffer.
Reorder the EFI variables in the tas2563_save_calibration() function
to put the values in the buffer in the correct order.
Fixes: 4fe238513407 ("ALSA: hda/tas2781: Move and unified the calibrated-data getting function for SPI and I2C into the tas2781_hda lib") Cc: <stable@vger.kernel.org> Signed-off-by: Gergo Koteles <soyer@irl.hu> Link: https://patch.msgid.link/20250829160450.66623-2-soyer@irl.hu Signed-off-by: Takashi Iwai <tiwai@suse.de>
Gergo Koteles [Fri, 29 Aug 2025 16:04:49 +0000 (18:04 +0200)]
ALSA: hda: tas2781: fix tas2563 EFI data endianness
Before conversion to unify the calibration data management, the
tas2563_apply_calib() function performed the big endian conversion and
wrote the calibration data to the device. The writing is now done by the
common tasdev_load_calibrated_data() function, but without conversion.
Put the values into the calibration data buffer with the expected
endianness.
Fixes: 4fe238513407 ("ALSA: hda/tas2781: Move and unified the calibrated-data getting function for SPI and I2C into the tas2781_hda lib") Cc: <stable@vger.kernel.org> Signed-off-by: Gergo Koteles <soyer@irl.hu> Link: https://patch.msgid.link/20250829160450.66623-1-soyer@irl.hu Signed-off-by: Takashi Iwai <tiwai@suse.de>
Takashi Sakamoto [Fri, 29 Aug 2025 23:37:49 +0000 (08:37 +0900)]
ALSA: firewire-motu: drop EPOLLOUT from poll return values as write is not supported
The ALSA HwDep character device of the firewire-motu driver incorrectly
returns EPOLLOUT in poll(2), even though the driver implements no operation
for write(2). This misleads userspace applications to believe write() is
allowed, potentially resulting in unnecessarily wakeups.
This issue dates back to the driver's initial code added by a commit 71c3797779d3 ("ALSA: firewire-motu: add hwdep interface"), and persisted
when POLLOUT was updated to EPOLLOUT by a commit a9a08845e9ac ('vfs: do
bulk POLL* -> EPOLL* replacement("").').
Linus Torvalds [Fri, 29 Aug 2025 20:54:26 +0000 (13:54 -0700)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"ARM:
- Correctly handle 'invariant' system registers for protected VMs
- Improved handling of VNCR data aborts, including external aborts
- Fixes for handling of FEAT_RAS for NV guests, providing a sane
fault context during SEA injection and preventing the use of
RASv1p1 fault injection hardware
- Ensure that page table destruction when a VM is destroyed gives an
opportunity to reschedule
- Large fix to KVM's infrastructure for managing guest context loaded
on the CPU, addressing issues where the output of AT emulation
doesn't get reflected to the guest
- Fix AT S12 emulation to actually perform stage-2 translation when
necessary
- Avoid attempting vLPI irqbypass when GICv4 has been explicitly
disabled for a VM
- Minor KVM + selftest fixes
RISC-V:
- Fix pte settings within kvm_riscv_gstage_ioremap()
- Fix comments in kvm_riscv_check_vcpu_requests()
- Fix stack overrun when setting vlenb via ONE_REG
x86:
- Use array_index_nospec() to sanitize the target vCPU ID when
handling PV IPIs and yields as the ID is guest-controlled.
- Drop a superfluous cpumask_empty() check when reclaiming SEV
memory, as the common case, by far, is that at least one CPU will
have entered the VM, and wbnoinvd_on_cpus_mask() will naturally
handle the rare case where the set of have_run_cpus is empty.
Selftests (not KVM):
- Rename the is_signed_type() macro in kselftest_harness.h to
is_signed_var() to fix a collision with linux/overflow.h. The
collision generates compiler warnings due to the two macros having
different meaning"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (29 commits)
KVM: arm64: nv: Fix ATS12 handling of single-stage translation
KVM: arm64: Remove __vcpu_{read,write}_sys_reg_{from,to}_cpu()
KVM: arm64: Fix vcpu_{read,write}_sys_reg() accessors
KVM: arm64: Simplify sysreg access on exception delivery
KVM: arm64: Check for SYSREGS_ON_CPU before accessing the 32bit state
RISC-V: KVM: fix stack overrun when loading vlenb
RISC-V: KVM: Correct kvm_riscv_check_vcpu_requests() comment
RISC-V: KVM: Fix pte settings within kvm_riscv_gstage_ioremap()
KVM: arm64: selftests: Sync ID_AA64MMFR3_EL1 in set_id_regs
KVM: arm64: Get rid of ARM64_FEATURE_MASK()
KVM: arm64: Make ID_AA64PFR1_EL1.RAS_frac writable
KVM: arm64: Make ID_AA64PFR0_EL1.RAS writable
KVM: arm64: Ignore HCR_EL2.FIEN set by L1 guest's EL2
KVM: arm64: Handle RASv1p1 registers
arm64: Add capability denoting FEAT_RASv1p1
KVM: arm64: Reschedule as needed when destroying the stage-2 page-tables
KVM: arm64: Split kvm_pgtable_stage2_destroy()
selftests: harness: Rename is_signed_type() to avoid collision with overflow.h
KVM: SEV: don't check have_run_cpus in sev_writeback_caches()
KVM: arm64: Correctly populate FAR_EL2 on nested SEA injection
...
After an innocuous change in -next that modified a structure that
contains __counted_by, clang-19 start crashing when building certain
files in drivers/gpu/drm/xe. When assertions are enabled, the more
descriptive failure is:
clang: clang/lib/AST/RecordLayoutBuilder.cpp:3335: const ASTRecordLayout &clang::ASTContext::getASTRecordLayout(const RecordDecl *) const: Assertion `D && "Cannot get layout of forward declarations!"' failed.
According to a reverse bisect, a tangential change to the LLVM IR
generation phase of clang during the LLVM 20 development cycle [1]
resolves this problem. Bump the version of clang that enables
CONFIG_CC_HAS_COUNTED_BY to 20.1.0 to ensure that this issue cannot be
hit.
Paolo Bonzini [Fri, 29 Aug 2025 16:57:31 +0000 (12:57 -0400)]
Merge tag 'kvmarm-fixes-6.17-1' of https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 changes for 6.17, take #2
- Correctly handle 'invariant' system registers for protected VMs
- Improved handling of VNCR data aborts, including external aborts
- Fixes for handling of FEAT_RAS for NV guests, providing a sane
fault context during SEA injection and preventing the use of
RASv1p1 fault injection hardware
- Ensure that page table destruction when a VM is destroyed gives an
opportunity to reschedule
- Large fix to KVM's infrastructure for managing guest context loaded
on the CPU, addressing issues where the output of AT emulation
doesn't get reflected to the guest
- Fix AT S12 emulation to actually perform stage-2 translation when
necessary
- Avoid attempting vLPI irqbypass when GICv4 has been explicitly
disabled for a VM
Paolo Bonzini [Fri, 29 Aug 2025 16:57:18 +0000 (12:57 -0400)]
Merge tag 'kvm-riscv-fixes-6.17-1' of https://github.com/kvm-riscv/linux into HEAD
KVM/riscv fixes for 6.17, take #1
- Fix pte settings within kvm_riscv_gstage_ioremap()
- Fix comments in kvm_riscv_check_vcpu_requests()
- Fix stack overrun when setting vlenb via ONE_REG
Linus Torvalds [Fri, 29 Aug 2025 16:15:46 +0000 (09:15 -0700)]
Merge tag 'efi-fixes-for-v6.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi
Pull EFI fixes from Ard Biesheuvel:
- Assorted fixes for the OP-TEE based pseudo-EFI variable store
- Fix for an OOB access when looking up the same non-existing efivarfs
entry multiple times in parallel
* tag 'efi-fixes-for-v6.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
efivarfs: Fix slab-out-of-bounds in efivarfs_d_compare
efi: stmm: Drop unneeded null pointer check
efi: stmm: Drop unused EFI error from setup_mm_hdr arguments
efi: stmm: Do not return EFI_OUT_OF_RESOURCES on internal errors
efi: stmm: Fix incorrect buffer allocation method
Linus Torvalds [Fri, 29 Aug 2025 15:09:34 +0000 (08:09 -0700)]
Merge tag 'xfs-fixes-6.17-rc4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Carlos Maiolino:
"The highlight I'd like to point here is related to the XFS_RT
Kconfig, which has been updated to be enabled by default now if
CONFIG_BLK_DEV_ZONED is enabled.
This also contains a few fixes for zoned devices support in XFS,
specially related to swapon requests in inodes belonging to the zoned
FS.
A null-ptr dereference fix in the xattr data, due to a mishandling of
medium errors generated by block devices is also included"
* tag 'xfs-fixes-6.17-rc4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: do not propagate ENODATA disk errors into xattr code
xfs: reject swapon for inodes on a zoned file system earlier
xfs: kick off inodegc when failing to reserve zoned blocks
xfs: remove xfs_last_used_zone
xfs: Default XFS_RT to Y if CONFIG_BLK_DEV_ZONED is enabled
Linus Torvalds [Fri, 29 Aug 2025 14:37:21 +0000 (07:37 -0700)]
Merge tag 'regulator-fix-v6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
Pull regulator fix from Mark Brown:
"One simple fix for the pm8008 driver for poor error handling,
switching to use a helper which does the right thing in the
affected case"
* tag 'regulator-fix-v6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
regulator: pm8008: fix probe failure due to negative voltage selector
Linus Torvalds [Fri, 29 Aug 2025 14:29:17 +0000 (07:29 -0700)]
Merge tag 'ata-6.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux
Pull ata fixes from Damien Le Moal:
- Fix the type of return values to be signed in the ahci_xgen driver
(Qianfeng)
- Add the mask_port_ext module parameter to the ahci driver.
This is to allow a user to ignore ports that are advertized as
external (hotplug capable) in favor of lower link power management
policies instead of the default max_performance for these ports.
This is useful to allow e.g. laptops to go into low power states when
hooked up to docking station with sata slots, connected with an
external port for hotplug (me)
* tag 'ata-6.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux:
ata: ahci_xgene: Use int type for 'rc' to store error codes
ata: ahci: Allow ignoring the external/hotplug capability of ports
Takashi Iwai [Fri, 29 Aug 2025 09:13:09 +0000 (11:13 +0200)]
Merge tag 'asoc-fix-v6.17-rc3' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus
ASoC: Fixes for v6.17
The main fixes here are for some of the cleanups done in the core in
this release, we had broken component lookup in the case with a single
bus and DMA controller. Otherwise it's driver specific changes, the
shortlogs for the Intel WCL and rsnd drivers look like minor cleanups
but are actually bugfixes (adding an op needed for correct functionality
and reverting an inappropriate helper usage).
Linus Torvalds [Fri, 29 Aug 2025 02:56:32 +0000 (19:56 -0700)]
Merge tag 'drm-fixes-2025-08-29' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Dave Airlie:
"Weekly fixes, feels a bit big.
The major piece is msm fixes, then the usual amdgpu/xe along with some
mediatek and nouveau fixes and a tegra revert.
gpuvm:
- fix some typos
xe:
- Fix user-fence race issue
- Couple xe_vm fixes
- Don't trigger rebind on initial dma-buf validation
- Fix a build issue related to basename() posix vs gnu discrepancy
nouveau:
- fix linear modifier
- remove some dead code
msm:
- Core/GPU:
- fix comment doc warning in gpuvm
- fix build with KMS disabled
- fix pgtable setup/teardown race
- global fault counter fix
- various error path fixes
- GPU devcoredump snapshot fixes
- handle in-place VM_BIND remaps to solve turnip vm update race
- skip re-emitting IBs for unusable VMs
- Don't use %pK through printk
- moved display snapshot init earlier, fixing a crash
- DPU:
- Fixed crash in virtual plane checking code
- Fixed mode comparison in virtual plane checking code
- DSI:
- Adjusted width of resulution-related registers
- Fixed locking issue on 14nm PLLs
- UBWC (per Bjorn's ack)
- Added UBWC configuration for several missing platforms (fixing
regression)
mediatek:
- Add error handling for old state CRTC in atomic_disable
- Fix DSI host and panel bridge pre-enable order
- Fix device/node reference count leaks in mtk_drm_get_all_drm_priv
- mtk_hdmi: Fix inverted parameters in some regmap_update_bits calls
tegra:
- revert dma-buf change"
* tag 'drm-fixes-2025-08-29' of https://gitlab.freedesktop.org/drm/kernel: (56 commits)
drm/mediatek: mtk_hdmi: Fix inverted parameters in some regmap_update_bits calls
drm/amdgpu/userq: fix error handling of invalid doorbell
drm/amdgpu: update firmware version checks for user queue support
drm/amd/amdgpu: disable hwmon power1_cap* for gfx 11.0.3 on vf mode
Revert "drm/amdgpu: fix incorrect vm flags to map bo"
drm/amdgpu/gfx12: set MQD as appriopriate for queue types
drm/amdgpu/gfx11: set MQD as appriopriate for queue types
drm/xe: switch to local xbasename() helper
drm/xe: Don't trigger rebind on initial dma-buf validation
drm/xe/vm: Clear the scratch_pt pointer on error
drm/xe/vm: Don't pin the vm_resv during validation
drm/xe/xe_sync: avoid race during ufence signaling
Revert "drm/tegra: Use dma_buf from GEM object instance"
soc: qcom: use no-UBWC config for MSM8956/76
soc: qcom: add configuration for MSM8929
soc: qcom: ubwc: add more missing platforms
soc: qcom: ubwc: use no-uwbc config for MSM8917
drm/msm/dpu: Add a null ptr check for dpu_encoder_needs_modeset
dt-bindings: display/msm: qcom,mdp5: drop lut clock
drm/gpuvm: fix various typos in .c and .h gpuvm file
...
Linus Torvalds [Fri, 29 Aug 2025 01:51:28 +0000 (18:51 -0700)]
Merge tag 'block-6.17-20250828' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- Fix a lockdep spotted issue on recursive locking for zoned writes, in
case of errors
- Update bcache MAINTAINERS entry address for Coly
- Fix for a ublk release issue, with selftests
- Fix for a regression introduced in this cycle, where it assumed
q->rq_qos was always set if the bio flag indicated that
- Fix for a regression introduced in this cycle, where loop retrieving
block device sizes got broken
* tag 'block-6.17-20250828' of git://git.kernel.dk/linux:
bcache: change maintainer's email address
ublk selftests: add --no_ublk_fixed_fd for not using registered ublk char device
ublk: avoid ublk_io_release() called after ublk char dev is closed
block: validate QoS before calling __rq_qos_done_bio()
blk-zoned: Fix a lockdep complaint about recursive locking
loop: fix zero sized loop for block special file
Linus Torvalds [Fri, 29 Aug 2025 01:41:53 +0000 (18:41 -0700)]
Merge tag 'io_uring-6.17-20250828' of git://git.kernel.dk/linux
Pull io_uring fixes from Jens Axboe:
- Use the proper type for min_t() in getting the min of the leftover
bytes and the buffer length.
- As good practice, use READ_ONCE() consistently for reading ring
provided buffer lengths. Additionally, stop looping for incremental
commits if a zero sized buffer is hit, as no further progress can be
made at that point.
* tag 'io_uring-6.17-20250828' of git://git.kernel.dk/linux:
io_uring/kbuf: always use READ_ONCE() to read ring provided buffer lengths
io_uring/kbuf: fix signedness in this_len calculation
Linus Torvalds [Fri, 29 Aug 2025 00:35:51 +0000 (17:35 -0700)]
Merge tag 'net-6.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from Bluetooth.
Current release - regressions:
- ipv4: fix regression in local-broadcast routes
- vsock: fix error-handling regression introduced in v6.17-rc1
Previous releases - regressions:
- bluetooth:
- mark connection as closed during suspend disconnect
- fix set_local_name race condition
- eth:
- ice: fix NULL pointer dereference on reset
- mlx5: fix memory leak in hws_pool_buddy_init error path
- bnxt_en: fix stats context reservation logic
- hv: fix loss of receive events from host during channel open
Previous releases - always broken:
- page_pool: fix incorrect mp_ops error handling
- sctp: initialize more fields in sctp_v6_from_sk()
- eth:
- octeontx2-vf: fix max packet length errors
- idpf: fix Tx flow scheduling to avoid Tx timeouts
- bnxt_en: fix memory corruption during ifdown
- ice: fix incorrect counter for buffer allocation failures
- mlx5: fix lockdep assertion on sync reset unload event
- fbnic: fixup rtnl_lock and devl_lock handling
- xgmac: do not enable RX FIFO overflow interrupts
- phy: mscc: fix when PTP clock is register and unregister
Misc:
- add Telit Cinterion LE910C4-WWX new compositions"
* tag 'net-6.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (60 commits)
net: ipv4: fix regression in local-broadcast routes
net: macb: Disable clocks once
fbnic: Move phylink resume out of service_task and into open/close
fbnic: Fixup rtnl_lock and devl_lock handling related to mailbox code
net: rose: fix a typo in rose_clear_routes()
l2tp: do not use sock_hold() in pppol2tp_session_get_sock()
sctp: initialize more fields in sctp_v6_from_sk()
MAINTAINERS: rmnet: Update email addresses
net: rose: include node references in rose_neigh refcount
net: rose: convert 'use' field to refcount_t
net: rose: split remove and free operations in rose_remove_neigh()
net: hv_netvsc: fix loss of early receive events from host during channel open.
net: stmmac: Set CIC bit only for TX queues with COE
net: stmmac: xgmac: Correct supported speed modes
net: stmmac: xgmac: Do not enable RX FIFO Overflow interrupts
net/mlx5e: Set local Xoff after FW update
net/mlx5e: Update and set Xon/Xoff upon port speed set
net/mlx5e: Update and set Xon/Xoff upon MTU set
net/mlx5: Prevent flow steering mode changes in switchdev mode
net/mlx5: Nack sync reset when SFs are present
...
1. Add error handling for old state CRTC in atomic_disable
2. Fix DSI host and panel bridge pre-enable order
3. Fix device/node reference count leaks in mtk_drm_get_all_drm_priv
4. mtk_hdmi: Fix inverted parameters in some regmap_update_bits calls
Linus Torvalds [Thu, 28 Aug 2025 23:34:32 +0000 (16:34 -0700)]
Merge tag 'pm-6.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fix from Rafael Wysocki:
"Add missing locking annotations to two recently introduced
list_for_each_entry_rcu() loops in the core device suspend/resume
code (Johannes Berg)"
* tag 'pm-6.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: sleep: annotate RCU list iterations
drm/mediatek: mtk_hdmi: Fix inverted parameters in some regmap_update_bits calls
In mtk_hdmi driver, a recent change replaced custom register access
function calls by regmap ones, but two replacements by regmap_update_bits
were done incorrectly, because original offset and mask parameters were
inverted, so fix them.
Linus Torvalds [Thu, 28 Aug 2025 23:04:14 +0000 (16:04 -0700)]
Merge tag 'dma-mapping-6.17-2025-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux
Pull dma-mapping fixes from Marek Szyprowski:
- another small fix for arm64 systems with memory encryption (Shanker
Donthineni)
- fix for arm32 systems with non-standard CMA configuration (Oreoluwa
Babatunde)
* tag 'dma-mapping-6.17-2025-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
dma/pool: Ensure DMA_DIRECT_REMAP allocations are decrypted
of: reserved_mem: Restructure call site for dma_contiguous_early_fixup()
Linus Torvalds [Thu, 28 Aug 2025 22:46:06 +0000 (15:46 -0700)]
Merge tag 'fixes-2025-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock
Pull memblock fixes from Mike Rapoport:
- printk cleanups in memblock and numa_memblks
- update kernel-doc for MEMBLOCK_RSRV_NOINIT to be more accurate and
detailed
* tag 'fixes-2025-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
memblock: fix kernel-doc for MEMBLOCK_RSRV_NOINIT
mm: numa,memblock: Use SZ_1M macro to denote bytes to MB conversion
mm/numa_memblks: Use pr_debug instead of printk(KERN_DEBUG)
Dave Airlie [Thu, 28 Aug 2025 22:44:10 +0000 (08:44 +1000)]
Merge tag 'drm-misc-fixes-2025-08-28' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes
Several nouveau fixes to remove unused code, fix an error path and be
less restrictive with the formats it accepts. A fix for amdgpu to pin
vmapped dma-buf, and a revert for tegra for a regression in the dma-buf
/ GEM code.
Linus Torvalds [Thu, 28 Aug 2025 22:39:06 +0000 (15:39 -0700)]
Merge tag 'powerpc-6.17-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Madhavan Srinivasan:
- Merge two CONFIG_POWERPC64_CPU entries in Kconfig.cputype
- Replace extra-y to always-y in Makefile
- Cleanup to use dev_fwnode helper
- Fix misleading comment in kvmppc_prepare_to_enter()
- misc cleanup and fixes
Thanks to Amit Machhiwal, Andrew Donnellan, Christophe Leroy, Gautam
Menghani, Jiri Slaby (SUSE), Masahiro Yamada, Shrikanth Hegde, Stephen
Rothwell, Venkat Rao Bagalkote, and Xichao Zhao
* tag 'powerpc-6.17-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/boot/install.sh: Fix shellcheck warnings
powerpc/prom_init: Fix shellcheck warnings
powerpc/kvm: Fix ifdef to remove build warning
powerpc: unify two CONFIG_POWERPC64_CPU entries in the same choice block
powerpc: use always-y instead of extra-y in Makefiles
powerpc/64: Drop unnecessary 'rc' variable
powerpc: Use dev_fwnode()
KVM: PPC: Fix misleading interrupts comment in kvmppc_prepare_to_enter()
Marc Zyngier [Sat, 9 Aug 2025 14:48:10 +0000 (15:48 +0100)]
KVM: arm64: nv: Fix ATS12 handling of single-stage translation
Volodymyr reports that using a Xen DomU as a nested guest (where
HCR_EL2.E2H == 0), ATS12 results in a translation that stops at
the L2's S1, which isn't something you'd normally expects.
Comparing the code against the spec proves to be illuminating,
and suggests that the author of such code must have been tired,
cross-eyed, drunk, or maybe all of the above.
The gist of it is that, apart from HCR_EL2.VM or HCR_EL2.DC being
0, only the use of the EL2&0 translation regime limits the walk
to S1 only, and that we must finish the S2 walk in any other case.
Which solves the above issue, as E2H==0 indicates that ATS12 walks
the EL1&0 translation regime.
Ajye Huang [Tue, 26 Aug 2025 15:40:40 +0000 (23:40 +0800)]
ASoC: SOF: Intel: WCL: Add the sdw_process_wakeen op
Add the missing op in the device description to avoid issues with jack
detection.
Fixes: 6b04629ae97a ("ASoC: SOF: Intel: add initial support for WCL") Acked-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com> Signed-off-by: Ajye Huang <ajye_huang@compal.corp-partner.google.com>
Message-ID: <20250826154040.2723998-1-ajye_huang@compal.corp-partner.google.com> Signed-off-by: Mark Brown <broonie@kernel.org>
Volodymyr reports (again!) that under some circumstances (E2H==0,
walking S1 PTs), PAR_EL1 doesn't report the value of the latest
walk in the CPU register, but that instead the value is written to
the backing store.
Further investigation indicates that the root cause of this is
that a group of registers (PAR_EL1, TPIDR*_EL{0,1}, the *32_EL2 dregs)
should always be considered as "on CPU", as they are not remapped
between EL1 and EL2.
We fail to treat them accordingly, and end-up considering that
the register (PAR_EL1 in this example) should be written to memory
instead of in the register.
While it would be possible to quickly work around it, it is obvious
that the way we track these things at the moment is pretty horrible,
and could do with some improvement.
Revamp the whole thing by:
- defining a location for a register (memory, cpu), potentially
depending on the state of the vcpu
- define a transformation for this register (mapped register, potential
translation, special register needing some particular attention)
- convey this information in a structure that can be easily passed
around
As a result, the accessors themselves become much simpler, as the
state is explicit instead of being driven by hard-to-understand
conventions.
We get rid of the "pure EL2 register" notion, which wasn't very
useful, and add sanitisation of the values by applying the RESx
masks as required, something that was missing so far.
And of course, we add the missing registers to the list, with the
indication that they are always loaded.
Marc Zyngier [Sun, 17 Aug 2025 12:19:24 +0000 (13:19 +0100)]
KVM: arm64: Simplify sysreg access on exception delivery
Distinguishing between NV and VHE is slightly pointless, and only
serves as an extra complication, or a way to introduce bugs, such
as the way SPSR_EL1 gets written without checking for the state
being resident.
Get rid if this silly distinction, and fix the bug in one go.
Marc Zyngier [Sun, 17 Aug 2025 12:19:23 +0000 (13:19 +0100)]
KVM: arm64: Check for SYSREGS_ON_CPU before accessing the 32bit state
Just like c6e35dff58d3 ("KVM: arm64: Check for SYSREGS_ON_CPU before
accessing the CPU state") fixed the 64bit state access, add a check
for the 32bit state actually being on the CPU before writing it.
Takashi Iwai [Thu, 28 Aug 2025 14:11:00 +0000 (16:11 +0200)]
ALSA: hda: Avoid binding with SOF for SKL/KBL platforms
For Intel SKL and KBL platforms, it may be bound with one of three
HD-audio drivers (AVS, SOF and legacy). AVS is the preferred one when
DMIC is detected, and that's how it's defined in the snd-intel-dspcfg
config table.
But, when AVS driver is disabled (CONFIG_SND_SOC_INTEL_AVS=n), the
device may be bound freely with either SOF or legacy driver.
Before 6.17, the legacy driver took it primarily, but on 6.17, likely
due to the recent code shuffling, SOF driver seems taking it at first,
and fails to probe. For avoiding the regression, we should enforce to
bind those with the legacy HD-audio drvier when AVS is disabled.
This patch adds the extra two entries in intel-dspcfg table that are
applied only when CONFIG_SND_SOC_INTEL_AVS=n, for binding with the
legacy driver.
Note that there are entries for APL in that config table block, but
APL may be supported by SOF for certain setups, so the choice can't be
exclusive. Hence this patch includes only SKL and KBL.
Ming Lei [Wed, 27 Aug 2025 12:16:00 +0000 (20:16 +0800)]
ublk selftests: add --no_ublk_fixed_fd for not using registered ublk char device
Add a new command line option --no_ublk_fixed_fd that excludes the ublk
control device (/dev/ublkcN) from io_uring's registered files array.
When this option is used, only backing files are registered starting
from index 1, while the ublk control device is accessed using its raw
file descriptor.
Add ublk_get_registered_fd() helper function that returns the appropriate
file descriptor for use with io_uring operations.
Key optimizations implemented:
- Cache UBLKS_Q_NO_UBLK_FIXED_FD flag in ublk_queue.flags to avoid
reading dev->no_ublk_fixed_fd in fast path
- Cache ublk char device fd in ublk_queue.ublk_fd for fast access
- Update ublk_get_registered_fd() to use ublk_queue * parameter
- Update io_uring_prep_buf_register/unregister() to use ublk_queue *
- Replace ublk_device * access with ublk_queue * access in fast paths
Also pass --no_ublk_fixed_fd to test_stress_04.sh for covering
plain ublk char device mode.
Ming Lei [Wed, 27 Aug 2025 12:15:59 +0000 (20:15 +0800)]
ublk: avoid ublk_io_release() called after ublk char dev is closed
When running test_stress_04.sh, the following warning is triggered:
WARNING: CPU: 1 PID: 135 at drivers/block/ublk_drv.c:1933 ublk_ch_release+0x423/0x4b0 [ublk_drv]
This happens when the daemon is abruptly killed:
- some references may still be held, because registering IO buffer
doesn't grab ublk char device reference
OR
- io->task_registered_buffers won't be cleared because io buffer is
released from non-daemon context
For zero-copy and auto buffer register modes, I/O reference crosses
syscalls, so IO reference may not be dropped naturally when ublk server is
killed abruptly. However, when releasing io_uring context, it is guaranteed
that the reference is dropped finally, see io_sqe_buffers_unregister() from
io_ring_ctx_free().
Fix this by adding ublk_drain_io_references() that:
- Waits for active I/O references dropped in async way by scheduling
work function, for avoiding ublk dev and io_uring file's release
dependency
- Reinitializes io->ref and io->task_registered_buffers to clean state
This ensures the reference count state is clean when ublk_queue_reinit()
is called, preventing the warning and potential use-after-free.
Fixes: 1f6540e2aabb ("ublk: zc register/unregister bvec") Fixes: 1ceeedb59749 ("ublk: optimize UBLK_IO_UNREGISTER_IO_BUF on daemon task") Fixes: 8a8fe42d765b ("ublk: optimize UBLK_IO_REGISTER_IO_BUF on daemon task") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250827121602.2619736-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 27 Aug 2025 21:27:30 +0000 (15:27 -0600)]
io_uring/kbuf: always use READ_ONCE() to read ring provided buffer lengths
Since the buffers are mapped from userspace, it is prudent to use
READ_ONCE() to read the value into a local variable, and use that for
any other actions taken. Having a stable read of the buffer length
avoids worrying about it changing after checking, or being read multiple
times.
Similarly, the buffer may well change in between it being picked and
being committed. Ensure the looping for incremental ring buffer commit
stops if it hits a zero sized buffer, as no further progress can be made
at that point.
ASoC: rsnd: tidyup direction name on rsnd_dai_connect()
commit 2c6b6a3e8b93 ("ASoC: rsnd: use snd_pcm_direction_name()") uses
snd_pcm_direction_name() instead of original method to get string
"Playback" or "Capture". But io->substream might be NULL in this timing.
Let's re-use original method.
Fixes: 2c6b6a3e8b93 ("ASoC: rsnd: use snd_pcm_direction_name()") Reported-by: Thuan Nguyen <thuan.nguyen-hong@banvien.com.vn> Tested-by: Thuan Nguyen <thuan.nguyen-hong@banvien.com.vn> Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Message-ID: <87zfbmwq6v.wl-kuninori.morimoto.gx@renesas.com> Signed-off-by: Mark Brown <broonie@kernel.org>
Oscar Maes [Wed, 27 Aug 2025 06:23:21 +0000 (08:23 +0200)]
net: ipv4: fix regression in local-broadcast routes
Commit 9e30ecf23b1b ("net: ipv4: fix incorrect MTU in broadcast routes")
introduced a regression where local-broadcast packets would have their
gateway set in __mkroute_output, which was caused by fi = NULL being
removed.
Fix this by resetting the fib_info for local-broadcast packets. This
preserves the intended changes for directed-broadcast packets.
Cc: stable@vger.kernel.org Fixes: 9e30ecf23b1b ("net: ipv4: fix incorrect MTU in broadcast routes") Reported-by: Brett A C Sheffield <bacs@librecast.net> Closes: https://lore.kernel.org/regressions/20250822165231.4353-4-bacs@librecast.net Signed-off-by: Oscar Maes <oscmaes92@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250827062322.4807-1-oscmaes92@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>