git.ipfire.org Git - thirdparty/kernel/stable.git/log

execmem: move execmem_force_rw() and execmem_restore_rox() before use

to avoid static declarations.

Link: https://lkml.kernel.org/r/20250713071730.4117334-5-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Daniel Gomez <da.gomez@samsung.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Petr Pavlu <petr.pavlu@suse.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

execmem: rework execmem_cache_free()

Currently execmem_cache_free() ignores potential allocation failures that
may happen in execmem_cache_add(). Besides, it uses text poking to fill
the memory with trapping instructions before returning it to cache
although it would be more efficient to make that memory writable, update
it using memcpy and then restore ROX protection.

Rework execmem_cache_free() so that in case of an error it will defer
freeing of the memory to a delayed work.

With this the happy fast path will now change permissions to RW, fill the
memory with trapping instructions using memcpy, restore ROX permissions,
add the memory back to the free cache and clear the relevant entry in
busy_areas.

If any step in the fast path fails, the entry in busy_areas will be marked
as pending_free. These entries will be handled by a delayed work and
freed asynchronously.

To make the fast path faster, use __GFP_NORETRY for memory allocations and
let asynchronous handler try harder with GFP_KERNEL.

Link: https://lkml.kernel.org/r/20250713071730.4117334-4-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Daniel Gomez <da.gomez@samsung.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Petr Pavlu <petr.pavlu@suse.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

execmem: introduce execmem_alloc_rw()

Some callers of execmem_alloc() require the memory to be temporarily
writable even when it is allocated from ROX cache. These callers use
execemem_make_temp_rw() right after the call to execmem_alloc().

Wrap this sequence in execmem_alloc_rw() API.

Link: https://lkml.kernel.org/r/20250713071730.4117334-3-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Daniel Gomez <da.gomez@samsung.com>
Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

execmem: drop unused execmem_update_copy()

Patch series "x86: enable EXECMEM_ROX_CACHE for ftrace and kprobes", v3.

These patches enable use of EXECMEM_ROX_CACHE for ftrace and kprobes
allocations on x86.

They also include some ground work in execmem.

Since the execmem model for caching large ROX pages changed from the
initial assumption that the memory that is allocated from ROX cache is
always ROX to the current state where memory can be temporarily made RW
and then restored to ROX, we can stop using text poking to update it.
This also saves the hassle of trying lock text_mutex in
execmem_cache_free() when kprobes already hold that mutex.

This patch (of 8):

The execmem_update_copy() that used text poking was required when memory
allocated from ROX cache was always read-only. Since now its permissions
can be switched to read-write there is no need in a function that updates
memory with text poking.

Remove it.

Link: https://lkml.kernel.org/r/20250713071730.4117334-1-rppt@kernel.org
Link: https://lkml.kernel.org/r/20250713071730.4117334-2-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Daniel Gomez <da.gomez@samsung.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Petr Pavlu <petr.pavlu@suse.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: fix a UAF when vma->mm is freed after vma->vm_refcnt got dropped

By inducing delays in the right places, Jann Horn created a reproducer for
a hard to hit UAF issue that became possible after VMAs were allowed to be
recycled by adding SLAB_TYPESAFE_BY_RCU to their cache.

Race description is borrowed from Jann's discovery report:
lock_vma_under_rcu() looks up a VMA locklessly with mas_walk() under
rcu_read_lock().  At that point, the VMA may be concurrently freed, and it
can be recycled by another process.  vma_start_read() then increments the
vma->vm_refcnt (if it is in an acceptable range), and if this succeeds,
vma_start_read() can return a recycled VMA.

In this scenario where the VMA has been recycled, lock_vma_under_rcu()
will then detect the mismatching ->vm_mm pointer and drop the VMA through
vma_end_read(), which calls vma_refcount_put().  vma_refcount_put() drops
the refcount and then calls rcuwait_wake_up() using a copy of vma->vm_mm.
This is wrong: It implicitly assumes that the caller is keeping the VMA's
mm alive, but in this scenario the caller has no relation to the VMA's mm,
so the rcuwait_wake_up() can cause UAF.

The diagram depicting the race:
T1         T2         T3
==         ==         ==
lock_vma_under_rcu
  mas_walk
          <VMA gets removed from mm>
                      mmap
                        <the same VMA is reallocated>
  vma_start_read
    __refcount_inc_not_zero_limited_acquire
                      munmap
                        __vma_enter_locked
                          refcount_add_not_zero
  vma_end_read
    vma_refcount_put
      __refcount_dec_and_test
                          rcuwait_wait_event
                            <finish operation>
      rcuwait_wake_up [UAF]

Note that rcuwait_wait_event() in T3 does not block because refcount was
already dropped by T1.  At this point T3 can exit and free the mm causing
UAF in T1.

To avoid this we move vma->vm_mm verification into vma_start_read() and
grab vma->vm_mm to stabilize it before vma_refcount_put() operation.

[surenb@google.com: v3]
Link: https://lkml.kernel.org/r/20250729145709.2731370-1-surenb@google.com
Link: https://lkml.kernel.org/r/20250728175355.2282375-1-surenb@google.com
Fixes: 3104138517fc ("mm: make vma cache SLAB_TYPESAFE_BY_RCU")
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reported-by: Jann Horn <jannh@google.com>
Closes: https://lore.kernel.org/all/CAG48ez0-deFbVH=E3jbkWx=X3uVbd8nWeo6kbJPQ0KoUD+m2tA@mail.gmail.com/
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/rmap: add anon_vma lifetime debug check

If an anon folio is mapped into userspace, its anon_vma must be alive,
otherwise rmap walks can hit UAF.

There have been syzkaller reports a few months ago[1][2] of UAF in rmap
walks that seems to indicate that there can be pages with elevated
mapcount whose anon_vma has already been freed, but I think we never
figured out what the cause is; and syzkaller only hit these UAFs when
memory pressure randomly caused reclaim to rmap-walk the affected pages,
so it of course didn't manage to create a reproducer.

Add a VM_WARN_ON_FOLIO() when we add/remove mappings of anonymous folios
to hopefully catch such issues more reliably.

[1] https://lore.kernel.org/r/67abaeaf.050a0220.110943.0041.GAE@google.com
[2] https://lore.kernel.org/r/67a76f33.050a0220.3d72c.0028.GAE@google.com

Link: https://lkml.kernel.org/r/20250725-anonvma-uaf-debug-v2-1-bc3c7e5ba5b1@google.com
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Harry Yoo <harry.yoo@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: remove mm/io-mapping.c

This is dead code, which was used from commit b739f125e4eb ("i915: use
io_mapping_map_user") but reverted a month later by commit 0e4fe0c9f2f9
("Revert "i915: use io_mapping_map_user"") back in 2021.

Since then nobody has used it, so remove it.

[akpm@linux-foundation.org: update Documentation/core-api/mm-api.rst, per Vlastimil]
Link: https://lkml.kernel.org/r/20250725142901.81502-1-lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

khugepaged: optimize collapse_pte_mapped_thp() by PTE batching

Use PTE batching to batch process PTEs mapping the same large folio. An
improvement is expected due to batching mapcount manipulation on the
folios, and for arm64 which supports contig mappings, the number of
TLB flushes is also reduced.

Note that we do not need to make a change to the check
"if (folio_page(folio, i) != page)"; if i'th page of the folio is equal
to the first page of our batch, then i + 1, .... i + nr_batch_ptes - 1
pages of the folio will be equal to the corresponding pages of our
batch mapping consecutive pages.

Link: https://lkml.kernel.org/r/20250724052301.23844-4-dev.jain@arm.com
Signed-off-by: Dev Jain <dev.jain@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

khugepaged: optimize __collapse_huge_page_copy_succeeded() by PTE batching

Use PTE batching to batch process PTEs mapping the same large folio. An
improvement is expected due to batching refcount-mapcount manipulation on
the folios, and for arm64 which supports contig mappings, the number of
TLB flushes is also reduced.

Link: https://lkml.kernel.org/r/20250724052301.23844-3-dev.jain@arm.com
Signed-off-by: Dev Jain <dev.jain@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: add get_and_clear_ptes() and clear_ptes()

Patch series "Optimizations for khugepaged", v4.

If the underlying folio mapped by the ptes is large, we can process those
ptes in a batch using folio_pte_batch().

For arm64 specifically, this results in a 16x reduction in the number of
ptep_get() calls, since on a contig block, ptep_get() on arm64 will
iterate through all 16 entries to collect a/d bits.  Next, ptep_clear()
will cause a TLBI for every contig block in the range via
contpte_try_unfold().  Instead, use clear_ptes() to only do the TLBI at
the first and last contig block of the range.

For split folios, there will be no pte batching; the batch size returned
by folio_pte_batch() will be 1.  For pagetable split folios, the ptes will
still point to the same large folio; for arm64, this results in the
optimization described above, and for other arches, a minor improvement is
expected due to a reduction in the number of function calls and batching
atomic operations.

This patch (of 3):

Let's add variants to be used where "full" does not apply -- which will
be the majority of cases in the future. "full" really only applies if
we are about to tear down a full MM.

Use get_and_clear_ptes() in existing code, clear_ptes() users will
be added next.

Link: https://lkml.kernel.org/r/20250724052301.23844-2-dev.jain@arm.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/mincore: hold PTL in mincore_hugetlb

Hold PTL in mincore_hugetlb() to avoid operating on stale page, as
mincore_pte_range() have done.

Link: https://lkml.kernel.org/r/20250724090958.455887-4-tujinjiang@huawei.com
Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Brahmajit Das <brahmajit.xyz@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: David Rientjes <rientjes@google.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joern Engel <joern@logfs.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Thiago Jung Bauermann <thiago.bauermann@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/memory-failure: hold PTL in hwpoison_hugetlb_range

Hold PTL in hwpoison_hugetlb_range() to avoid operating on stale page, as
hwpoison_pte_range() have done.

This change is not known to address any issues which users have
experienced.

Link: https://lkml.kernel.org/r/20250725033112.2690158-1-tujinjiang@huawei.com
Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Brahmajit Das <brahmajit.xyz@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: David Rientjes <rientjes@google.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joern Engel <joern@logfs.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Thiago Jung Bauermann <thiago.bauermann@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/mseal: rework mseal apply logic

The logic can be simplified - firstly by renaming the inconsistently named
apply_mm_seal() to mseal_apply().

We then wrap mseal_fixup() into the main loop as the logic is simple
enough to not require it, equally it isn't a hugely pleasant pattern in
mprotect() etc. so it's not something we want to perpetuate.

We eliminate the need for invoking vma_iter_end() on each loop by directly
determining if the VMA was merged - the only thing we need concern
ourselves with is whether the start/end of the (gapless) range are offset
into VMAs.

This refactoring also avoids the rather horrid 'pass pointer to prev
around' pattern used in mprotect() et al.

No functional change intended.

Link: https://lkml.kernel.org/r/ddfa4376ce29f19a589d7dc8c92cb7d4f7605a4c.1753431105.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Jeff Xu <jeffxu@chromium.org>
Cc: Jann Horn <jannh@google.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/mseal: simplify and rename VMA gap check

The check_mm_seal() function is doing something general - checking whether
a range contains only VMAs (or rather that it does NOT contain any
unmapped regions).

So rename this function to range_contains_unmapped().

Additionally simplify the logic, we are simply checking whether the last
vma->vm_end has either a VMA starting after it or ends before the end
parameter.

This check is rather dubious, so it is sensible to keep it local to
mm/mseal.c as at a later stage it may be removed, and we don't want any
other mm code to perform such a check.

No functional change intended.

[lorenzo.stoakes@oracle.com: add comment explaining why we disallow gaps on mseal()]
Link: https://lkml.kernel.org/r/d85b3d55-09dc-43ba-8204-b48267a96751@lucifer.local
Link: https://lkml.kernel.org/r/dd50984eff1e242b5f7f0f070a3360ef760e06b8.1753431105.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Jeff Xu <jeffxu@chromium.org>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Cc: Jann Horn <jannh@google.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/mseal: small cleanups

Drop the wholly unnecessary set_vma_sealed() helper(), which is used only
once, and place VMA_ITERATOR() declarations in the correct place.

Retain vma_is_sealed(), and use it instead of the confusingly named
can_modify_vma(), so it's abundantly clear what's being tested, rather
then a nebulous sense of 'can the VMA be modified'.

No functional change intended.

Link: https://lkml.kernel.org/r/98cf28d04583d632a6eb698e9ad23733bb6af26b.1753431105.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Jeff Xu <jeffxu@chromium.org>
Cc: Jann Horn <jannh@google.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/mseal: update madvise() logic

The madvise() logic is inexplicably performed in mm/mseal.c - this ought
to be located in mm/madvise.c.

Additionally can_modify_vma_madv() is inconsistently named and, in
combination with is_ro_anon(), is very confusing logic.

Put a static function in mm/madvise.c instead - can_madvise_modify() -
that spells out exactly what's happening. Also explicitly check for an
anon VMA.

Also add commentary to explain what's going on.

Essentially - we disallow discarding of data in mseal()'d mappings in
instances where the user couldn't otherwise write to that data.

We retain the existing behaviour here regarding MAP_PRIVATE mappings of
file-backed mappings, which entails some complexity - while this, strictly
speaking - appears to violate mseal() semantics, it may interact badly
with users which expect to be able to madvise(MADV_DONTNEED) .text
mappings for instance.

We may revisit this at a later date.

No functional change intended.

Link: https://lkml.kernel.org/r/492a98d9189646e92c8f23f4cce41ed323fe01df.1753431105.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jeff Xu <jeffxu@chromium.org>
Cc: Kees Cook <kees@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/mseal: always define VM_SEALED

Patch series "mseal cleanups", v4.

Perform a number of cleanups to the mseal logic.  Firstly, VM_SEALED is
treated differently from every other VMA flag, it really doesn't make
sense to do this, so we start by making this consistent with everything
else.

Next we place the madvise logic where it belongs - in mm/madvise.c.  It
really makes no sense to abstract this elsewhere.  In doing so, we go to
great lengths to explain very clearly the previously very confusing logic
as to what sealed mappings are impacted here.

In doing so, we retain existing logic regarding treatment of madvise()
discard operations for a sealed, read-only MAP_PRIVATE file-backed
mapping.  This is something we likely need to revisit.

We then abstract out and explain the 'are there are any gaps in this range
in the mm?' check being performed as a prerequisite to mseal being
performed.

Finally, we simplify the actual mseal logic which is really quite
straightforward.

No functional change is intended.

This patch (of 4):

There is no reason to treat VM_SEALED in a special way, in each other case
in which a VMA flag is unavailable due to configuration, we simply assign
that flag to VM_NONE, so make VM_SEALED consistent with all other VMA
flags in this respect.

Additionally, use the next available bit for VM_SEALED, 42, rather than
arbitrarily putting it at 63 and update the declaration to match all other
VMA flags.

No functional change intended.

Link: https://lkml.kernel.org/r/cover.1753431105.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/aeb398a77029b6e7377cd944328bc9bbc3c90537.1753431105.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jeff Xu <jeffxu@chromium.org>
Cc: Kees Cook <kees@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/damon/vaddr: skip isolating folios already in destination nid

damos_va_migrate_dests_add() determines the node a folio should be in
based on the struct damos_migrate_dests associated with the migration
scheme and adds the folio to the linked list corresponding to that node so
it can be migrated later.  Currently, folios are isolated and added to the
list even if they are already in the node they should be in.

In using damon weighted interleave more, I've found that the overhead of
needlessly adding these folios to the migration lists can be quite high.
The overhead comes from isolating folios and placing them in the migration
lists inside of damos_va_migrate_dests_add(), as well as the cost of
handling those folios in damon_migrate_pages().  This patch eliminates
that overhead by simply avoiding the addition of folios that are already
in their intended location to the migration list.

To show the benefit of this patch, we start the test workload and start a
DAMON instance attached to that workload with a migrate_hot scheme that
has one dest field sending data to the local node.  This way, we are only
measuring the overheads of the scheme, and not the cost of migrating
pages, since data will be allocated to the local node by default.  I
tested with two workloads: the embedding reduction workload used in [1]
and a microbenchmark that allocates 20GB of data then sleeps, which is
similar to the memory usage of the embedding reduction workload.

The time taken in damos_va_migrate_dests_add() and damon_migrate_pages()
each aggregation interval is shown below.

Before this patch:
                       damos_va_migrate_dests_add damon_migrate_pages
microbenchmark                   ~2ms                      ~3ms
embedding reduction              ~1s                       ~3s

After this patch:
                       damos_va_migrate_dests_add damon_migrate_pages
microbenchmark                    0us                      ~40us
embedding reduction               0us                      ~100us

I did not do an in depth analysis for why things are much slower in the
embedding reduction workload than the microbenchmark.  However, I assume
it's because the embedding reduction workload oversaturates the bandwidth
of the local memory node, increasing the memory access latency, and in
turn making the pointer chasing involved in iterating through a linked
list much slower.  Regardless of that, this patch results in a significant
speedup.

[1] https://lore.kernel.org/damon/20250709005952.17776-1-bijan311@gmail.com/

Link: https://lkml.kernel.org/r/20250725163300.4602-1-bijan311@gmail.com
Fixes: 19c1dc15c859 ("mm/damon/vaddr: use damos->migrate_dests in migrate_{hot,cold}")
Signed-off-by: Bijan Tabatabai <bijantabatab@micron.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

selftests: cachestat: add tests for mmap, refactor and enhance mmap test for cachestat validation

Add a cohesive test case that verifies cachestat behavior with
memory-mapped files using mmap(). Also refactor the test logic to reduce
redundancy, improve error reporting, and clarify failure messages for both
shmem and mmap file types.

[akpm@linux-foundation.org: coding-style cleanups]
Link: https://lkml.kernel.org/r/20250709174657.6916-1-suresh.k.chandrappa@gmail.com
Signed-off-by: Suresh K C <suresh.k.chandrappa@gmail.com>
Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Tested-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: add process info to bad rss-counter warning

Enhance the debugging information in check_mm() by including the process
name and PID when reporting bad rss-counter states. This helps identify
which process is associated with the memory accounting issue.

Link: https://lkml.kernel.org/r/20250723100901.1909683-1-liuqiye2025@163.com
Signed-off-by: Xuanye Liu <liuqiye2025@163.com>
Acked-by: SeongJae Park <sj@kernel.org>
Cc: Ben Segall <bsegall@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mel Gorman <mgorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

kasan: skip quarantine if object is still accessible under RCU

Currently, enabling KASAN masks bugs where a lockless lookup path gets a
pointer to a SLAB_TYPESAFE_BY_RCU object that might concurrently be
recycled and is insufficiently careful about handling recycled objects:
KASAN puts freed objects in SLAB_TYPESAFE_BY_RCU slabs onto its quarantine
queues, even when it can't actually detect UAF in these objects, and the
quarantine prevents fast recycling.

When I introduced CONFIG_SLUB_RCU_DEBUG, my intention was that enabling
CONFIG_SLUB_RCU_DEBUG should cause KASAN to mark such objects as freed
after an RCU grace period and put them on the quarantine, while disabling
CONFIG_SLUB_RCU_DEBUG should allow such objects to be reused immediately;
but that hasn't actually been working.

I discovered such a UAF bug involving SLAB_TYPESAFE_BY_RCU yesterday; I
could only trigger this bug in a KASAN build by disabling
CONFIG_SLUB_RCU_DEBUG and applying this patch.

Link: https://lkml.kernel.org/r/20250723-kasan-tsbrcu-noquarantine-v1-1-846c8645976c@google.com
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Alexander Potapenko <glider@google.com>
Acked-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/page-flags: remove folio_start_writeback_keepwrite()

Commit cd57b77197a4 ("ext4: Convert ext4_bio_write_page() to use a folio)
removed set_page_writeback_keepwrite() which was the last/only caller of
folio_start_writeback_keepwrite().

Link: https://lkml.kernel.org/r/20250722182230.2114587-1-joannelkoong@gmail.com
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

selftests/mm: add process_madvise() tests

Add tests for process_madvise(), focusing on verifying behavior under
various conditions including valid usage and error cases.

[lianux.mm@gmail.com: v7]
Link: https://lkml.kernel.org/r/20250729113109.12272-1-lianux.mm@gmail.com
Link: https://lkml.kernel.org/r/20250729113109.12272-1-lianux.mm@gmail.com
Link: https://lkml.kernel.org/r/20250721114614.40996-1-lianux.mm@gmail.com
Signed-off-by: wang lian <lianux.mm@gmail.com>
Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Suggested-by: Zi Yan <ziy@nvidia.com>
Suggested-by: Mark Brown <broonie@kernel.org>
Acked-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Tested-by: Zi Yan <ziy@nvidia.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: shmem: fix the shmem large folio allocation for the i915 driver

After commit acd7ccb284b8 ("mm: shmem: add large folio support for
tmpfs"), we extend the 'huge=' option to allow any sized large folios for
tmpfs, which means tmpfs will allow getting a highest order hint based on
the size of write() and fallocate() paths, and then will try each
allowable large order.

However, when the i915 driver allocates shmem memory, it doesn't provide
hint information about the size of the large folio to be allocated,
resulting in the inability to allocate PMD-sized shmem, which in turn
affects GPU performance.

Patryk added:

: In my tests, the performance drop ranges from a few percent up to 13%
: in Unigine Superposition under heavy memory usage on the CPU Core Ultra
: 155H with the Xe 128 EU GPU. Other users have reported performance
: impact up to 30% on certain workloads. Please find more in the
: regressions reports:
: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/14645
: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/13845
:
: I believe the change should be backported to all active kernel branches
: after version 6.12.

To fix this issue, we can use the inode's size as a write size hint in
shmem_read_folio_gfp() to help allocate PMD-sized large folios.

Link: https://lkml.kernel.org/r/f7e64e99a3a87a8144cc6b2f1dddf7a89c12ce44.1753926601.git.baolin.wang@linux.alibaba.com
Fixes: acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs")
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reported-by: Patryk Kowalczyk <patryk@kowalczyk.ws>
Reported-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Tested-by: Patryk Kowalczyk <patryk@kowalczyk.ws>
Suggested-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

tools/getdelays: add backward compatibility for taskstats version

Add version checks to print_delayacct() to handle differences in struct
taskstats across kernel versions.  Field availability depends on taskstats
version (t->version), corresponding to TASKSTATS_VERSION in kernel headers
(see include/uapi/linux/taskstats.h).

Version feature mapping:
- version >= 11  - supports COMPACT statistics
- version >= 13  - supports WPCOPY statistics
- version >= 14  - supports IRQ statistics
- version >= 16  - supports *_max and *_min delay statistics

This ensures the tool works correctly with both older and newer kernel
versions by conditionally printing fields based on the reported version.

eg.1
bash# grep -r "#define TASKSTATS_VERSION" /usr/include/linux/taskstats.h
"#define TASKSTATS_VERSION       10"
bash# ./getdelays -d -p 1
CPU                 count     real total  virtual total    delay total  delay average
                     7481     3786181709     3807098291       36393725          0.005ms
IO                  count    delay total  delay average
                      369     1116046035          3.025ms
SWAP                count    delay total  delay average
                        0              0          0.000ms
RECLAIM             count    delay total  delay average
                        0              0          0.000ms
THRASHING           count    delay total  delay average
                        0              0          0.000ms

eg.2
bash# grep -r "#define TASKSTATS_VERSION" /usr/include/linux/taskstats.h
"#define TASKSTATS_VERSION       14"
bash# ./getdelays -d -p 1
CPU                 count     real total  virtual total    delay total  delay average
                    68862   163474790046   174584722267    19962496806          0.290ms
IO                  count    delay total  delay average
                        0              0          0.000ms
SWAP                count    delay total  delay average
                        0              0          0.000ms
RECLAIM             count    delay total  delay average
                        0              0          0.000ms
THRASHING           count    delay total  delay average
                        0              0          0.000ms
COMPACT             count    delay total  delay average
                        0              0          0.000ms
WPCOPY              count    delay total  delay average
                        0              0          0.000ms
IRQ                 count    delay total  delay average
                        0              0          0.000ms

Link: https://lkml.kernel.org/r/20250731225326549CttJ7g9NfjTlaqBwl015T@zte.com.cn
Signed-off-by: Fan Yu <fan.yu9@zte.com.cn>
Cc: Fan Yu <fan.yu9@zte.com.cn>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Wang Yaxin <wang.yaxin@zte.com.cn>
Cc: xu xin <xu.xin16@zte.com.cn>
Cc: Yang Yang <yang.yang29@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

kho: add test for kexec handover

Testing kexec handover requires a kernel driver that will generate some
data and preserve it with KHO on the first boot and then restore that data
and verify it was preserved properly after kexec.

To facilitate such test, along with the kernel driver responsible for data
generation, preservation and restoration add a script that runs a kernel
in a VM with a minimal /init. The /init enables KHO, loads a kernel image
for kexec and runs kexec reboot. After the boot of the kexeced kernel,
the driver verifies that the data was properly preserved.

[rppt@kernel.org: fix section mismatch]
Link: https://lkml.kernel.org/r/aIiRC8fXiOXKbPM_@kernel.org
Link: https://lkml.kernel.org/r/20250727083733.2590139-1-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Changyuan Lyu <changyuanl@google.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Pratyush Yadav <pratyush@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

delaytop: enhance error logging and add PSI feature description

This patch improves error diagnostics and documentation for delaytop:

1) Enhanced error logging:
   - Added explicit error messages in critical failure paths
   - Implemented BOOL_FPRINT macro for robust output handling

2) PSI feature documentation:
   - Updated header comment to reflect PSI monitoring capability
   - Improved output formatting for PSI information

System Pressure Information: (avg10/avg60/avg300/total)
CPU some:       0.0%/   0.0%/   0.0%/     345(ms)
CPU full:       0.0%/   0.0%/   0.0%/       0(ms)
Memory full:    0.0%/   0.0%/   0.0%/       0(ms)
Memory some:    0.0%/   0.0%/   0.0%/       0(ms)
IO full:        0.0%/   0.0%/   0.0%/      65(ms)
IO some:        0.0%/   0.0%/   0.0%/      79(ms)
IRQ full:       0.0%/   0.0%/   0.0%/       0(ms)

Link: https://lkml.kernel.org/r/202507281628341752gMXCMN7S-Vz_LHYHum9r@zte.com.cn
Signed-off-by: Fan Yu <fan.yu9@zte.com.cn>
Signed-off-by: Wang Yaxin <wang.yaxin@zte.com.cn>
Acked-by: Yang Yang <yang.yang29@zte.com.cn>
Cc: Fan Yu <fan.yu9@zte.com.cn>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

samples: Kconfig: fix spelling mistake "instancess" -> "instances"

There is a spelling mistake in the SAMPLE_TRACE_ARRAY config. Fix it.

Link: https://lkml.kernel.org/r/20250724111715.141826-1-colin.i.king@gmail.com
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

fat: fix too many log in fat_chain_add()

This log was excessive for a serial console. So use the ratelimited
version instead.

Link: https://lkml.kernel.org/r/87qzy611d9.fsf@mail.parknet.co.jp
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Reported-by: syzbot+fa7ef54f66c189c04b73@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=fa7ef54f66c189c04b73
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

scripts/spelling.txt: add notifer||notifier to spelling.txt

This typo was not listed in scripts/spelling.txt, thus it was more
difficult to detect. Add it for convenience.

Link: https://lkml.kernel.org/r/02153C05ED7B49B7+20250722073431.21983-8-wangyuli@uniontech.com
Signed-off-by: WangYuli <wangyuli@uniontech.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

xen/xenbus: fix typo "notifer"

There is a spelling mistake of 'notifer' in the comment which
should be 'notifier'.

Link: https://lkml.kernel.org/r/C6633C66376C709A+20250722073431.21983-7-wangyuli@uniontech.com
Signed-off-by: WangYuli <wangyuli@uniontech.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

net: mvneta: fix typo "notifer"

There is a spelling mistake of 'notifer' in the comment which
should be 'notifier'.

Link: https://lkml.kernel.org/r/0CB4300CB6F49007+20250722073431.21983-4-wangyuli@uniontech.com
Signed-off-by: WangYuli <wangyuli@uniontech.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drm/xe: fix typo "notifer"

There is a spelling mistake of 'notifer' in the comment which
should be 'notifier'.

Link: https://lkml.kernel.org/r/94190C5F54A19F3E+20250722073431.21983-3-wangyuli@uniontech.com
Signed-off-by: WangYuli <wangyuli@uniontech.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

cxl: mce: fix typo "notifer"

According to the context, "mce_notifer" should be "mce_notifier".

Link: https://lkml.kernel.org/r/E1EB1BA9FDF07D53+20250722073431.21983-2-wangyuli@uniontech.com
Fixes: 516e5bd0b6bf ("cxl: Add mce notifier to emit aliased address for extended linear cache")
Signed-off-by: WangYuli <wangyuli@uniontech.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

KVM: x86: fix typo "notifer"

Patch series "treewide: Fix typo "notifer"", v3.

There are some spelling mistakes of 'notifer' in comments which
should be 'notifier'.

Fix them and add it to scripts/spelling.txt.

This patch (of 8):

There are some spelling mistakes of 'notifer' which should be 'notifier'.

Link: https://lkml.kernel.org/r/576F0D85F6853074+20250722072734.19367-1-wangyuli@uniontech.com
Link: https://lkml.kernel.org/r/7F05778C3A1A9F8B+20250722073431.21983-1-wangyuli@uniontech.com
Signed-off-by: WangYuli <wangyuli@uniontech.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

MAINTAINERS: add maintainers for delaytop

The delaytop tool supports showing system delays and task-level delays,
effectively identifying the top-n tasks with high latency in the system,
which is highly beneficial for improving system performance. Wang Yaxin
and her colleague Fan Yu focus on locating system delay issues. To
promote the thriving development of delaytop, we hope to serve as
maintainers to continuously improve it, aiming to provide a more effective
solution for system latency issues in the future.

Link: https://lkml.kernel.org/r/20250721094049958ImB8XG_imntcPqpQn1KfG@zte.com.cn
Signed-off-by: Wang Yaxin <wang.yaxin@zte.com.cn>
Signed-off-by: Fan Yu <fan.yu9@zte.com.cn>
Reviewed-by: Yang Yang <yang.yang29@zte.com.cn>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: xu xin <xu.xin16@zte.com.cn>
Cc: Krzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ucount: use atomic_long_try_cmpxchg() in atomic_long_inc_below()

Use atomic_long_try_cmpxchg() instead of
atomic_long_cmpxchg (*ptr, old, new) == old in atomic_long_inc_below().
x86 CMPXCHG instruction returns success in ZF flag, so this change saves
a compare after cmpxchg (and related move instruction in front of cmpxchg).

Also, atomic_long_try_cmpxchg implicitly assigns old *ptr value to "old"
when cmpxchg fails, enabling further code simplifications.

No functional change intended.

Link: https://lkml.kernel.org/r/20250721174610.28361-2-ubizjak@gmail.com
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Reviewed-by: Alexey Gladkov <legion@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Alexey Gladkov <legion@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: MengEn Sun <mengensun@tencent.com>
Cc: "Thomas Weißschuh" <linux@weissschuh.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ucount: fix atomic_long_inc_below() argument type

The type of u argument of atomic_long_inc_below() should be long to avoid
unwanted truncation to int.

The patch fixes the wrong argument type of an internal function to
prevent unwanted argument truncation. It fixes an internal locking
primitive; it should not have any direct effect on userspace.

Mark said

: AFAICT there's no problem in practice because atomic_long_inc_below()
: is only used by inc_ucount(), and it looks like the value is
: constrained between 0 and INT_MAX.
:
: In inc_ucount() the limit value is taken from
: user_namespace::ucount_max[], and AFAICT that's only written by
: sysctls, to the table setup by setup_userns_sysctls(), where
: UCOUNT_ENTRY() limits the value between 0 and INT_MAX.
:
: This is certainly a cleanup, but there might be no functional issue in
: practice as above.

Link: https://lkml.kernel.org/r/20250721174610.28361-1-ubizjak@gmail.com
Fixes: f9c82a4ea89c ("Increase size of ucounts to atomic_long_t")
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Alexey Gladkov <legion@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: MengEn Sun <mengensun@tencent.com>
Cc: "Thomas Weißschuh" <linux@weissschuh.net>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

kexec: enable CMA based contiguous allocation

When booting a new kernel with kexec_file, the kernel picks a target
location that the kernel should live at, then allocates random pages,
checks whether any of those patches magically happens to coincide with a
target address range and if so, uses them for that range.

For every page allocated this way, it then creates a page list that the
relocation code - code that executes while all CPUs are off and we are
just about to jump into the new kernel - copies to their final memory
location.  We can not put them there before, because chances are pretty
good that at least some page in the target range is already in use by the
currently running Linux environment.  Copying is happening from a single
CPU at RAM rate, which takes around 4-50 ms per 100 MiB.

All of this is inefficient and error prone.

To successfully kexec, we need to quiesce all devices of the outgoing
kernel so they don't scribble over the new kernel's memory.  We have seen
cases where that does not happen properly (*cough* GIC *cough*) and hence
the new kernel was corrupted.  This started a month long journey to root
cause failing kexecs to eventually see memory corruption, because the new
kernel was corrupted severely enough that it could not emit output to tell
us about the fact that it was corrupted.  By allocating memory for the
next kernel from a memory range that is guaranteed scribbling free, we can
boot the next kernel up to a point where it is at least able to detect
corruption and maybe even stop it before it becomes severe.  This
increases the chance for successful kexecs.

Since kexec got introduced, Linux has gained the CMA framework which can
perform physically contiguous memory mappings, while keeping that memory
available for movable memory when it is not needed for contiguous
allocations.  The default CMA allocator is for DMA allocations.

This patch adds logic to the kexec file loader to attempt to place the
target payload at a location allocated from CMA.  If successful, it uses
that memory range directly instead of creating copy instructions during
the hot phase.  To ensure that there is a safety net in case anything goes
wrong with the CMA allocation, it also adds a flag for user space to force
disable CMA allocations.

Using CMA allocations has two advantages:

  1) Faster by 4-50 ms per 100 MiB. There is no more need to copy in the
     hot phase.
  2) More robust. Even if by accident some page is still in use for DMA,
     the new kernel image will be safe from that access because it resides
     in a memory region that is considered allocated in the old kernel and
     has a chance to reinitialize that component.

Link: https://lkml.kernel.org/r/20250610085327.51817-1-graf@amazon.com
Signed-off-by: Alexander Graf <graf@amazon.com>
Acked-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Zhongkun He <hezhongkun.hzk@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

stackdepot: make max number of pools boot-time configurable

We're hitting the WARN in depot_init_pool() about reaching the stack depot
limit because we have long stacks that don't dedup very well.

Introduce a new start-up parameter to allow users to set the number of
maximum stack depot pools.

Link: https://lkml.kernel.org/r/20250718153928.94229-1-matt@readmodwrite.com
Signed-off-by: Matt Fleming <mfleming@cloudflare.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

lib/xxhash: remove unused functions

xxh32_digest() and xxh32_update() were added in 2017 in the original
xxhash commit, but have remained unused.

Remove them.

Link: https://lkml.kernel.org/r/20250716133245.243363-1-linux@treblig.org
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Dave Gilbert <linux@treblig.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

init/Kconfig: restore CONFIG_BROKEN help text

Linus added it in 2003, it later was removed. Put it back.

Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Borislav Betkov <bp@alien8.de>
Cc: David S. Miller <davem@davemloft.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/shmem, swap: improve cached mTHP handling and fix potential hang

The current swap-in code assumes that, when a swap entry in shmem mapping
is order 0, its cached folios (if present) must be order 0 too, which
turns out not always correct.

The problem is shmem_split_large_entry is called before verifying the
folio will eventually be swapped in, one possible race is:

    CPU1                          CPU2
shmem_swapin_folio
/* swap in of order > 0 swap entry S1 */
  folio = swap_cache_get_folio
  /* folio = NULL */
  order = xa_get_order
  /* order > 0 */
  folio = shmem_swap_alloc_folio
  /* mTHP alloc failure, folio = NULL */
  <... Interrupted ...>
                                 shmem_swapin_folio
                                 /* S1 is swapped in */
                                 shmem_writeout
                                 /* S1 is swapped out, folio cached */
  shmem_split_large_entry(..., S1)
  /* S1 is split, but the folio covering it has order > 0 now */

Now any following swapin of S1 will hang: `xa_get_order` returns 0, and
folio lookup will return a folio with order > 0.  The
`xa_get_order(&mapping->i_pages, index) != folio_order(folio)` will always
return false causing swap-in to return -EEXIST.

And this looks fragile.  So fix this up by allowing seeing a larger folio
in swap cache, and check the whole shmem mapping range covered by the
swapin have the right swap value upon inserting the folio.  And drop the
redundant tree walks before the insertion.

This will actually improve performance, as it avoids two redundant Xarray
tree walks in the hot path, and the only side effect is that in the
failure path, shmem may redundantly reallocate a few folios causing
temporary slight memory pressure.

And worth noting, it may seems the order and value check before inserting
might help reducing the lock contention, which is not true.  The swap
cache layer ensures raced swapin will either see a swap cache folio or
failed to do a swapin (we have SWAP_HAS_CACHE bit even if swap cache is
bypassed), so holding the folio lock and checking the folio flag is
already good enough for avoiding the lock contention.  The chance that a
folio passes the swap entry value check but the shmem mapping slot has
changed should be very low.

Link: https://lkml.kernel.org/r/20250728075306.12704-1-ryncsn@gmail.com
Link: https://lkml.kernel.org/r/20250728075306.12704-2-ryncsn@gmail.com
Fixes: 809bc86517cc ("mm: shmem: support large folio swap out")
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Merge tag 'fbdev-for-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev

Pull fbdev updates from Helge Deller:
"One potential buffer overflow fix in the framebuffer registration
  function, some fixes for the imxfb, nvidiafb and simplefb drivers, and
  a bunch of cleanups for fbcon, kyrofb and svgalib.

  Framework fixes:
   - fix potential buffer overflow in do_register_framebuffer() [Yongzhen Zhang]

  Driver fixes:
   - imxfb: prevent null-ptr-deref [Chenyuan Yang]
   - nvidiafb: fix build on 32-bit ARCH=um [Johannes Berg]
   - nvidiafb: add depends on HAS_IOPORT [Randy Dunlap]
   - simplefb: Use of_reserved_mem_region_to_resource() for "memory-region" [Rob Herring]

  Cleanups:
   - fbcon: various code cleanups wrt blinking [Ville Syrjälä]
   - kyrofb: Convert to devm_*() functions [Giovanni Di Santi]
   - svgalib: Coding style cleanups [Darshan R.]
   - Fix typo in Kconfig text for FB_DEVICE [Daniel Palmer]"

* tag 'fbdev-for-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev:
  fbcon: Use 'bool' where appopriate
  fbcon: Introduce get_{fg,bg}_color()
  fbcon: fbcon_is_inactive() -> fbcon_is_active()
  fbcon: fbcon_cursor_noblink -> fbcon_cursor_blink
  fbdev: Fix typo in Kconfig text for FB_DEVICE
  fbdev: imxfb: Check fb_add_videomode to prevent null-ptr-deref
  fbdev: svgalib: Clean up coding style
  fbdev: kyro: Use devm_ioremap_wc() for screen mem
  fbdev: kyro: Use devm_ioremap() for mmio registers
  fbdev: kyro: Add missing PCI memory region request
  fbdev: simplefb: Use of_reserved_mem_region_to_resource() for "memory-region"
  fbdev: fix potential buffer overflow in do_register_framebuffer()
  fbdev: nvidiafb: add depends on HAS_IOPORT
  fbdev: nvidiafb: fix build on 32-bit ARCH=um

Merge tag 'firewire-updates-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394

Pull firewire updates from Takashi Sakamoto:
"This update replaces the remaining tasklet usage in the FireWire
  subsystem with workqueue for asynchronous packet transmission. With
  this change, tasklets are now fully eliminated from the subsystem.

  Asynchronous packet transmission is used for serial bus topology
  management as well as for the operation of the SBP-2 protocol driver
  (firewire-sbp2). To ensure reliability during low-memory conditions,
  the associated workqueue is created with the WQ_MEM_RECLAIM flag,
  allowing it to participate in memory reclaim paths. Other attributes
  are aligned with those used for isochronous packet handling, which was
  migrated to workqueues in v6.12.

  The workqueues are sleepable and support preemptible work items,
  making them more suitable for real-time workloads that benefit from
  timely task preemption at the system level.

  There remains an issue where 'schedule()' may be called within an RCU
  read-side critical section, due to a direct replacement of
  'tasklet_disable_in_atomic()' with 'disable_work_sync()'. A proposed
  fix for this has been posted[1], and is currently under review and
  testing. It is expected to be sent upstream later"

Link: https://lore.kernel.org/lkml/20250728015125.17825-1-o-takashi@sakamocchi.jp/
* tag 'firewire-updates-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394:
  firewire: ohci: reduce the size of common context structure by extracting members into AT structure
  firewire: core: minor code refactoring to localize table of gap count
  firewire: ohci: use workqueue to handle events of AT request/response contexts
  firewire: ohci: use workqueue to handle events of AR request/response contexts
  firewire: core: allocate workqueue for AR/AT request/response contexts
  firewire: core: use from_work() macro to expand parent structure of work_struct
  firewire: ohci: use from_work() macro to expand parent structure of work_struct
  firewire: ohci: correct code comments about bus_reset tasklet

bpf: Fix memory leak of bpf_scc_info objects

env->scc_info array contains references to bpf_scc_info objects
allocated lazily in verifier.c:scc_visit_alloc().
env->scc_cnt was supposed to track env->scc_info array size
in order to free referenced objects in verifier.c:free_states().
Fix initialization of env->scc_cnt that was omitted in
verifier.c:compute_scc().

To reproduce the bug:
- build with CONFIG_DEBUG_KMEMLEAK
- boot and load bpf program with loops, e.g.:
  ./veristat -q pyperf180.bpf.o
- initiate memleak scan and check results:
  echo scan > /sys/kernel/debug/kmemleak
  cat /sys/kernel/debug/kmemleak

Fixes: c9e31900b54c ("bpf: propagate read/precision marks over state graph backedges")
Reported-by: Jens Axboe <axboe@kernel.dk>
Closes: https://lore.kernel.org/bpf/CAADnVQKXUWg9uRCPD5ebRXwN4dmBCRUFFM7kN=GxymYz3zU25A@mail.gmail.com/T/
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250801232330.1800436-1-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

ALSA: usb-audio: Don't use printk_ratelimit for debug prints

printk_ratelimit is deprecated, since it shares state with all other
printk sites. Additionally, the suppression message is printed at
warning level even though the actual messages are printed at debug and
are (usually) invisible! This can result in thousands of messages like

retire_capture_urb: 4992 callbacks suppressed

in the console, and can inhibit debugging since it is unclear what the
source of the suppressed callbacks is.

Switch to dev_dbg_ratelimited which doesn't print anything unless debug
is enabled.

Signed-off-by: Sean Anderson <sean.anderson@linux.dev>
Link: https://patch.msgid.link/20250801154710.739464-1-sean.anderson@linux.dev
Signed-off-by: Takashi Iwai <tiwai@suse.de>

futex: Move futex cleanup to __mmdrop()

Futex hash allocations are done in mm_init() and the cleanup happens in
__mmput(). That works most of the time, but there are mm instances which
are instantiated via mm_alloc() and freed via mmdrop(), which causes the
futex hash to be leaked.

Move the cleanup to __mmdrop().

Fixes: 56180dd20c19 ("futex: Use RCU-based per-CPU reference counting instead of rcuref_t")
Reported-by: André Draszik <andre.draszik@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: André Draszik <andre.draszik@linaro.org>
Link: https://lore.kernel.org/all/87ldo5ihu0.ffs@tglx
Closes: https://lore.kernel.org/all/0c8cc83bb73abf080faf584f319008b67d0931db.camel@linaro.org

smp: Fix spelling in on_each_cpu_cond_mask()'s doc-comment

"boolean" is spelt as "blooean". Fix that.

Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250722161818.6139-1-romank@linux.microsoft.com

Merge tag 'bpf-next-6.17' into loongarch-next

LoongArch architecture changes for 6.17 have many bpf features such as
trampoline, so merge 'bpf-next-6.17' to create a base to make bpf work
well.

Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Pull bpf fixes from Alexei Starovoitov:

- Fix kCFI failures in JITed BPF code on arm64 (Sami Tolvanen, Puranjay
   Mohan, Mark Rutland, Maxwell Bland)

- Disallow tail calls between BPF programs that use different cgroup
   local storage maps to prevent out-of-bounds access (Daniel Borkmann)

- Fix unaligned access in flow_dissector and netfilter BPF programs
   (Paul Chaignon)

- Avoid possible use of uninitialized mod_len in libbpf (Achill
   Gilgenast)

* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  selftests/bpf: Test for unaligned flow_dissector ctx access
  bpf: Improve ctx access verifier error message
  bpf: Check netfilter ctx accesses are aligned
  bpf: Check flow_dissector ctx accesses are aligned
  arm64/cfi,bpf: Support kCFI + BPF on arm64
  cfi: Move BPF CFI types and helpers to generic code
  cfi: add C CFI type macro
  libbpf: Avoid possible use of uninitialized mod_len
  bpf: Fix oob access in cgroup local storage
  bpf: Move cgroup iterator helpers to bpf.h
  bpf: Move bpf map owner out of common struct
  bpf: Add cookie object to bpf maps

Merge tag 'perf-tools-for-v6.17-2025-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools

Pull perf tools updates from Namhyung Kim:
"Build-ID processing goodies:

     Build-IDs are content based hashes to link regions of memory to ELF
     files in post processing. They have been available in distros for
     quite a while:

       $ file /bin/bash
       /bin/bash: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV),
       dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2,
       BuildID[sha1]=707a1c670cd72f8e55ffedfbe94ea98901b7ce3a,
       for GNU/Linux 3.2.0, stripped

     It is possible to ask the kernel to get it from mmap executable
     backing storage at time they are being put in place and send it as
     metadata at that moment to have in perf.data.

     Prefer that across the board to speed up 'record' time - it post
     processes the samples to find binaries touched by any samples and
     to save them with build-ID. It can skip reading build-ID in
     userspace if it comes from the kernel.

  perf record:

   * Make --buildid-mmap default.  The kernel can generate MMAP2 events
     with a build-ID from ELF header.  Use that by default instead of using
     inode and device ID to identify binaries.  It also can be disabled
     with --no-buildid-mmap.

   * Use BPF for -u/--uid option to sample processes belong to a user.
     BPF can track user processes more accurately and the existing logic
     often fails to get the list of processes due to race with reading the
     /proc filesystem.

   * Generate PERF_RECORD_BPF_METADATA when it profiles BPF programs and
     they have variables starting with "bpf_metadata_".  This will help to
     identify BPF objects used in the profile.  This has been supported in
     bpftool for some time and allows the recording of metadata such as
     commit hashes, versions, etc, that now gets recorded in perf.data as
     well.

   * Collect list of DSOs touched in the sample callchains as well as in
     the sample itself.  This would increase the processing time at the end
     of record, but can improve the data quality.

  perf stat:

   * Add a new 'drm' pseudo-PMU support like in 'hwmon'.  It can collect
     DRM usage stats using fdinfo in /proc.

     On my Intel laptop, it shows like below:

       $ perf list drm
       ...

       drm:
         drm-active-stolen-system0
              [Total memory active in one or more engines. Unit: drm_i915]
         drm-active-system0
              [Total memory active in one or more engines. Unit: drm_i915]
         drm-engine-capacity-video
              [Engine capacity. Unit: drm_i915]
         drm-engine-copy
              [Utilization in ns. Unit: drm_i915]
         drm-engine-render
              [Utilization in ns. Unit: drm_i915]
         drm-engine-video
              [Utilization in ns. Unit: drm_i915]
         ...

       $ sudo perf stat -a -e drm-engine-render,drm-engine-video,drm-engine-capacity-video sleep 1

        Performance counter stats for 'system wide':

       48,137,316,988,873 ns       drm-engine-render
           34,452,696,746 ns       drm-engine-video
                       20 capacity drm-engine-capacity-video

              1.002086194 seconds time elapsed

  perf list

   * Add description for software events.  The description is in JSON format
     and the event parser now can handle the software events like others
     (for example, it's case-insensitive and subject to wildcard matching).

       $ perf list software

       List of pre-defined events (to be used in -e or -M):

       software:
         alignment-faults
              [Number of kernel handled memory alignment faults. Unit: software]
         bpf-output
              [An event used by BPF programs to write to the perf ring buffer. Unit: software]
         cgroup-switches
              [Number of context switches to a task in a different cgroup. Unit: software]
         context-switches
              [Number of context switches [This event is an alias of cs]. Unit: software]
         cpu-clock
              [Per-CPU high-resolution timer based event. Unit: software]
         cpu-migrations
              [Number of times a process has migrated to a new CPU [This event is an alias of migrations]. Unit: software]
         cs
              [Number of context switches [This event is an alias of context-switches]. Unit: software]
         dummy
              [A placeholder event that doesn't count anything. Unit: software]
         emulation-faults
              [Number of kernel handled unimplemented instruction faults handled through emulation. Unit: software]
         faults
              [Number of page faults [This event is an alias of page-faults]. Unit: software]
         major-faults
              [Number of major page faults. Major faults require I/O to handle. Unit: software]
         migrations
              [Number of times a process has migrated to a new CPU [This event is an alias of cpu-migrations]. Unit: software]
         minor-faults
              [Number of minor page faults. Minor faults don't require I/O to handle. Unit: software]
         page-faults
              [Number of page faults [This event is an alias of faults]. Unit: software]
         task-clock
              [Per-task high-resolution timer based event. Unit: software]

  perf ftrace:

   * Add -e/--events option to perf ftrace latency to measure latency
     between the two events instead of a function.

       $ sudo perf ftrace latency -ab -e i915_request_wait_begin,i915_request_wait_end --hide-empty -- sleep 1
       #   DURATION     |      COUNT | GRAPH                                |
          256 -  512 us |          4 | ######                               |
            2 -    4 ms |          2 | ###                                  |
            4 -    8 ms |         12 | ###################                  |
            8 -   16 ms |         10 | ################                     |

       # statistics  (in usec)
         total time:               194915
           avg time:                 6961
           max time:                12855
           min time:                  373
              count:                   28

   * Add new function graph tracer options (--graph-opts) to display more
     info like arguments and return value.  They will be passed to the
     kernel ftrace directly.

       $ sudo perf ftrace -G vfs_write --graph-opts retval,retaddr
       # tracer: function_graph
       #
       # CPU  DURATION                  FUNCTION CALLS
       # |     |   |                     |   |   |   |
       ...
       5)               |  mutex_unlock() { /* <-rb_simple_write+0xda/0x150 */
       5)   0.188 us    |    local_clock(); /* <-lock_release+0x2ad/0x440 ret=0x3bf2a3cf90e */
       5)               |    rt_mutex_slowunlock() { /* <-rb_simple_write+0xda/0x150 */
       5)               |      _raw_spin_lock_irqsave() { /* <-rt_mutex_slowunlock+0x4f/0x200 */
       5)   0.123 us    |        preempt_count_add(); /* <-_raw_spin_lock_irqsave+0x23/0x90 ret=0x0 */
       5)   0.128 us    |        local_clock(); /* <-__lock_acquire.isra.0+0x17a/0x740 ret=0x3bf2a3cfc8b */
       5)   0.086 us    |        do_raw_spin_trylock(); /* <-_raw_spin_lock_irqsave+0x4a/0x90 ret=0x1 */
       5)   0.845 us    |      } /* _raw_spin_lock_irqsave ret=0x292 */
       ...

  Misc:

   * Add perf archive --exclude-buildids <FILE> option to skip some binaries.
     The format of the FILE should be same as an output of perf buildid-list.

   * Get rid of dependency of libcrypto.  It was just to get SHA-1 hash so
     implement it directly like in the kernel.  A side effect is that it
     needs -fno-strict-aliasing compiler option (again, like in the kernel).

   * Convert all shell script tests to use bash"

* tag 'perf-tools-for-v6.17-2025-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools: (179 commits)
  perf record: Cache build-ID of hit DSOs only
  perf test: Ensure lock contention using pipe mode
  perf python: Stop using deprecated PyUnicode_AsString()
  perf list: Skip ABI PMUs when printing pmu values
  perf list: Remove tracepoint printing code
  perf tp_pmu: Add event APIs
  perf tp_pmu: Factor existing tracepoint logic to new file
  perf parse-events: Remove non-json software events
  perf jevents: Add common software event json
  perf tools: Remove libtraceevent in .gitignore
  perf test: Fix comment ordering
  perf sort: Use perf_env to set arch sort keys and header
  perf test: Move PERF_SAMPLE_WEIGHT_STRUCT parsing to common test
  perf sample: Remove arch notion of sample parsing
  perf env: Remove global perf_env
  perf trace: Avoid global perf_env with evsel__env
  perf auxtrace: Pass perf_env from session through to mmap read
  perf machine: Explicitly pass in host perf_env
  perf bench synthesize: Avoid use of global perf_env
  perf top: Make perf_env locally scoped
  ...

Merge tag 'parisc-for-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux

Pull parisc updates from Helge Deller:

- The parisc kernel wrongly allows reading from read-protected
   userspace memory without faulting, e.g. when userspace uses
   mprotect() to read-protect a memory area and then uses a pointer to
   this memory in a write(2, addr, 1) syscall.

   To fix this issue, Dave Anglin developed a set of patches which use
   the proberi assembler instruction to additionally check read access
   permissions at runtime.

- Randy Dunlap contributed two patches to fix a minor typo and to
   explain why a 32-bit compiler is needed although a 64-bit kernel is
   built

* tag 'parisc-for-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
  parisc: Revise __get_user() to probe user read access
  parisc: Revise gateway LWS calls to probe user read access
  parisc: Drop WARN_ON_ONCE() from flush_cache_vmap
  parisc: Try to fixup kernel exception in bad_area_nosemaphore path of do_page_fault()
  parisc: Define and use set_pte_at()
  parisc: Rename pte_needs_flush() to pte_needs_cache_flush() in cache.c
  parisc: Check region is readable by user in raw_copy_from_user()
  parisc: Update comments in make_insert_tlb
  parisc: Makefile: explain that 64BIT requires both 32-bit and 64-bit compilers
  parisc: Makefile: fix a typo in palo.conf

tracing: Have unsigned int function args displayed as hexadecimal

Most function arguments that are passed in as unsigned int or unsigned
long are better displayed as hexadecimal than normal integer. For example,
the functions:

static void __create_object(unsigned long ptr, size_t size,
int min_count, gfp_t gfp, unsigned int objflags);

static bool stack_access_ok(struct unwind_state *state, unsigned long _addr,
    size_t len);

void __local_bh_disable_ip(unsigned long ip, unsigned int cnt);

Show up in the trace as:

    __create_object(ptr=-131387050520576, size=4096, min_count=1, gfp=3264, objflags=0) <-kmem_cache_alloc_noprof
    stack_access_ok(state=0xffffc9000233fc98, _addr=-60473102566256, len=8) <-unwind_next_frame
    __local_bh_disable_ip(ip=-2127311112, cnt=256) <-handle_softirqs

Instead, by displaying unsigned as hexadecimal, they look more like this:

    __create_object(ptr=0xffff8881028d2080, size=0x280, min_count=1, gfp=0x82820, objflags=0x0) <-kmem_cache_alloc_node_noprof
    stack_access_ok(state=0xffffc90000003938, _addr=0xffffc90000003930, len=0x8) <-unwind_next_frame
    __local_bh_disable_ip(ip=0xffffffff8133cef8, cnt=0x100) <-handle_softirqs

Which is much easier to understand as most unsigned longs are usually just
pointers. Even the "unsigned int cnt" in __local_bh_disable_ip() looks
better as hexadecimal as a lot of flags are passed as unsigned.

Changes since v2: https://lore.kernel.org/20250801111453.01502861@gandalf.local.home

- Use btf_int_encoding() instead of open coding it (Martin KaFai Lau)

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Douglas Raillard <douglas.raillard@arm.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Link: https://lore.kernel.org/20250801165601.7770d65c@gandalf.local.home
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>

Merge tag 'cxl-for-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl

Pull CXL updates from Dave Jiang:
"The most significant changes in this pull request is the series that
  introduces ACQUIRE() and ACQUIRE_ERR() macros to replace conditional
  locking and ease the pain points of scoped_cond_guard().

  The series also includes follow on changes that refactor the CXL
  sub-system to utilize the new macros.

  Detail summary:

   - Add documentation template for CXL conventions to document CXL
     platform quirks

   - Replace mutex_lock_io() with mutex_lock() for mailbox

   - Add location limit for fake CFMWS range for cxl_test, ARM platform
     enabling

   - CXL documentation typo and clarity fixes

   - Use correct format specifier for function cxl_set_ecs_threshold()

   - Make cxl_bus_type constant

   - Introduce new helper cxl_resource_contains_addr() to check address
     availability

   - Fix wrong DPA checking for PPR operation

   - Remove core/acpi.c and CXL core dependency on ACPI

   - Introduce ACQUIRE() and ACQUIRE_ERR() for conditional locks

   - Add CXL updates utilizing ACQUIRE() macro to remove gotos and
     improve readability

   - Add return for the dummy version of cxl_decoder_detach() without
     CONFIG_CXL_REGION

   - CXL events updates for spec r3.2

   - Fix return of __cxl_decoder_detach() error path

   - CXL debugfs documentation fix"

* tag 'cxl-for-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (28 commits)
  Documentation/ABI/testing/debugfs-cxl: Add 'cxl' to clear_poison path
  cxl/region: Fix an ERR_PTR() vs NULL bug
  cxl/events: Trace Memory Sparing Event Record
  cxl/events: Add extra validity checks for CVME count in DRAM Event Record
  cxl/events: Add extra validity checks for corrected memory error count in General Media Event Record
  cxl/events: Update Common Event Record to CXL spec rev 3.2
  cxl: Fix -Werror=return-type in cxl_decoder_detach()
  cleanup: Fix documentation build error for ACQUIRE updates
  cxl: Convert to ACQUIRE() for conditional rwsem locking
  cxl/region: Consolidate cxl_decoder_kill_region() and cxl_region_detach()
  cxl/region: Move ready-to-probe state check to a helper
  cxl/region: Split commit_store() into __commit() and queue_reset() helpers
  cxl/decoder: Drop pointless locking
  cxl/decoder: Move decoder register programming to a helper
  cxl/mbox: Convert poison list mutex to ACQUIRE()
  cleanup: Introduce ACQUIRE() and ACQUIRE_ERR() for conditional locks
  cxl: Remove core/acpi.c and cxl core dependency on ACPI
  cxl/core: Using cxl_resource_contains_addr() to check address availability
  cxl/edac: Fix wrong dpa checking for PPR operation
  cxl/core: Introduce a new helper cxl_resource_contains_addr()
  ...

net: Add locking to protect skb->dev access in ip_output

In ip_output() skb->dev is updated from the skb_dst(skb)->dev
this can become invalid when the interface is unregistered and freed,

Introduced new skb_dst_dev_rcu() function to be used instead of
skb_dst_dev() within rcu_locks in ip_output.This will ensure that
all the skb's associated with the dev being deregistered will
be transnmitted out first, before freeing the dev.

Given that ip_output() is called within an rcu_read_lock()
critical section or from a bottom-half context, it is safe to introduce
an RCU read-side critical section within it.

Multiple panic call stacks were observed when UL traffic was run
in concurrency with device deregistration from different functions,
pasting one sample for reference.

[496733.627565][T13385] Call trace:
[496733.627570][T13385] bpf_prog_ce7c9180c3b128ea_cgroupskb_egres+0x24c/0x7f0
[496733.627581][T13385] __cgroup_bpf_run_filter_skb+0x128/0x498
[496733.627595][T13385] ip_finish_output+0xa4/0xf4
[496733.627605][T13385] ip_output+0x100/0x1a0
[496733.627613][T13385] ip_send_skb+0x68/0x100
[496733.627618][T13385] udp_send_skb+0x1c4/0x384
[496733.627625][T13385] udp_sendmsg+0x7b0/0x898
[496733.627631][T13385] inet_sendmsg+0x5c/0x7c
[496733.627639][T13385] __sys_sendto+0x174/0x1e4
[496733.627647][T13385] __arm64_sys_sendto+0x28/0x3c
[496733.627653][T13385] invoke_syscall+0x58/0x11c
[496733.627662][T13385] el0_svc_common+0x88/0xf4
[496733.627669][T13385] do_el0_svc+0x2c/0xb0
[496733.627676][T13385] el0_svc+0x2c/0xa4
[496733.627683][T13385] el0t_64_sync_handler+0x68/0xb4
[496733.627689][T13385] el0t_64_sync+0x1a4/0x1a8

Changes in v3:
- Replaced WARN_ON() with WARN_ON_ONCE(), as suggested by Willem de Bruijn.
- Dropped legacy lines mistakenly pulled in from an outdated branch.

Changes in v2:
- Addressed review comments from Eric Dumazet
- Used READ_ONCE() to prevent potential load/store tearing
- Added skb_dst_dev_rcu() and used along with rcu_read_lock() in ip_output

Signed-off-by: Sharath Chandra Vurukala <quic_sharathv@quicinc.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250730105118.GA26100@hu-sharathv-hyd.qualcomm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/sched: taprio: enforce minimum value for picos_per_byte

Syzbot reported a WARNING in taprio_get_start_time().

When link speed is 470,589 or greater, q->picos_per_byte becomes too
small, causing length_to_duration(q, ETH_ZLEN) to return zero.

This zero value leads to validation failures in fill_sched_entry() and
parse_taprio_schedule(), allowing arbitrary values to be assigned to
entry->interval and cycle_time. As a result, sched->cycle can become zero.

Since SPEED_800000 is the largest defined speed in
include/uapi/linux/ethtool.h, this issue can occur in realistic scenarios.

To ensure length_to_duration() returns a non-zero value for minimum-sized
Ethernet frames (ETH_ZLEN = 60), picos_per_byte must be at least 17
(60 * 17 > PSEC_PER_NSEC which is 1000).

This patch enforces a minimum value of 17 for picos_per_byte when the
calculated value would be lower, and adds a warning message to inform
users that scheduling accuracy may be affected at very high link speeds.

Fixes: fb66df20a720 ("net/sched: taprio: extend minimum interval restriction to entire cycle too")
Reported-by: syzbot+398e1ee4ca2cac05fddb@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=398e1ee4ca2cac05fddb
Signed-off-by: Takamitsu Iwai <takamitz@amazon.co.jp>
Link: https://patch.msgid.link/20250728173149.45585-1-takamitz@amazon.co.jp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: drop UFO packets in udp_rcv_segment()

When sending a packet with virtio_net_hdr to tun device, if the gso_type
in virtio_net_hdr is SKB_GSO_UDP and the gso_size is less than udphdr
size, below crash may happen.

  ------------[ cut here ]------------
  kernel BUG at net/core/skbuff.c:4572!
  Oops: invalid opcode: 0000 [#1] SMP NOPTI
  CPU: 0 UID: 0 PID: 62 Comm: mytest Not tainted 6.16.0-rc7 #203 PREEMPT(voluntary)
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
  RIP: 0010:skb_pull_rcsum+0x8e/0xa0
  Code: 00 00 5b c3 cc cc cc cc 8b 93 88 00 00 00 f7 da e8 37 44 38 00 f7 d8 89 83 88 00 00 00 48 8b 83 c8 00 00 00 5b c3 cc cc cc cc <0f> 0b 0f 0b 66 66 2e 0f 1f 84 00 000
  RSP: 0018:ffffc900001fba38 EFLAGS: 00000297
  RAX: 0000000000000004 RBX: ffff8880040c1000 RCX: ffffc900001fb948
  RDX: ffff888003e6d700 RSI: 0000000000000008 RDI: ffff88800411a062
  RBP: ffff8880040c1000 R08: 0000000000000000 R09: 0000000000000001
  R10: ffff888003606c00 R11: 0000000000000001 R12: 0000000000000000
  R13: ffff888004060900 R14: ffff888004050000 R15: ffff888004060900
  FS:  000000002406d3c0(0000) GS:ffff888084a19000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000020000040 CR3: 0000000004007000 CR4: 00000000000006f0
  Call Trace:
   <TASK>
   udp_queue_rcv_one_skb+0x176/0x4b0 net/ipv4/udp.c:2445
   udp_queue_rcv_skb+0x155/0x1f0 net/ipv4/udp.c:2475
   udp_unicast_rcv_skb+0x71/0x90 net/ipv4/udp.c:2626
   __udp4_lib_rcv+0x433/0xb00 net/ipv4/udp.c:2690
   ip_protocol_deliver_rcu+0xa6/0x160 net/ipv4/ip_input.c:205
   ip_local_deliver_finish+0x72/0x90 net/ipv4/ip_input.c:233
   ip_sublist_rcv_finish+0x5f/0x70 net/ipv4/ip_input.c:579
   ip_sublist_rcv+0x122/0x1b0 net/ipv4/ip_input.c:636
   ip_list_rcv+0xf7/0x130 net/ipv4/ip_input.c:670
   __netif_receive_skb_list_core+0x21d/0x240 net/core/dev.c:6067
   netif_receive_skb_list_internal+0x186/0x2b0 net/core/dev.c:6210
   napi_complete_done+0x78/0x180 net/core/dev.c:6580
   tun_get_user+0xa63/0x1120 drivers/net/tun.c:1909
   tun_chr_write_iter+0x65/0xb0 drivers/net/tun.c:1984
   vfs_write+0x300/0x420 fs/read_write.c:593
   ksys_write+0x60/0xd0 fs/read_write.c:686
   do_syscall_64+0x50/0x1c0 arch/x86/entry/syscall_64.c:63
   </TASK>

To trigger gso segment in udp_queue_rcv_skb(), we should also set option
UDP_ENCAP_ESPINUDP to enable udp_sk(sk)->encap_rcv. When the encap_rcv
hook return 1 in udp_queue_rcv_one_skb(), udp_csum_pull_header() will try
to pull udphdr, but the skb size has been segmented to gso size, which
leads to this crash.

Previous commit cf329aa42b66 ("udp: cope with UDP GRO packet misdirection")
introduces segmentation in UDP receive path only for GRO, which was never
intended to be used for UFO, so drop UFO packets in udp_rcv_segment().

Link: https://lore.kernel.org/netdev/20250724083005.3918375-1-wangliang74@huawei.com/
Link: https://lore.kernel.org/netdev/20250729123907.3318425-1-wangliang74@huawei.com/
Fixes: cf329aa42b66 ("udp: cope with UDP GRO packet misdirection")
Suggested-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Signed-off-by: Wang Liang <wangliang74@huawei.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250730101458.3470788-1-wangliang74@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'rproc-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/remoteproc/linux

Pull remoteproc updates from Bjorn Andersson:

- Make the Xilinx remoteproc driver support running on only a single
   core, disable still unsupported remoteproc features, and stop the
   remoteproc on shutdown to facilitate kexec.

- Conclude the renaming of the Qualcomm ADSP driver to "PAS" that was
   started many years ago.

* tag 'rproc-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/remoteproc/linux:
  remoteproc: xlnx: Fix kernel-doc warnings
  remoteproc: xlnx: Disable unsupported features
  remoteproc: xlnx: Add shutdown callback
  remoteproc: xlnx: Allow single core use in split mode
  dt-bindings: remoteproc: qcom,sa8775p-pas: Correct the interrupt number
  remoteproc: Don't use %pK through printk
  dt-bindings: remoteproc: qcom,sm8150-pas: Document QCS615 remoteproc
  remoteproc: qcom: pas: Conclude the rename from adsp

selftests/bpf: Test for unaligned flow_dissector ctx access

This patch adds tests for two context fields where unaligned accesses
were not properly rejected.

Note the new macro is similar to the existing narrow_load macro, but we
need a different description and access offset. Combining the two
macros into one is probably doable but I don't think it would help
readability.

vmlinux.h is included in place of bpf.h so we have the definition of
struct bpf_nf_ctx.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Tested-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/bf014046ddcf41677fb8b98d150c14027e9fddba.1754039605.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

net: mdio: mdio-bcm-unimac: Correct rate fallback logic

When the parent clock is a gated clock which has multiple parents, the
clock provider (clk-scmi typically) might return a rate of 0 since there
is not one of those particular parent clocks that should be chosen for
returning a rate. Prior to ee975351cf0c ("net: mdio: mdio-bcm-unimac:
Manage clock around I/O accesses"), we would not always be passing a
clock reference depending upon how mdio-bcm-unimac was instantiated. In
that case, we would take the fallback path where the rate is hard coded
to 250MHz.

Make sure that we still fallback to using a fixed rate for the divider
calculation, otherwise we simply ignore the desired MDIO bus clock
frequency which can prevent us from interfacing with Ethernet PHYs
properly.

Fixes: ee975351cf0c ("net: mdio: mdio-bcm-unimac: Manage clock around I/O accesses")
Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250730202533.3463529-1-florian.fainelli@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: reject malicious packets in ipv6_gso_segment()

syzbot was able to craft a packet with very long IPv6 extension headers
leading to an overflow of skb->transport_header.

This 16bit field has a limited range.

Add skb_reset_transport_header_careful() helper and use it
from ipv6_gso_segment()

WARNING: CPU: 0 PID: 5871 at ./include/linux/skbuff.h:3032 skb_reset_transport_header include/linux/skbuff.h:3032 [inline]
WARNING: CPU: 0 PID: 5871 at ./include/linux/skbuff.h:3032 ipv6_gso_segment+0x15e2/0x21e0 net/ipv6/ip6_offload.c:151
Modules linked in:
CPU: 0 UID: 0 PID: 5871 Comm: syz-executor211 Not tainted 6.16.0-rc6-syzkaller-g7abc678e3084 #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/12/2025
RIP: 0010:skb_reset_transport_header include/linux/skbuff.h:3032 [inline]
RIP: 0010:ipv6_gso_segment+0x15e2/0x21e0 net/ipv6/ip6_offload.c:151
Call Trace:
<TASK>
  skb_mac_gso_segment+0x31c/0x640 net/core/gso.c:53
  nsh_gso_segment+0x54a/0xe10 net/nsh/nsh.c:110
  skb_mac_gso_segment+0x31c/0x640 net/core/gso.c:53
  __skb_gso_segment+0x342/0x510 net/core/gso.c:124
  skb_gso_segment include/net/gso.h:83 [inline]
  validate_xmit_skb+0x857/0x11b0 net/core/dev.c:3950
  validate_xmit_skb_list+0x84/0x120 net/core/dev.c:4000
  sch_direct_xmit+0xd3/0x4b0 net/sched/sch_generic.c:329
  __dev_xmit_skb net/core/dev.c:4102 [inline]
  __dev_queue_xmit+0x17b6/0x3a70 net/core/dev.c:4679

Fixes: d1da932ed4ec ("ipv6: Separate ipv6 offload support")
Reported-by: syzbot+af43e647fd835acc02df@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/688a1a05.050a0220.5d226.0008.GAE@google.com/T/#u
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250730131738.3385939-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: avoid using ifconfig

ifconfig is deprecated and not always present, use ip command instead.

Fixes: e0f3b3e5c77a ("selftests: Add test cases for vlan_filter modification during runtime")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Dong Chenchen <dongchenchen2@huawei.com>
Reviewed-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20250730115313.3356036-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dpll: Make ZL3073X invisible

Currently, the user is always asked about the Microchip Azurite
DPLL/PTP/SyncE core driver, even when I2C and SPI are disabled, and thus
the driver cannot be used at all.

Fix this by making the Kconfig symbol for the core driver invisible
(unless compile-testing), and selecting it by the bus glue sub-drivers.
Drop the modular defaults, as drivers should not default to enabled.

Fixes: 2df8e64e01c10a4b ("dpll: Add basic Microchip ZL3073x support")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://patch.msgid.link/97804163aeb262f0e0706d00c29d9bb751844454.1753874405.git.geert+renesas@glider.be
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Correctly set gso_segs when LRO is used

When gso_segs is left at 0, a number of assumptions will end up being
incorrect throughout the stack.

For example, in the GRO-path, we set NAPI_GRO_CB()->count to gso_segs.
So, if a non-LRO'ed packet followed by an LRO'ed packet is being
processed in GRO, the first one will have NAPI_GRO_CB()->count set to 1 and
the next one to 0 (in dev_gro_receive()).
Since commit 531d0d32de3e
("net/mlx5: Correctly set gso_size when LRO is used")
these packets will get merged (as their gso_size now matches).
So, we end up in gro_complete() with NAPI_GRO_CB()->count == 1 and thus
don't call inet_gro_complete(). Meaning, checksum-validation in
tcp_checksum_complete() will fail with a "hw csum failure".

Even before the above mentioned commit, incorrect gso_segs means that other
things like TCP's accounting of incoming packets (tp->segs_in,
data_segs_in, rcv_ooopack) will be incorrect. Which means that if one
does bytes_received/data_segs_in, the result will be bigger than the
MTU.

Fix this by initializing gso_segs correctly when LRO is used.

Fixes: e586b3b0baee ("net/mlx5: Ethernet Datapath files")
Reported-by: Gal Pressman <gal@nvidia.com>
Closes: https://lore.kernel.org/netdev/6583783f-f0fb-4fb1-a415-feec8155bc69@nvidia.com/
Signed-off-by: Christoph Paasch <cpaasch@openai.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250729-mlx5_gso_segs-v1-1-b48c480c1c12@openai.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull virtio updates from Michael Tsirkin:

- vhost can now support legacy threading if enabled in Kconfig

- vsock memory allocation strategies for large buffers have been
   improved, reducing pressure on kmalloc

- vhost now supports the in-order feature. guest bits missed the merge
   window.

- fixes, cleanups all over the place

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (30 commits)
  vsock/virtio: Allocate nonlinear SKBs for handling large transmit buffers
  vsock/virtio: Rename virtio_vsock_skb_rx_put()
  vhost/vsock: Allocate nonlinear SKBs for handling large receive buffers
  vsock/virtio: Move SKB allocation lower-bound check to callers
  vsock/virtio: Rename virtio_vsock_alloc_skb()
  vsock/virtio: Resize receive buffers so that each SKB fits in a 4K page
  vsock/virtio: Move length check to callers of virtio_vsock_skb_rx_put()
  vsock/virtio: Validate length in packet header before skb_put()
  vhost/vsock: Avoid allocating arbitrarily-sized SKBs
  vhost_net: basic in_order support
  vhost: basic in order support
  vhost: fail early when __vhost_add_used() fails
  vhost: Reintroduce kthread API and add mode selection
  vdpa: Fix IDR memory leak in VDUSE module exit
  vdpa/mlx5: Fix release of uninitialized resources on error path
  vhost-scsi: Fix check for inline_sg_cnt exceeding preallocated limit
  virtio: virtio_dma_buf: fix missing parameter documentation
  vhost: Fix typos
  vhost: vringh: Remove unused functions
  vhost: vringh: Remove unused iotlb functions
  ...

sfc: unfix not-a-typo in comment

Commit fe09560f8241 ("net: Fix typos") removed duplicated word 'fallback',
but this was not a typo and change altered the semantic meaning of
the comment.
Partially revert, using the phrase 'fallback of the fallback' to make
the meaning more clear to future readers so that they won't try to
change it again.

Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250731144138.2637949-1-edward.cree@amd.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'pci-v6.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci

Pull PCI updates from Bjorn Helgaas:
"Enumeration:

   - Allow built-in drivers, not just modular drivers, to use async
     initial probing (Lukas Wunner)

   - Support Immediate Readiness even on devices with no PM Capability
     (Sean Christopherson)

   - Consolidate definition of PCIE_RESET_CONFIG_WAIT_MS (100ms), the
     required delay between a reset and sending config requests to a
     device (Niklas Cassel)

   - Add pci_is_display() to check for "Display" base class and use it
     in ALSA hda, vfio, vga_switcheroo, vt-d (Mario Limonciello)

   - Allow 'isolated PCI functions' (multi-function devices without a
     function 0) for LoongArch, similar to s390 and jailhouse (Huacai
     Chen)

  Power control:

   - Add ability to enable optional slot clock for cases where the PCIe
     host controller and the slot are supplied by different clocks
     (Marek Vasut)

  PCIe native device hotplug:

   - Fix runtime PM ref imbalance on Hot-Plug Capable ports caused by
     misinterpreting a config read failure after a device has been
     removed (Lukas Wunner)

   - Avoid creating a useless PCIe port service device for pciehp if the
     slot is handled by the ACPI hotplug driver (Lukas Wunner)

   - Ignore ACPI hotplug slots when calculating depth of pciehp hotplug
     ports (Lukas Wunner)

  Virtualization:

   - Save VF resizable BAR state and restore it after reset (Michał
     Winiarski)

   - Allow IOV resources (VF BARs) to be resized (Michał Winiarski)

   - Add pci_iov_vf_bar_set_size() so drivers can control VF BAR size
     (Michał Winiarski)

  Endpoint framework:

   - Add RC-to-EP doorbell support using platform MSI controller,
     including a test case (Frank Li)

   - Allow BAR assignment via configfs so platforms have flexibility in
     determining BAR usage (Jerome Brunet)

  Native PCIe controller drivers:

   - Convert amazon,al-alpine-v[23]-pcie, apm,xgene-pcie,
     axis,artpec6-pcie, marvell,armada-3700-pcie, st,spear1340-pcie to
     DT schema format (Rob Herring)

   - Use dev_fwnode() instead of of_fwnode_handle() to remove OF
     dependency in altera (fixes an unused variable), designware-host,
     mediatek, mediatek-gen3, mobiveil, plda, xilinx, xilinx-dma,
     xilinx-nwl (Jiri Slaby, Arnd Bergmann)

   - Convert aardvark, altera, brcmstb, designware-host, iproc,
     mediatek, mediatek-gen3, mobiveil, plda, rcar-host, vmd, xilinx,
     xilinx-dma, xilinx-nwl from using pci_msi_create_irq_domain() to
     using msi_create_parent_irq_domain() instead; this makes the
     interrupt controller per-PCI device, allows dynamic allocation of
     vectors after initialization, and allows support of IMS (Nam Cao)

  APM X-Gene PCIe controller driver:

   - Rewrite MSI handling to MSI CPU affinity, drop useless CPU hotplug
     bits, use device-managed memory allocations, and clean things up
     (Marc Zyngier)

   - Probe xgene-msi as a standard platform driver rather than a
     subsys_initcall (Marc Zyngier)

  Broadcom STB PCIe controller driver:

   - Add optional DT 'num-lanes' property and if present, use it to
     override the Maximum Link Width advertised in Link Capabilities
     (Jim Quinlan)

  Cadence PCIe controller driver:

   - Use PCIe Message routing types from the PCI core rather than
     defining private ones (Hans Zhang)

  Freescale i.MX6 PCIe controller driver:

   - Add IMX8MQ_EP third 64-bit BAR in epc_features (Richard Zhu)

   - Add IMX8MM_EP and IMX8MP_EP fixed 256-byte BAR 4 in epc_features
     (Richard Zhu)

   - Configure LUT for MSI/IOMMU in Endpoint mode so Root Complex can
     trigger doorbel on Endpoint (Frank Li)

   - Remove apps_reset (LTSSM_EN) from
     imx_pcie_{assert,deassert}_core_reset(), which fixes a hotplug
     regression on i.MX8MM (Richard Zhu)

   - Delay Endpoint link start until configfs 'start' written (Richard
     Zhu)

  Intel VMD host bridge driver:

   - Add Intel Panther Lake (PTL)-H/P/U Vendor ID (George D Sworo)

  Qualcomm PCIe controller driver:

   - Add DT binding and driver support for SA8255p, which supports ECAM
     for Configuration Space access (Mayank Rana)

   - Update DT binding and driver to describe PHYs and per-Root Port
     resets in a Root Port stanza and deprecate describing them in the
     host bridge; this makes it possible to support multiple Root Ports
     in the future (Krishna Chaitanya Chundru)

   - Add Qualcomm QCS615 to SM8150 DT binding (Ziyue Zhang)

   - Add Qualcomm QCS8300 to SA8775p DT binding (Ziyue Zhang)

   - Drop TBU and ref clocks from Qualcomm SM8150 and SC8180x DT
     bindings (Konrad Dybcio)

   - Document 'link_down' reset in Qualcomm SA8775P DT binding (Ziyue
     Zhang)

   - Add required PCIE_RESET_CONFIG_WAIT_MS delay after Link up IRQ
     (Niklas Cassel)

  Rockchip PCIe controller driver:

   - Drop unused PCIe Message routing and code definitions (Hans Zhang)

   - Remove several unused header includes (Hans Zhang)

   - Use standard PCIe config register definitions instead of
     rockchip-specific redefinitions (Geraldo Nascimento)

   - Set Target Link Speed to 5.0 GT/s before retraining so we have a
     chance to train at a higher speed (Geraldo Nascimento)

  Rockchip DesignWare PCIe controller driver:

   - Prevent race between link training and register update via DBI by
     inhibiting link training after hot reset and link down (Wilfred
     Mallawa)

   - Add required PCIE_RESET_CONFIG_WAIT_MS delay after Link up IRQ
     (Niklas Cassel)

  Sophgo PCIe controller driver:

   - Add DT binding and driver for Sophgo SG2044 PCIe controller driver
     in Root Complex mode (Inochi Amaoto)

  Synopsys DesignWare PCIe controller driver:

   - Add required PCIE_RESET_CONFIG_WAIT_MS after waiting for Link up on
     Ports that support > 5.0 GT/s. Slower Ports still rely on the
     not-quite-correct PCIE_LINK_WAIT_SLEEP_MS 90ms default delay while
     waiting for the Link (Niklas Cassel)"

* tag 'pci-v6.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: (116 commits)
  dt-bindings: PCI: qcom,pcie-sa8775p: Document 'link_down' reset
  dt-bindings: PCI: Remove 83xx-512x-pci.txt
  dt-bindings: PCI: Convert amazon,al-alpine-v[23]-pcie to DT schema
  dt-bindings: PCI: Convert marvell,armada-3700-pcie to DT schema
  dt-bindings: PCI: Convert apm,xgene-pcie to DT schema
  dt-bindings: PCI: Convert axis,artpec6-pcie to DT schema
  dt-bindings: PCI: Convert st,spear1340-pcie to DT schema
  PCI: Move is_pciehp check out of pciehp_is_native()
  PCI: pciehp: Use is_pciehp instead of is_hotplug_bridge
  PCI/portdrv: Use is_pciehp instead of is_hotplug_bridge
  PCI/ACPI: Fix runtime PM ref imbalance on Hot-Plug Capable ports
  selftests: pci_endpoint: Add doorbell test case
  misc: pci_endpoint_test: Add doorbell test case
  PCI: endpoint: pci-epf-test: Add doorbell test support
  PCI: endpoint: Add pci_epf_align_inbound_addr() helper for inbound address alignment
  PCI: endpoint: pci-ep-msi: Add checks for MSI parent and mutability
  PCI: endpoint: Add RC-to-EP doorbell support using platform MSI controller
  PCI: dwc: Add Sophgo SG2044 PCIe controller driver in Root Complex mode
  PCI: vmd: Switch to msi_create_parent_irq_domain()
  PCI: vmd: Convert to lock guards
  ...

net: airoha: Fix PPE table access in airoha_ppe_debugfs_foe_show()

In order to avoid any possible race we need to hold the ppe_lock
spinlock accessing the hw PPE table. airoha_ppe_foe_get_entry routine is
always executed holding ppe_lock except in airoha_ppe_debugfs_foe_show
routine. Fix the problem introducing airoha_ppe_foe_get_entry_locked
routine.

Fixes: 3fe15c640f380 ("net: airoha: Introduce PPE debugfs support")
Reviewed-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20250731-airoha_ppe_foe_get_entry_locked-v2-1-50efbd8c0fd6@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: net: Fix flaky neighbor garbage collection test

The purpose of the "Periodic garbage collection" test case is to make
sure that "extern_valid" neighbors are not flushed during periodic
garbage collection, unlike regular neighbor entries.

The test case is currently doing the following:

1. Changing the base reachable time to 10 seconds so that periodic
   garbage collection will run every 5 seconds.

2. Changing the garbage collection stale time to 5 seconds so that
   neighbors that have not been used in the last 5 seconds will be
   considered for removal.

3. Waiting for the base reachable time change to take effect.

4. Adding an "extern_valid" neighbor, a non-"extern_valid" neighbor and
   a bunch of other neighbors so that the threshold ("thresh1") will be
   crossed and stale neighbors will be flushed during garbage
   collection.

5. Waiting for 10 seconds to give garbage collection a chance to run.

6. Checking that the "extern_valid" neighbor was not flushed and that
   the non-"extern_valid" neighbor was flushed.

The test sometimes fails in the netdev CI because the non-"extern_valid"
neighbor was not flushed. I am unable to reproduce this locally, but my
theory that since we do not know exactly when the periodic garbage
collection runs, it is possible for it to run at a time when the
non-"extern_valid" neighbor is still not considered stale.

Fix by moving the addition of the two neighbors before step 3 and by
reducing the garbage collection stale time to 1 second, to ensure that
both neighbors are considered stale when garbage collection runs.

Fixes: 171f2ee31a42 ("selftests: net: Add a selftest for externally validated neighbor entries")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Closes: https://lore.kernel.org/netdev/20250728093504.4ebbd73c@kernel.org/
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250731110914.506890-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ring-buffer: Convert ring_buffer_write() to use guard(preempt_notrace)

The function ring_buffer_write() has a goto out to only do a
preempt_enable_notrace(). This can be replaced by a guard.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250801203858.205479143@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>

tracing: Use __free(kfree) in trace.c to remove gotos

There's a couple of locations that have goto out in trace.c for the only
purpose of freeing a variable that was allocated. These can be replaced
with __free(kfree).

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250801203858.040892777@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>

tracing: Add guard() around locks and mutexes in trace.c

There's several locations in trace.c that can be simplified by using
guards around raw_spin_lock_irqsave, mutexes and preempt disabling.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250801203857.879085376@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>

tracing: Add guard(ring_buffer_nest)

Some calls to the tracing ring buffer can happen when the ring buffer is
already being written to by the same context (for example, a
trace_printk() in between a ring_buffer_lock_reserve() and a
ring_buffer_unlock_commit()).

In order to not trigger the recursion detection, these functions use
ring_buffer_nest_start() and ring_buffer_nest_end(). Create a guard() for
these functions so that their use cases can be simplified and not need to
use goto for the release.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250801203857.710501021@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>

tracing: Remove unneeded goto out logic

Several places in the trace.c file there's a goto out where the out is
simply a return. There's no reason to jump to the out label if it's not
doing any more logic but simply returning from the function.

Replace the goto outs with a return and remove the out labels.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250801203857.538726745@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>

Merge tag 'linux-watchdog-6.17-rc1' of git://www.linux-watchdog.org/linux-watchdog

Pull watchdog updates from Wim Van Sebroeck:

- sbsa: Adjust keepalive timeout to avoid MediaTek WS0 race condition

- Various improvements and fixes

* tag 'linux-watchdog-6.17-rc1' of git://www.linux-watchdog.org/linux-watchdog:
  watchdog: sbsa: Adjust keepalive timeout to avoid MediaTek WS0 race condition
  watchdog: dw_wdt: Fix default timeout
  watchdog: Don't use "proxy" headers
  watchdog: it87_wdt: Don't use "proxy" headers
  watchdog: renesas_wdt: Convert to DEFINE_SIMPLE_DEV_PM_OPS()
  watchdog: iTCO_wdt: Report error if timeout configuration fails
  watchdog: rti_wdt: Use of_reserved_mem_region_to_resource() for "memory-region"
  dt-bindings: watchdog: nxp,pnx4008-wdt: allow clocks property
  watchdog: ziirave_wdt: check record length in ziirave_firm_verify()

Add audio support for acp7.2 platform

Merge series from Venkata Prasad Potturu <venkataprasad.potturu@amd.com>:

This patch series is to add legacy and sof audio support
for acp7.2 platform.

Merge tag 'dmaengine-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine

Pull dmaengine updates from Vinod Koul:
"Core:

   - Managed API for dma channel request

  New support:

   - Sophgo CV18XX/SG200X dmamux driver

   - Qualcomm Milos GPI, sc8280xp GPI support

  Updates:

   - Conversion of brcm,iproc-sba and marvell,orion-xor binding

   - Unused code cleanup across drivers"

* tag 'dmaengine-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine: (23 commits)
  dt-bindings: dma: fsl-mxs-dma: allow interrupt-names for fsl,imx23-dma-apbx
  dmaengine: xdmac: make it selectable for ARCH_MICROCHIP
  dt-bindings: dma: Convert marvell,orion-xor to DT schema
  dt-bindings: dma: Convert brcm,iproc-sba to DT schema
  dmaengine: nbpfaxi: Add missing check after DMA map
  dmaengine: mv_xor: Fix missing check after DMA map and missing unmap
  dt-bindings: dma: qcom,gpi: document the Milos GPI DMA Engine
  dmaengine: idxd: Remove __packed from structures
  dmaengine: ti: Do not enable by default during compile testing
  dmaengine: sh: Do not enable SH_DMAE_BASE by default during compile testing
  dmaengine: idxd: Fix warning for deadcode.deadstore
  dmaengine: mmp: Fix again Wvoid-pointer-to-enum-cast warning
  dmaengine: fsl-qdma: Add missing fsl_qdma_format kerneldoc
  dmaengine: qcom: gpi: Drop unused gpi_write_reg_field()
  dmaengine: fsl-dpaa2-qdma: Drop unused mc_enc()
  dmaengine: dw-edma: Drop unused dchan2dev() and chan2dev()
  dmaengine: stm32: Don't use %pK through printk
  dmaengine: stm32-dma: configure next sg only if there are more than 2 sgs
  dmaengine: sun4i: Simplify error handling in probe()
  dt-bindings: dma: qcom,gpi: Document the sc8280xp GPI DMA engine
  ...

Merge tag 'phy-for-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/phy/linux-phy

Pull phy updates from Vinod Koul:
"New Support:

   - Qualcomm Milos Synopsys eUSB2 PHY, SM8750 QMP phy support, M31
     eUSB2 PHY driver

   - Samsung Exynos990 usbdrd phy, Exynos7870 MIPI phy support

   - Renesas RZ/V2N usb2-phy support

  Updates:

   - Bulk Yaml binding conversion By Rob H (too many to be listed)

   - cadence: Sierra PCIe, USB PHY multilink configuration support

   - Qualcomm refactoring of UFS PHY reset and UFS driver support for
     phy calibrate API"

* tag 'phy-for-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/phy/linux-phy: (74 commits)
  phy: qcom: phy-qcom-m31: Update IPQ5332 M31 USB phy initialization sequence
  dt-bindings: phy: Convert brcm,sr-usb-combo-phy to DT schema
  dt-bindings: phy: Convert ti,da830-usb-phy to DT schema
  dt-bindings: phy: marvell,mmp2-usb-phy: Drop status from the example
  dt-bindings: phy: mixel, mipi-dsi-phy: Allow assigned-clock* properties
  phy: exynos-mipi-video: correct cam0 sysreg property name for exynos7870
  phy: qcom: phy-qcom-snps-eusb2: Update init sequence per HPG 1.0.2
  phy: qcom: phy-qcom-snps-eusb2: Add missing write from init sequence
  dt-bindings: phy: qcom,snps-eusb2: document the Milos Synopsys eUSB2 PHY
  dt-bindings: usb: qcom,snps-dwc3: Add Milos compatible
  phy: rockchip-pcie: Properly disable TEST_WRITE strobe signal
  phy: rockchip-pcie: Enable all four lanes if required
  dt-bindings: phy: qcom,sc8280xp-qmp-pcie-phy: Update pcie phy bindings for QCS615
  phy: qcom: qmp-combo: Add missing PLL (VCO) configuration on SM8750
  phy: qcom: m31-eusb2: drop registration printk
  phy: qcom: m31-eusb2: fix match data santity check
  phy: qcom: qmp-pcie: Update PHY settings for QCS8300 & SA8775P
  phy: qualcomm: phy-qcom-eusb2-repeater: Don't zero-out registers
  dt-bindings: phy: qcom,snps-eusb2-repeater: Remove default tuning values
  phy: mediatek: tphy: Cleanup and document slew calibration
  ...

Merge tag 'sound-6.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound

Pull more sound updates from Takashi Iwai:
"For catching up the remaining stuff for 6.17: only small updates and
  the rest are mostly small fixes.

   - Fixes in HD-audio codec driver Kconfig, so that configurations can
     be more easily/safely carried between different versions

   - Fixes in ASoC SDCA, FSL xcvr, AW88399

   - ASoC IMX WM8524 support

   - HD-audio and USB-audio quirks and fixes

   - A minor selftest fix"

* tag 'sound-6.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (21 commits)
  ALSA: usb: scarlett2: Fix missing NULL check
  mips: Update HD-audio configs again
  LoongArch: Update HD-audio codec configs
  arm: Update HD-audio configs again
  selftests: ALSA: fix memory leak in utimer test
  ALSA: usb-audio: Add DSD support for Comtrue USB Audio device
  ALSA: hda/hdmi: Enable drivers as default
  ALSA: hda/cirrus: Enable drivers as default
  ALSA: hda/realtek: Enable drivers as default
  ALSA: hda/realtek - Fix mute LED for HP Victus 16-d1xxx (MB 8A26)
  ALSA: hda/realtek - Fix mute LED for HP Victus 16-s0xxx
  ALSA: hda: Fix the wrong register was used for DVC of TAS2770
  ALSA: scarlett2: Add retry on -EPROTO from scarlett2_usb_tx()
  ALSA: hda/realtek - Fix mute LED for HP Victus 16-r1xxx
  ASoC: codecs: Add acpi_match_table for aw88399 driver
  ASoC: dt-bindings: atmel,at91-ssc: add microchip,sam9x7-ssc
  ASoC: imx-card: Add WM8524 support
  ASoC: fsl_xcvr: get channel status data with firmware exists
  ASoC: fsl_xcvr: get channel status data when PHY is not exists
  ASoC: SDCA: Add support for -cn- value properties
  ...

Merge tag 'soundwire-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/soundwire

Pull soundwire updates from Vinod Koul:
"A couple of small core changes and driver updates:

   - Core: handling of nesting irqs to outside the lock, stream
     parameters handing on port prep failures.

   - AMD driver support for ACP 7.2 platforms and improved handing of
     slave alerts and resume sequences

   - Qualcomm updating driver debug spew

   - Intel BPT message length limitations, rt721 codec as wake capable
     etc"

* tag 'soundwire-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/soundwire:
  soundwire: amd: Add support for acp7.2 platform
  soundwire: stream: restore params when prepare ports fail
  soundwire: debugfs: move debug statement outside of error handling
  soundwire: amd: add check for status update registers
  soundwire: intel_auxdevice: add rt721 codec to wake_capable_list
  soundwire: Correct some property names
  soundwire: update Intel BPT message length limitation
  soundwire: intel_ace2.x: Use str_read_write() helper
  soundwire: amd: cancel pending slave status handling workqueue during remove sequence
  soundwire: amd: serialize amd manager resume sequence during pm_prepare
  soundwire: qcom: demote probe registration printk
  ASoC: cs42l43: Remove unnecessary work functions
  soundwire: Move handle_nested_irq outside of sdw_dev_lock
  MAINTAINERS: Remove Sanyog Kale as reviewer on SoundWire

Merge tag 'trace-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing updates from Steven Rostedt:

- Deprecate auto-mounting tracefs to /sys/kernel/debug/tracing

   When tracefs was first introduced back in 2014, the directory
   /sys/kernel/tracing was added and is the designated location to mount
   tracefs. To keep backward compatibility, tracefs was auto-mounted in
   /sys/kernel/debug/tracing as well.

   All distros now mount tracefs on /sys/kernel/tracing. Having it seen
   in two different locations has lead to various issues and
   inconsistencies.

   The VFS folks have to also maintain debugfs_create_automount() for
   this single user.

   It's been over 10 years. Tooling and scripts should start replacing
   the debugfs location with the tracefs one. The reason tracefs was
   created in the first place was to allow access to the tracing
   facilities without the need to configure debugfs into the kernel.
   Using tracefs should now be more robust.

   A new config is created: CONFIG_TRACEFS_AUTOMOUNT_DEPRECATED which is
   default y, so that the kernel is still built with the automount. This
   config allows those that want to remove the automount from debugfs to
   do so.

   When tracefs is accessed from /sys/kernel/debug/tracing, the
   following printk is triggerd:

     pr_warn("NOTICE: Automounting of tracing to debugfs is deprecated and will be removed in 2030\n");

   This gives users another 5 years to fix their scripts.

- Use queue_rcu_work() instead of call_rcu() for freeing event filters

   The number of filters to be free can be many depending on the number
   of events within an event system. Freeing them from softirq context
   can potentially cause undesired latency. Use the RCU workqueue to
   free them instead.

- Remove pointless memory barriers in latency code

   Memory barriers were added to some of the latency code a long time
   ago with the idea of "making them visible", but that's not what
   memory barriers are for. They are to synchronize access between
   different variables. There was no synchronization here making them
   pointless.

- Remove "__attribute__()" from the type field of event format

   When LLVM is used to compile the kernel with CONFIG_DEBUG_INFO_BTF=y
   and PAHOLE_HAS_BTF_TAG=y, some of the format fields get expanded with
   the following:

     field:const char * filename;      offset:24;      size:8; signed:0;

   Turns into:

     field:const char __attribute__((btf_type_tag("user"))) * filename;      offset:24;      size:8; signed:0;

   This confuses parsers. Add code to strip these tags from the strings.

- Add eprobe config option CONFIG_EPROBE_EVENTS

   Eprobes were added back in 5.15 but were only enabled when another
   probe was enabled (kprobe, fprobe, uprobe, etc). The eprobes had no
   config option of their own. Add one as they should be a separate
   entity.

   It's default y to keep with the old kernels but still has
   dependencies on TRACING and HAVE_REGS_AND_STACK_ACCESS_API.

- Add eprobe documentation

   When eprobes were added back in 5.15 no documentation was added to
   describe them. This needs to be rectified.

- Replace open coded cpumask_next_wrap() in move_to_next_cpu()

- Have preemptirq_delay_run() use off-stack CPU mask

- Remove obsolete comment about pelt_cfs event

   DECLARE_TRACE() appends "_tp" to trace events now, but the comment
   above pelt_cfs still mentioned appending it manually.

- Remove EVENT_FILE_FL_SOFT_MODE flag

   The SOFT_MODE flag was required when the soft enabling and disabling
   of trace events was first introduced. But there was a bug with this
   approach as it only worked for a single instance. When multiple users
   required soft disabling and disabling the code was changed to have a
   ref count. The SOFT_MODE flag is now set iff the ref count is non
   zero. This is redundant and just reading the ref count is good
   enough.

- Fix typo in comment

* tag 'trace-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  Documentation: tracing: Add documentation about eprobes
  tracing: Have eprobes have their own config option
  tracing: Remove "__attribute__()" from the type field of event format
  tracing: Deprecate auto-mounting tracefs in debugfs
  tracing: Fix comment in trace_module_remove_events()
  tracing: Remove EVENT_FILE_FL_SOFT_MODE flag
  tracing: Remove pointless memory barriers
  tracing/sched: Remove obsolete comment on suffixes
  kernel: trace: preemptirq_delay_test: use offstack cpu mask
  tracing: Use queue_rcu_work() to free filters
  tracing: Replace opencoded cpumask_next_wrap() in move_to_next_cpu()

Merge tag 'trace-tools-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing tools updates from Steven Rostedt:

- Introduce enum timerlat_tracing_mode

   Now that BPF based sampling has been added to timerlat, add an enum
   to represent which mode timerlat is running in

- Add action on timelat threshold feature

   A new option, --on-threshold, is added, taking an argument that
   further specifies the action. Actions added in this patch are:

     - trace[,file=<filename>]: Saves tracefs buffer, optionally taking a
             filename
     - signal,num=<sig>,pid=<pid>: Sends signal to process. "parent" might
             be specified instead of number to send signal to parent process
     - shell,command=<command>: Execute shell command

- Allow resuming tracing in timerlat bpf

   rtla-timerlat BPF program uses a global variable stored in a .bss
   section to store whether tracing has been stopped. Map it to allow it
   to resume tracing after it has been stopped

- Add continue action to timerlat

   Introduce option to resume tracing after a latency threshold
   overflow. The option is implemented as an action named "continue"

- Add action on end feature to timerlat

   Implement actions on end next to actions on threshold. A new option,
   --on-end is added, parallel to --on-threshold. Instead of being
   executed whenever a latency threshold is reached, it is executed at
   the end of the measurement

- Have rtla tests check output with grep

   Add argument to the check command in the test suite that takes a
   regular expression that the output of rtla command is checked
   against. This allows testing for specific information in rtla output
   in addition to checking the return value

- Add tests for timerlat actions

- Update the documentation for the new features

* tag 'trace-tools-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  rtla/tests: Test timerlat -P option using actions
  rtla/tests: Add grep checks for base test cases
  Documentation/rtla: Add actions feature
  rtla/tests: Limit duration to maximum of 10s
  rtla/tests: Add tests for actions
  rtla/tests: Check rtla output with grep
  rtla/timerlat: Add action on end feature
  rtla/timerlat: Add continue action
  rtla/timerlat_bpf: Allow resuming tracing
  rtla/timerlat: Add action on threshold feature
  rtla/timerlat: Introduce enum timerlat_tracing_mode

Merge tag 'trace-deferred-unwind-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull initial deferred unwind infrastructure from Steven Rostedt:
"This is the core infrastructure for the deferred unwinder that is
  required for sframes[1]. Several other patch series are based on this
  work although those patch series are not dependent on each other. In
  order to simplify the development, having this core series upstream
  will allow the other series to be worked on in parallel. The other
  series are:

    - The two patches to implement x86 support [2] [3]

    - The s390 work [4]

    - The perf work [5]

    - The ftrace work [6]

    - The sframe work [7]

  And more is on the way.

  The core infrastructure adds the following in kernel APIs:

    - int unwind_user_faultable(struct unwind_stacktrace *trace);

        Performs a user space stack trace that may fault user pages in.

    - int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func);

        Allows a tracer to register with the unwind deferred
        infrastructure.

    - int unwind_deferred_request(struct unwind_work *work, u64 *cookie);

        Used when a tracer request a deferred trace. Can be called from
        interrupt or NMI context.

    - void unwind_deferred_cancel(struct unwind_work *work);

        Called by a tracer to unregister from the deferred unwind
        infrastructure.

    - void unwind_deferred_task_exit(struct task_struct *task);

        Called by task exit code to flush any pending unwind requests.

    - void unwind_task_init(struct task_struct *task);

        Called by do_fork() to initialize the task struct for the
        deferred unwinder.

    - void unwind_task_free(struct task_struct *task);

        Called by do_exit() to free up any resources used by the
        deferred unwinder.

    None of the above is actually compiled unless an architecture enables it,
    which none currently do"

Link: https://sourceware.org/binutils/wiki/sframe
Link: https://lore.kernel.org/linux-trace-kernel/20250717004958.260781923@kernel.org/
Link: https://lore.kernel.org/linux-trace-kernel/20250717004958.432327787@kernel.org/
Link: https://lore.kernel.org/linux-trace-kernel/20250710163522.3195293-1-jremus@linux.ibm.com/
Link: https://lore.kernel.org/linux-trace-kernel/20250718164119.089692174@kernel.org/
Link: https://lore.kernel.org/linux-trace-kernel/20250424192612.505622711@goodmis.org/
Link: https://lore.kernel.org/linux-trace-kernel/20250717012848.927473176@kernel.org/
* tag 'trace-deferred-unwind-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  unwind: Finish up unwind when a task exits
  unwind deferred: Use SRCU unwind_deferred_task_work()
  unwind: Add USED bit to only have one conditional on way back to user space
  unwind deferred: Add unwind_completed mask to stop spurious callbacks
  unwind deferred: Use bitmask to determine which callbacks to call
  unwind_user/deferred: Make unwind deferral requests NMI-safe
  unwind_user/deferred: Add deferred unwinding interface
  unwind_user/deferred: Add unwind cache
  unwind_user/deferred: Add unwind_user_faultable()
  unwind_user: Add user space unwinding API with frame pointer support

bpf: Improve ctx access verifier error message

We've already had two "error during ctx access conversion" warnings
triggered by syzkaller. Let's improve the error message by dumping the
cnt variable so that we can more easily differentiate between the
different error cases.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/cc94316c30dd76fae4a75a664b61a2dbfe68e205.1754039605.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Check netfilter ctx accesses are aligned

Similarly to the previous patch fixing the flow_dissector ctx accesses,
nf_is_valid_access also doesn't check that ctx accesses are aligned.
Contrary to flow_dissector programs, netfilter programs don't have
context conversion. The unaligned ctx accesses are therefore allowed by
the verifier.

Fixes: fd9c663b9ad6 ("bpf: minimal support for programs hooked into netfilter framework")
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/853ae9ed5edaa5196e8472ff0f1bb1cc24059214.1754039605.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Check flow_dissector ctx accesses are aligned

flow_dissector_is_valid_access doesn't check that the context access is
aligned. As a consequence, an unaligned access within one of the exposed
field is considered valid and later rejected by
flow_dissector_convert_ctx_access when we try to convert it.

The later rejection is problematic because it's reported as a verifier
bug with a kernel warning and doesn't point to the right instruction in
verifier logs.

Fixes: d58e468b1112 ("flow_dissector: implements flow dissector BPF hook")
Reported-by: syzbot+ccac90e482b2a81d74aa@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=ccac90e482b2a81d74aa
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/cc1b036be484c99be45eddf48bd78cc6f72839b1.1754039605.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

x86/cpu: Add new Intel CPU model numbers for Wildcatlake and Novalake

Wildcatlake is a mobile CPU. Novalake has both desktop and mobile
versions.

[ bp: Merge into a single patch. ]

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250730150437.4701-1-tony.luck@intel.com

spi: cs42l43: Property entry should be a null-terminated array

The software node does not specify a count of property entries, so the
array must be null-terminated.

When unterminated, this can lead to a fault in the downstream cs35l56
amplifier driver, because the node parse walks off the end of the
array into unknown memory.

Fixes: 0ca645ab5b15 ("spi: cs42l43: Add speaker id support to the bridge configuration")
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220371
Signed-off-by: Simon Trimmer <simont@opensource.cirrus.com>
Link: https://patch.msgid.link/20250731160109.1547131-1-simont@opensource.cirrus.com
Signed-off-by: Mark Brown <broonie@kernel.org>

ASoC: Intel: sof_sdw: Add quirk for Alienware Area 51 (2025) 0CCC SKU

Add DMI quirk entry for Alienware systems with SKU "0CCC" to enable
proper speaker codec configuration (SOC_SDW_CODEC_SPKR).

This system requires the same audio configuration as some existing Dell systems.
Without this patch, the laptop's speakers and microphone will not work.

Signed-off-by: Peter Jakubek <peterjakubek@gmail.com>
Link: https://patch.msgid.link/20250731172104.2009007-1-peterjakubek@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>

ASoC: tas2781: Fix the wrong step for TLV on tas2781

The step for TLV on tas2781, should be 50 (-0.5dB).

Fixes: 678f38eba1f2 ("ASoC: tas2781: Add Header file for tas2781 driver")
Signed-off-by: Baojun Xu <baojun.xu@ti.com>
Link: https://patch.msgid.link/20250801021618.64627-1-baojun.xu@ti.com
Signed-off-by: Mark Brown <broonie@kernel.org>

ASoC: amd: acp: Add SoundWire SOF machine driver support for acp7.2 platform

Add SoundWire SOF machine driver support for acp7.2 platform.

Signed-off-by: Venkata Prasad Potturu <venkataprasad.potturu@amd.com>
Link: https://patch.msgid.link/20250801062207.579388-5-venkataprasad.potturu@amd.com
Signed-off-by: Mark Brown <broonie@kernel.org>

ASoC: amd: acp: Add SoundWire legacy machine driver support for acp7.2 platform

Add SoundWire legacy machine driver support for acp7.2 platform.

Signed-off-by: Venkata Prasad Potturu <venkataprasad.potturu@amd.com>
Link: https://patch.msgid.link/20250801062207.579388-4-venkataprasad.potturu@amd.com
Signed-off-by: Mark Brown <broonie@kernel.org>

ASoC: amd: ps: Add SoundWire pci and dma driver support for acp7.2 platform

Add SoundWire pci and dma driver support for acp7.2 platform.

Signed-off-by: Venkata Prasad Potturu <venkataprasad.potturu@amd.com>
Link: https://patch.msgid.link/20250801062207.579388-3-venkataprasad.potturu@amd.com
Signed-off-by: Mark Brown <broonie@kernel.org>

ASoC: SOF: amd: Add sof audio support for acp7.2 platform

Add pci revision id to support sof audio for acp7.2 platfom.

Signed-off-by: Venkata Prasad Potturu <venkataprasad.potturu@amd.com>
Link: https://patch.msgid.link/20250801062207.579388-2-venkataprasad.potturu@amd.com
Signed-off-by: Mark Brown <broonie@kernel.org>

vsock/virtio: Allocate nonlinear SKBs for handling large transmit buffers

When transmitting a vsock packet, virtio_transport_send_pkt_info() calls
virtio_transport_alloc_linear_skb() to allocate and fill SKBs with the
transmit data. Unfortunately, these are always linear allocations and
can therefore result in significant pressure on kmalloc() considering
that the maximum packet size (VIRTIO_VSOCK_MAX_PKT_BUF_SIZE +
VIRTIO_VSOCK_SKB_HEADROOM) is a little over 64KiB, resulting in a 128KiB
allocation for each packet.

Rework the vsock SKB allocation so that, for sizes with page order
greater than PAGE_ALLOC_COSTLY_ORDER, a nonlinear SKB is allocated
instead with the packet header in the SKB and the transmit data in the
fragments. Note that this affects both the vhost and virtio transports.

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Will Deacon <will@kernel.org>
Message-Id: <20250717090116.11987-10-will@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

vsock/virtio: Rename virtio_vsock_skb_rx_put()

In preparation for using virtio_vsock_skb_rx_put() when populating SKBs
on the vsock TX path, rename virtio_vsock_skb_rx_put() to
virtio_vsock_skb_put().

No functional change.

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Will Deacon <will@kernel.org>
Message-Id: <20250717090116.11987-9-will@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

vhost/vsock: Allocate nonlinear SKBs for handling large receive buffers

When receiving a packet from a guest, vhost_vsock_handle_tx_kick()
calls vhost_vsock_alloc_linear_skb() to allocate and fill an SKB with
the receive data. Unfortunately, these are always linear allocations and
can therefore result in significant pressure on kmalloc() considering
that the maximum packet size (VIRTIO_VSOCK_MAX_PKT_BUF_SIZE +
VIRTIO_VSOCK_SKB_HEADROOM) is a little over 64KiB, resulting in a 128KiB
allocation for each packet.

Rework the vsock SKB allocation so that, for sizes with page order
greater than PAGE_ALLOC_COSTLY_ORDER, a nonlinear SKB is allocated
instead with the packet header in the SKB and the receive data in the
fragments. Finally, add a debug warning if virtio_vsock_skb_rx_put() is
ever called on an SKB with a non-zero length, as this would be
destructive for the nonlinear case.

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Will Deacon <will@kernel.org>
Message-Id: <20250717090116.11987-8-will@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

vsock/virtio: Move SKB allocation lower-bound check to callers

virtio_vsock_alloc_linear_skb() checks that the requested size is at
least big enough for the packet header (VIRTIO_VSOCK_SKB_HEADROOM).

Of the three callers of virtio_vsock_alloc_linear_skb(), only
vhost_vsock_alloc_skb() can potentially pass a packet smaller than the
header size and, as it already has a check against the maximum packet
size, extend its bounds checking to consider the minimum packet size
and remove the check from virtio_vsock_alloc_linear_skb().

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Will Deacon <will@kernel.org>
Message-Id: <20250717090116.11987-7-will@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

vsock/virtio: Rename virtio_vsock_alloc_skb()

In preparation for nonlinear allocations for large SKBs, rename
virtio_vsock_alloc_skb() to virtio_vsock_alloc_linear_skb() to indicate
that it returns linear SKBs unconditionally and switch all callers over
to this new interface for now.

No functional change.

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Will Deacon <will@kernel.org>
Message-Id: <20250717090116.11987-6-will@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>