Currently execmem_cache_free() ignores potential allocation failures that
may happen in execmem_cache_add(). Besides, it uses text poking to fill
the memory with trapping instructions before returning it to cache
although it would be more efficient to make that memory writable, update
it using memcpy and then restore ROX protection.
Rework execmem_cache_free() so that in case of an error it will defer
freeing of the memory to a delayed work.
With this the happy fast path will now change permissions to RW, fill the
memory with trapping instructions using memcpy, restore ROX permissions,
add the memory back to the free cache and clear the relevant entry in
busy_areas.
If any step in the fast path fails, the entry in busy_areas will be marked
as pending_free. These entries will be handled by a delayed work and
freed asynchronously.
To make the fast path faster, use __GFP_NORETRY for memory allocations and
let asynchronous handler try harder with GFP_KERNEL.
Link: https://lkml.kernel.org/r/20250713071730.4117334-4-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Petr Pavlu <petr.pavlu@suse.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Some callers of execmem_alloc() require the memory to be temporarily
writable even when it is allocated from ROX cache. These callers use
execemem_make_temp_rw() right after the call to execmem_alloc().
Wrap this sequence in execmem_alloc_rw() API.
Link: https://lkml.kernel.org/r/20250713071730.4117334-3-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Reviewed-by: Petr Pavlu <petr.pavlu@suse.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "x86: enable EXECMEM_ROX_CACHE for ftrace and kprobes", v3.
These patches enable use of EXECMEM_ROX_CACHE for ftrace and kprobes
allocations on x86.
They also include some ground work in execmem.
Since the execmem model for caching large ROX pages changed from the
initial assumption that the memory that is allocated from ROX cache is
always ROX to the current state where memory can be temporarily made RW
and then restored to ROX, we can stop using text poking to update it.
This also saves the hassle of trying lock text_mutex in
execmem_cache_free() when kprobes already hold that mutex.
This patch (of 8):
The execmem_update_copy() that used text poking was required when memory
allocated from ROX cache was always read-only. Since now its permissions
can be switched to read-write there is no need in a function that updates
memory with text poking.
mm: fix a UAF when vma->mm is freed after vma->vm_refcnt got dropped
By inducing delays in the right places, Jann Horn created a reproducer for
a hard to hit UAF issue that became possible after VMAs were allowed to be
recycled by adding SLAB_TYPESAFE_BY_RCU to their cache.
Race description is borrowed from Jann's discovery report:
lock_vma_under_rcu() looks up a VMA locklessly with mas_walk() under
rcu_read_lock(). At that point, the VMA may be concurrently freed, and it
can be recycled by another process. vma_start_read() then increments the
vma->vm_refcnt (if it is in an acceptable range), and if this succeeds,
vma_start_read() can return a recycled VMA.
In this scenario where the VMA has been recycled, lock_vma_under_rcu()
will then detect the mismatching ->vm_mm pointer and drop the VMA through
vma_end_read(), which calls vma_refcount_put(). vma_refcount_put() drops
the refcount and then calls rcuwait_wake_up() using a copy of vma->vm_mm.
This is wrong: It implicitly assumes that the caller is keeping the VMA's
mm alive, but in this scenario the caller has no relation to the VMA's mm,
so the rcuwait_wake_up() can cause UAF.
The diagram depicting the race:
T1 T2 T3
== == ==
lock_vma_under_rcu
mas_walk
<VMA gets removed from mm>
mmap
<the same VMA is reallocated>
vma_start_read
__refcount_inc_not_zero_limited_acquire
munmap
__vma_enter_locked
refcount_add_not_zero
vma_end_read
vma_refcount_put
__refcount_dec_and_test
rcuwait_wait_event
<finish operation>
rcuwait_wake_up [UAF]
Note that rcuwait_wait_event() in T3 does not block because refcount was
already dropped by T1. At this point T3 can exit and free the mm causing
UAF in T1.
To avoid this we move vma->vm_mm verification into vma_start_read() and
grab vma->vm_mm to stabilize it before vma_refcount_put() operation.
If an anon folio is mapped into userspace, its anon_vma must be alive,
otherwise rmap walks can hit UAF.
There have been syzkaller reports a few months ago[1][2] of UAF in rmap
walks that seems to indicate that there can be pages with elevated
mapcount whose anon_vma has already been freed, but I think we never
figured out what the cause is; and syzkaller only hit these UAFs when
memory pressure randomly caused reclaim to rmap-walk the affected pages,
so it of course didn't manage to create a reproducer.
Add a VM_WARN_ON_FOLIO() when we add/remove mappings of anonymous folios
to hopefully catch such issues more reliably.
Lorenzo Stoakes [Fri, 25 Jul 2025 14:29:01 +0000 (15:29 +0100)]
mm: remove mm/io-mapping.c
This is dead code, which was used from commit b739f125e4eb ("i915: use
io_mapping_map_user") but reverted a month later by commit 0e4fe0c9f2f9
("Revert "i915: use io_mapping_map_user"") back in 2021.
Since then nobody has used it, so remove it.
[akpm@linux-foundation.org: update Documentation/core-api/mm-api.rst, per Vlastimil] Link: https://lkml.kernel.org/r/20250725142901.81502-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dev Jain [Thu, 24 Jul 2025 05:23:01 +0000 (10:53 +0530)]
khugepaged: optimize collapse_pte_mapped_thp() by PTE batching
Use PTE batching to batch process PTEs mapping the same large folio. An
improvement is expected due to batching mapcount manipulation on the
folios, and for arm64 which supports contig mappings, the number of
TLB flushes is also reduced.
Note that we do not need to make a change to the check
"if (folio_page(folio, i) != page)"; if i'th page of the folio is equal
to the first page of our batch, then i + 1, .... i + nr_batch_ptes - 1
pages of the folio will be equal to the corresponding pages of our
batch mapping consecutive pages.
Link: https://lkml.kernel.org/r/20250724052301.23844-4-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Barry Song <baohua@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dev Jain [Thu, 24 Jul 2025 05:23:00 +0000 (10:53 +0530)]
khugepaged: optimize __collapse_huge_page_copy_succeeded() by PTE batching
Use PTE batching to batch process PTEs mapping the same large folio. An
improvement is expected due to batching refcount-mapcount manipulation on
the folios, and for arm64 which supports contig mappings, the number of
TLB flushes is also reduced.
Link: https://lkml.kernel.org/r/20250724052301.23844-3-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Barry Song <baohua@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
If the underlying folio mapped by the ptes is large, we can process those
ptes in a batch using folio_pte_batch().
For arm64 specifically, this results in a 16x reduction in the number of
ptep_get() calls, since on a contig block, ptep_get() on arm64 will
iterate through all 16 entries to collect a/d bits. Next, ptep_clear()
will cause a TLBI for every contig block in the range via
contpte_try_unfold(). Instead, use clear_ptes() to only do the TLBI at
the first and last contig block of the range.
For split folios, there will be no pte batching; the batch size returned
by folio_pte_batch() will be 1. For pagetable split folios, the ptes will
still point to the same large folio; for arm64, this results in the
optimization described above, and for other arches, a minor improvement is
expected due to a reduction in the number of function calls and batching
atomic operations.
This patch (of 3):
Let's add variants to be used where "full" does not apply -- which will
be the majority of cases in the future. "full" really only applies if
we are about to tear down a full MM.
Use get_and_clear_ptes() in existing code, clear_ptes() users will
be added next.
Link: https://lkml.kernel.org/r/20250724052301.23844-2-dev.jain@arm.com Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Fri, 25 Jul 2025 08:29:45 +0000 (09:29 +0100)]
mm/mseal: rework mseal apply logic
The logic can be simplified - firstly by renaming the inconsistently named
apply_mm_seal() to mseal_apply().
We then wrap mseal_fixup() into the main loop as the logic is simple
enough to not require it, equally it isn't a hugely pleasant pattern in
mprotect() etc. so it's not something we want to perpetuate.
We eliminate the need for invoking vma_iter_end() on each loop by directly
determining if the VMA was merged - the only thing we need concern
ourselves with is whether the start/end of the (gapless) range are offset
into VMAs.
This refactoring also avoids the rather horrid 'pass pointer to prev
around' pattern used in mprotect() et al.
No functional change intended.
Link: https://lkml.kernel.org/r/ddfa4376ce29f19a589d7dc8c92cb7d4f7605a4c.1753431105.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Jeff Xu <jeffxu@chromium.org> Cc: Jann Horn <jannh@google.com> Cc: Kees Cook <kees@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Fri, 25 Jul 2025 08:29:44 +0000 (09:29 +0100)]
mm/mseal: simplify and rename VMA gap check
The check_mm_seal() function is doing something general - checking whether
a range contains only VMAs (or rather that it does NOT contain any
unmapped regions).
So rename this function to range_contains_unmapped().
Additionally simplify the logic, we are simply checking whether the last
vma->vm_end has either a VMA starting after it or ends before the end
parameter.
This check is rather dubious, so it is sensible to keep it local to
mm/mseal.c as at a later stage it may be removed, and we don't want any
other mm code to perform such a check.
Lorenzo Stoakes [Fri, 25 Jul 2025 08:29:43 +0000 (09:29 +0100)]
mm/mseal: small cleanups
Drop the wholly unnecessary set_vma_sealed() helper(), which is used only
once, and place VMA_ITERATOR() declarations in the correct place.
Retain vma_is_sealed(), and use it instead of the confusingly named
can_modify_vma(), so it's abundantly clear what's being tested, rather
then a nebulous sense of 'can the VMA be modified'.
No functional change intended.
Link: https://lkml.kernel.org/r/98cf28d04583d632a6eb698e9ad23733bb6af26b.1753431105.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Jeff Xu <jeffxu@chromium.org> Cc: Jann Horn <jannh@google.com> Cc: Kees Cook <kees@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Fri, 25 Jul 2025 08:29:42 +0000 (09:29 +0100)]
mm/mseal: update madvise() logic
The madvise() logic is inexplicably performed in mm/mseal.c - this ought
to be located in mm/madvise.c.
Additionally can_modify_vma_madv() is inconsistently named and, in
combination with is_ro_anon(), is very confusing logic.
Put a static function in mm/madvise.c instead - can_madvise_modify() -
that spells out exactly what's happening. Also explicitly check for an
anon VMA.
Also add commentary to explain what's going on.
Essentially - we disallow discarding of data in mseal()'d mappings in
instances where the user couldn't otherwise write to that data.
We retain the existing behaviour here regarding MAP_PRIVATE mappings of
file-backed mappings, which entails some complexity - while this, strictly
speaking - appears to violate mseal() semantics, it may interact badly
with users which expect to be able to madvise(MADV_DONTNEED) .text
mappings for instance.
We may revisit this at a later date.
No functional change intended.
Link: https://lkml.kernel.org/r/492a98d9189646e92c8f23f4cce41ed323fe01df.1753431105.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Kees Cook <kees@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Fri, 25 Jul 2025 08:29:41 +0000 (09:29 +0100)]
mm/mseal: always define VM_SEALED
Patch series "mseal cleanups", v4.
Perform a number of cleanups to the mseal logic. Firstly, VM_SEALED is
treated differently from every other VMA flag, it really doesn't make
sense to do this, so we start by making this consistent with everything
else.
Next we place the madvise logic where it belongs - in mm/madvise.c. It
really makes no sense to abstract this elsewhere. In doing so, we go to
great lengths to explain very clearly the previously very confusing logic
as to what sealed mappings are impacted here.
In doing so, we retain existing logic regarding treatment of madvise()
discard operations for a sealed, read-only MAP_PRIVATE file-backed
mapping. This is something we likely need to revisit.
We then abstract out and explain the 'are there are any gaps in this range
in the mm?' check being performed as a prerequisite to mseal being
performed.
Finally, we simplify the actual mseal logic which is really quite
straightforward.
No functional change is intended.
This patch (of 4):
There is no reason to treat VM_SEALED in a special way, in each other case
in which a VMA flag is unavailable due to configuration, we simply assign
that flag to VM_NONE, so make VM_SEALED consistent with all other VMA
flags in this respect.
Additionally, use the next available bit for VM_SEALED, 42, rather than
arbitrarily putting it at 63 and update the declaration to match all other
VMA flags.
mm/damon/vaddr: skip isolating folios already in destination nid
damos_va_migrate_dests_add() determines the node a folio should be in
based on the struct damos_migrate_dests associated with the migration
scheme and adds the folio to the linked list corresponding to that node so
it can be migrated later. Currently, folios are isolated and added to the
list even if they are already in the node they should be in.
In using damon weighted interleave more, I've found that the overhead of
needlessly adding these folios to the migration lists can be quite high.
The overhead comes from isolating folios and placing them in the migration
lists inside of damos_va_migrate_dests_add(), as well as the cost of
handling those folios in damon_migrate_pages(). This patch eliminates
that overhead by simply avoiding the addition of folios that are already
in their intended location to the migration list.
To show the benefit of this patch, we start the test workload and start a
DAMON instance attached to that workload with a migrate_hot scheme that
has one dest field sending data to the local node. This way, we are only
measuring the overheads of the scheme, and not the cost of migrating
pages, since data will be allocated to the local node by default. I
tested with two workloads: the embedding reduction workload used in [1]
and a microbenchmark that allocates 20GB of data then sleeps, which is
similar to the memory usage of the embedding reduction workload.
The time taken in damos_va_migrate_dests_add() and damon_migrate_pages()
each aggregation interval is shown below.
Before this patch:
damos_va_migrate_dests_add damon_migrate_pages
microbenchmark ~2ms ~3ms
embedding reduction ~1s ~3s
After this patch:
damos_va_migrate_dests_add damon_migrate_pages
microbenchmark 0us ~40us
embedding reduction 0us ~100us
I did not do an in depth analysis for why things are much slower in the
embedding reduction workload than the microbenchmark. However, I assume
it's because the embedding reduction workload oversaturates the bandwidth
of the local memory node, increasing the memory access latency, and in
turn making the pointer chasing involved in iterating through a linked
list much slower. Regardless of that, this patch results in a significant
speedup.
Link: https://lkml.kernel.org/r/20250725163300.4602-1-bijan311@gmail.com Fixes: 19c1dc15c859 ("mm/damon/vaddr: use damos->migrate_dests in migrate_{hot,cold}") Signed-off-by: Bijan Tabatabai <bijantabatab@micron.com> Reviewed-by: SeongJae Park <sj@kernel.org> Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Suresh K C [Wed, 9 Jul 2025 17:46:57 +0000 (23:16 +0530)]
selftests: cachestat: add tests for mmap, refactor and enhance mmap test for cachestat validation
Add a cohesive test case that verifies cachestat behavior with
memory-mapped files using mmap(). Also refactor the test logic to reduce
redundancy, improve error reporting, and clarify failure messages for both
shmem and mmap file types.
[akpm@linux-foundation.org: coding-style cleanups] Link: https://lkml.kernel.org/r/20250709174657.6916-1-suresh.k.chandrappa@gmail.com Signed-off-by: Suresh K C <suresh.k.chandrappa@gmail.com> Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com> Tested-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Xuanye Liu [Wed, 23 Jul 2025 10:09:00 +0000 (18:09 +0800)]
mm: add process info to bad rss-counter warning
Enhance the debugging information in check_mm() by including the process
name and PID when reporting bad rss-counter states. This helps identify
which process is associated with the memory accounting issue.
Link: https://lkml.kernel.org/r/20250723100901.1909683-1-liuqiye2025@163.com Signed-off-by: Xuanye Liu <liuqiye2025@163.com> Acked-by: SeongJae Park <sj@kernel.org> Cc: Ben Segall <bsegall@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kees Cook <kees@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mel Gorman <mgorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
kasan: skip quarantine if object is still accessible under RCU
Currently, enabling KASAN masks bugs where a lockless lookup path gets a
pointer to a SLAB_TYPESAFE_BY_RCU object that might concurrently be
recycled and is insufficiently careful about handling recycled objects:
KASAN puts freed objects in SLAB_TYPESAFE_BY_RCU slabs onto its quarantine
queues, even when it can't actually detect UAF in these objects, and the
quarantine prevents fast recycling.
When I introduced CONFIG_SLUB_RCU_DEBUG, my intention was that enabling
CONFIG_SLUB_RCU_DEBUG should cause KASAN to mark such objects as freed
after an RCU grace period and put them on the quarantine, while disabling
CONFIG_SLUB_RCU_DEBUG should allow such objects to be reused immediately;
but that hasn't actually been working.
I discovered such a UAF bug involving SLAB_TYPESAFE_BY_RCU yesterday; I
could only trigger this bug in a KASAN build by disabling
CONFIG_SLUB_RCU_DEBUG and applying this patch.
Commit cd57b77197a4 ("ext4: Convert ext4_bio_write_page() to use a folio)
removed set_page_writeback_keepwrite() which was the last/only caller of
folio_start_writeback_keepwrite().
Link: https://lkml.kernel.org/r/20250722182230.2114587-1-joannelkoong@gmail.com Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
wang lian [Mon, 21 Jul 2025 11:46:14 +0000 (19:46 +0800)]
selftests/mm: add process_madvise() tests
Add tests for process_madvise(), focusing on verifying behavior under
various conditions including valid usage and error cases.
[lianux.mm@gmail.com: v7] Link: https://lkml.kernel.org/r/20250729113109.12272-1-lianux.mm@gmail.com Link: https://lkml.kernel.org/r/20250729113109.12272-1-lianux.mm@gmail.com Link: https://lkml.kernel.org/r/20250721114614.40996-1-lianux.mm@gmail.com Signed-off-by: wang lian <lianux.mm@gmail.com> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Suggested-by: David Hildenbrand <david@redhat.com> Suggested-by: Zi Yan <ziy@nvidia.com> Suggested-by: Mark Brown <broonie@kernel.org> Acked-by: SeongJae Park <sj@kernel.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Tested-by: Zi Yan <ziy@nvidia.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baolin Wang [Thu, 31 Jul 2025 01:53:43 +0000 (09:53 +0800)]
mm: shmem: fix the shmem large folio allocation for the i915 driver
After commit acd7ccb284b8 ("mm: shmem: add large folio support for
tmpfs"), we extend the 'huge=' option to allow any sized large folios for
tmpfs, which means tmpfs will allow getting a highest order hint based on
the size of write() and fallocate() paths, and then will try each
allowable large order.
However, when the i915 driver allocates shmem memory, it doesn't provide
hint information about the size of the large folio to be allocated,
resulting in the inability to allocate PMD-sized shmem, which in turn
affects GPU performance.
Patryk added:
: In my tests, the performance drop ranges from a few percent up to 13%
: in Unigine Superposition under heavy memory usage on the CPU Core Ultra
: 155H with the Xe 128 EU GPU. Other users have reported performance
: impact up to 30% on certain workloads. Please find more in the
: regressions reports:
: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/14645
: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/13845
:
: I believe the change should be backported to all active kernel branches
: after version 6.12.
To fix this issue, we can use the inode's size as a write size hint in
shmem_read_folio_gfp() to help allocate PMD-sized large folios.
Link: https://lkml.kernel.org/r/f7e64e99a3a87a8144cc6b2f1dddf7a89c12ce44.1753926601.git.baolin.wang@linux.alibaba.com Fixes: acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reported-by: Patryk Kowalczyk <patryk@kowalczyk.ws> Reported-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Tested-by: Patryk Kowalczyk <patryk@kowalczyk.ws> Suggested-by: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Fan Yu [Thu, 31 Jul 2025 14:53:26 +0000 (22:53 +0800)]
tools/getdelays: add backward compatibility for taskstats version
Add version checks to print_delayacct() to handle differences in struct
taskstats across kernel versions. Field availability depends on taskstats
version (t->version), corresponding to TASKSTATS_VERSION in kernel headers
(see include/uapi/linux/taskstats.h).
Version feature mapping:
- version >= 11 - supports COMPACT statistics
- version >= 13 - supports WPCOPY statistics
- version >= 14 - supports IRQ statistics
- version >= 16 - supports *_max and *_min delay statistics
This ensures the tool works correctly with both older and newer kernel
versions by conditionally printing fields based on the reported version.
eg.1
bash# grep -r "#define TASKSTATS_VERSION" /usr/include/linux/taskstats.h
"#define TASKSTATS_VERSION 10"
bash# ./getdelays -d -p 1
CPU count real total virtual total delay total delay average
7481 3786181709380709829136393725 0.005ms
IO count delay total delay average
369 1116046035 3.025ms
SWAP count delay total delay average
0 0 0.000ms
RECLAIM count delay total delay average
0 0 0.000ms
THRASHING count delay total delay average
0 0 0.000ms
eg.2
bash# grep -r "#define TASKSTATS_VERSION" /usr/include/linux/taskstats.h
"#define TASKSTATS_VERSION 14"
bash# ./getdelays -d -p 1
CPU count real total virtual total delay total delay average
68862 16347479004617458472226719962496806 0.290ms
IO count delay total delay average
0 0 0.000ms
SWAP count delay total delay average
0 0 0.000ms
RECLAIM count delay total delay average
0 0 0.000ms
THRASHING count delay total delay average
0 0 0.000ms
COMPACT count delay total delay average
0 0 0.000ms
WPCOPY count delay total delay average
0 0 0.000ms
IRQ count delay total delay average
0 0 0.000ms
Link: https://lkml.kernel.org/r/20250731225326549CttJ7g9NfjTlaqBwl015T@zte.com.cn Signed-off-by: Fan Yu <fan.yu9@zte.com.cn> Cc: Fan Yu <fan.yu9@zte.com.cn> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Wang Yaxin <wang.yaxin@zte.com.cn> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Yang Yang <yang.yang29@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Testing kexec handover requires a kernel driver that will generate some
data and preserve it with KHO on the first boot and then restore that data
and verify it was preserved properly after kexec.
To facilitate such test, along with the kernel driver responsible for data
generation, preservation and restoration add a script that runs a kernel
in a VM with a minimal /init. The /init enables KHO, loads a kernel image
for kexec and runs kexec reboot. After the boot of the kexeced kernel,
the driver verifies that the data was properly preserved.
Link: https://lkml.kernel.org/r/202507281628341752gMXCMN7S-Vz_LHYHum9r@zte.com.cn Signed-off-by: Fan Yu <fan.yu9@zte.com.cn> Signed-off-by: Wang Yaxin <wang.yaxin@zte.com.cn> Acked-by: Yang Yang <yang.yang29@zte.com.cn> Cc: Fan Yu <fan.yu9@zte.com.cn> Cc: Jonathan Corbet <corbet@lwn.net> Cc: xu xin <xu.xin16@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wang Yaxin [Mon, 21 Jul 2025 01:40:49 +0000 (09:40 +0800)]
MAINTAINERS: add maintainers for delaytop
The delaytop tool supports showing system delays and task-level delays,
effectively identifying the top-n tasks with high latency in the system,
which is highly beneficial for improving system performance. Wang Yaxin
and her colleague Fan Yu focus on locating system delay issues. To
promote the thriving development of delaytop, we hope to serve as
maintainers to continuously improve it, aiming to provide a more effective
solution for system latency issues in the future.
Link: https://lkml.kernel.org/r/20250721094049958ImB8XG_imntcPqpQn1KfG@zte.com.cn Signed-off-by: Wang Yaxin <wang.yaxin@zte.com.cn> Signed-off-by: Fan Yu <fan.yu9@zte.com.cn> Reviewed-by: Yang Yang <yang.yang29@zte.com.cn> Cc: Balbir Singh <bsingharora@gmail.com> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Krzysztof Kozlowski <krzk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
ucount: use atomic_long_try_cmpxchg() in atomic_long_inc_below()
Use atomic_long_try_cmpxchg() instead of
atomic_long_cmpxchg (*ptr, old, new) == old in atomic_long_inc_below().
x86 CMPXCHG instruction returns success in ZF flag, so this change saves
a compare after cmpxchg (and related move instruction in front of cmpxchg).
Also, atomic_long_try_cmpxchg implicitly assigns old *ptr value to "old"
when cmpxchg fails, enabling further code simplifications.
No functional change intended.
Link: https://lkml.kernel.org/r/20250721174610.28361-2-ubizjak@gmail.com Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Reviewed-by: Alexey Gladkov <legion@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Alexey Gladkov <legion@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: MengEn Sun <mengensun@tencent.com> Cc: "Thomas Weißschuh" <linux@weissschuh.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The type of u argument of atomic_long_inc_below() should be long to avoid
unwanted truncation to int.
The patch fixes the wrong argument type of an internal function to
prevent unwanted argument truncation. It fixes an internal locking
primitive; it should not have any direct effect on userspace.
Mark said
: AFAICT there's no problem in practice because atomic_long_inc_below()
: is only used by inc_ucount(), and it looks like the value is
: constrained between 0 and INT_MAX.
:
: In inc_ucount() the limit value is taken from
: user_namespace::ucount_max[], and AFAICT that's only written by
: sysctls, to the table setup by setup_userns_sysctls(), where
: UCOUNT_ENTRY() limits the value between 0 and INT_MAX.
:
: This is certainly a cleanup, but there might be no functional issue in
: practice as above.
Link: https://lkml.kernel.org/r/20250721174610.28361-1-ubizjak@gmail.com Fixes: f9c82a4ea89c ("Increase size of ucounts to atomic_long_t") Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Alexey Gladkov <legion@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: MengEn Sun <mengensun@tencent.com> Cc: "Thomas Weißschuh" <linux@weissschuh.net> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Alexander Graf [Tue, 10 Jun 2025 08:53:27 +0000 (08:53 +0000)]
kexec: enable CMA based contiguous allocation
When booting a new kernel with kexec_file, the kernel picks a target
location that the kernel should live at, then allocates random pages,
checks whether any of those patches magically happens to coincide with a
target address range and if so, uses them for that range.
For every page allocated this way, it then creates a page list that the
relocation code - code that executes while all CPUs are off and we are
just about to jump into the new kernel - copies to their final memory
location. We can not put them there before, because chances are pretty
good that at least some page in the target range is already in use by the
currently running Linux environment. Copying is happening from a single
CPU at RAM rate, which takes around 4-50 ms per 100 MiB.
All of this is inefficient and error prone.
To successfully kexec, we need to quiesce all devices of the outgoing
kernel so they don't scribble over the new kernel's memory. We have seen
cases where that does not happen properly (*cough* GIC *cough*) and hence
the new kernel was corrupted. This started a month long journey to root
cause failing kexecs to eventually see memory corruption, because the new
kernel was corrupted severely enough that it could not emit output to tell
us about the fact that it was corrupted. By allocating memory for the
next kernel from a memory range that is guaranteed scribbling free, we can
boot the next kernel up to a point where it is at least able to detect
corruption and maybe even stop it before it becomes severe. This
increases the chance for successful kexecs.
Since kexec got introduced, Linux has gained the CMA framework which can
perform physically contiguous memory mappings, while keeping that memory
available for movable memory when it is not needed for contiguous
allocations. The default CMA allocator is for DMA allocations.
This patch adds logic to the kexec file loader to attempt to place the
target payload at a location allocated from CMA. If successful, it uses
that memory range directly instead of creating copy instructions during
the hot phase. To ensure that there is a safety net in case anything goes
wrong with the CMA allocation, it also adds a flag for user space to force
disable CMA allocations.
Using CMA allocations has two advantages:
1) Faster by 4-50 ms per 100 MiB. There is no more need to copy in the
hot phase.
2) More robust. Even if by accident some page is still in use for DMA,
the new kernel image will be safe from that access because it resides
in a memory region that is considered allocated in the old kernel and
has a chance to reinitialize that component.
Link: https://lkml.kernel.org/r/20250610085327.51817-1-graf@amazon.com Signed-off-by: Alexander Graf <graf@amazon.com> Acked-by: Baoquan He <bhe@redhat.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Zhongkun He <hezhongkun.hzk@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
xxh32_digest() and xxh32_update() were added in 2017 in the original
xxhash commit, but have remained unused.
Remove them.
Link: https://lkml.kernel.org/r/20250716133245.243363-1-linux@treblig.org Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Dave Gilbert <linux@treblig.org> Cc: Nick Terrell <terrelln@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Mon, 28 Jul 2025 07:52:59 +0000 (15:52 +0800)]
mm/shmem, swap: improve cached mTHP handling and fix potential hang
The current swap-in code assumes that, when a swap entry in shmem mapping
is order 0, its cached folios (if present) must be order 0 too, which
turns out not always correct.
The problem is shmem_split_large_entry is called before verifying the
folio will eventually be swapped in, one possible race is:
CPU1 CPU2
shmem_swapin_folio
/* swap in of order > 0 swap entry S1 */
folio = swap_cache_get_folio
/* folio = NULL */
order = xa_get_order
/* order > 0 */
folio = shmem_swap_alloc_folio
/* mTHP alloc failure, folio = NULL */
<... Interrupted ...>
shmem_swapin_folio
/* S1 is swapped in */
shmem_writeout
/* S1 is swapped out, folio cached */
shmem_split_large_entry(..., S1)
/* S1 is split, but the folio covering it has order > 0 now */
Now any following swapin of S1 will hang: `xa_get_order` returns 0, and
folio lookup will return a folio with order > 0. The
`xa_get_order(&mapping->i_pages, index) != folio_order(folio)` will always
return false causing swap-in to return -EEXIST.
And this looks fragile. So fix this up by allowing seeing a larger folio
in swap cache, and check the whole shmem mapping range covered by the
swapin have the right swap value upon inserting the folio. And drop the
redundant tree walks before the insertion.
This will actually improve performance, as it avoids two redundant Xarray
tree walks in the hot path, and the only side effect is that in the
failure path, shmem may redundantly reallocate a few folios causing
temporary slight memory pressure.
And worth noting, it may seems the order and value check before inserting
might help reducing the lock contention, which is not true. The swap
cache layer ensures raced swapin will either see a swap cache folio or
failed to do a swapin (we have SWAP_HAS_CACHE bit even if swap cache is
bypassed), so holding the folio lock and checking the folio flag is
already good enough for avoiding the lock contention. The chance that a
folio passes the swap entry value check but the shmem mapping slot has
changed should be very low.
Link: https://lkml.kernel.org/r/20250728075306.12704-1-ryncsn@gmail.com Link: https://lkml.kernel.org/r/20250728075306.12704-2-ryncsn@gmail.com Fixes: 809bc86517cc ("mm: shmem: support large folio swap out") Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Dev Jain <dev.jain@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Linus Torvalds [Sat, 2 Aug 2025 16:58:11 +0000 (09:58 -0700)]
Merge tag 'fbdev-for-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev
Pull fbdev updates from Helge Deller:
"One potential buffer overflow fix in the framebuffer registration
function, some fixes for the imxfb, nvidiafb and simplefb drivers, and
a bunch of cleanups for fbcon, kyrofb and svgalib.
Driver fixes:
- imxfb: prevent null-ptr-deref [Chenyuan Yang]
- nvidiafb: fix build on 32-bit ARCH=um [Johannes Berg]
- nvidiafb: add depends on HAS_IOPORT [Randy Dunlap]
- simplefb: Use of_reserved_mem_region_to_resource() for "memory-region" [Rob Herring]
Cleanups:
- fbcon: various code cleanups wrt blinking [Ville Syrjälä]
- kyrofb: Convert to devm_*() functions [Giovanni Di Santi]
- svgalib: Coding style cleanups [Darshan R.]
- Fix typo in Kconfig text for FB_DEVICE [Daniel Palmer]"
* tag 'fbdev-for-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev:
fbcon: Use 'bool' where appopriate
fbcon: Introduce get_{fg,bg}_color()
fbcon: fbcon_is_inactive() -> fbcon_is_active()
fbcon: fbcon_cursor_noblink -> fbcon_cursor_blink
fbdev: Fix typo in Kconfig text for FB_DEVICE
fbdev: imxfb: Check fb_add_videomode to prevent null-ptr-deref
fbdev: svgalib: Clean up coding style
fbdev: kyro: Use devm_ioremap_wc() for screen mem
fbdev: kyro: Use devm_ioremap() for mmio registers
fbdev: kyro: Add missing PCI memory region request
fbdev: simplefb: Use of_reserved_mem_region_to_resource() for "memory-region"
fbdev: fix potential buffer overflow in do_register_framebuffer()
fbdev: nvidiafb: add depends on HAS_IOPORT
fbdev: nvidiafb: fix build on 32-bit ARCH=um
Linus Torvalds [Sat, 2 Aug 2025 16:52:53 +0000 (09:52 -0700)]
Merge tag 'firewire-updates-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394
Pull firewire updates from Takashi Sakamoto:
"This update replaces the remaining tasklet usage in the FireWire
subsystem with workqueue for asynchronous packet transmission. With
this change, tasklets are now fully eliminated from the subsystem.
Asynchronous packet transmission is used for serial bus topology
management as well as for the operation of the SBP-2 protocol driver
(firewire-sbp2). To ensure reliability during low-memory conditions,
the associated workqueue is created with the WQ_MEM_RECLAIM flag,
allowing it to participate in memory reclaim paths. Other attributes
are aligned with those used for isochronous packet handling, which was
migrated to workqueues in v6.12.
The workqueues are sleepable and support preemptible work items,
making them more suitable for real-time workloads that benefit from
timely task preemption at the system level.
There remains an issue where 'schedule()' may be called within an RCU
read-side critical section, due to a direct replacement of
'tasklet_disable_in_atomic()' with 'disable_work_sync()'. A proposed
fix for this has been posted[1], and is currently under review and
testing. It is expected to be sent upstream later"
Link: https://lore.kernel.org/lkml/20250728015125.17825-1-o-takashi@sakamocchi.jp/
* tag 'firewire-updates-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394:
firewire: ohci: reduce the size of common context structure by extracting members into AT structure
firewire: core: minor code refactoring to localize table of gap count
firewire: ohci: use workqueue to handle events of AT request/response contexts
firewire: ohci: use workqueue to handle events of AR request/response contexts
firewire: core: allocate workqueue for AR/AT request/response contexts
firewire: core: use from_work() macro to expand parent structure of work_struct
firewire: ohci: use from_work() macro to expand parent structure of work_struct
firewire: ohci: correct code comments about bus_reset tasklet
env->scc_info array contains references to bpf_scc_info objects
allocated lazily in verifier.c:scc_visit_alloc().
env->scc_cnt was supposed to track env->scc_info array size
in order to free referenced objects in verifier.c:free_states().
Fix initialization of env->scc_cnt that was omitted in
verifier.c:compute_scc().
To reproduce the bug:
- build with CONFIG_DEBUG_KMEMLEAK
- boot and load bpf program with loops, e.g.:
./veristat -q pyperf180.bpf.o
- initiate memleak scan and check results:
echo scan > /sys/kernel/debug/kmemleak
cat /sys/kernel/debug/kmemleak
Fixes: c9e31900b54c ("bpf: propagate read/precision marks over state graph backedges") Reported-by: Jens Axboe <axboe@kernel.dk> Closes: https://lore.kernel.org/bpf/CAADnVQKXUWg9uRCPD5ebRXwN4dmBCRUFFM7kN=GxymYz3zU25A@mail.gmail.com/T/ Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Tested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250801232330.1800436-1-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Sean Anderson [Fri, 1 Aug 2025 15:47:10 +0000 (11:47 -0400)]
ALSA: usb-audio: Don't use printk_ratelimit for debug prints
printk_ratelimit is deprecated, since it shares state with all other
printk sites. Additionally, the suppression message is printed at
warning level even though the actual messages are printed at debug and
are (usually) invisible! This can result in thousands of messages like
retire_capture_urb: 4992 callbacks suppressed
in the console, and can inhibit debugging since it is unclear what the
source of the suppressed callbacks is.
Switch to dev_dbg_ratelimited which doesn't print anything unless debug
is enabled.
Thomas Gleixner [Wed, 30 Jul 2025 19:44:55 +0000 (21:44 +0200)]
futex: Move futex cleanup to __mmdrop()
Futex hash allocations are done in mm_init() and the cleanup happens in
__mmput(). That works most of the time, but there are mm instances which
are instantiated via mm_alloc() and freed via mmdrop(), which causes the
futex hash to be leaked.
Linus Torvalds [Sat, 2 Aug 2025 00:13:26 +0000 (17:13 -0700)]
Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Pull bpf fixes from Alexei Starovoitov:
- Fix kCFI failures in JITed BPF code on arm64 (Sami Tolvanen, Puranjay
Mohan, Mark Rutland, Maxwell Bland)
- Disallow tail calls between BPF programs that use different cgroup
local storage maps to prevent out-of-bounds access (Daniel Borkmann)
- Fix unaligned access in flow_dissector and netfilter BPF programs
(Paul Chaignon)
- Avoid possible use of uninitialized mod_len in libbpf (Achill
Gilgenast)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: Test for unaligned flow_dissector ctx access
bpf: Improve ctx access verifier error message
bpf: Check netfilter ctx accesses are aligned
bpf: Check flow_dissector ctx accesses are aligned
arm64/cfi,bpf: Support kCFI + BPF on arm64
cfi: Move BPF CFI types and helpers to generic code
cfi: add C CFI type macro
libbpf: Avoid possible use of uninitialized mod_len
bpf: Fix oob access in cgroup local storage
bpf: Move cgroup iterator helpers to bpf.h
bpf: Move bpf map owner out of common struct
bpf: Add cookie object to bpf maps
Linus Torvalds [Fri, 1 Aug 2025 23:55:47 +0000 (16:55 -0700)]
Merge tag 'perf-tools-for-v6.17-2025-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools
Pull perf tools updates from Namhyung Kim:
"Build-ID processing goodies:
Build-IDs are content based hashes to link regions of memory to ELF
files in post processing. They have been available in distros for
quite a while:
$ file /bin/bash
/bin/bash: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV),
dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2,
BuildID[sha1]=707a1c670cd72f8e55ffedfbe94ea98901b7ce3a,
for GNU/Linux 3.2.0, stripped
It is possible to ask the kernel to get it from mmap executable
backing storage at time they are being put in place and send it as
metadata at that moment to have in perf.data.
Prefer that across the board to speed up 'record' time - it post
processes the samples to find binaries touched by any samples and
to save them with build-ID. It can skip reading build-ID in
userspace if it comes from the kernel.
perf record:
* Make --buildid-mmap default. The kernel can generate MMAP2 events
with a build-ID from ELF header. Use that by default instead of using
inode and device ID to identify binaries. It also can be disabled
with --no-buildid-mmap.
* Use BPF for -u/--uid option to sample processes belong to a user.
BPF can track user processes more accurately and the existing logic
often fails to get the list of processes due to race with reading the
/proc filesystem.
* Generate PERF_RECORD_BPF_METADATA when it profiles BPF programs and
they have variables starting with "bpf_metadata_". This will help to
identify BPF objects used in the profile. This has been supported in
bpftool for some time and allows the recording of metadata such as
commit hashes, versions, etc, that now gets recorded in perf.data as
well.
* Collect list of DSOs touched in the sample callchains as well as in
the sample itself. This would increase the processing time at the end
of record, but can improve the data quality.
perf stat:
* Add a new 'drm' pseudo-PMU support like in 'hwmon'. It can collect
DRM usage stats using fdinfo in /proc.
On my Intel laptop, it shows like below:
$ perf list drm
...
drm:
drm-active-stolen-system0
[Total memory active in one or more engines. Unit: drm_i915]
drm-active-system0
[Total memory active in one or more engines. Unit: drm_i915]
drm-engine-capacity-video
[Engine capacity. Unit: drm_i915]
drm-engine-copy
[Utilization in ns. Unit: drm_i915]
drm-engine-render
[Utilization in ns. Unit: drm_i915]
drm-engine-video
[Utilization in ns. Unit: drm_i915]
...
$ sudo perf stat -a -e drm-engine-render,drm-engine-video,drm-engine-capacity-video sleep 1
* Add description for software events. The description is in JSON format
and the event parser now can handle the software events like others
(for example, it's case-insensitive and subject to wildcard matching).
$ perf list software
List of pre-defined events (to be used in -e or -M):
software:
alignment-faults
[Number of kernel handled memory alignment faults. Unit: software]
bpf-output
[An event used by BPF programs to write to the perf ring buffer. Unit: software]
cgroup-switches
[Number of context switches to a task in a different cgroup. Unit: software]
context-switches
[Number of context switches [This event is an alias of cs]. Unit: software]
cpu-clock
[Per-CPU high-resolution timer based event. Unit: software]
cpu-migrations
[Number of times a process has migrated to a new CPU [This event is an alias of migrations]. Unit: software]
cs
[Number of context switches [This event is an alias of context-switches]. Unit: software]
dummy
[A placeholder event that doesn't count anything. Unit: software]
emulation-faults
[Number of kernel handled unimplemented instruction faults handled through emulation. Unit: software]
faults
[Number of page faults [This event is an alias of page-faults]. Unit: software]
major-faults
[Number of major page faults. Major faults require I/O to handle. Unit: software]
migrations
[Number of times a process has migrated to a new CPU [This event is an alias of cpu-migrations]. Unit: software]
minor-faults
[Number of minor page faults. Minor faults don't require I/O to handle. Unit: software]
page-faults
[Number of page faults [This event is an alias of faults]. Unit: software]
task-clock
[Per-task high-resolution timer based event. Unit: software]
perf ftrace:
* Add -e/--events option to perf ftrace latency to measure latency
between the two events instead of a function.
# statistics (in usec)
total time: 194915
avg time: 6961
max time: 12855
min time: 373
count: 28
* Add new function graph tracer options (--graph-opts) to display more
info like arguments and return value. They will be passed to the
kernel ftrace directly.
* Add perf archive --exclude-buildids <FILE> option to skip some binaries.
The format of the FILE should be same as an output of perf buildid-list.
* Get rid of dependency of libcrypto. It was just to get SHA-1 hash so
implement it directly like in the kernel. A side effect is that it
needs -fno-strict-aliasing compiler option (again, like in the kernel).
* Convert all shell script tests to use bash"
* tag 'perf-tools-for-v6.17-2025-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools: (179 commits)
perf record: Cache build-ID of hit DSOs only
perf test: Ensure lock contention using pipe mode
perf python: Stop using deprecated PyUnicode_AsString()
perf list: Skip ABI PMUs when printing pmu values
perf list: Remove tracepoint printing code
perf tp_pmu: Add event APIs
perf tp_pmu: Factor existing tracepoint logic to new file
perf parse-events: Remove non-json software events
perf jevents: Add common software event json
perf tools: Remove libtraceevent in .gitignore
perf test: Fix comment ordering
perf sort: Use perf_env to set arch sort keys and header
perf test: Move PERF_SAMPLE_WEIGHT_STRUCT parsing to common test
perf sample: Remove arch notion of sample parsing
perf env: Remove global perf_env
perf trace: Avoid global perf_env with evsel__env
perf auxtrace: Pass perf_env from session through to mmap read
perf machine: Explicitly pass in host perf_env
perf bench synthesize: Avoid use of global perf_env
perf top: Make perf_env locally scoped
...
Linus Torvalds [Fri, 1 Aug 2025 23:15:53 +0000 (16:15 -0700)]
Merge tag 'parisc-for-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc updates from Helge Deller:
- The parisc kernel wrongly allows reading from read-protected
userspace memory without faulting, e.g. when userspace uses
mprotect() to read-protect a memory area and then uses a pointer to
this memory in a write(2, addr, 1) syscall.
To fix this issue, Dave Anglin developed a set of patches which use
the proberi assembler instruction to additionally check read access
permissions at runtime.
- Randy Dunlap contributed two patches to fix a minor typo and to
explain why a 32-bit compiler is needed although a 64-bit kernel is
built
* tag 'parisc-for-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Revise __get_user() to probe user read access
parisc: Revise gateway LWS calls to probe user read access
parisc: Drop WARN_ON_ONCE() from flush_cache_vmap
parisc: Try to fixup kernel exception in bad_area_nosemaphore path of do_page_fault()
parisc: Define and use set_pte_at()
parisc: Rename pte_needs_flush() to pte_needs_cache_flush() in cache.c
parisc: Check region is readable by user in raw_copy_from_user()
parisc: Update comments in make_insert_tlb
parisc: Makefile: explain that 64BIT requires both 32-bit and 64-bit compilers
parisc: Makefile: fix a typo in palo.conf
Steven Rostedt [Fri, 1 Aug 2025 20:56:01 +0000 (16:56 -0400)]
tracing: Have unsigned int function args displayed as hexadecimal
Most function arguments that are passed in as unsigned int or unsigned
long are better displayed as hexadecimal than normal integer. For example,
the functions:
static void __create_object(unsigned long ptr, size_t size,
int min_count, gfp_t gfp, unsigned int objflags);
static bool stack_access_ok(struct unwind_state *state, unsigned long _addr,
size_t len);
void __local_bh_disable_ip(unsigned long ip, unsigned int cnt);
Which is much easier to understand as most unsigned longs are usually just
pointers. Even the "unsigned int cnt" in __local_bh_disable_ip() looks
better as hexadecimal as a lot of flags are passed as unsigned.
Linus Torvalds [Fri, 1 Aug 2025 22:47:06 +0000 (15:47 -0700)]
Merge tag 'cxl-for-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
Pull CXL updates from Dave Jiang:
"The most significant changes in this pull request is the series that
introduces ACQUIRE() and ACQUIRE_ERR() macros to replace conditional
locking and ease the pain points of scoped_cond_guard().
The series also includes follow on changes that refactor the CXL
sub-system to utilize the new macros.
Detail summary:
- Add documentation template for CXL conventions to document CXL
platform quirks
- Replace mutex_lock_io() with mutex_lock() for mailbox
- Add location limit for fake CFMWS range for cxl_test, ARM platform
enabling
- CXL documentation typo and clarity fixes
- Use correct format specifier for function cxl_set_ecs_threshold()
- Make cxl_bus_type constant
- Introduce new helper cxl_resource_contains_addr() to check address
availability
- Fix wrong DPA checking for PPR operation
- Remove core/acpi.c and CXL core dependency on ACPI
- Introduce ACQUIRE() and ACQUIRE_ERR() for conditional locks
- Add CXL updates utilizing ACQUIRE() macro to remove gotos and
improve readability
- Add return for the dummy version of cxl_decoder_detach() without
CONFIG_CXL_REGION
- CXL events updates for spec r3.2
- Fix return of __cxl_decoder_detach() error path
- CXL debugfs documentation fix"
* tag 'cxl-for-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (28 commits)
Documentation/ABI/testing/debugfs-cxl: Add 'cxl' to clear_poison path
cxl/region: Fix an ERR_PTR() vs NULL bug
cxl/events: Trace Memory Sparing Event Record
cxl/events: Add extra validity checks for CVME count in DRAM Event Record
cxl/events: Add extra validity checks for corrected memory error count in General Media Event Record
cxl/events: Update Common Event Record to CXL spec rev 3.2
cxl: Fix -Werror=return-type in cxl_decoder_detach()
cleanup: Fix documentation build error for ACQUIRE updates
cxl: Convert to ACQUIRE() for conditional rwsem locking
cxl/region: Consolidate cxl_decoder_kill_region() and cxl_region_detach()
cxl/region: Move ready-to-probe state check to a helper
cxl/region: Split commit_store() into __commit() and queue_reset() helpers
cxl/decoder: Drop pointless locking
cxl/decoder: Move decoder register programming to a helper
cxl/mbox: Convert poison list mutex to ACQUIRE()
cleanup: Introduce ACQUIRE() and ACQUIRE_ERR() for conditional locks
cxl: Remove core/acpi.c and cxl core dependency on ACPI
cxl/core: Using cxl_resource_contains_addr() to check address availability
cxl/edac: Fix wrong dpa checking for PPR operation
cxl/core: Introduce a new helper cxl_resource_contains_addr()
...
net: Add locking to protect skb->dev access in ip_output
In ip_output() skb->dev is updated from the skb_dst(skb)->dev
this can become invalid when the interface is unregistered and freed,
Introduced new skb_dst_dev_rcu() function to be used instead of
skb_dst_dev() within rcu_locks in ip_output.This will ensure that
all the skb's associated with the dev being deregistered will
be transnmitted out first, before freeing the dev.
Given that ip_output() is called within an rcu_read_lock()
critical section or from a bottom-half context, it is safe to introduce
an RCU read-side critical section within it.
Multiple panic call stacks were observed when UL traffic was run
in concurrency with device deregistration from different functions,
pasting one sample for reference.
Changes in v3:
- Replaced WARN_ON() with WARN_ON_ONCE(), as suggested by Willem de Bruijn.
- Dropped legacy lines mistakenly pulled in from an outdated branch.
Changes in v2:
- Addressed review comments from Eric Dumazet
- Used READ_ONCE() to prevent potential load/store tearing
- Added skb_dst_dev_rcu() and used along with rcu_read_lock() in ip_output
net/sched: taprio: enforce minimum value for picos_per_byte
Syzbot reported a WARNING in taprio_get_start_time().
When link speed is 470,589 or greater, q->picos_per_byte becomes too
small, causing length_to_duration(q, ETH_ZLEN) to return zero.
This zero value leads to validation failures in fill_sched_entry() and
parse_taprio_schedule(), allowing arbitrary values to be assigned to
entry->interval and cycle_time. As a result, sched->cycle can become zero.
Since SPEED_800000 is the largest defined speed in
include/uapi/linux/ethtool.h, this issue can occur in realistic scenarios.
To ensure length_to_duration() returns a non-zero value for minimum-sized
Ethernet frames (ETH_ZLEN = 60), picos_per_byte must be at least 17
(60 * 17 > PSEC_PER_NSEC which is 1000).
This patch enforces a minimum value of 17 for picos_per_byte when the
calculated value would be lower, and adds a warning message to inform
users that scheduling accuracy may be affected at very high link speeds.
Wang Liang [Wed, 30 Jul 2025 10:14:58 +0000 (18:14 +0800)]
net: drop UFO packets in udp_rcv_segment()
When sending a packet with virtio_net_hdr to tun device, if the gso_type
in virtio_net_hdr is SKB_GSO_UDP and the gso_size is less than udphdr
size, below crash may happen.
To trigger gso segment in udp_queue_rcv_skb(), we should also set option
UDP_ENCAP_ESPINUDP to enable udp_sk(sk)->encap_rcv. When the encap_rcv
hook return 1 in udp_queue_rcv_one_skb(), udp_csum_pull_header() will try
to pull udphdr, but the skb size has been segmented to gso size, which
leads to this crash.
Previous commit cf329aa42b66 ("udp: cope with UDP GRO packet misdirection")
introduces segmentation in UDP receive path only for GRO, which was never
intended to be used for UFO, so drop UFO packets in udp_rcv_segment().
Linus Torvalds [Fri, 1 Aug 2025 22:02:25 +0000 (15:02 -0700)]
Merge tag 'rproc-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/remoteproc/linux
Pull remoteproc updates from Bjorn Andersson:
- Make the Xilinx remoteproc driver support running on only a single
core, disable still unsupported remoteproc features, and stop the
remoteproc on shutdown to facilitate kexec.
- Conclude the renaming of the Qualcomm ADSP driver to "PAS" that was
started many years ago.
* tag 'rproc-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/remoteproc/linux:
remoteproc: xlnx: Fix kernel-doc warnings
remoteproc: xlnx: Disable unsupported features
remoteproc: xlnx: Add shutdown callback
remoteproc: xlnx: Allow single core use in split mode
dt-bindings: remoteproc: qcom,sa8775p-pas: Correct the interrupt number
remoteproc: Don't use %pK through printk
dt-bindings: remoteproc: qcom,sm8150-pas: Document QCS615 remoteproc
remoteproc: qcom: pas: Conclude the rename from adsp
Paul Chaignon [Fri, 1 Aug 2025 09:49:44 +0000 (11:49 +0200)]
selftests/bpf: Test for unaligned flow_dissector ctx access
This patch adds tests for two context fields where unaligned accesses
were not properly rejected.
Note the new macro is similar to the existing narrow_load macro, but we
need a different description and access offset. Combining the two
macros into one is probably doable but I don't think it would help
readability.
vmlinux.h is included in place of bpf.h so we have the definition of
struct bpf_nf_ctx.
When the parent clock is a gated clock which has multiple parents, the
clock provider (clk-scmi typically) might return a rate of 0 since there
is not one of those particular parent clocks that should be chosen for
returning a rate. Prior to ee975351cf0c ("net: mdio: mdio-bcm-unimac:
Manage clock around I/O accesses"), we would not always be passing a
clock reference depending upon how mdio-bcm-unimac was instantiated. In
that case, we would take the fallback path where the rate is hard coded
to 250MHz.
Make sure that we still fallback to using a fixed rate for the divider
calculation, otherwise we simply ignore the desired MDIO bus clock
frequency which can prevent us from interfacing with Ethernet PHYs
properly.
Fixes: ee975351cf0c ("net: mdio: mdio-bcm-unimac: Manage clock around I/O accesses") Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250730202533.3463529-1-florian.fainelli@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet [Wed, 30 Jul 2025 11:53:13 +0000 (11:53 +0000)]
selftests: avoid using ifconfig
ifconfig is deprecated and not always present, use ip command instead.
Fixes: e0f3b3e5c77a ("selftests: Add test cases for vlan_filter modification during runtime") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Dong Chenchen <dongchenchen2@huawei.com> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://patch.msgid.link/20250730115313.3356036-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, the user is always asked about the Microchip Azurite
DPLL/PTP/SyncE core driver, even when I2C and SPI are disabled, and thus
the driver cannot be used at all.
Fix this by making the Kconfig symbol for the core driver invisible
(unless compile-testing), and selecting it by the bus glue sub-drivers.
Drop the modular defaults, as drivers should not default to enabled.
Christoph Paasch [Tue, 29 Jul 2025 18:34:00 +0000 (11:34 -0700)]
net/mlx5: Correctly set gso_segs when LRO is used
When gso_segs is left at 0, a number of assumptions will end up being
incorrect throughout the stack.
For example, in the GRO-path, we set NAPI_GRO_CB()->count to gso_segs.
So, if a non-LRO'ed packet followed by an LRO'ed packet is being
processed in GRO, the first one will have NAPI_GRO_CB()->count set to 1 and
the next one to 0 (in dev_gro_receive()).
Since commit 531d0d32de3e
("net/mlx5: Correctly set gso_size when LRO is used")
these packets will get merged (as their gso_size now matches).
So, we end up in gro_complete() with NAPI_GRO_CB()->count == 1 and thus
don't call inet_gro_complete(). Meaning, checksum-validation in
tcp_checksum_complete() will fail with a "hw csum failure".
Even before the above mentioned commit, incorrect gso_segs means that other
things like TCP's accounting of incoming packets (tp->segs_in,
data_segs_in, rcv_ooopack) will be incorrect. Which means that if one
does bytes_received/data_segs_in, the result will be bigger than the
MTU.
Fix this by initializing gso_segs correctly when LRO is used.
Fixes: e586b3b0baee ("net/mlx5: Ethernet Datapath files") Reported-by: Gal Pressman <gal@nvidia.com> Closes: https://lore.kernel.org/netdev/6583783f-f0fb-4fb1-a415-feec8155bc69@nvidia.com/ Signed-off-by: Christoph Paasch <cpaasch@openai.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250729-mlx5_gso_segs-v1-1-b48c480c1c12@openai.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Edward Cree [Thu, 31 Jul 2025 14:41:38 +0000 (15:41 +0100)]
sfc: unfix not-a-typo in comment
Commit fe09560f8241 ("net: Fix typos") removed duplicated word 'fallback',
but this was not a typo and change altered the semantic meaning of
the comment.
Partially revert, using the phrase 'fallback of the fallback' to make
the meaning more clear to future readers so that they won't try to
change it again.
Linus Torvalds [Fri, 1 Aug 2025 20:59:07 +0000 (13:59 -0700)]
Merge tag 'pci-v6.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Pull PCI updates from Bjorn Helgaas:
"Enumeration:
- Allow built-in drivers, not just modular drivers, to use async
initial probing (Lukas Wunner)
- Support Immediate Readiness even on devices with no PM Capability
(Sean Christopherson)
- Consolidate definition of PCIE_RESET_CONFIG_WAIT_MS (100ms), the
required delay between a reset and sending config requests to a
device (Niklas Cassel)
- Add pci_is_display() to check for "Display" base class and use it
in ALSA hda, vfio, vga_switcheroo, vt-d (Mario Limonciello)
- Allow 'isolated PCI functions' (multi-function devices without a
function 0) for LoongArch, similar to s390 and jailhouse (Huacai
Chen)
Power control:
- Add ability to enable optional slot clock for cases where the PCIe
host controller and the slot are supplied by different clocks
(Marek Vasut)
PCIe native device hotplug:
- Fix runtime PM ref imbalance on Hot-Plug Capable ports caused by
misinterpreting a config read failure after a device has been
removed (Lukas Wunner)
- Avoid creating a useless PCIe port service device for pciehp if the
slot is handled by the ACPI hotplug driver (Lukas Wunner)
- Ignore ACPI hotplug slots when calculating depth of pciehp hotplug
ports (Lukas Wunner)
Virtualization:
- Save VF resizable BAR state and restore it after reset (Michał
Winiarski)
- Allow IOV resources (VF BARs) to be resized (Michał Winiarski)
- Add pci_iov_vf_bar_set_size() so drivers can control VF BAR size
(Michał Winiarski)
Endpoint framework:
- Add RC-to-EP doorbell support using platform MSI controller,
including a test case (Frank Li)
- Allow BAR assignment via configfs so platforms have flexibility in
determining BAR usage (Jerome Brunet)
Native PCIe controller drivers:
- Convert amazon,al-alpine-v[23]-pcie, apm,xgene-pcie,
axis,artpec6-pcie, marvell,armada-3700-pcie, st,spear1340-pcie to
DT schema format (Rob Herring)
- Use dev_fwnode() instead of of_fwnode_handle() to remove OF
dependency in altera (fixes an unused variable), designware-host,
mediatek, mediatek-gen3, mobiveil, plda, xilinx, xilinx-dma,
xilinx-nwl (Jiri Slaby, Arnd Bergmann)
- Convert aardvark, altera, brcmstb, designware-host, iproc,
mediatek, mediatek-gen3, mobiveil, plda, rcar-host, vmd, xilinx,
xilinx-dma, xilinx-nwl from using pci_msi_create_irq_domain() to
using msi_create_parent_irq_domain() instead; this makes the
interrupt controller per-PCI device, allows dynamic allocation of
vectors after initialization, and allows support of IMS (Nam Cao)
APM X-Gene PCIe controller driver:
- Rewrite MSI handling to MSI CPU affinity, drop useless CPU hotplug
bits, use device-managed memory allocations, and clean things up
(Marc Zyngier)
- Probe xgene-msi as a standard platform driver rather than a
subsys_initcall (Marc Zyngier)
Broadcom STB PCIe controller driver:
- Add optional DT 'num-lanes' property and if present, use it to
override the Maximum Link Width advertised in Link Capabilities
(Jim Quinlan)
Cadence PCIe controller driver:
- Use PCIe Message routing types from the PCI core rather than
defining private ones (Hans Zhang)
Freescale i.MX6 PCIe controller driver:
- Add IMX8MQ_EP third 64-bit BAR in epc_features (Richard Zhu)
- Add IMX8MM_EP and IMX8MP_EP fixed 256-byte BAR 4 in epc_features
(Richard Zhu)
- Configure LUT for MSI/IOMMU in Endpoint mode so Root Complex can
trigger doorbel on Endpoint (Frank Li)
- Remove apps_reset (LTSSM_EN) from
imx_pcie_{assert,deassert}_core_reset(), which fixes a hotplug
regression on i.MX8MM (Richard Zhu)
- Delay Endpoint link start until configfs 'start' written (Richard
Zhu)
Intel VMD host bridge driver:
- Add Intel Panther Lake (PTL)-H/P/U Vendor ID (George D Sworo)
Qualcomm PCIe controller driver:
- Add DT binding and driver support for SA8255p, which supports ECAM
for Configuration Space access (Mayank Rana)
- Update DT binding and driver to describe PHYs and per-Root Port
resets in a Root Port stanza and deprecate describing them in the
host bridge; this makes it possible to support multiple Root Ports
in the future (Krishna Chaitanya Chundru)
- Add Qualcomm QCS615 to SM8150 DT binding (Ziyue Zhang)
- Add Qualcomm QCS8300 to SA8775p DT binding (Ziyue Zhang)
- Drop TBU and ref clocks from Qualcomm SM8150 and SC8180x DT
bindings (Konrad Dybcio)
- Add required PCIE_RESET_CONFIG_WAIT_MS delay after Link up IRQ
(Niklas Cassel)
Rockchip PCIe controller driver:
- Drop unused PCIe Message routing and code definitions (Hans Zhang)
- Remove several unused header includes (Hans Zhang)
- Use standard PCIe config register definitions instead of
rockchip-specific redefinitions (Geraldo Nascimento)
- Set Target Link Speed to 5.0 GT/s before retraining so we have a
chance to train at a higher speed (Geraldo Nascimento)
Rockchip DesignWare PCIe controller driver:
- Prevent race between link training and register update via DBI by
inhibiting link training after hot reset and link down (Wilfred
Mallawa)
- Add required PCIE_RESET_CONFIG_WAIT_MS delay after Link up IRQ
(Niklas Cassel)
Sophgo PCIe controller driver:
- Add DT binding and driver for Sophgo SG2044 PCIe controller driver
in Root Complex mode (Inochi Amaoto)
Synopsys DesignWare PCIe controller driver:
- Add required PCIE_RESET_CONFIG_WAIT_MS after waiting for Link up on
Ports that support > 5.0 GT/s. Slower Ports still rely on the
not-quite-correct PCIE_LINK_WAIT_SLEEP_MS 90ms default delay while
waiting for the Link (Niklas Cassel)"
* tag 'pci-v6.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: (116 commits)
dt-bindings: PCI: qcom,pcie-sa8775p: Document 'link_down' reset
dt-bindings: PCI: Remove 83xx-512x-pci.txt
dt-bindings: PCI: Convert amazon,al-alpine-v[23]-pcie to DT schema
dt-bindings: PCI: Convert marvell,armada-3700-pcie to DT schema
dt-bindings: PCI: Convert apm,xgene-pcie to DT schema
dt-bindings: PCI: Convert axis,artpec6-pcie to DT schema
dt-bindings: PCI: Convert st,spear1340-pcie to DT schema
PCI: Move is_pciehp check out of pciehp_is_native()
PCI: pciehp: Use is_pciehp instead of is_hotplug_bridge
PCI/portdrv: Use is_pciehp instead of is_hotplug_bridge
PCI/ACPI: Fix runtime PM ref imbalance on Hot-Plug Capable ports
selftests: pci_endpoint: Add doorbell test case
misc: pci_endpoint_test: Add doorbell test case
PCI: endpoint: pci-epf-test: Add doorbell test support
PCI: endpoint: Add pci_epf_align_inbound_addr() helper for inbound address alignment
PCI: endpoint: pci-ep-msi: Add checks for MSI parent and mutability
PCI: endpoint: Add RC-to-EP doorbell support using platform MSI controller
PCI: dwc: Add Sophgo SG2044 PCIe controller driver in Root Complex mode
PCI: vmd: Switch to msi_create_parent_irq_domain()
PCI: vmd: Convert to lock guards
...
Lorenzo Bianconi [Thu, 31 Jul 2025 10:29:08 +0000 (12:29 +0200)]
net: airoha: Fix PPE table access in airoha_ppe_debugfs_foe_show()
In order to avoid any possible race we need to hold the ppe_lock
spinlock accessing the hw PPE table. airoha_ppe_foe_get_entry routine is
always executed holding ppe_lock except in airoha_ppe_debugfs_foe_show
routine. Fix the problem introducing airoha_ppe_foe_get_entry_locked
routine.
selftests: net: Fix flaky neighbor garbage collection test
The purpose of the "Periodic garbage collection" test case is to make
sure that "extern_valid" neighbors are not flushed during periodic
garbage collection, unlike regular neighbor entries.
The test case is currently doing the following:
1. Changing the base reachable time to 10 seconds so that periodic
garbage collection will run every 5 seconds.
2. Changing the garbage collection stale time to 5 seconds so that
neighbors that have not been used in the last 5 seconds will be
considered for removal.
3. Waiting for the base reachable time change to take effect.
4. Adding an "extern_valid" neighbor, a non-"extern_valid" neighbor and
a bunch of other neighbors so that the threshold ("thresh1") will be
crossed and stale neighbors will be flushed during garbage
collection.
5. Waiting for 10 seconds to give garbage collection a chance to run.
6. Checking that the "extern_valid" neighbor was not flushed and that
the non-"extern_valid" neighbor was flushed.
The test sometimes fails in the netdev CI because the non-"extern_valid"
neighbor was not flushed. I am unable to reproduce this locally, but my
theory that since we do not know exactly when the periodic garbage
collection runs, it is possible for it to run at a time when the
non-"extern_valid" neighbor is still not considered stale.
Fix by moving the addition of the two neighbors before step 3 and by
reducing the garbage collection stale time to 1 second, to ensure that
both neighbors are considered stale when garbage collection runs.
Fixes: 171f2ee31a42 ("selftests: net: Add a selftest for externally validated neighbor entries") Reported-by: Jakub Kicinski <kuba@kernel.org> Closes: https://lore.kernel.org/netdev/20250728093504.4ebbd73c@kernel.org/ Signed-off-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250731110914.506890-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Steven Rostedt [Fri, 1 Aug 2025 20:37:26 +0000 (16:37 -0400)]
tracing: Use __free(kfree) in trace.c to remove gotos
There's a couple of locations that have goto out in trace.c for the only
purpose of freeing a variable that was allocated. These can be replaced
with __free(kfree).
Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/20250801203858.040892777@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Steven Rostedt [Fri, 1 Aug 2025 20:37:24 +0000 (16:37 -0400)]
tracing: Add guard(ring_buffer_nest)
Some calls to the tracing ring buffer can happen when the ring buffer is
already being written to by the same context (for example, a
trace_printk() in between a ring_buffer_lock_reserve() and a
ring_buffer_unlock_commit()).
In order to not trigger the recursion detection, these functions use
ring_buffer_nest_start() and ring_buffer_nest_end(). Create a guard() for
these functions so that their use cases can be simplified and not need to
use goto for the release.
Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/20250801203857.710501021@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Steven Rostedt [Fri, 1 Aug 2025 20:37:23 +0000 (16:37 -0400)]
tracing: Remove unneeded goto out logic
Several places in the trace.c file there's a goto out where the out is
simply a return. There's no reason to jump to the out label if it's not
doing any more logic but simply returning from the function.
Replace the goto outs with a return and remove the out labels.
Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/20250801203857.538726745@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Linus Torvalds [Fri, 1 Aug 2025 19:35:12 +0000 (12:35 -0700)]
Merge tag 'dmaengine-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine
Pull dmaengine updates from Vinod Koul:
"Core:
- Managed API for dma channel request
New support:
- Sophgo CV18XX/SG200X dmamux driver
- Qualcomm Milos GPI, sc8280xp GPI support
Updates:
- Conversion of brcm,iproc-sba and marvell,orion-xor binding
- Unused code cleanup across drivers"
* tag 'dmaengine-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine: (23 commits)
dt-bindings: dma: fsl-mxs-dma: allow interrupt-names for fsl,imx23-dma-apbx
dmaengine: xdmac: make it selectable for ARCH_MICROCHIP
dt-bindings: dma: Convert marvell,orion-xor to DT schema
dt-bindings: dma: Convert brcm,iproc-sba to DT schema
dmaengine: nbpfaxi: Add missing check after DMA map
dmaengine: mv_xor: Fix missing check after DMA map and missing unmap
dt-bindings: dma: qcom,gpi: document the Milos GPI DMA Engine
dmaengine: idxd: Remove __packed from structures
dmaengine: ti: Do not enable by default during compile testing
dmaengine: sh: Do not enable SH_DMAE_BASE by default during compile testing
dmaengine: idxd: Fix warning for deadcode.deadstore
dmaengine: mmp: Fix again Wvoid-pointer-to-enum-cast warning
dmaengine: fsl-qdma: Add missing fsl_qdma_format kerneldoc
dmaengine: qcom: gpi: Drop unused gpi_write_reg_field()
dmaengine: fsl-dpaa2-qdma: Drop unused mc_enc()
dmaengine: dw-edma: Drop unused dchan2dev() and chan2dev()
dmaengine: stm32: Don't use %pK through printk
dmaengine: stm32-dma: configure next sg only if there are more than 2 sgs
dmaengine: sun4i: Simplify error handling in probe()
dt-bindings: dma: qcom,gpi: Document the sc8280xp GPI DMA engine
...
Linus Torvalds [Fri, 1 Aug 2025 19:26:24 +0000 (12:26 -0700)]
Merge tag 'sound-6.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull more sound updates from Takashi Iwai:
"For catching up the remaining stuff for 6.17: only small updates and
the rest are mostly small fixes.
- Fixes in HD-audio codec driver Kconfig, so that configurations can
be more easily/safely carried between different versions
- Fixes in ASoC SDCA, FSL xcvr, AW88399
- ASoC IMX WM8524 support
- HD-audio and USB-audio quirks and fixes
- A minor selftest fix"
* tag 'sound-6.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (21 commits)
ALSA: usb: scarlett2: Fix missing NULL check
mips: Update HD-audio configs again
LoongArch: Update HD-audio codec configs
arm: Update HD-audio configs again
selftests: ALSA: fix memory leak in utimer test
ALSA: usb-audio: Add DSD support for Comtrue USB Audio device
ALSA: hda/hdmi: Enable drivers as default
ALSA: hda/cirrus: Enable drivers as default
ALSA: hda/realtek: Enable drivers as default
ALSA: hda/realtek - Fix mute LED for HP Victus 16-d1xxx (MB 8A26)
ALSA: hda/realtek - Fix mute LED for HP Victus 16-s0xxx
ALSA: hda: Fix the wrong register was used for DVC of TAS2770
ALSA: scarlett2: Add retry on -EPROTO from scarlett2_usb_tx()
ALSA: hda/realtek - Fix mute LED for HP Victus 16-r1xxx
ASoC: codecs: Add acpi_match_table for aw88399 driver
ASoC: dt-bindings: atmel,at91-ssc: add microchip,sam9x7-ssc
ASoC: imx-card: Add WM8524 support
ASoC: fsl_xcvr: get channel status data with firmware exists
ASoC: fsl_xcvr: get channel status data when PHY is not exists
ASoC: SDCA: Add support for -cn- value properties
...
Linus Torvalds [Fri, 1 Aug 2025 17:29:36 +0000 (10:29 -0700)]
Merge tag 'trace-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing updates from Steven Rostedt:
- Deprecate auto-mounting tracefs to /sys/kernel/debug/tracing
When tracefs was first introduced back in 2014, the directory
/sys/kernel/tracing was added and is the designated location to mount
tracefs. To keep backward compatibility, tracefs was auto-mounted in
/sys/kernel/debug/tracing as well.
All distros now mount tracefs on /sys/kernel/tracing. Having it seen
in two different locations has lead to various issues and
inconsistencies.
The VFS folks have to also maintain debugfs_create_automount() for
this single user.
It's been over 10 years. Tooling and scripts should start replacing
the debugfs location with the tracefs one. The reason tracefs was
created in the first place was to allow access to the tracing
facilities without the need to configure debugfs into the kernel.
Using tracefs should now be more robust.
A new config is created: CONFIG_TRACEFS_AUTOMOUNT_DEPRECATED which is
default y, so that the kernel is still built with the automount. This
config allows those that want to remove the automount from debugfs to
do so.
When tracefs is accessed from /sys/kernel/debug/tracing, the
following printk is triggerd:
pr_warn("NOTICE: Automounting of tracing to debugfs is deprecated and will be removed in 2030\n");
This gives users another 5 years to fix their scripts.
- Use queue_rcu_work() instead of call_rcu() for freeing event filters
The number of filters to be free can be many depending on the number
of events within an event system. Freeing them from softirq context
can potentially cause undesired latency. Use the RCU workqueue to
free them instead.
- Remove pointless memory barriers in latency code
Memory barriers were added to some of the latency code a long time
ago with the idea of "making them visible", but that's not what
memory barriers are for. They are to synchronize access between
different variables. There was no synchronization here making them
pointless.
- Remove "__attribute__()" from the type field of event format
When LLVM is used to compile the kernel with CONFIG_DEBUG_INFO_BTF=y
and PAHOLE_HAS_BTF_TAG=y, some of the format fields get expanded with
the following:
This confuses parsers. Add code to strip these tags from the strings.
- Add eprobe config option CONFIG_EPROBE_EVENTS
Eprobes were added back in 5.15 but were only enabled when another
probe was enabled (kprobe, fprobe, uprobe, etc). The eprobes had no
config option of their own. Add one as they should be a separate
entity.
It's default y to keep with the old kernels but still has
dependencies on TRACING and HAVE_REGS_AND_STACK_ACCESS_API.
- Add eprobe documentation
When eprobes were added back in 5.15 no documentation was added to
describe them. This needs to be rectified.
- Replace open coded cpumask_next_wrap() in move_to_next_cpu()
- Have preemptirq_delay_run() use off-stack CPU mask
- Remove obsolete comment about pelt_cfs event
DECLARE_TRACE() appends "_tp" to trace events now, but the comment
above pelt_cfs still mentioned appending it manually.
- Remove EVENT_FILE_FL_SOFT_MODE flag
The SOFT_MODE flag was required when the soft enabling and disabling
of trace events was first introduced. But there was a bug with this
approach as it only worked for a single instance. When multiple users
required soft disabling and disabling the code was changed to have a
ref count. The SOFT_MODE flag is now set iff the ref count is non
zero. This is redundant and just reading the ref count is good
enough.
- Fix typo in comment
* tag 'trace-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
Documentation: tracing: Add documentation about eprobes
tracing: Have eprobes have their own config option
tracing: Remove "__attribute__()" from the type field of event format
tracing: Deprecate auto-mounting tracefs in debugfs
tracing: Fix comment in trace_module_remove_events()
tracing: Remove EVENT_FILE_FL_SOFT_MODE flag
tracing: Remove pointless memory barriers
tracing/sched: Remove obsolete comment on suffixes
kernel: trace: preemptirq_delay_test: use offstack cpu mask
tracing: Use queue_rcu_work() to free filters
tracing: Replace opencoded cpumask_next_wrap() in move_to_next_cpu()
Linus Torvalds [Fri, 1 Aug 2025 17:23:13 +0000 (10:23 -0700)]
Merge tag 'trace-tools-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing tools updates from Steven Rostedt:
- Introduce enum timerlat_tracing_mode
Now that BPF based sampling has been added to timerlat, add an enum
to represent which mode timerlat is running in
- Add action on timelat threshold feature
A new option, --on-threshold, is added, taking an argument that
further specifies the action. Actions added in this patch are:
- trace[,file=<filename>]: Saves tracefs buffer, optionally taking a
filename
- signal,num=<sig>,pid=<pid>: Sends signal to process. "parent" might
be specified instead of number to send signal to parent process
- shell,command=<command>: Execute shell command
- Allow resuming tracing in timerlat bpf
rtla-timerlat BPF program uses a global variable stored in a .bss
section to store whether tracing has been stopped. Map it to allow it
to resume tracing after it has been stopped
- Add continue action to timerlat
Introduce option to resume tracing after a latency threshold
overflow. The option is implemented as an action named "continue"
- Add action on end feature to timerlat
Implement actions on end next to actions on threshold. A new option,
--on-end is added, parallel to --on-threshold. Instead of being
executed whenever a latency threshold is reached, it is executed at
the end of the measurement
- Have rtla tests check output with grep
Add argument to the check command in the test suite that takes a
regular expression that the output of rtla command is checked
against. This allows testing for specific information in rtla output
in addition to checking the return value
- Add tests for timerlat actions
- Update the documentation for the new features
* tag 'trace-tools-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
rtla/tests: Test timerlat -P option using actions
rtla/tests: Add grep checks for base test cases
Documentation/rtla: Add actions feature
rtla/tests: Limit duration to maximum of 10s
rtla/tests: Add tests for actions
rtla/tests: Check rtla output with grep
rtla/timerlat: Add action on end feature
rtla/timerlat: Add continue action
rtla/timerlat_bpf: Allow resuming tracing
rtla/timerlat: Add action on threshold feature
rtla/timerlat: Introduce enum timerlat_tracing_mode
Linus Torvalds [Fri, 1 Aug 2025 16:46:24 +0000 (09:46 -0700)]
Merge tag 'trace-deferred-unwind-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull initial deferred unwind infrastructure from Steven Rostedt:
"This is the core infrastructure for the deferred unwinder that is
required for sframes[1]. Several other patch series are based on this
work although those patch series are not dependent on each other. In
order to simplify the development, having this core series upstream
will allow the other series to be worked on in parallel. The other
series are:
- The two patches to implement x86 support [2] [3]
- The s390 work [4]
- The perf work [5]
- The ftrace work [6]
- The sframe work [7]
And more is on the way.
The core infrastructure adds the following in kernel APIs:
- int unwind_user_faultable(struct unwind_stacktrace *trace);
Performs a user space stack trace that may fault user pages in.
- int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func);
Allows a tracer to register with the unwind deferred
infrastructure.
- int unwind_deferred_request(struct unwind_work *work, u64 *cookie);
Used when a tracer request a deferred trace. Can be called from
interrupt or NMI context.
Paul Chaignon [Fri, 1 Aug 2025 09:49:15 +0000 (11:49 +0200)]
bpf: Improve ctx access verifier error message
We've already had two "error during ctx access conversion" warnings
triggered by syzkaller. Let's improve the error message by dumping the
cnt variable so that we can more easily differentiate between the
different error cases.
Paul Chaignon [Fri, 1 Aug 2025 09:48:15 +0000 (11:48 +0200)]
bpf: Check netfilter ctx accesses are aligned
Similarly to the previous patch fixing the flow_dissector ctx accesses,
nf_is_valid_access also doesn't check that ctx accesses are aligned.
Contrary to flow_dissector programs, netfilter programs don't have
context conversion. The unaligned ctx accesses are therefore allowed by
the verifier.
Paul Chaignon [Fri, 1 Aug 2025 09:47:23 +0000 (11:47 +0200)]
bpf: Check flow_dissector ctx accesses are aligned
flow_dissector_is_valid_access doesn't check that the context access is
aligned. As a consequence, an unaligned access within one of the exposed
field is considered valid and later rejected by
flow_dissector_convert_ctx_access when we try to convert it.
The later rejection is problematic because it's reported as a verifier
bug with a kernel warning and doesn't point to the right instruction in
verifier logs.
Fixes: d58e468b1112 ("flow_dissector: implements flow dissector BPF hook") Reported-by: syzbot+ccac90e482b2a81d74aa@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=ccac90e482b2a81d74aa Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/cc1b036be484c99be45eddf48bd78cc6f72839b1.1754039605.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Simon Trimmer [Thu, 31 Jul 2025 16:01:09 +0000 (16:01 +0000)]
spi: cs42l43: Property entry should be a null-terminated array
The software node does not specify a count of property entries, so the
array must be null-terminated.
When unterminated, this can lead to a fault in the downstream cs35l56
amplifier driver, because the node parse walks off the end of the
array into unknown memory.
Fixes: 0ca645ab5b15 ("spi: cs42l43: Add speaker id support to the bridge configuration") Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220371 Signed-off-by: Simon Trimmer <simont@opensource.cirrus.com> Link: https://patch.msgid.link/20250731160109.1547131-1-simont@opensource.cirrus.com Signed-off-by: Mark Brown <broonie@kernel.org>
Peter Jakubek [Thu, 31 Jul 2025 17:21:04 +0000 (18:21 +0100)]
ASoC: Intel: sof_sdw: Add quirk for Alienware Area 51 (2025) 0CCC SKU
Add DMI quirk entry for Alienware systems with SKU "0CCC" to enable
proper speaker codec configuration (SOC_SDW_CODEC_SPKR).
This system requires the same audio configuration as some existing Dell systems.
Without this patch, the laptop's speakers and microphone will not work.
Will Deacon [Thu, 17 Jul 2025 09:01:16 +0000 (10:01 +0100)]
vsock/virtio: Allocate nonlinear SKBs for handling large transmit buffers
When transmitting a vsock packet, virtio_transport_send_pkt_info() calls
virtio_transport_alloc_linear_skb() to allocate and fill SKBs with the
transmit data. Unfortunately, these are always linear allocations and
can therefore result in significant pressure on kmalloc() considering
that the maximum packet size (VIRTIO_VSOCK_MAX_PKT_BUF_SIZE +
VIRTIO_VSOCK_SKB_HEADROOM) is a little over 64KiB, resulting in a 128KiB
allocation for each packet.
Rework the vsock SKB allocation so that, for sizes with page order
greater than PAGE_ALLOC_COSTLY_ORDER, a nonlinear SKB is allocated
instead with the packet header in the SKB and the transmit data in the
fragments. Note that this affects both the vhost and virtio transports.
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Will Deacon <will@kernel.org>
Message-Id: <20250717090116.11987-10-will@kernel.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Will Deacon [Thu, 17 Jul 2025 09:01:15 +0000 (10:01 +0100)]
vsock/virtio: Rename virtio_vsock_skb_rx_put()
In preparation for using virtio_vsock_skb_rx_put() when populating SKBs
on the vsock TX path, rename virtio_vsock_skb_rx_put() to
virtio_vsock_skb_put().
No functional change.
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Will Deacon <will@kernel.org>
Message-Id: <20250717090116.11987-9-will@kernel.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Will Deacon [Thu, 17 Jul 2025 09:01:14 +0000 (10:01 +0100)]
vhost/vsock: Allocate nonlinear SKBs for handling large receive buffers
When receiving a packet from a guest, vhost_vsock_handle_tx_kick()
calls vhost_vsock_alloc_linear_skb() to allocate and fill an SKB with
the receive data. Unfortunately, these are always linear allocations and
can therefore result in significant pressure on kmalloc() considering
that the maximum packet size (VIRTIO_VSOCK_MAX_PKT_BUF_SIZE +
VIRTIO_VSOCK_SKB_HEADROOM) is a little over 64KiB, resulting in a 128KiB
allocation for each packet.
Rework the vsock SKB allocation so that, for sizes with page order
greater than PAGE_ALLOC_COSTLY_ORDER, a nonlinear SKB is allocated
instead with the packet header in the SKB and the receive data in the
fragments. Finally, add a debug warning if virtio_vsock_skb_rx_put() is
ever called on an SKB with a non-zero length, as this would be
destructive for the nonlinear case.
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Will Deacon <will@kernel.org>
Message-Id: <20250717090116.11987-8-will@kernel.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Will Deacon [Thu, 17 Jul 2025 09:01:13 +0000 (10:01 +0100)]
vsock/virtio: Move SKB allocation lower-bound check to callers
virtio_vsock_alloc_linear_skb() checks that the requested size is at
least big enough for the packet header (VIRTIO_VSOCK_SKB_HEADROOM).
Of the three callers of virtio_vsock_alloc_linear_skb(), only
vhost_vsock_alloc_skb() can potentially pass a packet smaller than the
header size and, as it already has a check against the maximum packet
size, extend its bounds checking to consider the minimum packet size
and remove the check from virtio_vsock_alloc_linear_skb().
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Will Deacon <will@kernel.org>
Message-Id: <20250717090116.11987-7-will@kernel.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Will Deacon [Thu, 17 Jul 2025 09:01:12 +0000 (10:01 +0100)]
vsock/virtio: Rename virtio_vsock_alloc_skb()
In preparation for nonlinear allocations for large SKBs, rename
virtio_vsock_alloc_skb() to virtio_vsock_alloc_linear_skb() to indicate
that it returns linear SKBs unconditionally and switch all callers over
to this new interface for now.
No functional change.
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Will Deacon <will@kernel.org>
Message-Id: <20250717090116.11987-6-will@kernel.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>