mm/mglru: use folio_mark_accessed to replace folio_set_active
MGLRU gives high priority to folios mapped in page tables. As a result,
folio_set_active() is invoked for all folios read during page faults. In
practice, however, readahead can bring in many folios that are never
accessed via page tables.
A previous attempt by Lei Liu proposed introducing a separate LRU for
readahead[1] to make readahead pages easier to reclaim, but that approach
is likely over-engineered.
Before commit 4d5d14a01e2c ("mm/mglru: rework workingset protection"),
folios with PG_active were always placed in the youngest generation,
leading to over-protection and increased refaults. After that commit,
PG_active folios are placed in the second youngest generation, which is
still too optimistic given the presence of readahead. In contrast, the
classic active/inactive scheme is more conservative.
This patch switches to using folio_mark_accessed() and
begins prefaulted file folios from the second oldest
generation instead of active generations.
We should also adjust the following accordingly:
- WORKINGSET_ACTIVATE: aligned with setting active for refaulted workingset
folios;
- lru_gen_folio_seq(): place (pre)faulted file folios into the second
oldest generation;
- promote second-scanned folios to workingset in
folio_check_references(): we now have to depend on
folio_lru_refs() > 1, since we previously relied on PG_referenced
being set during the first scan, but PG_referenced is now set
earlier.
On x86, running a kernel build inside a memcg with a 1GB memory
limit using 20 threads.
Wang Wensheng [Sun, 24 May 2026 03:10:53 +0000 (11:10 +0800)]
kasan/test: only do kmalloc_double_kzfree for generic mode
kmalloc_double_kzfree() would corrupt kernel memory when the just freed
memory were allocated by another thread before the second call to
kfree_sensitive() and the new allocation tag happened to match the old
one.
This could not happen in GENERIC mode as it uses quarantine.
Link: https://lore.kernel.org/20260524031053.381776-1-wsw9603@163.com Signed-off-by: Wang Wensheng <wsw9603@163.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Wed, 20 May 2026 15:03:10 +0000 (08:03 -0700)]
mm/damon/core: trace esz at first setup
DAMON traces effective size quota from the second update, only if a change
has been made by the update. Tracing only changed updates was an
intentional decision to avoid unnecessary same value tracing. Always
skipping the first value is just an unintended mistake.
The mistake makes the tracepoint based investigation incomplete, because
the first effective size quota is never traced. It is not a big issue
when the 'consist' quota tuner is used, because it keeps changing the
quota in the usual setup.
However, when the 'temporal' tuner is used, the quota value is not changed
before the goal achievement status is completely changed. For example, if
the DAMOS scheme is started with an under-achieved goal, the quota is set
to the maximum value, and kept the same value until the goal is achieved.
Because DAMON skips the first value, the user cannot know what effective
quota the current scheme is using. Only after the goal is achieved, the
effective quota is changed to zero, and traced.
Unconditionally trace the initial quota value to fix this problem.
Note that the 'temporal' quota tuner was introduced by commit af738a6a00c1
("mm/damon/core: introduce DAMOS_QUOTA_GOAL_TUNER_TEMPORAL"), which was
added to 7.1-rc1. But even with the 'consist' quota tuner, the tracing is
unintentionally incomplete. Hence this commit marks the introduction of
the trace event as the broken commit.
Link: https://lore.kernel.org/20260520150311.80925-1-sj@kernel.org Fixes: a86d695193bf ("mm/damon: add trace event for effective size quota") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> # 6.17.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Maksym Shcherba [Thu, 21 May 2026 20:20:20 +0000 (23:20 +0300)]
Docs/admin-guide/mm/damon/usage: clarify current_value of quota goals
The sysfs interface for DAMON quota goals includes a `current_value` file.
This file is not updated by the kernel and only serves to receive user
input.
Clarify in the documentation that the kernel does not update
`current_value`, and that reading it only has meaning when `target_metric`
is set to `user_input`.
While at it, fix missing commas in the goal files list.
Maksym Shcherba [Thu, 21 May 2026 20:20:19 +0000 (23:20 +0300)]
mm/damon: fix missing parens in macro arguments
Patch series "mm/damon: fix macro arguments and clarify quota goals doc",
v2.
This patch (of 2):
The DAMON iterator macros do not wrap their pointer arguments with
parentheses. This can cause build failures when the argument is a complex
expression due to operator precedence issues.
Add missing parentheses around the arguments in the following macros
to prevent potential build failures:
- damon_for_each_region()
- damon_for_each_region_from()
- damon_for_each_region_safe()
- damos_for_each_quota_goal()
SeongJae Park [Fri, 22 May 2026 15:40:22 +0000 (08:40 -0700)]
selftests/damon/sysfs.py: stop kdamonds before failing
When an assertion is failed, sysfs.py DAMON selftest immediately exits the
test program leaving the DAMON running behind. Many of the following
tests need to start DAMON on their own. But because DAMON that was
started by sysfs.py is still running, those start attempts fail, and the
tests are failed or skipped. Update sysfs.py to stop DAMON before exiting
the test program due to the assertion failure.
Link: https://lore.kernel.org/20260522154026.80546-12-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Fri, 22 May 2026 15:40:21 +0000 (08:40 -0700)]
mm/damon/tests/core-kunit: add damon_set_regions() test cases
damon_set_regions() is one of the main DAMON kernel API functions that set
up the monitoring target memory region boundaries. Implement unit tests
for verifying its basic functionalities.
Link: https://lore.kernel.org/20260522154026.80546-11-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Fri, 22 May 2026 15:40:20 +0000 (08:40 -0700)]
mm/damon/core: remove damon_verify_nr_regions()
When CONFIG_DAMON_DEBUG_SANITY is enabled, damon_verify_nr_regions() is
called for each damon_nr_regions() invocation. damon_veify_nr_regions()
iterates all regions. damon_nr_regions() is called for each region in
kdamond_reset_aggregated() and damos_apply_scheme(). Hence it imposes
O(n**2) overhead where n is the number of regions.
Though the verification is enabled only under DAMON_DEBUG_SANITY, which is
not for production use cases, it could be too high overhead. Meanwhile,
damon_verify_ctx() is doing the damon_nr_regions() test. Because
damon_verify_ctx() is called for each kdamond_call(), the test coverage
from damon_verify_ctx() could be sufficient. Remove damon_nr_regions()
verification.
Link: https://lore.kernel.org/20260522154026.80546-10-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
kdamond_call() is the place where DAMON API callers are allowed to access
the DAMON context's public internal state including the monitoring
results. Hence it is important to ensure it is called with the expected
DAMON context state. Do the check under DAMON_DEBUG_SANITY.
Link: https://lore.kernel.org/20260522154026.80546-9-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Fri, 22 May 2026 15:40:18 +0000 (08:40 -0700)]
mm/damon/core: hide damon_destroy_region()
damon_destroy_region() is being used by only DAMON core, but exposed to
DAMON API callers. Exposing something that is not really being used by
others will only increase the maintenance cost. Hide it.
Link: https://lore.kernel.org/20260522154026.80546-8-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Fri, 22 May 2026 15:40:17 +0000 (08:40 -0700)]
mm/damon/core: hide damon_insert_region()
damon_insert_region() is being used by only DAMON core, but exposed to
DAMON API callers. Exposing something that is not really being used by
others will only increase the maintenance cost. Hide it.
Link: https://lore.kernel.org/20260522154026.80546-7-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Fri, 22 May 2026 15:40:16 +0000 (08:40 -0700)]
mm/damon/core: hide damon_add_region()
damon_add_region() is being used by only DAMON core, but exposed to DAMON
API callers. Exposing something that is not really being used by others
will only increase the maintenance cost. Hide it.
Link: https://lore.kernel.org/20260522154026.80546-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Fri, 22 May 2026 15:40:15 +0000 (08:40 -0700)]
mm/damon/tests/vaddr-kunit: replace damon_add_region() with damon_set_regions()
DAMON virtual address operation set (vaddr) unit tests is using
damon_add_region() for setup of DAMON monitoring target region boundaries
setup. But, damon_set_regions() is designed for exactly the purpose. All
other DAMON API callers use the function for the purpose. Replace
damon_add_region() usage in the unit tests with damon_set_regions(), for
unifying the use case and reducing the maintenance cost.
Link: https://lore.kernel.org/20260522154026.80546-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Fri, 22 May 2026 15:40:14 +0000 (08:40 -0700)]
samples/damon/mtier: replace damon_add_region() with damon_set_regions()
mtier DAMON sample module and DAMON virtual address operation set (vaddr)
unit tests are using damon_add_region() for setup of DAMON monitoring
target region boundaries setup. But, damon_set_regions() is designed for
exactly the purpose. All other DAMON API callers use the function for the
purpose. Replace damon_add_region() usage in mtier sample module with
damon_set_regions(), for unifying the use case and reducing the
maintenance cost.
Link: https://lore.kernel.org/20260522154026.80546-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Fri, 22 May 2026 15:40:13 +0000 (08:40 -0700)]
mm/damon/core: do not use region out of a loop in damon_set_regions()
damon_set_regions() assumes the DAMON region iterator is referencing the
last region after the region iteration loop is completed. The code is
indeed implemented in the way, but that is not a documented safe behavior.
Hence it is unreliable and difficult to read. Cleanup the code to avoid
the case.
No behavioral change is intended.
Link: https://lore.kernel.org/20260522154026.80546-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Fri, 22 May 2026 15:40:12 +0000 (08:40 -0700)]
mm/damon/core: safely handle no region case in damon_set_regions()
Patch series "mm/damon: minor improvements for code readability and tests".
Implement minor improvements on code readability and tests for DAMON.
First seven patches are for DAMON code readability and resulting
maintenance. Patches 1 and 2 make damon_set_regions() safer and easier to
read. Patches 3 and 4 remove fragmented DAMON API use cases. Patches 5-7
hides unused core functions that are unnecessarily exposed to API callers.
The following seven patches are for DAMON tests improvement. Patches 8
and 9 adds and removes DAMON_DEBUG_SANITY verifications to ensure
reasonable test coverage without too high overhead. Patch 10 adds a new
kunit test for damon_set_regions(). Patch 11 makes sysfs.py selftest more
gracefully finishes under test failures. Patches 12-13 adds simple
sysfs.sh test cases for the monitoring intervals goal directory, the
addr_unit file and the pause file.
This patch (of 14):
damon_set_regions() calls damon_first_region() regardless of the number of
DAMON regions in a given DAMON target. damon_first_region() internally
uses list_first_entry(), which clearly documents the list is expected to
be not empty. Due to the internal implementation of the macro,
damon_set_regions() is safe for now. But the internal implementation of
the macro can be changed in future. Refactor the function to explicitly
and safely handle the empty region list case without depending on the
internal implementation.
Rather than providing a hook, simplify things by providing the ability to
override mmap action errors. This allows us to more carefully validate
the value provided and thus ensure only a valid error code is specified,
and simplifies the interface.
This way, we eliminate all hooks but mmap_prepare and allow only mmap
actions to be specified (which core mm controls).
This significantly improves robustness and eliminates any unnecessary code
duplication in driver mmap hooks.
We also update the /dev/mem logic (the only user) to use
mmap_action->error_override instead.
Link: https://lore.kernel.org/55d13f7d016b827c459946d46a56105635be111c.1780397980.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Tue, 2 Jun 2026 11:06:26 +0000 (12:06 +0100)]
mm/vma: remove mmap_action->success_hook
This hook was introduced to work around code that seemed to absolutely
require access to a VMA pointer upon mmap().
However, providing this hook leaves a backdoor to drivers getting access
to the very thing mmap_prepare eliminates - a pointer to the VMA.
Let's solve this contradiction by removing it. The key intended user was
hugetlb, however it seems that the best course now is to avoid allowing
all drivers the ability to work around mmap_prepare, and find a different
solution there.
Link: https://lore.kernel.org/f79434e6d30af6d92999be6b76e197f1847105fa.1780397980.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Tue, 2 Jun 2026 11:06:25 +0000 (12:06 +0100)]
drivers/char/mem: eliminate unnecessary use of success_hook
Patch series "remove mmap_action success, error hooks", v3.
The mmap_action->success_hook was a strange beast added to enable code
which appeared to absolutely require access to a VMA pointer to work
correctly.
Primarily this was for hugetlb, however a different approach will be taken
there, as clearly more work is required to figure out a sensible way of
converting hugetlb to use mmap_prepare.
The other user was the memory char driver, specifically /dev/zero which
has the unusual property of explicitly setting file-backed VMAs anonymous.
Providing the success hook was always foolish, as it allowed drivers a way
to workaround the restriction that they should not access a pointer to a
not-yet-correctly-initialised VMA - which defeats the purpose of the
mmap_prepare work.
We can achieve the same thing in memory char driver without needing the
success hook, so this series removes that, then removes the success hook
altogether.
The error hook is also unnecessary - the motivation for this was for
functions which need to override the error code when performing an mmap
action in order to avoid breaking userspace.
We can achieve this by just providing a field for the error code. Doing
this means we don't have to worry about the hook doing anything odd.
We also add a check to ensure the error code is in fact valid.
Again the memory char driver is the only current user of this, so this
series updates it to use that.
After this change mmap_action has no custom hooks at all, which seems
rather more cromulent than before.
This patch (of 3):
/dev/zero, uniquely, marks memory mapped there as anonymous. This is
currently achieved using the mmap_action->success_hook.
However this hook circumvents the abstraction of VMA initialisation so
it's preferable to do things a different way.
To achieve this, this patch firstly defaults the VMA descriptor's vm_ops
field to the dummy VMA operations, which is what file-backed VMAs default
this field to.
That way, we can detect whether a driver sets this field to NULL in order
to mark it anonymous.
We then introduce vma_desc_set_anonymous() to do this explicitly, and
invoke it in mmap_zero_prepare().
This way, any driver which does not explicitly set desc->vm_ops, retains
the dummy vm_ops as they would previously.
We also update set_vma_user_defined_fields() to make clear that we are
either setting vma->vm_ops to what is provided by the driver (or
defaulting to dummy_vm_ops if not set), or setting the VMA anonymous.
This lays the groundwork for removing the success hook.
Link: https://lore.kernel.org/cover.1780397980.git.ljs@kernel.org Link: https://lore.kernel.org/010579cca6787cf7bb057ab1f7228978b10601c8.1780397980.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wei Yang [Wed, 20 May 2026 02:03:36 +0000 (02:03 +0000)]
selftests/mm/split_huge_page_test.c: close fd on write error
When create_pagecache_thp_and_fd() write returns error on
/proc/sys/vm/dropcache, it just "goto err_out_unlink", which left fd still
open.
Use "goto err_out_close" to close the fd.
Link: https://lore.kernel.org/20260520020336.28914-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Lance Yang <lance.yang@linux.dev> Cc: "Liam R. Howlett" <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dmitry Ilvokhin [Wed, 20 May 2026 12:22:28 +0000 (12:22 +0000)]
mm/page_alloc: fix defrag_mode for non-reclaimable allocations
When defrag_mode is enabled, ALLOC_NOFRAGMENT is enforced to prevent
migratetype fallbacks and keep pageblocks clean. The allocator relies on
reclaim and compaction to free pages of the correct type before allowing
fallback as a last resort.
However, non-reclaimable allocations such as GFP_ATOMIC cannot invoke
direct reclaim or compaction. With defrag_mode=1, these allocations hit
the !can_direct_reclaim bailout in __alloc_pages_slowpath() with
ALLOC_NOFRAGMENT still set, and fail without ever attempting a fallback.
This causes a large number of SLUB allocation failures for
skbuff_head_cache under network-heavy workloads, despite free memory being
available in other migratetype freelists.
We observed it on a few of the Meta workloads that adopted
defrag_mode=1.
For the service under load there were 85509 SLUB allocation failures
messages in dmesg within 2 hours. All of them are GFP_ATOMIC
allocations for skbuff_head_cache, despite free pages being available
in other migratetype freelists (~13 GB free).
Since it is networking path from the practical point of view, this
means dropped packets, failed RPC requests, tail latency spikes and
overall service degradation.
Clear ALLOC_NOFRAGMENT and retry for allocations that request kswapd
reclaim but cannot do direct reclaim themselves (GFP_ATOMIC). Purely
speculative allocations like GFP_TRANSHUGE_LIGHT that don't set
__GFP_KSWAPD_RECLAIM are left to fail, since they have reasonable
fallbacks and should not cause fragmentation.
Link: https://lore.kernel.org/20260520122228.201550-1-d@ilvokhin.com Fixes: e3aa7df331bc ("mm: page_alloc: defrag_mode") Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Linus Torvalds [Thu, 4 Jun 2026 21:35:55 +0000 (14:35 -0700)]
Merge tag 'net-7.1-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from Netfilter, wireless and Bluetooth.
Current release - fix to a fix:
- Bluetooth: MGMT: fix backward compatibility with bluetoothd
which adds stray bytes to MGMT_OP_ADD_EXT_ADV_DATA
Previous releases - regressions:
- af_unix: fix inq_len update inaccuracy on partial read
- eth: fec: fix pinctrl default state restore order on resume
- wifi: iwlwifi:
- mvm: don't support the reset handshake for old firmwares
- pcie: simplify the resume flow if fast resume is not used,
work around NIC access failures
Previous releases - always broken:
- Bluetooth: L2CAP: reject BR/EDR signaling packets over MTUsig
- sctp: fix a couple of bugs in COOKIE_ECHO processing
- sched: fix pedit partial COW leading to page cache corruption
- wifi: nl80211: reject oversized EMA RNR lists
- netfilter:
- conntrack_irc: fix possible out-of-bounds read
- bridge: make ebt_snat ARP rewrite writable
- appletalk: zero-initialize aarp_entry to prevent heap info leak
- ipv4: restrict IPOPT_SSRR and IPOPT_LSRR options
- mptcp: fix number of bugs reported by AI scans and discovered
during NVMe over MPTCP testing"
* tag 'net-7.1-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (85 commits)
Reapply "bnxt_en: bring back rtnl_lock() in the bnxt_open() path"
udp: clear skb->dev before running a sockmap verdict
sctp: purge outqueue on stale COOKIE-ECHO handling
bonding: annotate data-races arcound churn variables
net/802/mrp: fix vector attribute parsing in mrp_pdu_parse_vecattr
rtase: Avoid sleeping in get_stats64()
ieee802154: 6lowpan: only accept IPv6 packets in lowpan_xmit()
ipv6: mcast: Fix use-after-free when processing MLD queries
selftests: net: add vxlan vnifilter notification test
vxlan: vnifilter: fix spurious notification on VNI update
vxlan: vnifilter: send notification on VNI add
rtase: Reset TX subqueue when clearing TX ring
octeontx2-af: npc: Fix CPT channel mask in npc_install_flow
dt-bindings: ethernet: eswin: fix hsp-sp-csr backward compatibility
sctp: validate cached peer INIT chunk length in COOKIE_ECHO processing
net/sched: fix pedit partial COW leading to page cache corruption
vsock/vmci: fix sk_ack_backlog leak on failed handshake
net: bonding: fix NULL pointer dereference in bond_do_ioctl()
geneve: fix length used in GRO hint UDP checksum adjustment
net: ethernet: mtk_eth_soc: Fix use-after-free in metadata dst teardown
...
====================
net: ethtool: make sure __ethtool_get_link_ksettings() is ops-locked
This is prep for the series which will make most of the ethtool ops
run without rtnl_lock. The AI bots surfaced a number of callers of
__ethtool_get_link_ksettings() which need fixing, so I decided to
send that as a smaller prep-series. Each driver changed separately
for ease of review.
Full series unlocking ethtool ops AKA v1::
https://lore.kernel.org/20260528231637.251822-1-kuba@kernel.org
====================
Jakub Kicinski [Wed, 3 Jun 2026 01:28:40 +0000 (18:28 -0700)]
net: ethtool: make sure __ethtool_get_link_ksettings() is ops-locked
All drivers which may call *_get_link_ksettings() on ops-locked
devices from paths already holding the ops lock are ready now.
Make __ethtool_get_link_ksettings() take the ops lock, and assert
that it's held in netif_get_link_ksettings().
Jakub Kicinski [Wed, 3 Jun 2026 01:28:39 +0000 (18:28 -0700)]
scsi: fcoe: don't recurse on the netdev's ops lock
fcoe_link_speed_update() calls __ethtool_get_link_ksettings() on the
lport's netdev, which will soon take the dev's ops lock. Some notifier
callers already arrive with this lock held. Switch to
netif_get_link_ksettings() and adjust the explicit call sites to take
the netdev lock explicitly.
Within fcoe_device_notification() try to only query the link speed
from notifiers which announce link state change (UP / CHANGE),
DOWN / GOING_DOWN notifiers are slightly sketchy when it comes
to ops locking right now, and the code already special-cases
those by maintaining the local link_possible variable.
Also take the lock in bnx2fc_net_config(), even though I think
that bnx2fc call sites are largely irrelevant since it's not
an ops-locked driver.
Jakub Kicinski [Wed, 3 Jun 2026 01:28:38 +0000 (18:28 -0700)]
leds: trigger: netdev: don't recurse on the netdev ops lock
get_device_state() calls __ethtool_get_link_ksettings() on the trigger's
netdev, which will soon take the dev's ops lock. Three of its callers
already hold that lock and one doesn't, so the function would either
deadlock or run unprotected depending on the path.
Make get_device_state() expect the dev's ops lock held and switch to
netif_get_link_ksettings():
* netdev_trig_notify() NETDEV_UP / NETDEV_CHANGE / NETDEV_CHANGENAME
arrive with the dev's ops lock held (per netdevices.rst).
* set_device_name() does not hold the lock, take it explicitly.
Due to lock ordering we need to reshuffle the code in set_device_name()
a little bit. We need to find the device earlier on, so that we can
lock it before we take trigger_data->lock.
Jakub Kicinski [Wed, 3 Jun 2026 01:28:36 +0000 (18:28 -0700)]
net: bridge: don't recurse on the port's netdev ops lock
port_cost() calls __ethtool_get_link_ksettings() on the port device,
which will soon take the port's ops lock. br_port_carrier_check()
is reached via the NETDEV_CHANGE notifier from linkwatch, which
already holds the port's ops lock, so the call would deadlock.
Make port_cost() expect the port's ops lock held and switch to
netif_get_link_ksettings(). The only other caller is new_nbp(),
make sure it takes the lock explicitly.
Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260603012840.2254293-8-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Wed, 3 Jun 2026 01:28:35 +0000 (18:28 -0700)]
net: team: don't recurse on the port's netdev ops lock
__team_port_change_send() calls __ethtool_get_link_ksettings() on
the port, which will soon take the port's ops lock. The notifier
caller already holds it while the slave-add/del callers do not,
so the function would either deadlock or run unprotected depending
on the path.
Make __team_port_change_send() expect the port's ops lock held and
switch to netif_get_link_ksettings(). team_device_event()'s NETDEV_UP /
NETDEV_CHANGE already arrive with the port's ops lock held.
team_port_add() now take it explicitly.
Note that NETDEV_DOWN and team_port_del() will pass false as @linkup
so they will not execute netif_get_link_ksettings(). This is fortunate
as NETDEV_DOWN has somewhat mixed locking right now.
Jakub Kicinski [Wed, 3 Jun 2026 01:28:34 +0000 (18:28 -0700)]
net: bonding: don't recurse on the slave's netdev ops lock
bond_update_speed_duplex() calls __ethtool_get_link_ksettings() on
the slave, which will soon take the slave's ops lock. One of its
callers already holds it and the other three don't, so the function
would either deadlock or run unprotected depending on the path.
Make the helper expect the slave's ops lock held and switch to
netif_get_link_ksettings(). Wrap the three call sites that don't
already hold it:
* bond_enslave() (rtnl held; core drops the lower's ops lock
around ->ndo_add_slave).
* bond_miimon_commit() (rtnl_trylock'd from the mii workqueue).
* bond_ethtool_get_link_ksettings() (rtnl held via ethtool layer,
bond device itself is not ops locked).
The call site which does already hold the ops lock is
bond_slave_netdev_event() via NETDEV_UP / NETDEV_CHANGE notifiers,
so it stays as-is.
Jakub Kicinski [Wed, 3 Jun 2026 01:28:33 +0000 (18:28 -0700)]
net: ethtool: add netif_get_link_ksettings() for correct ops-locked use
__ethtool_get_link_ksettings() is exported and called from sysfs
and many drivers. It invokes ethtool_ops->get_link_ksettings
so by our own docs it should be holding netdev lock for ops locked
devices. Looks like commit 2bcf4772e45a ("net: ethtool:
try to protect all callback with netdev instance lock")
missed adding the ops lock here.
There's a number of callers we need to fix up so let's add the
netif_get_link_ksettings() helper first, without any actual
locking changes (this commit is a nop).
Not treating this as a fix because I don't think any driver cares
at this point, but if we want to remove the rtnl_lock protection
this will become critical.
Jakub Kicinski [Wed, 3 Jun 2026 01:28:32 +0000 (18:28 -0700)]
net: document NETDEV_CHANGENAME as ops locked
NETDEV_CHANGENAME is only emitted from netif_change_name().
netif_change_name() has two callers both of which hold netdev_lock_ops()
around the call site:
- dev_change_name()
- do_setlink()
Jakub Kicinski [Wed, 3 Jun 2026 01:28:31 +0000 (18:28 -0700)]
net: ethtool: cmis_cdb: hold instance lock for ops locked devices
FW module flashing was written so that the flashing happens
without holding rtnl_lock. This allows flashing multiple modules
at once. Current drivers can handle that well, but we should
let drivers depend on the netdev instance lock. Instance lock
is per netdev, and so is the module so we won't break parallel
updates.
Linus Torvalds [Thu, 4 Jun 2026 20:38:42 +0000 (13:38 -0700)]
Merge tag 'trace-v7.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fix from Steven Rostedt:
- Fix CFI violation in probestub function
The probestub is a function to allow tprobes to hook to a tracepoint
to gain access to its parameters.
The function itself is only referenced by the tracepoint structure
which lives in the __tracepoint section. objtool explicitly ignores
that section and when processing functions in the kernel, if it
detects one that has no references it will seal it to have its ENDBR
stripped on boot up.
This means the probstub function will have its ENDBR stripped and if
a tprobe is attached to it with IBT enabled, it will go *boom*.
* tag 'trace-v7.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Fix CFI violation in probestub being called by tprobes
Ashutosh Desai [Sat, 2 May 2026 16:00:57 +0000 (16:00 +0000)]
rust: sync: add #[must_use] to GlobalGuard and GlobalLock::try_lock
Guard is marked #[must_use] since dropping it releases the lock. GlobalGuard
wraps Guard with identical semantics but was missing the annotation, so
discarding it would silently compile without warning.
Similarly, GlobalLock::try_lock was missing #[must_use]. Option<T> does not
propagate #[must_use] from T, so the attribute needs to be on the function
directly - same reason Lock::try_lock has it.
Michel Dänzer [Mon, 18 May 2026 15:48:09 +0000 (17:48 +0200)]
drm/amd/display: Consult MCCS FreeSync cap only if requested & supported
When the do_mccs parameter is false, we don't call
dm_helpers_read_mccs_caps, so sink->mccs_caps.freesync_supported is
unlikely to be true.
Fixes: 6f71d5dd3206 ("drm/amd/display: Read sink freesync support via mccs")
Bug: https://gitlab.freedesktop.org/drm/amd/-/work_items/5286 Signed-off-by: Michel Dänzer <mdaenzer@redhat.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 115bf5ca318e18a3dc1888ec6271c7052774952a)
Yongqiang Sun [Tue, 2 Jun 2026 13:59:44 +0000 (09:59 -0400)]
drm/amdkfd: Unwind debug trap enable on copy_to_user failure
If kfd_dbg_trap_enable() fails while copying runtime_info to userspace,
it had already activated the trap, set debug_trap_enabled, taken an extra
process reference, and opened the debug event file. Return -EFAULT without
unwinding that state, leaving inconsistent trap state and a refcount
imbalance that could break later DISABLE/ENABLE.
On copy_to_user failure, deactivate the trap and undo the rest of the
enable setup before returning.
Signed-off-by: Yongqiang Sun <Yongqiang.Sun@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 01112e241e37f9ac98b6f418d93ce2e0b87b7ee0)
Sunday Clement [Tue, 19 May 2026 14:02:30 +0000 (10:02 -0400)]
drm/amdkfd: Add bounds check for AMDKFD_IOC_WAIT_EVENTS
The kfd_wait_on_events ioctl passes a user-supplied num_events parameter
directly to alloc_event_waiters() which calls kcalloc() without validation.
This allows unprivileged users with /dev/kfd access to trigger large kernel
memory allocations, potentially causing memory exhaustion and denial of
service via the OOM killer.
Add a check to reject num_events values exceeding KFD_SIGNAL_EVENT_LIMIT
(4096), which is the maximum number of events a single process can create.
Signed-off-by: Sunday Clement <Sunday.Clement@amd.com> Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 39eb6da7acee8d0cc12a8959235b590f295d7b4c)
David Rosca [Sat, 13 Sep 2025 14:51:02 +0000 (16:51 +0200)]
drm/amdgpu/userq: Fix reading timeline points in wait ioctl
Use correct u64 type.
Signed-off-by: David Rosca <david.rosca@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 0ac98160dfb6ab3c6d7b38e0ff9687780beed9cb)
Yongqiang Sun [Wed, 27 May 2026 13:50:47 +0000 (09:50 -0400)]
drm/amdkfd: fix SMI event cross-process information leak
kfd_smi_ev_enabled() skips the suser privilege check when pid=0.
PROCESS_START, PROCESS_END, and VMFAULT events are emitted with
pid=0 while carrying another process's PID and command name, so any
/dev/kfd user in the render group can monitor all GPU workloads.
Pass the target process PID into kfd_smi_event_add() for these events
so the existing per-client filter restricts delivery to the owning
process or CAP_SYS_ADMIN subscribers.
Signed-off-by: Yongqiang Sun <Yongqiang.Sun@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 92a8dba246d371fe268280e5fd74b0955688e6df)
Wandun Chen [Thu, 4 Jun 2026 01:53:32 +0000 (09:53 +0800)]
of: reserved_mem: avoid post-init UAF when alloc_reserved_mem_array() fails
The global pointer 'reserved_mem' continues to reference the
reserved_mem_array which lives in __initdata if
alloc_reserved_mem_array() fails. of_reserved_mem_lookup() is
exported for post-init use, that would dereference freed memory
and trigger a use-after-free.
So reset reserved_mem_count to 0 when alloc_reserved_mem_array()
fails.
Alex Deucher [Wed, 6 May 2026 20:50:42 +0000 (16:50 -0400)]
drm/amdkfd: always resume_all after suspend_all
Need to restore any good queues even if the suspend_all
failed for some. Always run remove_queue as that will
schedule a GPU reset is removing the queue fails.
v2: move resume_all after remove
Fixes: eb067d65c33e ("drm/amdkfd: Update BadOpcode Interrupt handling with MES") Reviewed-by: Amber Lin <Amber.Lin@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Yunxiang Li [Wed, 27 May 2026 22:05:37 +0000 (18:05 -0400)]
drm/amdgpu/gfx: move fault and EOP IRQ get/put to hw_init/hw_fini
priv_reg / priv_inst / bad_op and (on v11+) userq EOP IRQs are
acquired in late_init but released in hw_fini. This split forced
gfx_v9_0_hw_fini() to defensively guard each put with
amdgpu_irq_enabled() because hw_fini runs on paths that may not
reach late_init.
amdgpu_ip_block_hw_fini() only runs after hw_init returns success,
and suspend / resume cycle the refs through the same path, so
hw_init / hw_fini pair without any extra tracking. Move the gets
there and drop the guards.
While here, fix the pre-existing partial-failure leak in
set_userq_eop_interrupts() (gfx11 / 12_0 / 12_1). amdgpu_irq_get()
increments the refcount before calling .set, so a failure partway
through the loop leaves earlier successful gets stranded. Track
the loop position and roll back on the enable path.
Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Linus Torvalds [Thu, 4 Jun 2026 19:31:20 +0000 (12:31 -0700)]
Merge tag 's390-7.1-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Alexander Gordeev:
- Enable IOMMUFD and VFIO cdev such that PCI pass-through to
QEMU/KVM can optionally utilize native IOMMUFD
- With HAVE_ARCH_BUG_FORMAT enabled the BUG infrastructure might
misinterpret flags or fault. Fix this by moving the "format"
field emission into __BUG_ENTRY()
- The generic version of _THIS_IP_ is known to be brittle and may
break with current and future GCC and Clang optimizations. Fix
it by overriding _THIS_IP_
* tag 's390-7.1-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390: Implement _THIS_IP_ using inline asm
s390/bug: Always emit format word in __BUG_ENTRY
s390/configs: Enable IOMMUFD and VFIO cdev in defconfigs
Michel Dänzer [Mon, 18 May 2026 15:48:09 +0000 (17:48 +0200)]
drm/amd/display: Consult MCCS FreeSync cap only if requested & supported
When the do_mccs parameter is false, we don't call
dm_helpers_read_mccs_caps, so sink->mccs_caps.freesync_supported is
unlikely to be true.
Fixes: 6f71d5dd3206 ("drm/amd/display: Read sink freesync support via mccs")
Bug: https://gitlab.freedesktop.org/drm/amd/-/work_items/5286 Signed-off-by: Michel Dänzer <mdaenzer@redhat.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Lijo Lazar [Tue, 19 May 2026 13:00:03 +0000 (18:30 +0530)]
drm/amd/pm: Use strscpy in profile mode parsing
Use strscpy to copy the buffer which makes it explicit that a valid NULL
terminated string gets copied. Also, make it explicit that the source
buffer can be copied safely to the temporary buffer by checking against
its size.
Yongqiang Sun [Mon, 1 Jun 2026 19:28:30 +0000 (15:28 -0400)]
drm/amdkfd: Fix infinite loop parsing CRAT with zero subtype length
Malformed ACPI CRAT tables can advertise a zero or undersized subtype
length. The parser then fails to advance the cursor and loops forever
while the remaining image still looks large enough for a generic header.
Validate sub_type_hdr->length on each iteration before parsing or
advancing. Return -EINVAL and warn when length is zero or smaller than
the generic subtype header.
Signed-off-by: Yongqiang Sun <Yongqiang.Sun@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Yongqiang Sun [Mon, 1 Jun 2026 19:48:44 +0000 (15:48 -0400)]
drm/amdkfd: fix sysfs topology prop length on buffer truncation
sysfs_show_gen_prop() accumulated snprintf()'s return value into the
offset. snprintf() reports bytes that would have been written, not
bytes actually written, so a truncated sysfs show could over-report
its length. Use sysfs_emit_at(), which returns only the bytes written.
Signed-off-by: Yongqiang Sun <Yongqiang.Sun@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Honglei Huang [Fri, 29 May 2026 02:23:17 +0000 (10:23 +0800)]
drm/amdgpu: drop retry loop in amdgpu_hmm_range_get_pages
Since commit c08972f55594 ("drm/amdgpu: fix amdgpu_hmm_range_get_pages")
moved mmu_interval_read_begin() out of the per-chunk loop, the
captured notifier_seq is no longer refreshed across retries. As a
result, the existing -EBUSY retry path can never make progress:
hmm_range_fault() returns -EBUSY only when
mmu_interval_check_retry(notifier, notifier_seq) reports that the
sequence is stale. Once the sequence has advanced, the stored seq
will never match again, so every subsequent call within the same
invocation returns -EBUSY immediately.
The "goto retry" therefore degenerates into a busy spin that simply
burns CPU for the full HMM_RANGE_DEFAULT_TIMEOUT (~1s) window before
finally bailing out with -EAGAIN. This is pure latency with no chance
of recovery, and it actively hurts the KFD userptr stack: the caller
ends up blocked for a second while holding mmap_lock, only to return
-EAGAIN to the restore worker (or to userspace) which would have
re-driven the operation immediately anyway.
Drop the retry/timeout entirely and let -EBUSY propagate straight to
out_free_pfns, where it is already translated to -EAGAIN. Recovery is
handled at a higher level: the KFD restore_userptr_worker reschedules
itself, and the userptr ioctl path returns -EAGAIN to userspace.
No functional regression: the previous behaviour on -EBUSY was already
to fail with -EAGAIN after a 1s stall; we just skip the stall.
Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Honglei Huang <honghuan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Asad Kamal [Wed, 3 Jun 2026 07:11:33 +0000 (15:11 +0800)]
drm/amd/pm: Stop pp_od_clk_voltage emit at PAGE_SIZE
Stop appending OD sections in amdgpu_get_pp_od_clk_voltage()
once the sysfs page is full, instead of checking every sysfs_emit_at()
in SMU helpers. This is purely defensive hardening.
v2: Drop the prior series that checked sysfs_emit_at()
return values in every SMU *_emit_clk_levels() helper and
smu_cmn_print_*().(Kevin)
v3: Update description, remove all clamping
Signed-off-by: Asad Kamal <asad.kamal@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Yongqiang Sun [Tue, 2 Jun 2026 13:59:44 +0000 (09:59 -0400)]
drm/amdkfd: Unwind debug trap enable on copy_to_user failure
If kfd_dbg_trap_enable() fails while copying runtime_info to userspace,
it had already activated the trap, set debug_trap_enabled, taken an extra
process reference, and opened the debug event file. Return -EFAULT without
unwinding that state, leaving inconsistent trap state and a refcount
imbalance that could break later DISABLE/ENABLE.
On copy_to_user failure, deactivate the trap and undo the rest of the
enable setup before returning.
Signed-off-by: Yongqiang Sun <Yongqiang.Sun@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Sunil Khatri [Mon, 1 Jun 2026 14:45:34 +0000 (20:15 +0530)]
drm/amdgpu: validate the mes firmware version for gfx12.1
MES firmware should report the same version whether read from
the register or from the firmware ucode binary. This is not
always the case, so add a log when they mismatch.
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Sunil Khatri [Mon, 1 Jun 2026 14:44:50 +0000 (20:14 +0530)]
drm/amdgpu: validate the mes firmware version for gfx12
MES firmware should report the same version whether read from
the register or from the firmware ucode binary. This is not
always the case, so add a log when they mismatch.
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Sunil Khatri [Mon, 1 Jun 2026 14:41:17 +0000 (20:11 +0530)]
drm/amdgpu: compare MES firmware version ucode for gfx11
MES firmware should report the same version whether read from
the register or from the firmware ucode binary. This is not
always the case, so add a log when they mismatch.
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Sunday Clement [Tue, 19 May 2026 14:02:30 +0000 (10:02 -0400)]
drm/amdkfd: Add bounds check for AMDKFD_IOC_WAIT_EVENTS
The kfd_wait_on_events ioctl passes a user-supplied num_events parameter
directly to alloc_event_waiters() which calls kcalloc() without validation.
This allows unprivileged users with /dev/kfd access to trigger large kernel
memory allocations, potentially causing memory exhaustion and denial of
service via the OOM killer.
Add a check to reject num_events values exceeding KFD_SIGNAL_EVENT_LIMIT
(4096), which is the maximum number of events a single process can create.
Christian König [Wed, 25 Feb 2026 14:12:02 +0000 (15:12 +0100)]
drm/amdgpu: restart the CS if some parts of the VM are still invalidated
Make sure that we only submit work with full up to date VM page tables.
Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Vitaly Prosyak <vitaly.prosyak@amd.com> Tested-by: Vitaly Prosyak <vitaly.prosyak@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
drm/amd/display: use unsigned types for local pipe and REG_GET counters
Two small type fixes that match how the values are actually consumed:
- decide_zstate_support() iterates from 0 to pipe_count, which is
unsigned. Make the loop index unsigned int.
- hpo_enc401_read_state() reads HDMI_PIXEL_ENCODING and
HDMI_DEEP_COLOR_DEPTH via REG_GET_2(), which internally casts the
output pointer to (uint32_t *). Passing the address of an int is a
strict-aliasing wart even when the sizes match. Declare the locals
as uint32_t.
No behavioural change since the values are only compared against small
non-negative constants.
Signed-off-by: Aurabindo Pillai <aurabindo.pillai@amd.com> Reviewed-by: Alex Hung <alex.hung@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
drm/amd/display: widen dc_hdmi_frl_flags.force_frl_rate to unsigned int
dc_hdmi_frl_flags.force_frl_rate mirrors dc_debug_options.force_frl_rate,
which was just widened to unsigned int. Match the type here too so the
assignment in link_hdmi_frl.c does not narrow from unsigned to signed.
All call sites in link_hdmi_frl.c only compare the value against 0, 0xF,
or an hdmi_frl_link_rate enum whose values are non-negative, so the
change is behaviour-preserving and does not introduce sign-compare
warnings.
Signed-off-by: Aurabindo Pillai <aurabindo.pillai@amd.com> Reviewed-by: Alex Hung <alex.hung@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
David Rosca [Sat, 13 Sep 2025 14:51:02 +0000 (16:51 +0200)]
drm/amdgpu/userq: Fix reading timeline points in wait ioctl
Use correct u64 type.
Signed-off-by: David Rosca <david.rosca@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Jeevana Muthyala [Thu, 14 May 2026 10:56:17 +0000 (16:26 +0530)]
drm/amdgpu/vcn5.0.0: enable secure submission on unified ring for VCN 5.3.0
Enable secure submission support on the unified ring for VCN IP version
5.3.0 by setting `secure_submission_supported = true` in
vcn_v5_0_0_unified_ring_vm_funcs.
Secure IB submission is supported on VCN 5.3.0 hardware/firmware,
allowing protected decode workloads to bypass the common IB gate.
Without this, secure playback submissions can be blocked and fail.
Other VCN 5.x variants using the same vcn_v5_0_0_ip_block
(e.g. IP_VERSION(5, 0, 0)) do not support secure submission
on the unified ring and therefore continue using non-secure paths.
This change only advertises existing hardware/firmware capability;
non-secure decode paths remain unaffected.
Signed-off-by: Jeevana Muthyala <Jeevana.Muthyala2@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Christian König [Tue, 5 May 2026 13:40:04 +0000 (15:40 +0200)]
drm/amdgpu: deprecate guilty handling
The guilty handling tried to establish a second way of signaling problems with
the GPU back to userspace. This caused quite a bunch of issue we had to work
around, especially lifetime issues with the drm_sched_entity.
Just drop the handling altogether and use the dma_fence based approach instead.
v2: fix reversed condition in entity check (Alex)
Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Vitaly Prosyak [Wed, 13 May 2026 20:08:30 +0000 (16:08 -0400)]
drm/amdgpu: Add lockdep annotations for lock ordering validation
Add lockdep annotations to teach lockdep the correct lock hierarchy
and catch ordering violations during development. This follows the
pattern established by dma-resv in drivers/dma-buf/dma-resv.c.
The implementation provides:
- Lock ordering training at module init (amdgpu_lockdep_init)
- Lock class association for real driver locks (amdgpu_lockdep_set_class)
Dummy locks are associated with the same class keys as real driver locks
via lockdep_set_class(), ensuring lockdep connects the training ordering
with actual runtime locks.
Testing:
Build the kernel with CONFIG_PROVE_LOCKING=y (enables CONFIG_LOCKDEP):
scripts/config --enable PROVE_LOCKING
scripts/config --enable DEBUG_LOCKDEP
make -j$(nproc)
On boot, dmesg should show:
AMDGPU: Lockdep annotations initialized (9 lock levels)
The companion IGT test (tests/amdgpu/amd_lockdep) exercises lock-heavy
GPU code paths concurrently to trigger lockdep warnings on violations:
sudo ./build/tests/amdgpu/amd_lockdep
sudo dmesg | grep -A 50 "circular locking dependency"
IGT subtests:
concurrent-reset-and-submit - reset_sem vs submission locks
concurrent-mmap-and-evict - mmap_lock vs vram_lock
concurrent-userptr-and-reset - notifier_lock vs reset_sem
stress-all-paths - all of the above simultaneously
A clean dmesg (no "circular locking dependency" or "possible recursive
locking detected" messages) confirms no lock ordering violations.
For CI integration, the test should be run on kernels compiled with
CONFIG_LOCKDEP=y; dmesg is scanned post-run for lockdep splats.
v2: (Christian)
- Move notifier_lock and vram_lock before reset locks in hierarchy.
HMM invalidation holds notifier_lock and can wait for GPU reset
completion, so notifier_lock must be outer to reset_domain->sem.
- Associate dummy locks with lock class keys via lockdep_set_class()
so lockdep connects training with real driver locks.
- Update commit message to list all 9 lock levels.
Requires CONFIG_PROVE_LOCKING=y to activate.
Cc: Christian Konig <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com> Reviewed-by: Christian Konig <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Yongqiang Sun [Wed, 27 May 2026 13:50:47 +0000 (09:50 -0400)]
drm/amdkfd: fix SMI event cross-process information leak
kfd_smi_ev_enabled() skips the suser privilege check when pid=0.
PROCESS_START, PROCESS_END, and VMFAULT events are emitted with
pid=0 while carrying another process's PID and command name, so any
/dev/kfd user in the render group can monitor all GPU workloads.
Pass the target process PID into kfd_smi_event_add() for these events
so the existing per-client filter restricts delivery to the owning
process or CAP_SYS_ADMIN subscribers.
Signed-off-by: Yongqiang Sun <Yongqiang.Sun@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Matthew Stewart [Thu, 28 May 2026 22:21:54 +0000 (18:21 -0400)]
drm/amd/display: Add DCN42B to dml21_translation_helper
Needed for DML to function with DCN42B.
Signed-off-by: Matthew Stewart <Matthew.Stewart2@amd.com> Reviewed-by: Roman Li <roman.li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Matthew Stewart [Wed, 27 May 2026 14:07:02 +0000 (10:07 -0400)]
drm/amd/display: Fix DCN42B version detection
In resource_parse_asic_id, the check for GC_11_0_4 was unbounded, which
caused it to override the detection of DCN42B.
Signed-off-by: Matthew Stewart <Matthew.Stewart2@amd.com> Reviewed-by: Roman Li <Roman.Li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Replace BUG()/BUG_ON() with error logs and safe returns in several
places where they can be triggered by invalid userspace input,
preventing DoS via kernel panic.
Signed-off-by: Ce Sun <cesun102@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
selftests/bpf: ignore call depth accounting for retbleed in verifier tests
When running the selftests on a retbleed-affected platform (eg:
Skylake), with call depth accounting enabled
(CONFIG_CALL_DEPTH_TRACKING=y) _and_ with retbleed=stuff, some verifier
selftests fail to validate the jited instructions. For example:
Those affected selftests allways fail on some call instruction: this
failure is due to the JIT compiler emitting call depth accounting for
retbleed mitigation (see x86_call_depth_emit_accounting calls in
bpf_jit_comp.c), resulting in an additional instruction being inserted
in front of every call instruction, similar to this one:
sarq $0x5, %gs:-0x39882741(%rip)
Fix those selftests by allowing them to ignore this possibly present
call depth accounting instruction.
zap_vma_range() requires the owning mm's mmap_lock to be held.
Taking mmap_read_lock under arena->lock would AB-BA against
arena_vm_close() and arena_map_mmap(), both of which run with
mmap_write_lock held and then acquire arena->lock. Instead drop
arena->lock, mmget_not_zero() the vma's mm, take mmap_read_lock, and
re-resolve the vma via find_vma() since it may have been unmapped or
replaced while waiting.
Track processed vmls with a per-call generation in vml->zap_gen and
serialize zap_pages() callers with a new arena->zap_mutex so
concurrent callers on different uaddr ranges do not mark each other's
vmls processed before the zap is done.
Matt Bobrowski [Wed, 3 Jun 2026 20:18:22 +0000 (20:18 +0000)]
bpf: clean up btf_scan_decl_tags()
Refactor the newly introduced btf_scan_decl_tags() to improve
readability and maintainability. The current implementation uses a
manual if-else chain and a magic number offset to strip the "arg:"
prefix from declaration tags.
Replace the if-else logic with a table-driven approach using a static
const array. This separates the tag data from the scanning logic, making
the helper more extensible for future tags. Additionally, replace the
magic number '4' with a sizeof-based calculation on the prefix string to
ensure the offset remains synchronized with the search key.
Finally, optimize the loop by moving the is_global check to the top of
the block. This allows the verifier to fail-fast on static subprograms
without performing unnecessary BTF string and type lookups.
Daniel Borkmann [Wed, 3 Jun 2026 21:16:58 +0000 (23:16 +0200)]
selftests/bpf: Test signed loader error paths
The positive path for signed BPF loaders is covered today by the
signed lskels (fentry_test, fexit_test, atomics).
But the runtime metadata check the generated loader performs (libbpf
gen_loader's emit_signature_match), the map content hash it relies
on, the load-time signature, and the immutability invariants of its
metadata map are not yet covered.
Thus, add a new, extensive test suite which drives libbpf's gen_loader
(bpf_object__gen_loader, gen_hash=true), the same machinery which
bpftool uses for signed light skeletons, and exercise corner cases
so that we can assert this in BPF CI:
map_excl exercises exclusive-map binding (allowed/denied), map-in-map
and map iterator rejection. It does not cover the create-time validation
of excl_prog_hash: the kernel only accepts a SHA-256-sized hash and
requires the pointer and size to be consistent.
Add map_excl_create_validation to check the rejected combinations:
Kenny Glowner [Thu, 21 May 2026 16:14:05 +0000 (11:14 -0500)]
rust: module_param: add missing newline to pr_warn_once
Add a trailing newline ('\n') to the pr_warn_once! call in set_param to
ensure the kernel ring buffer flushes the message correctly and
prevents log line smearing.
Signed-off-by: Kenny Glowner <SisyphusCode0311@gmail.com> Suggested-by: Miguel Ojeda <ojeda@kernel.org> Link: https://github.com/Rust-for-Linux/linux/issues/1139
[Sami: Updated the commit message as we use pr_warn_once now.] Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Andrii Kuchmenko [Mon, 18 May 2026 14:32:33 +0000 (17:32 +0300)]
module: decompress: check return value of module_extend_max_pages()
module_extend_max_pages() calls kvrealloc() internally and returns
-ENOMEM on allocation failure. The return value is never checked.
If the initial allocation fails, info->pages remains NULL and
info->max_pages remains 0. Subsequent calls to module_get_next_page()
will attempt to dynamically grow the array by calling
module_extend_max_pages(info, 0) since info->used_pages is 0. This
results in kvrealloc(NULL, 0) returning ZERO_SIZE_PTR, which is treated
as a success, leading to a dereference of ZERO_SIZE_PTR and a kernel
oops.
Fix: add the missing error check after module_extend_max_pages() and
return immediately on failure. This matches the pattern used by every
other kvrealloc() caller in the module loading path.
Fixes: b1ae6dc41eaa ("module: add in-kernel support for decompressing") Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Andrii Kuchmenko <capyenglishlite@gmail.com> Reviewed-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
[Sami: Corrected the analysis in the commit message.] Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
net/mlx5: convert miss_list allocation to kvmalloc_array()
dr_icm_buddy_init_ste_cache() allocates the per-buddy miss_list using
the open-coded kvmalloc(n * sizeof(*p), ...) form. The neighbouring
allocations in the same function already use the kvcalloc()/
kvzalloc_objs() forms; switch this last one to kvmalloc_array() for
consistency and for the size_mul overflow check that kvmalloc_array()
performs.
The semantics are unchanged: kvmalloc_array() returns a non-zeroed
buffer, just like the previous kvmalloc() call. Existing callers of
buddy->miss_list initialise each list_head before use.
Thomas Weißschuh [Thu, 21 May 2026 06:53:16 +0000 (08:53 +0200)]
vdso/datastore: Always provide symbol declarations
Allow callers to easily reference these symbols in code that is built
even when the generic datastore is disabled.
As there are no good default no-op variants of these symbols, do not
provide stubs but require users to have their own fallback handling
using IS_ENABLED(CONFIG_HAVE_GENERIC_VDSO).
Originally this function was supposed to work the same way as
__arch_get_vdso_u_time_data() and be overridden on some architectures.
However the actually used implementation, which just adds PAGE_SIZE, does
not need this override mechanism.
Adjust the name to reflect the true nature of the function.
Breno reports a lockdep warning in bnxt. During FW reset the driver
may end up calling netif_set_real_num_tx_queues() (if queue count
changes), so calls to bnxt_open() still require rtnl_lock.
Sechang Lim [Wed, 3 Jun 2026 16:27:33 +0000 (16:27 +0000)]
udp: clear skb->dev before running a sockmap verdict
On the UDP receive path skb->dev is repurposed as dev_scratch (the
truesize/state cache set by udp_set_dev_scratch()), through the
union { struct net_device *dev; unsigned long dev_scratch; } in sk_buff.
When a UDP socket is in a sockmap, sk_data_ready is
sk_psock_verdict_data_ready(), which calls udp_read_skb() -> recv_actor()
(sk_psock_verdict_recv) to run the attached SK_SKB verdict program in softirq.
If that program calls a socket-lookup helper (bpf_sk_lookup_tcp/udp,
bpf_skc_lookup_tcp), bpf_skc_lookup() does:
if (skb->dev)
caller_net = dev_net(skb->dev);
skb->dev still holds the dev_scratch value (a non-NULL integer), so dev_net()
dereferences it as a struct net_device * and the kernel takes a general
protection fault on a non-canonical address in softirq:
The rmem charge that dev_scratch accounted for is released by skb_recv_udp() on
dequeue, just above, so the scratch is dead by the time recv_actor() runs. Clear
skb->dev so bpf_skc_lookup() falls back to sock_net(skb->sk), which
skb_set_owner_sk_safe() set just above.
Fixes: 965b57b469a5 ("net: Introduce a new proto_ops ->read_skb()") Cc: stable@vger.kernel.org Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com> Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260603162737.697215-1-rhkrqnwk98@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Xin Long [Wed, 3 Jun 2026 18:11:44 +0000 (14:11 -0400)]
sctp: purge outqueue on stale COOKIE-ECHO handling
sctp_stream_update() is only invoked when the association is moved into
COOKIE_WAIT during association setup/reconfiguration. In this path, the
outbound stream scheduler state (stream->out_curr) is expected to be
clean, since no user data should have been transmitted yet unless the
state machine has already partially progressed.
However, a corner case exists in sctp_sf_do_5_2_6_stale(): when a
Stale Cookie ERROR is received, the association is rolled back from
COOKIE_ECHOED to COOKIE_WAIT. In this scenario, user data may already
have been queued and even bundled with the COOKIE-ECHO chunk.
During the rollback, sctp_stream_update() frees the old stream table
and installs a new one, but it does not invalidate stream->out_curr.
As a result, out_curr may still point to a freed sctp_stream_out
entry from the previous stream state.
Later, SCTP scheduler dequeue paths (FCFS, RR, PRIO, etc.) rely on
stream->out_curr->ext, which can lead to use-after-free once the old
stream state has been released via sctp_stream_free().
This results in crashes such as (reported by Yuqi):
BUG: KASAN: slab-use-after-free in sctp_sched_fcfs_dequeue+0x13a/0x140
Read of size 8 at addr ff1100004d4d3208 by task mini_poc/9312
CPU: 1 UID: 1001 PID: 9312 Comm: mini_poc Not tainted 7.1.0-rc1-00305-gbd3a4795d574 #5 PREEMPT(full)
sctp_sched_fcfs_dequeue+0x13a/0x140
sctp_outq_flush+0x1603/0x33e0
sctp_do_sm+0x31c9/0x5d30
sctp_assoc_bh_rcv+0x392/0x6f0
sctp_inq_push+0x1db/0x270
sctp_rcv+0x138d/0x3c10
Fix this by fully purging the association outqueue when handling the
Stale Cookie case. This ensures all pending transmit and retransmit
state is dropped, and any scheduler cached pointers are invalidated,
making it safe to rebuild stream state during COOKIE_WAIT restart.
Updating only stream->out_curr would be insufficient, since queued
and retransmittable data would still reference the old stream state and
trigger later use-after-free in dequeue paths.
Fixes: 5bbbbe32a431 ("sctp: introduce stream scheduler foundations") Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Zhengchuan Liang <zcliangcn@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Reported-by: Yuqi Xu <xuyq21@lenovo.com> Reported-by: Ren Wei <n05ec@lzu.edu.cn> Signed-off-by: Xin Long <lucien.xin@gmail.com> Link: https://patch.msgid.link/94318159b9052907a6cbb7256aee8b5f8dfbfccb.1780510304.git.lucien.xin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Mark Brown [Thu, 4 Jun 2026 16:00:29 +0000 (17:00 +0100)]
ASoC: Intel: catpt: Code cleanup
Cezary Rojewski <cezary.rojewski@intel.com> says:
All of the changes found here are cleanups and from functional
perspective, have no impact - either unused code is being removed or
existing code is altered to use helpers/macros to improve readability.
Collateral of recent fixes [1]. There is one more patchset with similar
goal following this one. Before the team managed to actually fix the
problem, a number of changes were added to make the code easier to
understand for people who are not the author (me).
Cezary Rojewski [Wed, 3 Jun 2026 08:58:26 +0000 (10:58 +0200)]
ASoC: Intel: catpt: Drop manipulation of the obsolete direction flag
Setting up direction for struct dma_slave_config is obsolete, see the
description of the struct. The transfer performed by the catpt-driver
is also always DMA_MEM_TO_MEM not DMA_MEM_TO_DEV with preparation step
being dmaengine_prep_dma_memcpy().
DW's ->device_prep_dma_memcpy() always fixes the direction to
DMA_MEM_TO_MEM even if its user fails to do so, see
drivers/dma/dw/core.c. While the change impacts number of checks done
by ->device_config() - p/m buswidth checks are skipped - fields being
fixed up in those i.e.: .dst_addr_width and .src_addr_width, do not take
part in DMA_MEM_TO_MEM transfer.
Cezary Rojewski [Wed, 3 Jun 2026 08:58:25 +0000 (10:58 +0200)]
ASoC: Intel: catpt: Remove unused WAVES controls
Support for the WAVES module was never officially published. The
kcontrols present in the existing code were added to retain 1:1 UAPI
with catpt-driver's predecessor, the haswell-driver despite the lack of
users for the functionality. Several years have passed since the
successful transition to the catpt-driver and removal of its predecessor
and there is no reason to keep the unused code.
Cezary Rojewski [Wed, 3 Jun 2026 08:58:22 +0000 (10:58 +0200)]
ASoC: Intel: catpt: Replace RAM-helpers with resource_xxx()
For catpt_sram_init(), the exact same functionality has been provided to
ioport.h with commit 9fb6fef0fb49 ("resource: Add resource set range and
size helpers") in recent years.
As for catpt_dram/iram_size(), leave it for the driver initialization
only. Have all other manipulations be done using resource_xxx() API
which are more familiar to kernel developers.
Johan Hovold [Thu, 4 Jun 2026 11:59:12 +0000 (13:59 +0200)]
regulator: bq257xx: drop confusing configuration of_node
The driver reuses the OF node of the parent multi-function device but
still sets the of_node field of the regulator configuration to any prior
OF node.
Since the MFD child device does not have an OF node set until probe is
called, this field is set to NULL on first probe and to the reused OF
node if the driver is later rebound.
As the device_set_of_node_from_dev() helper drops a reference to any
prior OF node before taking a reference to the new one this can
apparently also confuse LLMs like Sashiko which flags it as a potential
use-after-free (which it is not).
Drop the confusing and redundant configuration of_node assignment.