Avoid an extra indirect function call by using bh_submit() instead of
submit_bh(). Also simplify the control flow now that the buffer
refcount is not put by bh_end_read().
Avoid an extra indirect function call by using bh_submit() instead
of submit_bh(). Also simplify the control flow now that the buffer
refcount is not put by bh_end_read().
ocfs2: Convert ocfs2_write_super_or_backup to bh_submit()
Avoid an extra indirect function call and changing the buffer refcount
by using bh_submit() instead of submit_bh().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://patch.msgid.link/20260528173150.1093780-22-willy@infradead.org Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: ocfs2-devel@lists.linux.dev Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Avoid an extra indirect function call and changing the buffer refcount
by using bh_submit() instead of submit_bh().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://patch.msgid.link/20260528173150.1093780-21-willy@infradead.org Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: ocfs2-devel@lists.linux.dev Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Avoid an extra indirect function call and changing the buffer refcount
by using bh_submit() instead of submit_bh().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://patch.msgid.link/20260528173150.1093780-20-willy@infradead.org Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: ocfs2-devel@lists.linux.dev Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Avoid an extra indirect function call and changing the buffer
refcount by using bh_submit() instead of submit_bh().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://patch.msgid.link/20260528173150.1093780-19-willy@infradead.org Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: ocfs2-devel@lists.linux.dev Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Avoid an extra indirect function call by using bh_submit()
instead of submit_bh() in journal_submit_commit_record()
and jbd2_journal_commit_transaction(). These both use
journal_end_buffer_io_sync(), so it's more straightforward to do them
both at once.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://patch.msgid.link/20260528173150.1093780-17-willy@infradead.org Acked-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: linux-ext4@vger.kernel.org Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Avoid an extra indirect function call and changing the buffer refcount
by converting ext4_end_bitmap_read() from bh_end_io_t to bio_end_io_t
and calling bh_submit().
buffer: Convert __block_write_full_folio to __bh_submit()
Avoid an extra indirect function call by using __bh_submit() instead
of submit_bh_wbc(). Since there is only one caller of submit_bh_wbc()
left, inline it into submit_bh().
buffer: Convert block_read_full_folio to bh_submit()
Avoid an extra indirect function call by using bh_submit() instead of
submit_bh(). Since mark_buffer_async_read() would collapse to a single
function call, inline it into block_read_full_folio() along with its
extensive comment. Convert end_buffer_async_read_io() to
bh_end_async_read().
Avoid an extra indirect function call and changing the buffer refcount
by using bh_submit() instead of submit_bh().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://patch.msgid.link/20260528173150.1093780-10-willy@infradead.org Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
buffer: Add bh_end_read(), bh_end_write() and bh_end_async_write()
These are the bio_end_io_t versions of end_buffer_read_sync(),
end_buffer_write_sync() and end_buffer_async_write(). They do not
contain a put_bh() call as it is no longer necessary.
All callers of mark_buffer_async_write_endio() pass
end_buffer_async_write, so we can inline mark_buffer_async_write_endio()
into mark_buffer_async_write() and just call that instead.
bh_submit() takes a bio_end_io allowing users to avoid the indirect
function call through bh->b_end_io, and eventually allowing us to remove
bh->b_end_io.
Nam Cao [Tue, 2 Jun 2026 17:51:46 +0000 (19:51 +0200)]
eventpoll: Fix epoll_wait() report false negative
ep_events_available() checks for available events by looking at
ep->rdllist and ep_is_scanning(). However, this is done without a lock
and can report false negative if ep_start_scan() or ep_done_scan() are
executed by another task concurrently. For example:
_________________________________________________________________________
|ep_start_scan()
| list_splice_init(&ep->rdllist, ...)
ep_events_available() |
!list_empty_careful(&ep->rdllist)|
|| ep_is_scanning(ep) |
| ep_enter_scan(ep)
___________________________________|_____________________________________
In the above examples, ep_events_available() sees no event despite
events being available. In case epoll_wait() is called with timeout=0,
epoll_wait() will wrongly return "no event" to user.
Introduce a sequence lock to resolve this issue.
Measuring the time consumption of 10 million loop iterations doing
epoll_wait(), the following performance drop is observed:
timeout #event before after diff
0ms 0 3727ms 3974ms +6.6%
0ms 1 8099ms 9134ms +13%
1ms 1 13525ms 13586ms +0.45%
Considering the use case of epoll_wait() (wait for events, do something
with the events, repeat), it should only contribute to a small portion of
user's CPU consumption. Therefore this performance drop is not alarming.
Nam Cao [Tue, 2 Jun 2026 17:51:45 +0000 (19:51 +0200)]
selftests/eventpoll: Add test for multiple waiters
Add a test whichs creates 64 threads who all epoll_wait() on the same
eventpoll. The source eventfd is written but never read, therefore all the
threads should always see an EPOLLIN event.
This test fails because of a kernel bug, which will be fixed by a follow-up
commit.
Merge patch series "mm: improve write performance with RWF_DONTCACHE"
Jeff Layton <jlayton@kernel.org> says:
This patch series is intended to improve write performance with
RWF_DONTCACHE. This version fixes additional stat accounting issues
found during review: integer promotion on 32-bit, cgroup writeback
domain migration, folio split flag preservation, and a UAF that could
occur in filemap_dontcache_kick_writeback().
* patches from https://patch.msgid.link/20260511-dontcache-v7-0-2848ddce8090@kernel.org:
mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking
mm: track DONTCACHE dirty pages per bdi_writeback
mm: preserve PG_dropbehind flag during folio split
Jeff Layton [Mon, 11 May 2026 11:58:29 +0000 (07:58 -0400)]
mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking
The IOCB_DONTCACHE writeback path in generic_write_sync() calls
filemap_flush_range() on every write, submitting writeback inline in
the writer's context. Perf lock contention profiling shows the
performance problem is not lock contention but the writeback submission
work itself — walking the page tree and submitting I/O blocks the writer
for milliseconds, inflating p99.9 latency from 23ms (buffered) to 93ms
(dontcache).
Replace the inline filemap_flush_range() call with a flusher kick that
drains dirty pages in the background. This moves writeback submission
completely off the writer's hot path.
To avoid flushing unrelated buffered dirty data, add a dedicated
WB_start_dontcache bit and wb_check_start_dontcache() handler that uses
the per-wb WB_DONTCACHE_DIRTY counter to determine how many pages to
write back. The flusher writes back that many pages from the oldest dirty
inodes (not restricted to dontcache-specific inodes). This helps
preserve I/O batching while limiting the scope of expedited writeback.
Like WB_start_all, the WB_start_dontcache bit coalesces multiple
DONTCACHE writes into a single flusher wakeup without per-write
allocations. Use test_and_clear_bit to atomically consume the kick
request before reading the dirty counter and starting writeback, so that
concurrent DONTCACHE writes during writeback can re-set the bit and
schedule a follow-up flusher run.
Read the dirty counter with wb_stat_sum() (aggregating per-CPU batches)
rather than wb_stat() (which reads only the global counter) to ensure
small writes below the percpu batch threshold are visible to the flusher.
In filemap_dontcache_kick_writeback(), set the WB_start_dontcache bit
inside the unlocked_inode_to_wb_begin/end section for correct cgroup
writeback domain targeting, but defer the wb_wakeup() call until after
the section ends, since wb_wakeup() uses spin_unlock_irq() which would
unconditionally re-enable interrupts while the i_pages xa_lock may still
be held under irqsave during a cgroup writeback switch. Pin the wb with
wb_get() inside the RCU critical section before calling wb_wakeup()
outside it, since cgroup bdi_writeback structures are RCU-freed and the
wb pointer could become invalid after unlocked_inode_to_wb_end() drops
the RCU read lock.
Also add WB_REASON_DONTCACHE as a new writeback reason for tracing
visibility.
Dontcache now reaches 81% of buffered throughput (was 35%).
Competing writers (dontcache vs buffered, separate files):
Before After
buffered writer 868 433 MB/s
dontcache writer 415 433 MB/s
Aggregate 1,284 866 MB/s
Previously the buffered writer starved the dontcache writer 2:1.
With per-bdi_writeback tracking, both writers now receive equal
bandwidth. The aggregate matches the buffered-vs-buffered baseline
(863 MB/s), indicating fair sharing regardless of I/O mode.
The dontcache writer's p99.9 latency collapsed from 119 ms to
33 ms (-73%), eliminating the severe periodic stalls seen in the
baseline. Both writers now share identical latency profiles,
matching the buffered-vs-buffered pattern.
The per-bdi_writeback dirty tracking dramatically reduces peak dirty
pages in dontcache workloads, with the 32-file test dropping from
1.8 GB to 213 MB. Dontcache sequential write throughput triples and
multi-writer throughput reaches parity with buffered I/O, with tail
latencies collapsing by 1-2 orders of magnitude.
Assisted-by: Claude:claude-opus-4-6 Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://patch.msgid.link/20260511-dontcache-v7-3-2848ddce8090@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Jeff Layton [Mon, 11 May 2026 11:58:28 +0000 (07:58 -0400)]
mm: track DONTCACHE dirty pages per bdi_writeback
Add a per-wb WB_DONTCACHE_DIRTY counter that tracks the number of dirty
pages with the dropbehind flag set (i.e., pages dirtied via RWF_DONTCACHE
writes).
Increment the counter alongside WB_RECLAIMABLE in folio_account_dirtied()
when the folio has the dropbehind flag set, and decrement it in
folio_clear_dirty_for_io() and folio_account_cleaned(). Also decrement it
when a non-DONTCACHE lookup atomically clears the dropbehind flag on a
dirty folio in __filemap_get_folio_mpol(), using folio_test_clear_dropbehind()
to prevent concurrent lookups from double-decrementing the counter, and
guarding the decrement with mapping_can_writeback() to match the increment
path.
Transfer the counter alongside WB_RECLAIMABLE in inode_do_switch_wbs() so
that the stat is properly migrated when an inode switches cgroup writeback
domains.
The counter will be used by the writeback flusher to determine how many
pages to write back when expediting writeback for IOCB_DONTCACHE writes,
without flushing the entire BDI's dirty pages.
Suggested-by: Jan Kara <jack@suse.cz> Assisted-by: Claude:claude-opus-4-6 Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://patch.msgid.link/20260511-dontcache-v7-2-2848ddce8090@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Jeff Layton [Mon, 11 May 2026 11:58:27 +0000 (07:58 -0400)]
mm: preserve PG_dropbehind flag during folio split
__split_folio_to_order() copies page flags from the original folio to
newly created sub-folios using an explicit allowlist, but PG_dropbehind
is not included. When a large folio with PG_dropbehind set is split,
only the head sub-folio retains the flag; all tail sub-folios silently
lose it and will not be reclaimed eagerly after writeback completes.
Add PG_dropbehind to the flag copy mask so that the drop-behind hint
is preserved across folio splits.
Fixes: a323281cdfec ("mm: add PG_dropbehind folio flag") Assisted-by: Claude:claude-opus-4-6 Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://patch.msgid.link/20260511-dontcache-v7-1-2848ddce8090@kernel.org Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Merge patch series "libfs: set SB_I_NOEXEC and SB_I_NODEV in init_pseudo()"
John Hubbard <jhubbard@nvidia.com> says:
This began as a one-line dma-buf fix for a path_noexec() warning added
by commit 1e7ab6f67824 ("anon_inode: rework assertions"). Christoph
pointed out that the fix belongs higher up: a pseudo filesystem has no
reason not to set SB_I_NOEXEC by default. This series does that.
* Patch 1 sets both flags in init_pseudo(), so every pseudo
filesystem gets them. This is the only patch that changes a flag,
and the only one with Fixes:/Cc: stable.
* Patch 2 drops the assignments that are now redundant in the callers
that set them by hand.
Most callers already set one or both flags. I audited every
init_pseudo() caller. Here is what patch 1 actually changes for each.
The only visible effect is on dma-buf, where SB_I_NOEXEC silences the
warning. SB_I_NODEV is never consulted on these SB_NOUSER mounts, and
none of the callers that gain SB_I_NOEXEC are executed from.
caller had patch 1 adds
--------------------------- -------- --------------
fs/anon_inodes.c both nothing new
mm/secretmem.c both nothing new
virt/kvm/guest_memfd.c both nothing new
fs/nsfs.c both nothing new
fs/pidfs.c both nothing new
fs/aio.c NOEXEC NODEV
drivers/dma-buf/dma-buf.c neither NOEXEC + NODEV
net/socket.c neither NOEXEC + NODEV
fs/pipe.c neither NOEXEC + NODEV
kernel/resource.c neither NOEXEC + NODEV
fs/erofs/super.c neither NOEXEC + NODEV
fs/btrfs/tests/... neither NOEXEC + NODEV
drivers/vfio/vfio_main.c neither NOEXEC + NODEV
drivers/gpu/drm/drm_drv.c neither NOEXEC + NODEV
drivers/dax/super.c neither NOEXEC + NODEV
block/bdev.c neither NOEXEC + NODEV
* patches from https://patch.msgid.link/20260604025315.245910-1-jhubbard@nvidia.com:
libfs: drop redundant SB_I_NOEXEC/SB_I_NODEV in init_pseudo() callers
libfs: set SB_I_NOEXEC and SB_I_NODEV by default in init_pseudo()
John Hubbard [Thu, 4 Jun 2026 02:53:14 +0000 (19:53 -0700)]
libfs: set SB_I_NOEXEC and SB_I_NODEV by default in init_pseudo()
Since commit 1e7ab6f67824 ("anon_inode: rework assertions"),
path_noexec() warns when an anonymous-inode file is mmap'd from a
superblock that has not set SB_I_NOEXEC. dma-buf backs its files this
way and never set the flag, so mmap of any exported buffer trips the
warning on a CONFIG_DEBUG_VFS=y kernel:
init_pseudo() sets up internal SB_NOUSER mounts that are never
path-reachable. Set both flags here so every pseudo filesystem gets
them by default instead of each caller setting them.
SB_I_NODEV is inert for unreachable mounts. SB_I_NOEXEC has one
visible effect: an executable mapping of a pseudo-fs fd, such as a
dma-buf, now fails with -EPERM, which is the invariant the assertion
enforces. No in-tree caller maps these executable.
Reproduce on CONFIG_DEBUG_VFS=y:
make -C tools/testing/selftests/dmabuf-heaps
sudo ./tools/testing/selftests/dmabuf-heaps/dmabuf-heap -t system
Fixes: 1e7ab6f67824 ("anon_inode: rework assertions") Suggested-by: Christoph Hellwig <hch@infradead.org> Cc: stable@vger.kernel.org Signed-off-by: John Hubbard <jhubbard@nvidia.com> Link: https://patch.msgid.link/20260604025315.245910-2-jhubbard@nvidia.com Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Joanne Koong [Thu, 4 Jun 2026 01:18:58 +0000 (18:18 -0700)]
iomap: avoid potential null folio->mapping deref during error reporting
When a buffered read fails, iomap_finish_folio_read() reports the error
with fserror_report_io(folio->mapping->host, ...). This is called after
ifs->read_bytes_pending has been decremented by the bytes attempted to
be read.
For a folio split across multiple read completions, the folio is only
guaranteed to stay locked while read_bytes_pending > 0. Once
iomap_finish_folio_read() decrements read_bytes_pending, another
in-flight read can complete and end the read on the folio, which unlocks
it. This allows truncate logic to run and detach the folio (set
folio->mapping to NULL). The error reporting path then can dereference a
NULL folio->mapping. As reported by Sam Sun, this is the race that can
occur:
CPU0: failed completion CPU1: final completion CPU2: truncate
----------------------- ---------------------- --------------
read_bytes_pending -= len
finished = false
/* preempted before
fserror_report_io() */
read_bytes_pending -= len
finished = true
folio_end_read()
truncate clears
folio->mapping
fserror_report_io(
folio->mapping->host, ...)
^ NULL deref
Fix this by reporting the error first before decrementing
ifs->read_bytes_pending.
Fixes: a9d573ee88af ("iomap: report file I/O errors to the VFS") Cc: stable@vger.kernel.org Reported-by: Sam Sun <samsun1006219@gmail.com> Closes: https://lore.kernel.org/linux-fsdevel/CAEkJfYPhWdd59RKmuNLJg-bkypHz7xiOwaWyNVu3A8CUqQCnvg@mail.gmail.com/ Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Link: https://patch.msgid.link/20260604011858.2297561-1-joannelkoong@gmail.com Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Jann Horn [Wed, 3 Jun 2026 19:31:57 +0000 (21:31 +0200)]
fhandle: fix UAF due to unlocked ->mnt_ns read in may_decode_fh()
may_decode_fh() accesses mount::mnt_ns without holding any locks; that
means the mount can concurrently be unmounted, and the mnt_namespace can
concurrently be freed after an RCU grace period.
This race can happens as follows, assuming that the mount point was
created by open_tree(..., OPEN_TREE_CLONE):
Fix it by taking rcu_read_lock() around the mount::mnt_ns access, like
in __prepend_path().
Additionally, document the semantics of mount::mnt_ns, and use WRITE_ONCE()
for writers that can race with lockless readers.
This bug is unreachable unless one of the following is set:
because it requires an RCU grace period to happen during a syscall without
an explicit preemption.
This doesn't seem to have interesting security impact; worst-case, it could
leak the result of an integer comparison to userspace (from the level
check in cap_capable()), cause an endless loop, or crash the kernel by
dereferencing an invalid address.
Miguel Ojeda [Tue, 2 Jun 2026 15:16:38 +0000 (17:16 +0200)]
kbuild: rust: rename flag to `-Zdebuginfo-for-profiling` for Rust >= 1.98
Starting with Rust 1.98.0 (expected 2026-08-20), the
`-Zdebug-info-for-profiling` flag has been renamed to
`-Zdebuginfo-for-profiling` (i.e. one less dash, to match `debuginfo`s
in other flags) [1].
Without this change, one gets in the latest nightlies:
Andrew Jones [Wed, 27 May 2026 14:27:03 +0000 (09:27 -0500)]
kconfig: add kconfig-sym-check static checker
Add 'make kconfig-sym-check', a static checker that finds Kconfig
symbols referenced in expressions (select, depends on, default, etc.)
but never defined via config/menuconfig anywhere in the tree. New
dangling symbols are reported as errors (exit 1) unless they are
listed in an exclusion file, e.g.
KCONFIG_SYM_CHECK_EXCLUDES=sym-check-excludes make kconfig-sym-check
The exclusion file lists one symbol per line; blank lines and lines
starting with '#' are ignored.
The checker also warns about uppercase N/Y/M used as tristate literal
values following the same logic as checkpatch.
This new static checker is the script used for [1] with a few
improvements to avoid some false positives.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216748 Assisted-by: Claude:claude-sonnet-4-6 Signed-off-by: Andrew Jones <andrew.jones@linux.dev> Acked-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Julian Braha <julianbraha@gmail.com> Tested-by: Nicolas Schier <nsc@kernel.org> Acked-by: Nicolas Schier <nsc@kernel.org> Link: https://patch.msgid.link/20260527142703.107110-1-andrew.jones@linux.dev Signed-off-by: Nathan Chancellor <nathan@kernel.org>
====================
Fix use-after-free in metadata dst teardown in airoha_eth and mtk_eth_soc drivers
airoha_metadata_dst_free() and mtk_free_dev() call metadata_dst_free()
which frees the metadata_dst with kfree() immediately, bypassing the RCU
grace period.
Replace metadata_dst_free() with dst_release() which properly goes
through the refcount path and runs call_rcu_hurry() if refcount goes to
zero.
====================
net: ethernet: mtk_eth_soc: Fix use-after-free in metadata dst teardown
mtk_free_dev() calls metadata_dst_free() which frees the metadata_dst
with kfree() immediately, bypassing the RCU grace period.
In the RX path, skb_dst_set_noref() sets a non-refcounted pointer from
the skb to the metadata_dst. This function requires RCU read-side
protection and the dst must remain valid until all RCU readers complete.
Since metadata_dst_free() calls kfree() directly, a use-after-free can
occur if any skb still holds a noref pointer to the dst when the driver
tears it down.
Replace metadata_dst_free() with dst_release() which properly goes
through the refcount path: when the refcount drops to zero, it schedules
the actual free via call_rcu_hurry(), ensuring all RCU readers have
completed before the memory is freed.
net: airoha: Fix use-after-free in metadata dst teardown
airoha_metadata_dst_free() runs metadata_dst_free() which frees the
metadata_dst with kfree() immediately, bypassing the RCU grace period.
In the RX path, skb_dst_set_noref() sets a non-refcounted pointer from
the skb to the metadata_dst. This function requires RCU read-side
protection and the dst must remain valid until all RCU readers complete.
Since metadata_dst_free() calls kfree() directly, an use-after-free can
occur if any skb still holds a noref pointer to the dst when the driver
tears it down.
Replace metadata_dst_free() with dst_release() which properly goes
through the refcount path: when the refcount drops to zero, it schedules
the actual free via call_rcu_hurry(), ensuring all RCU readers have
completed before the memory is freed.
Jakub Kicinski [Thu, 4 Jun 2026 02:07:46 +0000 (19:07 -0700)]
Merge tag 'for-net-2026-06-03' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- hci_core: fix memory leak in error path of hci_alloc_dev()
- hci_sync: reject oversized Broadcast Announcement prepend
- MGMT: Fix backward compatibility with userspace
- MGMT: validate advertising TLV before type checks
- L2CAP: reject BR/EDR signaling packets over MTUsig
- RFCOMM: validate skb length in MCC handlers
- RFCOMM: hold listener socket in rfcomm_connect_ind()
- ISO: Fix not releasing hdev reference on iso_conn_big_sync
- ISO: Fix a use-after-free of the hci_conn pointer
- ISO: Fix data-race on iso_pi fields in hci_get_route calls
- SCO: Fix data-race on sco_pi fields in sco_connect
- BNEP: reject short frames before parsing
* tag 'for-net-2026-06-03' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: MGMT: Fix backward compatibility with userspace
Bluetooth: SCO: Fix data-race on sco_pi fields in sco_connect
Bluetooth: ISO: Fix data-race on iso_pi fields in hci_get_route calls
Bluetooth: ISO: Fix a use-after-free of the hci_conn pointer
Bluetooth: ISO: Fix not releasing hdev reference on iso_conn_big_sync
Bluetooth: fix memory leak in error path of hci_alloc_dev()
Bluetooth: bnep: reject short frames before parsing
Bluetooth: hci_sync: reject oversized Broadcast Announcement prepend
Bluetooth: L2CAP: reject BR/EDR signaling packets over MTUsig
Bluetooth: RFCOMM: validate skb length in MCC handlers
Bluetooth: MGMT: validate advertising TLV before type checks
Bluetooth: RFCOMM: hold listener socket in rfcomm_connect_ind()
====================
Jakub Kicinski [Thu, 4 Jun 2026 02:07:34 +0000 (19:07 -0700)]
Merge tag 'wireless-2026-06-03' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Johannes Berg says:
====================
Things are finally quieting down:
- iwlwifi:
- FW reset handshake removal for older devices
- NIC access fix in fast resume
- avoid too large command for some BIOSes
- fix TX power constraints in AP mode
- cfg80211:
- fix netlink parse overflow
- fix potential 6 GHz scan memory leak
- enforce HE/EHT consistency to avoid mac80211 crash
- mac80211: guard radiotap antenna parsing
* tag 'wireless-2026-06-03' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
wifi: cfg80211: enforce HE/EHT cap/oper consistency
wifi: fix leak if split 6 GHz scanning fails
wifi: mac80211: limit injected antenna index in ieee80211_parse_tx_radiotap
wifi: nl80211: reject oversized EMA RNR lists
wifi: iwlwifi: pcie: simplify the resume flow if fast resume is not used
wifi: iwlwifi: mvm: avoid oversized UATS command copy
wifi: iwlwifi: mld: send tx power constraints before link activation
wifi: iwlwifi: mvm: don't support the reset handshake for old firmwares
====================
Jakub Kicinski [Thu, 4 Jun 2026 02:04:46 +0000 (19:04 -0700)]
Merge branch 'mptcp-misc-fixes-for-v7-1-rc7'
Matthieu Baerts says:
====================
mptcp: misc fixes for v7.1-rc7
Here are various unrelated fixes:
- Patch 1: fix missing wakeups when multiple threads are reading from
the same fd. A fix for v5.7.
- Patch 2: fix retransmission loop when MPTCP checksum is enabled. A fix
for v5.14.
- Patch 3: fix a TOCTOU race while computing rcv_wnd. A fix for v5.11.
- Patch 4: allow subflows receive window to shrink if needed. A fix for
v5.19.
- Patches 5-6: avoid 'extra_subflows' to underflow with the userspace
PM. A fix for v5.19.
- Patch 7: report errors if one subflow cannot set SO_TIMESTAMPING. A
fix for v5.14.
- Patch 8: try to set TCP_MAXSEG on all subflows, before reporting
errors, if any. A fix for v6.17.
- Patch 9: check desc->count in read_sock, to act as expected. A fix
for v7.0.
- Patch 10: fix an uninit value in mptcp_established_options, reported
by syzbot. A fix for v7.1-rc1.
- Patch 11: fix a similar issue than the previous patch, exposed by the
same modification from v7.1-rc1, but was already causing issues since
v5.15.
====================
When an ADD_ADDR needs to be sent, it could be prepared if there is
enough remaining space and even if the packet is not a pure ACK. But it
would be dropped soon after.
Indeed, in mptcp_pm_add_addr_signal(), there is enough space to fit a
DSS of 20 octets and an ADD_ADDR echo containing an IPv4 address on 8
octets for example. In this case, the packet would be prepared, the
MPTCP_ADD_ADDR_ECHO bit would be removed from pm->addr_signal, but the
option would be silently dropped in mptcp_established_options_add_addr()
not to override DSS info in the union from 'struct mptcp_out_options',
and also because mptcp_write_options() will enforce mutually exclusion
with DSS.
Instead, don't even try to send an ADD_ADDR if it is not a pure ACK.
Retry for each new packet until a pure-ACK is emitted. That's fine to do
that, because each time an ADD_ADDR (echo) is scheduled, a pure ACK is
queued.
This also simplifies the code, and the skb checks can be done earlier,
before the lock.
Note: also, since commit 6d0060f600ad ("mptcp: Write MPTCP DSS headers
to outgoing data packets"), opts->ahmac would not have been set to 0
when other suboptions were not dropped, and when sending an ADD_ADDR
echo. That would have resulted in sending an ADD_ADDR using garbage
info, where there was not enough space, instead of an echo one without
the ADD_ADDR HMAC.
Local variable opts created at:
__tcp_transmit_skb+0x4d/0x5fe0 net/ipv4/tcp_output.c:1536
__tcp_send_ack+0x967/0xad0 net/ipv4/tcp_output.c:4499
The output path currently omits initializing the mptcp extension
`use_map` flag in a few corner cases.
Address the issue always zeroing all the extensions flags before
eventually initializing the individual bits. To that extent, introduce
and use a struct_group to avoid multiple bitwise operations.
Gang Yan [Tue, 2 Jun 2026 12:14:16 +0000 (22:14 +1000)]
mptcp: check desc->count in read_sock
__tcp_read_sock() checks desc->count after each skb is consumed and
breaks the loop when it reaches 0. The MPTCP variant lacks this check.
This is a functional bug, other subsystems also rely on this check:
TLS strparser sets desc->count to 0 once a full TLS record is assembled
and depends on this break to stop reading.
Add the same desc->count check to __mptcp_read_sock(), mirroring
__tcp_read_sock().
The mptcp_setsockopt_all_sf(), currently used only with TCP_MAXSEG,
stopped when one subflow returned an error.
Even if it is not wrong, this is different from the other helpers trying
to set the option on all subflows, and then returning an error if at
least one of them had an issue.
Follow this behaviour, for a question of uniformity.
sock_set_timestamping() can fail for different reasons. The returned
value should then be checked.
If sock_set_timestamping() fails for at least one subflow, the first
error is now reported to the userspace, similar to what is done with
other socket options.
Fixes: 9061f24bf82e ("mptcp: sockopt: propagate timestamp request to subflows") Cc: stable@vger.kernel.org Reported-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Closes: https://lore.kernel.org/willemdebruijn.kernel.178a41a53d041@gmail.com Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260602-net-mptcp-misc-fixes-7-1-rc7-v2-7-856831229976@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tao Cui [Tue, 2 Jun 2026 12:14:13 +0000 (22:14 +1000)]
selftests: mptcp: add test for extra_subflows underflow on userspace PM
Add a test to verify that when userspace PM fails to create a subflow
(e.g. using an unreachable address), the extra_subflows counter is not
decremented below zero.
Fixes: 77e4b94a3de6 ("mptcp: update userspace pm infos") Cc: stable@vger.kernel.org Signed-off-by: Tao Cui <cuitao@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260602-net-mptcp-misc-fixes-7-1-rc7-v2-6-856831229976@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tao Cui [Tue, 2 Jun 2026 12:14:12 +0000 (22:14 +1000)]
mptcp: pm: fix extra_subflows underflow on userspace PM subflow creation
The userspace PM increments extra_subflows after __mptcp_subflow_connect()
succeeds, but __mptcp_subflow_connect() calls mptcp_pm_close_subflow()
on failure to roll back the pre-increment done by the kernel PM's fill_*()
helpers. Because the userspace PM hasn't incremented yet at that point,
this decrement is spurious and causes extra_subflows to underflow.
Fix it by aligning the userspace PM with the kernel PM: increment
extra_subflows before calling __mptcp_subflow_connect(), so the existing
error path in subflow.c correctly rolls it back on failure. Also simplify
the error handling by taking pm.lock only when needed for cleanup.
Fixes: 77e4b94a3de6 ("mptcp: update userspace pm infos") Cc: stable@vger.kernel.org Signed-off-by: Tao Cui <cuitao@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260602-net-mptcp-misc-fixes-7-1-rc7-v2-5-856831229976@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Paolo Abeni [Tue, 2 Jun 2026 12:14:11 +0000 (22:14 +1000)]
mptcp: allow subflow rcv wnd to shrink
In MPTCP connection, the `window` field in the TCP header refers to the
MPTCP-level rcv_nxt and it's right edge should not move backward. Such
constraint is enforced at DSS option generation time.
At the same time, the TCP stack ensures independently that the TCP-level
rcv wnd right's edge does not move backward. That in turn causes artificial
inflating of the MPTCP rcv window when the incoming data is acked at the
TCP level and is OoO in the MPTCP sequence space (or lands in the backlog).
As a consequence, the incoming traffic can exceed the receiver rcvbuf size
even when the sender is not misbehaving.
Prevent such scenario forcibly allowing the TCP subflow to shrink the
TCP-level rcv wnd regardless of the current netns setting.
Fixes: f3589be0c420 ("mptcp: never shrink offered window") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260602-net-mptcp-misc-fixes-7-1-rc7-v2-4-856831229976@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Paolo Abeni [Tue, 2 Jun 2026 12:14:10 +0000 (22:14 +1000)]
mptcp: close TOCTOU race while computing rcv_wnd
The MPTCP output path access locklessly the MPTCP-level ack_seq
in multiple times, using possibly different values for the data_ack
in the DSS option and to compute the announced rcv wnd for the same
packet.
Refactor the cote to avoid inconsistencies which may confuse the
peer. Also ensure that the MPTCP level rcv wnd is updated only when
the egress packet actually contains a DSS ack.
Fixes: fa3fe2b15031 ("mptcp: track window announced to peer") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260602-net-mptcp-misc-fixes-7-1-rc7-v2-3-856831229976@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Paolo Abeni [Tue, 2 Jun 2026 12:14:09 +0000 (22:14 +1000)]
mptcp: fix retransmission loop when csum is enabled
Sashiko noted that retransmission with csum enabled can actually
transmit new data, but currently the relevant code does not update
accordingly snd_nxt.
The may cause incoming ack drop and an endless retransmission loop.
Paolo Abeni [Tue, 2 Jun 2026 12:14:08 +0000 (22:14 +1000)]
mptcp: fix missing wakeups in edge scenarios
The mptcp_recvmsg() can fill MPTCP socket receive queue via
mptcp_move_skbs(), but currently does not try to wakeup any listener,
because the same process is going to check the receive queue soon.
When multiple threads are reading from the same fd, the above can
cause stall. Add the missing wakeup.
Fixes: 6771bfd9ee24 ("mptcp: update mptcp ack sequence from work queue") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260602-net-mptcp-misc-fixes-7-1-rc7-v2-1-856831229976@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Kurt Kanzenbach [Fri, 29 May 2026 17:11:47 +0000 (19:11 +0200)]
ptp: vclock: Switch from RCU to SRCU
The usage of PTP vClocks leads immediately to the following issues with
ptp4l with LOCKDEP and DEBUG_ATOMIC_SLEEP enabled: "BUG: sleeping function
called from invalid context".
ptp_convert_timestamp() acquires a mutex_t within a RCU read section. This
is illegal, because acquiring a mutex_t can result in voluntary scheduling
request which is not allowed within a RCU read section.
Replace the RCU usage with SRCU where sleeping is allowed.
Reported-by: Florian Zeitz <florian.zeitz@schettke.com> Closes: https://lore.kernel.org/all/00a8cce8-410e-4038-98af-49be6d93d7bd@schettke.com/ Fixes: 67d93ffc0f3c ("ptp: vclock: use mutex to fix "sleep on atomic" bug") Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20260529-vclock_rcu-v2-1-02a5531fab92@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet [Tue, 2 Jun 2026 16:15:47 +0000 (16:15 +0000)]
ipv4: restrict IPOPT_SSRR and IPOPT_LSRR options
This patch restricts setting Loose Source and Record Route (LSRR)
and Strict Source and Record Route (SSRR) IP options to users
with CAP_NET_RAW capability.
This prevents unprivileged applications from forcing packets to route
through attacker-controlled nodes to leak TCP ISN and possibly other
protocol information.
While LSRR and SSRR are commonly filtered in many network environments,
they may still be supported and forwarded along some network paths.
RFC 7126 (Recommendations on Filtering of IPv4 Packets Containing
IPv4 Options) recommend to drop these options in 4.3 and 4.4.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: Tamir Shahar <tamirthesis@gmail.com> Reported-by: Amit Klein <aksecurity@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260602161547.2642155-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jianyu Li [Mon, 1 Jun 2026 11:36:40 +0000 (19:36 +0800)]
af_unix: Add test for SCM_INQ on partial read
Add test to verify that when a skb is partially consumed,
unix_inq_len() return correct remaining byte count.
Before:
# RUN scm_inq.stream.partial_read ...
# scm_inq.c:165:partial_read:Expected remain (512) == *(int *)CMSG_DATA(cmsg) (768)
# partial_read: Test terminated by assertion
# FAIL scm_inq.stream.partial_read
not ok 2 scm_inq.stream.partial_read
After:
# RUN scm_inq.stream.partial_read ...
# OK scm_inq.stream.partial_read
ok 2 scm_inq.stream.partial_read
Jianyu Li [Mon, 1 Jun 2026 11:36:39 +0000 (19:36 +0800)]
af_unix: Fix inq_len update problem in partial read
Currently inq_len is updated only when the whole skb is consumed.
If only part of the data is read, following SIOCINQ query would
get value greater than what actually left.
This change update inq_len timely in unix_stream_read_generic(),
and adjust unix_stream_read_skb() accordingly to prevent
repetitive update.
Fixes: f4e1fb04c123 ("af_unix: Use cached value for SOCK_STREAM in unix_inq_len().") Signed-off-by: Jianyu Li <jianyu.li@mediatek.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260601113640.231897-2-jianyu.li@mediatek.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Suman Ghosh [Fri, 29 May 2026 11:37:05 +0000 (17:07 +0530)]
octeontx2-af: Fix initialization of mcam's entry2target_pffunc field
NPC mcam entry stores a mapping between mcam entry and target pcifunc.
During initialization of this field, API kmalloc_array has been used which
caused some junk values to array. Whereas, the array is expected to be
initialized by 0. This patch fixes the same by using kcalloc instead of
kmalloc_array.
Fixes: 55307fcb9258 ("octeontx2-af: Add mbox messages to install and delete MCAM rules") Signed-off-by: Suman Ghosh <sumang@marvell.com> Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/1780054625-17090-1-git-send-email-sbhatta@marvell.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Geetha sowjanya [Fri, 29 May 2026 11:37:57 +0000 (17:07 +0530)]
octeontx2-pf: Fix NDC sync operation errors
On system reboot "rvu_nicpf 0002:03:00.0: NDC sync operation failed"
error messages are shown, even if the operations is successful.
This is due to wrong if error check in ndc_syc() function.
Fixes: 42c45ac1419c ("octeontx2-af: Sync NIX and NPA contexts from NDC to LLC/DRAM") Signed-off-by: Geetha sowjanya <gakula@marvell.com> Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/1780054677-17249-1-git-send-email-sbhatta@marvell.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Yizhou Zhao [Fri, 29 May 2026 10:50:16 +0000 (18:50 +0800)]
appletalk: aarp: zero-initialize aarp_entry to prevent heap info leak
aarp_alloc() allocates struct aarp_entry without zeroing it, but only
initializes refcnt and packet_queue. When an unresolved AARP entry is
created, hwaddr[ETH_ALEN] is left uninitialized.
aarp_seq_show() later prints this field with %pM when users read
/proc/net/atalk/arp. This can expose 6 bytes of stale heap data for
each unresolved entry.
Fix this by zero-initializing struct aarp_entry at allocation time.
Reported-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Reported-by: Yuxiang Yang <yangyx22@mails.tsinghua.edu.cn> Reported-by: Ao Wang <wangao@seu.edu.cn> Reported-by: Xuewei Feng <fengxw06@126.com> Reported-by: Qi Li <qli01@tsinghua.edu.cn> Reported-by: Ke Xu <xuke@tsinghua.edu.cn> Signed-off-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260529105017.81531-1-zhaoyz24@mails.tsinghua.edu.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jonas Jelonek [Thu, 28 May 2026 20:52:40 +0000 (20:52 +0000)]
net: sfp: initialize i2c_block_size at adapter configure time
sfp->i2c_block_size is only assigned in sfp_sm_mod_probe(), which runs
from the state machine timer after SFP_F_PRESENT has been set. Between
those two points, sfp_module_eeprom() (the ethtool -m callback) gates
only on SFP_F_PRESENT and can be entered with i2c_block_size still at
its kzalloc'd value of 0.
On a pure-I2C adapter, sfp_i2c_read() then issues an i2c_transfer()
with msgs[1].len = 0 inside a loop that subtracts this_len from len
each iteration; on adapters that succeed a zero-length read the loop
never advances, spinning while holding rtnl_lock.
This was previously addressed by initializing i2c_block_size in
sfp_alloc() (commit 813c2dd78618), but the initialization was dropped
when i2c_block_size was split from i2c_max_block_size.
Initialize sfp->i2c_block_size from sfp->i2c_max_block_size in
sfp_i2c_configure(), so the field is valid as soon as the adapter is
known. sfp_sm_mod_probe() still reassigns it on each module insertion
to recover from a per-module clamp to 1 (sfp_id_needs_byte_io).
Jason Xing [Sat, 30 May 2026 04:26:30 +0000 (12:26 +0800)]
xsk: cache csum_start/csum_offset to fix TOCTOU in xsk_skb_metadata()
The TX metadata area resides in the UMEM buffer which is memory-mapped
and concurrently writable by userspace. In xsk_skb_metadata(),
csum_start and csum_offset are read from shared memory for bounds
validation, then read again for skb assignment. A malicious userspace
application can race to overwrite these values between the two reads,
bypassing the bounds check and causing out-of-bounds memory access
during checksum computation in the transmit path.
Fix this by reading csum_start and csum_offset into local variables
once, then using the local copies for both validation and assignment.
Note that other metadata fields (flags, launch_time) and the cached
csum fields may be mutually inconsistent due to concurrent userspace
writes, but this is benign: the only security-critical invariant is
that each field's validated value is the same one used, which local
caching guarantees.
Closes: https://lore.kernel.org/all/20260503200927.73EA1C2BCB4@smtp.kernel.org/ Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Jason Xing <kernelxing@tencent.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Fixes: 48eb03dd2630 ("xsk: Add TX timestamp and TX checksum offload support") Link: https://patch.msgid.link/20260530042630.80626-1-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Usama Arif [Tue, 2 Jun 2026 17:22:47 +0000 (10:22 -0700)]
mm/mincore: handle non-swap entries before !CONFIG_SWAP guard
mincore_swap() also fields migration/hwpoison entries (and shmem
swapin-error entries), which can exist on !CONFIG_SWAP builds when
CONFIG_MIGRATION or CONFIG_MEMORY_FAILURE is enabled. The
!IS_ENABLED(CONFIG_SWAP) guard ran before the non-swap-entry early return,
so mincore_pte_range() can spuriously WARN and report these pages
nonresident on !CONFIG_SWAP kernels.
Move the guard below the non-swap-entry check so only true swap entries
trip the WARN, and migration/hwpoison entries take the existing "uptodate
/ non-shmem" path.
Link: https://lore.kernel.org/20260602172247.279421-1-usama.arif@linux.dev Fixes: 1f2052755c15 ("mm/mincore: use a helper for checking the swap cache") Signed-off-by: Usama Arif <usama.arif@linux.dev> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Kairui Song <kasong@tencent.com> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Baoquan He <baoquan.he@linux.dev> Cc: Chris Li <chrisl@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Alistair Popple [Thu, 21 May 2026 03:27:30 +0000 (13:27 +1000)]
arm64: mm: call pagetable dtor when freeing hot-removed page tables
Since 5e8eb9aeeda3 ("arm64: mm: always call PTE/PMD ctor in
__create_pgd_mapping()") page-table allocation on ARM64 always calls
pagetable_{pte,pmd,pud,p4d}_ctor(). This sets the page_type to
PGTY_table, increments NR_PAGETABLE and possible allocates a PTL. However
the matching pagetable_dtor() calls were never added.
With DEBUG_VM enabled on kernel versions prior to v6.17 without 2dfcd1608f3a9 ("mm/page_alloc: let page freeing clear any set page type")
this leads to the following warning when freeing these pages due to
page->page_type sharing page->_mapcount:
BUG: Bad page state in process ... pfn:284fbb
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x284fbb
flags: 0x17fffc000000000(node=0|zone=2|lastcpupid=0x1ffff)
page_type: f2(table)
page dumped because: nonzero mapcount
Call trace:
bad_page+0x13c/0x160
__free_frozen_pages+0x6cc/0x860
___free_pages+0xf4/0x180
free_pages+0x54/0x80
free_hotplug_page_range.part.0+0x58/0x90
free_empty_tables+0x438/0x500
__remove_pgd_mapping.constprop.0+0x60/0xa8
arch_remove_memory+0x48/0x80
try_remove_memory+0x158/0x1d8
offline_and_remove_memory+0x138/0x180
It can also lead to leaking the ptl allocation if ALLOC_SPLIT_PTLOCKS is
defined and incorrect NR_PAGETABLE stats. Fix this by calling
pagetable_dtor() in free_hotplug_pgtable_page() prior to freeing the page
to undo the effects of calling pagetable_*_ctor().
Link: https://lore.kernel.org/20260521032730.2104017-1-apopple@nvidia.com Fixes: 5e8eb9aeeda3 ("arm64: mm: always call PTE/PMD ctor in __create_pgd_mapping()") Signed-off-by: Alistair Popple <apopple@nvidia.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@kernel.org> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shakeel Butt [Mon, 1 Jun 2026 16:15:01 +0000 (09:15 -0700)]
mm/list_lru: drain before clearing xarray entry on reparent
memcg_reparent_list_lrus() clears the dying memcg's xarray entry with
xas_store(&xas, NULL) before reparenting its per-node lists into the
parent. This opens a window where a concurrent list_lru_del() arriving
for the dying memcg sees xa_load() == NULL, walks to the parent in
lock_list_lru_of_memcg(), takes the parent's per-node lock, and calls
list_del_init() on an item still physically linked on the dying memcg's
list.
If another in-flight thread holds the dying memcg's per-node lock at the
same moment (another list_lru_del, or a list_lru_walk_one running an
isolate callback), both threads modify ->next/->prev pointers on the same
physical list under different locks. Adjacent items can corrupt each
other's links.
Fix it by reversing the order: reparent each per-node list and mark the
child's list lru dead and then clear the xarray entry. Any concurrent
list_lru op that finds the still-set xarray entry either takes the dying
memcg's per-node lock (synchronizing with the drain) or sees LONG_MIN and
walks to the parent, where the items now live.
Link: https://lore.kernel.org/20260601161501.1444829-1-shakeel.butt@linux.dev Fixes: fb56fdf8b9a2 ("mm/list_lru: split the lock to per-cgroup scope") Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reported-by: Chris Mason <clm@fb.com> Reviewed-by: Kairui Song <kasong@tencent.com> Acked-by: Muchun Song <muchun.song@linux.dev> Cc: Dave Chinner <david@fromorbit.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Mon, 1 Jun 2026 08:30:44 +0000 (09:30 +0100)]
mm/huge_memory: use correct flags for device private PMD entry
Commit 65edfda6f3f2 ("mm/rmap: extend rmap and migration support
device-private entries") updated set_pmd_migration_entry() to use
pmdp_huge_get_and_clear() in the softleaf case, but made no further
adjustments to the function itself.
Therefore this function continues to incorrectly use pmd_write(),
pmd_soft_dirty() and pmd_uffd_wp() to determine whether the installed
migration entry should be marked writable, softdirty or uffd-wp
respectively.
Whilst all are incorrect, the most problematic of these is pmd_write(), as
this can lead to corrupted rmap state.
On x86-64 _PAGE_SWP_SOFT_DIRTY is aliased to _PAGE_RW. So calling
pmd_write() on a softleaf will return the softdirty state encoded in the
entry, assuming CONFIG_MEM_SOFT_DIRTY was enabled.
This was observed when running the hmm.hmm_device_private.anon_write_child
selftest:
1. The test faults in a range then migrates it such that a device-private
THP range is established.
2. The parent then migrates it to a device-private writable PMD entry whose
folio is entirely AnonExclusive with entire_mapcount=1, softdirty set
(accidentally correct write state).
3. The parent forks and the PMD entries are set to device-private read only
entries, entire_mapcount=2, softdirty still set.
4. [BUG] The child writes to the range then migrates to RAM - intending to
install non-writable migration entries - but replacing parent and child
PMD mappings with WRITABLE entries due to misinterpreting the softdirty
bit.
5. In remove_migration_pmd(), if !softleaf_is_migration_read(entry) we
set the RMAP_EXCLUSIVE flag when calling folio_add_anon_rmap_pmd() for
both parent and child, which are therefore AnonExclusive.
6. [SPLAT] Child sets migrated folio entire_mapcount=1, parent sets
entire_mapcount=2 and we end up with an AnonExclusive folio with
entire_mapcount=2! Assert fires in __folio_add_anon_rmap():
This patch fixes the issue by correctly referencing the softleaf entry
fields for writable, softdirty and uffd-wp in set_pmd_migration_entry().
It also only updates A/D flags if the entry is present as these are
otherwise not meaningful for a softleaf entry.
This patch also flips the if (!present) { ... } else { ... } logic in
set_pmd_migration_entry() so it is easier to understand, and adds some
comments to make things clearer.
I was able to bisect this to commit 775465fd26a3 ("lib/test_hmm: add zone
device private THP test infrastructure") which first exposes this bug as
it was the commit that permitted test_hmm to generate the test.
However commit 65edfda6f3f2 ("mm/rmap: extend rmap and migration support
device-private entries") is the commit that actually enabled this
behaviour.
Link: https://lore.kernel.org/20260601083044.57132-1-ljs@kernel.org Fixes: 65edfda6f3f2 ("mm/rmap: extend rmap and migration support device-private entries") Signed-off-by: Lorenzo Stoakes <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Balbir Singh <balbirs@nvidia.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Oscar Salvador (SUSE) <osalvador@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Fri, 29 May 2026 00:01:03 +0000 (17:01 -0700)]
mm/damon/lru_sort: handle ctx allocation failure
DAMON_LRU_SORT allocates the damon_ctx object for its kdamond in its init
function. damon_lru_sort_enabled_store() wrongly assumes the allocation
will always succeed once tried. If the damon_ctx allocation was failed,
therefore, code execution reaches to damon_commit_ctx() while 'ctx' is
NULL. As a result, it dereferences the NULL 'ctx' pointer. Avoid the
NULL dereference by returning -ENOMEM if 'ctx' is NULL.
Link: https://lore.kernel.org/20260529000104.7006-3-sj@kernel.org Fixes: c4a8e662c839 ("mm/damon/lru_sort: use damon_initialized()") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> # 6.18.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Fri, 29 May 2026 00:01:02 +0000 (17:01 -0700)]
mm/damon/reclaim: handle ctx allocation failure
Patch series "mm/damon/{reclaim,lru_sort}: handle ctx allocation failures".
DAMON_RECLAIM and DAMON_LRU_SORT could dereference NULL pointers if their
damon_ctx object allocations fail. The bugs are expected to happen
infrequently because the allocations are arguably too small to fail on
common setups. But theoretically they are possible and the consequences
are bad. Fix those.
The issues were discovered [1] by Sashiko.
This patch (of 2):
DAMON_RECLAIM allocates the damon_ctx object for its kdamond in its init
function. damon_reclaim_enabled_store() wrongly assumes the allocation
will always succeed once tried. If the damon_ctx allocation was failed,
therefore, code execution reaches to damon_commit_ctx() while 'ctx' is
NULL. As a result, it dereferences the NULL 'ctx' pointer. Avoid the
NULL dereference by returning -ENOMEM if 'ctx' is NULL.
Cunlong Li [Thu, 28 May 2026 02:48:44 +0000 (10:48 +0800)]
zram: fix use-after-free in zram_bvec_write_partial()
zram_read_page() picks the sync or async backing device read path based on
whether the parent bio is NULL. zram_bvec_write_partial() passes its
parent bio down, so for ZRAM_WB slots the read is dispatched
asynchronously and zram_read_page() returns 0 while the bio is still in
flight. The caller then runs memcpy_from_bvec(), zram_write_page() and
__free_page() on the buffer, leaving the async read to write into a freed
page.
zram_bvec_read_partial() was switched to NULL in commit 4e3c87b9421d
("zram: fix synchronous reads") for the same reason; the write_partial
counterpart was missed.
Link: https://lore.kernel.org/20260528-zram-v3-1-cab86eef8764@gmail.com Fixes: 8e654f8fbff5 ("zram: read page from backing device") Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Cunlong Li <shenxiaogll@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Cc: Yisheng Xie <xieyisheng1@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wang Yaxin [Wed, 27 May 2026 13:35:58 +0000 (21:35 +0800)]
tools headers UAPI: sync linux/taskstats.h for procacct.c
After commit 9b93f7e32774 ("tools/getdelays: use the static UAPI headers
from tools/include/uapi"), the Makefile was changed to use
-I../include/uapi/ instead of -I../../usr/include to ensure tools always
use the up-to-date UAPI headers.
However, only linux/taskstats.h was added to tools/include/uapi/ in commit e5bbb35a07b3 ("tools headers UAPI: sync linux/taskstats.h"), but
linux/acct.h was missing.
This causes procacct.c to fail to compile with:
procacct.c:234:37: error: 'AGROUP' undeclared (first use in this function)
gcc -I../include/uapi/ getdelays.c -o getdelays
gcc -I../include/uapi/ procacct.c -o procacct
procacct.c: In function `print_procacct':
procacct.c:234:37: error: `AGROUP' undeclared (first use in this function)
did you mean `NOGROUP'?
234 | , t->version >= 12 ? (t->ac_flag & AGROUP ? 'P' : 'T') : '?'
| ^~~~~~
| NOGROUP
procacct.c:234:37: note: each undeclared ident
because procacct.c uses the AGROUP macro defined in linux/acct.h.
Add the missing linux/acct.h to complete the static UAPI header set.
Link: https://lore.kernel.org/20260527213558929EhiHHy9EDTMjmg3uuDOMi@zte.com.cn Fixes: 9b93f7e32774 ("tools/getdelays: use the static UAPI headers from tools/include/uapi") Signed-off-by: Wang Yaxin <wang.yaxin@zte.com.cn> Reviewed-by: Thomas Weißschuh <linux@weissschuh.net> Cc: Fan Yu <fan.yu9@zte.com.cn> Cc: Jonathan Corbet <corbet@lwn.net> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Yang Yang <yang.yang29@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kaitao Cheng [Fri, 22 May 2026 13:14:34 +0000 (21:14 +0800)]
mm/cma_sysfs: skip inactive CMA areas in sysfs
cma_activate_area() can fail after a CMA area has already been added to
cma_areas[]. In that case the area is left in the global array, but it
does not reach the point where CMA_ACTIVATED is set.
cma_sysfs_init() currently walks all cma_area_count entries and creates
sysfs files for every area, including ones that failed activation. These
areas are not usable CMA areas and should not be exposed to userspace as
valid CMA regions.
If such an inactive area is exposed, userspace sees a CMA directory whose
read-only accounting files report zeros. total_pages and available_pages
report zero because the failed activation path clears cma->count and
cma->available_count, while the allocation and release counters also stay
at zero because the area cannot service CMA allocations. This makes the
failed area look like a valid but empty CMA region and can mislead tests,
monitoring, and diagnostics.
Skip CMA areas that did not reach CMA_ACTIVATED when creating the sysfs
objects. Since inactive entries can now be skipped, make the error unwind
tolerate entries that never had cma_kobj initialized.
Link: https://lore.kernel.org/20260524140420.61864-1-kaitao.cheng@linux.dev Link: https://lore.kernel.org/20260522131434.78532-1-kaitao.cheng@linux.dev Fixes: 43ca106fa8ec ("mm: cma: support sysfs") Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn> Reported-by: David Hildenbrand (Arm) <david@kernel.org> Suggested-by: David Hildenbrand (Arm) <david@kernel.org> Suggested-by: Muchun Song <songmuchun@bytedance.com> Reported-by: Muchun Song <songmuchun@bytedance.com> Closes: https://lore.kernel.org/linux-mm/55481a8b-dcfc-4bef-ba59-aa0b43dca88b@kernel.org/ Acked-by: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Dmitry Osipenko <digetx@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
ipc/shm: serialize orphan cleanup with shm_nattch updates
shm_destroy_orphaned() walks the shm idr under shm_ids(ns).rwsem, but that
does not serialize all fields tested by shm_may_destroy(). In particular,
shm_nattch is updated while holding shm_perm.lock, and attach paths can do
that without holding the rwsem.
Do not decide that an orphaned segment is unused before taking the object
lock. Move the shm_may_destroy() check under shm_perm.lock, matching the
other destroy paths, and unlock the segment when it no longer qualifies
for removal.
Link: https://lore.kernel.org/9d97cc1031de2d0bace0edf3a668818aa2f4eca6.1777410234.git.zylzyl2333@gmail.com Fixes: 4c677e2eefdb ("shm: optimize locking and ipc_namespace getting") Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Signed-off-by: Yilin Zhu <zylzyl2333@gmail.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Cc: Christian Brauner <brauner@kernel.org> Cc: Jeongjun Park <aha310510@gmail.com> Cc: Kees Cook <kees@kernel.org> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Serge Hallyn <sergeh@kernel.org> Cc: Vasiliy Kulikov <segoon@openwall.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Serge Hallyn <serge@hallyn.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tianyang Zhang [Wed, 13 May 2026 01:28:35 +0000 (09:28 +0800)]
irqchip/loongarch-ir: Add IR (interrupt redirection) irqchip support
The main function of the redirect interrupt controller is to manage
the redirected-interrupt table, which consists of many redirected entries.
When MSI interrupts are requested, the driver creates a corresponding
redirected entry that describes the target CPU/vector number and the
operating mode of the interrupt. The redirected interrupt module has an
independent cache, and during the interrupt routing process, it will
prioritize the redirected entries that hit the cache. The irqchip driver
can invalidate certain entry caches via a command queue.
Co-developed-by: Liupu Wang <wangliupu@loongson.cn> Signed-off-by: Liupu Wang <wangliupu@loongson.cn> Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Acked-by: Huacai Chen <chenhuacai@loongson.cn> Link: https://patch.msgid.link/20260513012839.2856463-5-zhangtianyang@loongson.cn
Tianyang Zhang [Wed, 13 May 2026 01:28:34 +0000 (09:28 +0800)]
irqchip/loongarch-avec: Return IRQ_SET_MASK_OK_DONE when keep affinity
Interrupt redirection support requires a new redirect domain, which will
appear as a child domain of avecintc domain. For each interrupt source,
avecintc domain only provides the CPU/interrupt vectors, while redirect
domain provides other operations to synchronize the interrupt affinity
information among multiple cores.
When modifying the affinity of an interrupt associated with the redirect
domain, if the avecintc domain detects that the actual interrupt affinity
hasn't been changed, then the redirect domain doesn't need to perform any
operations.
To achieve the above purpose, in avecintc_set_affinity() when the current
affinity remains valid, then return value is set to IRQ_SET_MASK_OK_DONE.
This doesn't introduce any compatibility issues, even if the new return
value causing msi_domain_set_affinity() to no longer perform the call to
irq_chip_write_msi_msg():
1) When both avecintc and redirect exist in the system, the msg_address
and msg_data no longer change after the allocation phase, so it does
not actually require updating the MSI message info.
2) When only avecintc exists in the system, the irq_domain_activate_irq()
interface will be responsible for the initial configuration of the MSI
message info, which is unconditional. After that, if unnecessary,
there is no modification to the MSI message info.
Tianyang Zhang [Wed, 13 May 2026 01:28:32 +0000 (09:28 +0800)]
Docs/LoongArch: Add advanced extended IRQ model
Introduce a new advanced extended interrupt model with redirect interrupt
controllers. When the redirect interrupt controller is enabled, the routing
target of MSI interrupts is no longer a specific CPU and vector number, but
a specific redirect entry. The actual CPU and vector number used are
described by the redirect entry.
Davidlohr Bueso [Thu, 7 May 2026 11:29:13 +0000 (04:29 -0700)]
locking/rtmutex: Skip remove_waiter() when waiter is not enqueued
syzbot triggered the following splat in remove_waiter() via
FUTEX_CMP_REQUEUE_PI:
KASAN: null-ptr-deref in range [0x0000000000000a88-0x0000000000000a8f]
class_raw_spinlock_constructor
remove_waiter+0x159/0x1200 kernel/locking/rtmutex.c:1561
rt_mutex_start_proxy_lock+0x103/0x120
futex_requeue+0x10e4/0x20d0
__x64_sys_futex+0x34f/0x4d0
task_blocks_on_rt_mutex() does not arm the waiter upon deadlock detection,
leaving waiter->task nil, where 3bfdc63936dd ("rtmutex: Use waiter::task instead
of current in remove_waiter()") made this fatal.
Furthermore, rt_mutex_start_proxy_lock() should not be calling into remove_waiter()
upon a successfully grabbing the rtmutex. 1a1fb985f2e2 ("futex: Handle early deadlock
return correctly"), moved the remove_waiter() out of __rt_mutex_start_proxy_lock()
(where 'ret' was only ever 0 or < 0) into the wrapper. Tighten this check to
account for try_to_take_rt_mutex().
Fixes: 3bfdc63936dd ("rtmutex: Use waiter::task instead of current in remove_waiter()") Reported-by: syzbot+78147abe6c524f183ee9@syzkaller.appspotmail.com Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Cc: stable@vger.kernel.org Closes: https://lore.kernel.org/all/69f114ac.050a0220.ac8b.0003.GAE@google.com/ Link: https://patch.msgid.link/20260507112913.1019537-1-dave@stgolabs.net
Priya Hosur [Thu, 7 May 2026 08:01:37 +0000 (13:31 +0530)]
drm/amd/pm: smu_v14_0_0: use SoftMin for gfxclk in set_soft_freq_limited_range
In smu_v14_0_0_set_soft_freq_limited_range(), the gfxclk floor is
programmed via SetHardMinGfxClk together with SetSoftMaxGfxClk. Under
power_dpm_force_performance_level=high this pins HardMin to peak gfxclk.
In PMFW arbitration HardMin has higher priority than SoftMax, so the
firmware thermal/PPT throttler cannot clamp gfxclk via SoftMax once
HardMin is set to peak. Replace SetHardMinGfxClk with SetSoftMinGfxclk
so the driver still requests peak performance but the firmware
throttler retains the ability to clamp gfxclk under thermal/PPT
pressure. SoftMax handling is unchanged and no other clock domains
are affected.
Signed-off-by: Priya Hosur <Priya.Hosur@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 3ea273267fd29cbf6d83ee72329f59eb5042605b) Cc: stable@vger.kernel.org
Donet Tom [Wed, 27 May 2026 13:19:31 +0000 (18:49 +0530)]
drm/amdgpu: Fix incorrect VRAM GART mappings on non-4K page size systems
When mapping VRAM pages into the GART page table,
amdgpu_gart_map_vram_range() assumes that the system page size is the
same as the GPU page size.
On systems with non-4K page sizes, multiple GPU pages can exist within
a single CPU page. As a result, the mappings are created incorrectly
because fewer page table entries are programmed than required.
Fix this by programming the mappings correctly for non-4K page size
systems.
Fixes: 237d623ae659 ("drm/amdgpu/gart: Add helper to bind VRAM pages (v2)") Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Donet Tom <donettom@linux.ibm.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit a8f0bc22388f74e0cf4ed8b7d1846c580eaf44cc) Cc: stable@vger.kernel.org
Sunil Khatri [Mon, 25 May 2026 04:26:23 +0000 (09:56 +0530)]
drm/amdgpu/userq: move wptr_obj cleanup in mqd_destroy
In case when queue_create fails and mqd has already been
allocated and hence wptr_obj is not cleaned up.
So moving that cleanup part to mqd_destroy so it takes
care of all the cases of clean up and during tear down of
the queue.
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 43355f62cd2ef5386c2693df537c232ea0f2ce6c)
Prike Liang [Tue, 26 May 2026 02:25:26 +0000 (10:25 +0800)]
drm/amdgpu: improve the userq seq BO free bit lookup
Use find_next_zero_bit() to locate the next free seq slot bit
instead of the current walk, for more efficient bitmap scanning.
Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit ff905a9b6228de9eedd0db71ecb1bdde91fb898d)
Sunil Khatri [Mon, 25 May 2026 07:48:00 +0000 (13:18 +0530)]
drm/amdgpu/userq: remove the vital queue unmap logging
Mesa userqueues free does not wait for the free to complete and go ahead
in unmapping the vital bos while kernel is still in queue free and
corresponding cleanup.
So ideally we don't need the logging for that and hence remove the warn
message as this is expected behaviour and functionally, we are making
sure to wait for the required fences before unmap.
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 758a868043dcb07eca923bc451c16da3e73dc47c)
Andrew Martin [Thu, 28 May 2026 16:54:39 +0000 (12:54 -0400)]
drm/amdkfd: Fix buffer overflow in SDMA queue checkpoint/restore on GFX11
The v11 MQD manager incorrectly assigned the CP-compute variants of
checkpoint_mqd/restore_mqd for KFD_MQD_TYPE_SDMA queues. These functions
use sizeof(struct v11_compute_mqd) (2048 bytes) instead of sizeof(struct
v11_sdma_mqd) (512 bytes), causing a 1536-byte overflow.
During CRIU checkpoint of an SDMA queue on Navi3x:
- checkpoint_mqd() reads 2048 bytes from a 512-byte SDMA MQD buffer,
leaking 1536 bytes of adjacent GTT memory to userspace
During CRIU restore:
- restore_mqd() writes 2048 bytes into a 512-byte SDMA MQD buffer,
corrupting 1536 bytes of adjacent GTT memory (often the ring buffer
or neighboring MQDs)
This is a copy-paste regression unique to v11. All other ASIC backends
(cik, vi, v9, v10, v12) correctly use the SDMA-specific variants.
Add checkpoint_mqd_sdma() and restore_mqd_sdma() functions that properly
handle the smaller v11_sdma_mqd structure, matching the pattern used in
other MQD managers.
Fixes: cc009e613de6 ("drm/amdkfd: Add KFD support for soc21 v3") Assisted-by: Claude:Sonnet 4-5 Signed-off-by: Andrew Martin <andrew.martin@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 6fa41db7ffdec97d62433adf03b7b9b759af8c2c) Cc: stable@vger.kernel.org
Muhammad Bilal [Sat, 23 May 2026 16:56:46 +0000 (16:56 +0000)]
drm/amdkfd: fix NULL dereference in get_queue_ids()
When usr_queue_id_array is NULL and num_queues is non-zero,
get_queue_ids() returns NULL. The callers check only IS_ERR() on the
return value; since IS_ERR(NULL) == false the check passes, and
suspend_queues() calls q_array_invalidate() which immediately
dereferences NULL while iterating num_queues times.
Userspace can trigger this via kfd_ioctl_set_debug_trap() by supplying
num_queues > 0 with a zero queue_array_ptr, causing a kernel panic.
A NULL usr_queue_id_array with num_queues == 0 is a legitimate no-op
(q_array_invalidate never executes, and resume_queues already guards
all queue_ids dereferences behind a NULL check). Return ERR_PTR(-EINVAL)
only when num_queues is non-zero and the pointer is absent; both callers
already propagate IS_ERR() returns correctly to userspace.
Fixes: a70a93fa568b ("drm/amdkfd: add debug suspend and resume process queues operation") Signed-off-by: Muhammad Bilal <meatuni001@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit f165a82cdf503884bb1797771c61b2fcc72113d4) Cc: stable@vger.kernel.org
Vitaly Prosyak [Fri, 29 May 2026 17:50:38 +0000 (13:50 -0400)]
drm/amdgpu: set noretry=1 as default for GFX 10.1.x (Navi10/12/14)
Problem:
While developing the amd_close_race IGT test (which intentionally triggers
execute permission faults by removing VM_PAGE_EXECUTABLE from GPU page table
entries), we discovered that on Navi10 (GFX 10.1.x) these faults produce
zero diagnostic output. The GPU simply hangs silently for ~10s until the
scheduler timeout fires. There is no way to distinguish an execute
permission fault from any other type of GPU hang.
Root cause:
GFX 10.1.x defaults to noretry=0, which sets
RETRY_PERMISSION_OR_INVALID_PAGE_FAULT=1 in the GFXHUB UTCL2 registers
(gfxhub_v2_0.c line 313). With this bit set, permission faults (valid PTE,
wrong R/W/X bits) are handled entirely within the UTCL1/UTCL2 hardware
loop: UTCL2 returns an XNACK to UTCL1, and UTCL1 re-requests the
translation indefinitely, expecting software to eventually fix the
permission bits (as happens in SVM/HMM recovery). No interrupt of any kind
reaches the IH ring.
This is different from invalid-page faults (V=0) which DO generate a retry
fault interrupt that the driver can escalate to a no-retry fault. Permission
faults with valid PTEs loop silently forever in hardware.
GFX 10.3+ already defaults to noretry=1, which makes permission faults
generate immediate L2 protection fault interrupts. GFX 10.1.x was
inadvertently left out of this default.
Fix:
Change the noretry=1 threshold from IP_VERSION(10, 3, 0) to
IP_VERSION(10, 1, 0) in amdgpu_gmc_noretry_set(). This is a one-line
change that aligns GFX 10.1.x behavior with GFX 10.3+ and all newer
generations.
With noretry=1, the existing non-retry fault handler
(gmc_v10_0_process_interrupt) already decodes and prints the full
GCVM_L2_PROTECTION_FAULT_STATUS register including PERMISSION_FAULTS,
faulting address, VMID, PASID, and process name. No additional logging
code is needed — the fix is purely routing permission faults to the
existing, fully-capable non-retry interrupt handler.
v2: Dropped GFX10-specific logging from gmc_v10_0.c and
kfd_int_process_v10.c (Felix Kuehling). v1 added logging in the retry
fault handler, but with noretry=1 permission faults take the non-retry
path — the v1 retry handler code was dead and would never execute.
Tested on Navi10 (GFX 10.1.10):
- Execute permission faults now produce immediate, clear output:
[gfxhub] page fault (src_id:0 ring:64 vmid:4 pasid:592)
Process amd_close_race pid 13380 thread amd_close_race pid 13384
in page at address 0x40001000 from client 0x1b (UTCL2)
GCVM_L2_PROTECTION_FAULT_STATUS:0x00700881
PERMISSION_FAULTS: 0x8
- No regressions with properly-mapped GPU workloads
Cc: Christian Koenig <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit eb21edd24c40d81066753f8ac6f23bce15745395) Cc: stable@vger.kernel.org
Timur Kristóf [Mon, 25 May 2026 11:45:02 +0000 (13:45 +0200)]
drm/amdgpu/gfxhub: Program CRASH_ON_*_FAULT bits to 0 as needed
When the fault stop mode isn't AMDGPU_VM_FAULT_STOP_ALWAYS,
these bits should be programmed to 0.
Program CRASH_ON_NO_RETRY_FAULT and CRASH_ON_RETRY_FAULT
always, to make sure to clear the bits when we don't want
to crash.
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit d0cd99e73090700b7a942b98a3327ec966597d0a)
Yang Wang [Mon, 11 May 2026 08:33:37 +0000 (16:33 +0800)]
drm/amd/pm: zero unused SMU argument registers
SMU messages may use fewer arguments than the available argument registers,
the previous code only wrote used registers and left the rest unchanged,
so stale values from a prior message could persist.
Write all argument registers for each message and zero the unused tail
to keep command arguments deterministic and avoid unintended carry-over.
Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit e03b635f61f77ebd5107ef82f48e3221cb695856)
Yang Wang [Fri, 29 May 2026 03:47:31 +0000 (11:47 +0800)]
drm/amd/pm: mark metrics.energy_accumulator is invalid for smu 14.0.2
EnergyAccumulator is unsupported on SMU 14.0.2, mark it invalid.
Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 646b05043eeed04b51c14aad22a400a8250af4b7) Cc: stable@vger.kernel.org
Yang Wang [Tue, 19 May 2026 03:18:12 +0000 (11:18 +0800)]
drm/amd/pm: fix smu13 power limit default/cap calculation
smu_v13_0_0_get_power_limit() and smu_v13_0_7_get_power_limit() mix
runtime power_limit with PP table limits when reporting default/min/max.
When current power limit query succeeds, default_power_limit was set to the
runtime value instead of the PP table default, and min/max could be derived
from inconsistent bases (MsgLimits/runtime), leading to incorrect cap info.
Use SocketPowerLimitAc/Dc as the PP default base (pp_limit), keep
current_power_limit as runtime value, and derive min/max from pp_limit with
OD percentages.
Closes: https://gitlab.freedesktop.org/drm/amd/-/work_items/5227 Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Kenneth Feng <kenneth.feng@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 1eaf26db95901ca70737503a89b831dd763c8453) Cc: stable@vger.kernel.org
Yang Wang [Thu, 21 May 2026 14:36:37 +0000 (22:36 +0800)]
drm/amd/pm: apply SMU 13.0.10 workaround during MP1 unload
On SMU v13.0.10, sending PrepareMp1ForUnload with the default
parameter may leave the device in an inaccessible state. This can
affect runtime power management and partial PnP flows.
e.g: kexec, driver unload, boco/d3cold.
Pass the required workaround parameter 0x55, when preparing MP1 for
unload on SMU v13.0.10, keep the existing behavior for other SMU
versions.
Closes: https://gitlab.freedesktop.org/drm/amd/-/work_items/5133 Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Kenneth Feng <kenneth.feng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 4e8ee1afeedb8d24dd22cdd5ae9f98a6d76ebe4b) Cc: stable@vger.kernel.org