git.ipfire.org Git - thirdparty/linux.git/log

Merge patch series "eventpoll: Fix epoll_wait() report false negative"

Nam Cao <namcao@linutronix.de> says:

While staring at epoll, I noticed ep_events_available() looks wrong. I
wrote a small program to confirm, and yes it is definitely wrong.

This series adds a reproducer to kselftest, and fix the bug.

* patches from https://patch.msgid.link/cover.1780422137.git.namcao@linutronix.de:
eventpoll: Fix epoll_wait() report false negative
selftests/eventpoll: Add test for multiple waiters

Link: https://patch.msgid.link/cover.1780422137.git.namcao@linutronix.de
Signed-off-by: Christian Brauner <brauner@kernel.org>

eventpoll: Fix epoll_wait() report false negative

ep_events_available() checks for available events by looking at
ep->rdllist and ep_is_scanning(). However, this is done without a lock
and can report false negative if ep_start_scan() or ep_done_scan() are
executed by another task concurrently. For example:
_________________________________________________________________________
                                   |ep_start_scan()
                                   |  list_splice_init(&ep->rdllist, ...)
ep_events_available()              |
  !list_empty_careful(&ep->rdllist)|
  || ep_is_scanning(ep)            |
                           |  ep_enter_scan(ep)
___________________________________|_____________________________________

Another example:
_________________________________________________________________________
ep_events_available()              |
                                   |ep_start_scan()
                                   |  list_splice_init(&ep->rdllist, ...)
                           |  ep_enter_scan(ep)
  !list_empty_careful(&ep->rdllist)|
                                   |ep_done_scan()
                                   |  ep_exit_scan(ep)
                                   |  list_splice(..., &ep->rdllist)
  || ep_is_scanning(ep)            |
___________________________________|_____________________________________

In the above examples, ep_events_available() sees no event despite
events being available. In case epoll_wait() is called with timeout=0,
epoll_wait() will wrongly return "no event" to user.

Introduce a sequence lock to resolve this issue.

Measuring the time consumption of 10 million loop iterations doing
epoll_wait(), the following performance drop is observed:

   timeout  #event  before    after    diff
     0ms      0     3727ms   3974ms   +6.6%
     0ms      1     8099ms   9134ms    +13%
     1ms      1    13525ms  13586ms  +0.45%

Considering the use case of epoll_wait() (wait for events, do something
with the events, repeat), it should only contribute to a small portion of
user's CPU consumption. Therefore this performance drop is not alarming.

Fixes: c5a282e9635e ("fs/epoll: reduce the scope of wq lock in epoll_wait()")
Suggested-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Link: https://patch.msgid.link/4363cd8e34a21d4f0d257be1b33e84dc25030fdf.1780422138.git.namcao@linutronix.de
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

selftests/eventpoll: Add test for multiple waiters

Add a test whichs creates 64 threads who all epoll_wait() on the same
eventpoll. The source eventfd is written but never read, therefore all the
threads should always see an EPOLLIN event.

This test fails because of a kernel bug, which will be fixed by a follow-up
commit.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Link: https://patch.msgid.link/b11947013563875c046c0b0959c29fd95eeebd34.1780422138.git.namcao@linutronix.de
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

ALSA: seq: oss: Use scoped cleanup for temporary MIDI use lock

The OSS sequencer write and out-of-band paths may receive a temporary
snd_use_lock_t reference from snd_seq_oss_process_event(). This was added
to keep MIDI device data alive until events with embedded SysEx data are
dispatched.

Use a scoped cleanup helper for that temporary reference. This keeps the
lifetime rule local to the variable declaration and avoids future missing
snd_use_lock_free() paths if these event handling paths gain more exits.

No functional change is intended.

Signed-off-by: Cássio Gabriel <cassiogabrielcontato@gmail.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Link: https://patch.msgid.link/20260604-alsa-scoped-cleanups-v1-3-10c43152a728@gmail.com

ALSA: core: Add scoped cleanup helper for card references

Several ALSA paths acquire temporary card references with snd_card_ref()
and release them manually with snd_card_unref(). control_led.c already
defines a local cleanup helper for this pattern, while other core paths
still open-code the release.

Move the helper to the common ALSA core header and use it in control-layer
card-reference paths. This makes the ownership rule explicit and avoids
future missing-unref mistakes when adding early exits.

No functional change is intended.

Signed-off-by: Cássio Gabriel <cassiogabrielcontato@gmail.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Link: https://patch.msgid.link/20260604-alsa-scoped-cleanups-v1-2-10c43152a728@gmail.com

ALSA: control: Use scoped cleanup for user control buffers

User-defined control TLV data and enum names are copied from user space
with vmemdup_user() before being installed in the user_element. Until
ownership is transferred, these temporary buffers have to be released on
every validation exit.

Use __free(kvfree) for the temporary buffers and no_free_ptr() when
ownership is transferred to the user_element. This removes the manual
kvfree() calls from the unchanged-TLV and enum-name validation paths,
makes the ownership hand-off explicit, and keeps the existing allocation
accounting and ABI unchanged.

No functional change is intended.

Signed-off-by: Cássio Gabriel <cassiogabrielcontato@gmail.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Link: https://patch.msgid.link/20260604-alsa-scoped-cleanups-v1-1-10c43152a728@gmail.com

nvme-tcp: lockdep: use dynamic lockdep keys per socket instance

When NVMe-TCP controller setup and teardown are repeated with lockdep
enabled, lockdep reports false positives WARN for the following locks:

  1) &q->elevator_lock        : IO scheduler change context
  2) &q->q_usage_counter(io)  : SCSI disk probe context
  3) fs_reclaim               : CPU hotplug bring-up context
  4) cpu_hotplug_lock         : socket establishment context
  5) sk_lock-AF_INET-NVME     : MQ sched dispatch context for the socket
  6) set->srcu                : NVMe controller delete context

The lockdep WARN was observed by running blktests test case nvme/005 for
tcp transport on v7.1-rc1 kernel with a patch. Refer to the Link tag for
the details of the WARN.

This is a false positive because lockdep confuses lock 4) (socket
establishment) with lock 5) (socket in use) for different socket
instances. The locks belong to different sockets, but lockdep treats
them as the same due to shared static lockdep keys.

Fix this by using dynamically allocated lockdep keys per socket instance
instead of static keys nvme_tcp_sk_key[] and nvme_tcp_slock_key[]. Add
nvme_tcp_sk_key and nvme_tcp_slock_key fields to struct nvme_tcp_queue
and pass them to sock_lock_init_class_and_name() for proper lockdep
tracking. Change the argument of nvme_tcp_reclassify_socket() from
'struct socket *' to 'struct nvme_tcp_queue *' to pass both the socket
and the keys. Add CONFIG_DEBUG_LOCK_ALLOC guards to nvme_tcp_alloc_queue()
and nvme_tcp_free_queue() to register and unregister the dynamic keys.
Additionally, move nvme_tcp_reclassify_socket() inside these guards since
it's only needed when lockdep is enabled.

Link: https://lore.kernel.org/linux-nvme/afB5syZbUrppgsDQ@shinmob/
Suggested-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>

Merge patch series "mm: improve write performance with RWF_DONTCACHE"

Jeff Layton <jlayton@kernel.org> says:

This patch series is intended to improve write performance with
RWF_DONTCACHE. This version fixes additional stat accounting issues
found during review: integer promotion on 32-bit, cgroup writeback
domain migration, folio split flag preservation, and a UAF that could
occur in filemap_dontcache_kick_writeback().

* patches from https://patch.msgid.link/20260511-dontcache-v7-0-2848ddce8090@kernel.org:
  mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking
  mm: track DONTCACHE dirty pages per bdi_writeback
  mm: preserve PG_dropbehind flag during folio split

Link: https://patch.msgid.link/20260511-dontcache-v7-0-2848ddce8090@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>

ALSA: hda/realtek: Add quirk for ASUS VivoBook X509DAP

The internal microphone on ASUS VivoBook X509DAP (subsystem ID
0x1043:0x197e) is not detected without a quirk entry. Add
ALC256_FIXUP_ASUS_MIC_NO_PRESENCE to fix the issue.

Signed-off-by: Andrei Faleichyk <andrei.faleichyk@noogadev.com>
Link: https://patch.msgid.link/20260603213313.6298-1-andrei.faleichyk@noogadev.com
Signed-off-by: Takashi Iwai <tiwai@suse.de>

mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking

The IOCB_DONTCACHE writeback path in generic_write_sync() calls
filemap_flush_range() on every write, submitting writeback inline in
the writer's context.  Perf lock contention profiling shows the
performance problem is not lock contention but the writeback submission
work itself — walking the page tree and submitting I/O blocks the writer
for milliseconds, inflating p99.9 latency from 23ms (buffered) to 93ms
(dontcache).

Replace the inline filemap_flush_range() call with a flusher kick that
drains dirty pages in the background.  This moves writeback submission
completely off the writer's hot path.

To avoid flushing unrelated buffered dirty data, add a dedicated
WB_start_dontcache bit and wb_check_start_dontcache() handler that uses
the per-wb WB_DONTCACHE_DIRTY counter to determine how many pages to
write back.  The flusher writes back that many pages from the oldest dirty
inodes (not restricted to dontcache-specific inodes). This helps
preserve I/O batching while limiting the scope of expedited writeback.

Like WB_start_all, the WB_start_dontcache bit coalesces multiple
DONTCACHE writes into a single flusher wakeup without per-write
allocations.  Use test_and_clear_bit to atomically consume the kick
request before reading the dirty counter and starting writeback, so that
concurrent DONTCACHE writes during writeback can re-set the bit and
schedule a follow-up flusher run.

Read the dirty counter with wb_stat_sum() (aggregating per-CPU batches)
rather than wb_stat() (which reads only the global counter) to ensure
small writes below the percpu batch threshold are visible to the flusher.

In filemap_dontcache_kick_writeback(), set the WB_start_dontcache bit
inside the unlocked_inode_to_wb_begin/end section for correct cgroup
writeback domain targeting, but defer the wb_wakeup() call until after
the section ends, since wb_wakeup() uses spin_unlock_irq() which would
unconditionally re-enable interrupts while the i_pages xa_lock may still
be held under irqsave during a cgroup writeback switch. Pin the wb with
wb_get() inside the RCU critical section before calling wb_wakeup()
outside it, since cgroup bdi_writeback structures are RCU-freed and the
wb pointer could become invalid after unlocked_inode_to_wb_end() drops
the RCU read lock.

Also add WB_REASON_DONTCACHE as a new writeback reason for tracing
visibility.

dontcache-bench results (same host, T6F_SKL_1920GBF, 251 GiB RAM,
xfs on NVMe, fio io_uring):

Buffered and direct I/O paths are unaffected by this patchset. All
improvements are confined to the dontcache path:

Single-stream throughput (MB/s):
                        Before    After    Change
  seq-write/dontcache      298      897    +201%
  rand-write/dontcache     131      236     +80%

Tail latency improvements (seq-write/dontcache):
  p99:    135,266 us  ->  23,986 us   (-82%)
  p99.9: 8,925,479 us ->  28,443 us   (-99.7%)

Multi-writer (4 jobs, sequential write):
                                Before    After    Change
  dontcache aggregate (MB/s)     2,529    4,532     +79%
  dontcache p99 (us)             8,553    1,002     -88%
  dontcache p99.9 (us)         109,314    1,057     -99%

  Dontcache multi-writer throughput now matches buffered (4,532 vs
  4,616 MB/s).

32-file write (Axboe test):
                                Before    After    Change
  dontcache aggregate (MB/s)     1,548    3,499    +126%
  dontcache p99 (us)            10,170      602     -94%
  Peak dirty pages (MB)          1,837      213     -88%

  Dontcache now reaches 81% of buffered throughput (was 35%).

Competing writers (dontcache vs buffered, separate files):
                                Before    After
  buffered writer                  868      433 MB/s
  dontcache writer                 415      433 MB/s
  Aggregate                      1,284      866 MB/s

  Previously the buffered writer starved the dontcache writer 2:1.
  With per-bdi_writeback tracking, both writers now receive equal
  bandwidth. The aggregate matches the buffered-vs-buffered baseline
  (863 MB/s), indicating fair sharing regardless of I/O mode.

  The dontcache writer's p99.9 latency collapsed from 119 ms to
  33 ms (-73%), eliminating the severe periodic stalls seen in the
  baseline. Both writers now share identical latency profiles,
  matching the buffered-vs-buffered pattern.

The per-bdi_writeback dirty tracking dramatically reduces peak dirty
pages in dontcache workloads, with the 32-file test dropping from
1.8 GB to 213 MB. Dontcache sequential write throughput triples and
multi-writer throughput reaches parity with buffered I/O, with tail
latencies collapsing by 1-2 orders of magnitude.

Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20260511-dontcache-v7-3-2848ddce8090@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

mm: track DONTCACHE dirty pages per bdi_writeback

Add a per-wb WB_DONTCACHE_DIRTY counter that tracks the number of dirty
pages with the dropbehind flag set (i.e., pages dirtied via RWF_DONTCACHE
writes).

Increment the counter alongside WB_RECLAIMABLE in folio_account_dirtied()
when the folio has the dropbehind flag set, and decrement it in
folio_clear_dirty_for_io() and folio_account_cleaned(). Also decrement it
when a non-DONTCACHE lookup atomically clears the dropbehind flag on a
dirty folio in __filemap_get_folio_mpol(), using folio_test_clear_dropbehind()
to prevent concurrent lookups from double-decrementing the counter, and
guarding the decrement with mapping_can_writeback() to match the increment
path.

Transfer the counter alongside WB_RECLAIMABLE in inode_do_switch_wbs() so
that the stat is properly migrated when an inode switches cgroup writeback
domains.

The counter will be used by the writeback flusher to determine how many
pages to write back when expediting writeback for IOCB_DONTCACHE writes,
without flushing the entire BDI's dirty pages.

Suggested-by: Jan Kara <jack@suse.cz>
Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20260511-dontcache-v7-2-2848ddce8090@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

mm: preserve PG_dropbehind flag during folio split

__split_folio_to_order() copies page flags from the original folio to
newly created sub-folios using an explicit allowlist, but PG_dropbehind
is not included. When a large folio with PG_dropbehind set is split,
only the head sub-folio retains the flag; all tail sub-folios silently
lose it and will not be reclaimed eagerly after writeback completes.

Add PG_dropbehind to the flag copy mask so that the drop-behind hint
is preserved across folio splits.

Fixes: a323281cdfec ("mm: add PG_dropbehind folio flag")
Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20260511-dontcache-v7-1-2848ddce8090@kernel.org
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

ALSA: usb-audio: qcom: Use PAGE_ALIGN macro for buffer size calculation

Use the kernel's PAGE_ALIGN() macro instead of open-coding the page
alignment calculation. This improves code readability and follows
kernel coding style.

The manual calculation:
  mult = len / PAGE_SIZE;
  remainder = len % PAGE_SIZE;
  len = mult * PAGE_SIZE;
  len += remainder ? PAGE_SIZE : 0;

is equivalent to:
  len = PAGE_ALIGN(len);

Signed-off-by: wangdicheng <wangdicheng@kylinos.cn>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Link: https://patch.msgid.link/20260603091102.231370-4-wangdich9700@163.com

ALSA: usb-audio: qcom: Fix return value in qc_usb_audio_offload_fill_avail_pcms

The function qc_usb_audio_offload_fill_avail_pcms() always returns -1
regardless of how many PCM devices were successfully filled. This makes
it impossible for callers to know the actual number of available PCMs.

Return the actual count of filled PCM devices instead, which allows
callers to verify that all expected PCMs were properly enumerated.

Signed-off-by: wangdicheng <wangdicheng@kylinos.cn>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Link: https://patch.msgid.link/20260603091102.231370-3-wangdich9700@163.com

ALSA: usb-audio: qcom: Use snprintf for mixer control name formatting

The current code uses sprintf() to format control names without bounds
checking, which could lead to buffer overflow if PCM index is large.
Replace sprintf with snprintf to ensure buffer safety.

The ctl_name buffer is 48 bytes, and the formatted string could exceed
this with large PCM index values. Using snprintf with sizeof(ctl_name)
prevents potential buffer overflow.

Signed-off-by: wangdicheng <wangdicheng@kylinos.cn>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Link: https://patch.msgid.link/20260603091102.231370-2-wangdich9700@163.com

ALSA: usb-audio: qcom: Improve error logging in USB offload

Add error codes to error messages for better debugging.
This helps identify the root cause when USB audio offload fails.

Error messages now include the actual error code returned by
xhci_sideband operations, making it easier to diagnose failures
during USB audio offload setup.

Signed-off-by: wangdicheng <wangdicheng@kylinos.cn>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Link: https://patch.msgid.link/20260603091102.231370-1-wangdich9700@163.com

ALSA: hda/realtek: ALC882: Fixup for Clevo P775TM1

Clevo P775TM1 laptops come with an ESS Sabre HiFi DAC. Setting
0x1b pin VREF to 80% enables said DAC output.

Signed-off-by: Evelyn Ali <evelynali99@gmail.com>
Link: https://patch.msgid.link/20260602214122.78020-1-evelynali99@gmail.com
Signed-off-by: Takashi Iwai <tiwai@suse.de>

Merge patch series "libfs: set SB_I_NOEXEC and SB_I_NODEV in init_pseudo()"

John Hubbard <jhubbard@nvidia.com> says:

This began as a one-line dma-buf fix for a path_noexec() warning added
by commit 1e7ab6f67824 ("anon_inode: rework assertions"). Christoph
pointed out that the fix belongs higher up: a pseudo filesystem has no
reason not to set SB_I_NOEXEC by default. This series does that.

  * Patch 1 sets both flags in init_pseudo(), so every pseudo
    filesystem gets them. This is the only patch that changes a flag,
    and the only one with Fixes:/Cc: stable.

  * Patch 2 drops the assignments that are now redundant in the callers
    that set them by hand.

Most callers already set one or both flags. I audited every
init_pseudo() caller. Here is what patch 1 actually changes for each.
The only visible effect is on dma-buf, where SB_I_NOEXEC silences the
warning. SB_I_NODEV is never consulted on these SB_NOUSER mounts, and
none of the callers that gain SB_I_NOEXEC are executed from.

  caller                       had        patch 1 adds
  ---------------------------  --------   --------------
  fs/anon_inodes.c             both       nothing new
  mm/secretmem.c               both       nothing new
  virt/kvm/guest_memfd.c       both       nothing new
  fs/nsfs.c                    both       nothing new
  fs/pidfs.c                   both       nothing new
  fs/aio.c                     NOEXEC     NODEV
  drivers/dma-buf/dma-buf.c    neither    NOEXEC + NODEV
  net/socket.c                 neither    NOEXEC + NODEV
  fs/pipe.c                    neither    NOEXEC + NODEV
  kernel/resource.c            neither    NOEXEC + NODEV
  fs/erofs/super.c             neither    NOEXEC + NODEV
  fs/btrfs/tests/...           neither    NOEXEC + NODEV
  drivers/vfio/vfio_main.c     neither    NOEXEC + NODEV
  drivers/gpu/drm/drm_drv.c    neither    NOEXEC + NODEV
  drivers/dax/super.c          neither    NOEXEC + NODEV
  block/bdev.c                 neither    NOEXEC + NODEV

* patches from https://patch.msgid.link/20260604025315.245910-1-jhubbard@nvidia.com:
  libfs: drop redundant SB_I_NOEXEC/SB_I_NODEV in init_pseudo() callers
  libfs: set SB_I_NOEXEC and SB_I_NODEV by default in init_pseudo()

Link: https://patch.msgid.link/20260604025315.245910-1-jhubbard@nvidia.com
Signed-off-by: Christian Brauner <brauner@kernel.org>

libfs: drop redundant SB_I_NOEXEC/SB_I_NODEV in init_pseudo() callers

init_pseudo() now sets SB_I_NOEXEC and SB_I_NODEV by default, so the
per-caller assignments are redundant. Drop them.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Link: https://patch.msgid.link/20260604025315.245910-3-jhubbard@nvidia.com
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

libfs: set SB_I_NOEXEC and SB_I_NODEV by default in init_pseudo()

Since commit 1e7ab6f67824 ("anon_inode: rework assertions"),
path_noexec() warns when an anonymous-inode file is mmap'd from a
superblock that has not set SB_I_NOEXEC. dma-buf backs its files this
way and never set the flag, so mmap of any exported buffer trips the
warning on a CONFIG_DEBUG_VFS=y kernel:

  WARNING: CPU: 11 PID: 121813 at fs/exec.c:118 path_noexec+0x47/0x50
   do_mmap+0x2b5/0x680
   vm_mmap_pgoff+0x129/0x210
   ksys_mmap_pgoff+0x177/0x240
   __x64_sys_mmap+0x33/0x70

init_pseudo() sets up internal SB_NOUSER mounts that are never
path-reachable. Set both flags here so every pseudo filesystem gets
them by default instead of each caller setting them.

SB_I_NODEV is inert for unreachable mounts. SB_I_NOEXEC has one
visible effect: an executable mapping of a pseudo-fs fd, such as a
dma-buf, now fails with -EPERM, which is the invariant the assertion
enforces. No in-tree caller maps these executable.

Reproduce on CONFIG_DEBUG_VFS=y:

  make -C tools/testing/selftests/dmabuf-heaps
  sudo ./tools/testing/selftests/dmabuf-heaps/dmabuf-heap -t system

Fixes: 1e7ab6f67824 ("anon_inode: rework assertions")
Suggested-by: Christoph Hellwig <hch@infradead.org>
Cc: stable@vger.kernel.org
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Link: https://patch.msgid.link/20260604025315.245910-2-jhubbard@nvidia.com
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

iomap: avoid potential null folio->mapping deref during error reporting

When a buffered read fails, iomap_finish_folio_read() reports the error
with fserror_report_io(folio->mapping->host, ...). This is called after
ifs->read_bytes_pending has been decremented by the bytes attempted to
be read.

For a folio split across multiple read completions, the folio is only
guaranteed to stay locked while read_bytes_pending > 0. Once
iomap_finish_folio_read() decrements read_bytes_pending, another
in-flight read can complete and end the read on the folio, which unlocks
it. This allows truncate logic to run and detach the folio (set
folio->mapping to NULL). The error reporting path then can dereference a
NULL folio->mapping. As reported by Sam Sun, this is the race that can
occur:

CPU0: failed completion      CPU1: final completion     CPU2: truncate
-----------------------      ----------------------     --------------
read_bytes_pending -= len
finished = false
/* preempted before
   fserror_report_io() */
     read_bytes_pending -= len
     finished = true
     folio_end_read()
truncate clears
folio->mapping
fserror_report_io(
  folio->mapping->host, ...)
      ^ NULL deref

Fix this by reporting the error first before decrementing
ifs->read_bytes_pending.

Fixes: a9d573ee88af ("iomap: report file I/O errors to the VFS")
Cc: stable@vger.kernel.org
Reported-by: Sam Sun <samsun1006219@gmail.com>
Closes: https://lore.kernel.org/linux-fsdevel/CAEkJfYPhWdd59RKmuNLJg-bkypHz7xiOwaWyNVu3A8CUqQCnvg@mail.gmail.com/
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20260604011858.2297561-1-joannelkoong@gmail.com
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

fhandle: fix UAF due to unlocked ->mnt_ns read in may_decode_fh()

may_decode_fh() accesses mount::mnt_ns without holding any locks; that
means the mount can concurrently be unmounted, and the mnt_namespace can
concurrently be freed after an RCU grace period.

This race can happens as follows, assuming that the mount point was
created by open_tree(..., OPEN_TREE_CLONE):

thread 1            thread 2            RCU
                    __do_sys_open_by_handle_at
                      do_handle_open
                        handle_to_path
                          may_decode_fh
                            is_mounted
                              [mount::mnt_ns access]
                            [mount::mnt_ns access]
__do_sys_close
  fput_close_sync
    __fput
      dissolve_on_fput
        umount_tree
        class_namespace_excl_destructor
          namespace_unlock
            free_mnt_ns
              mnt_ns_tree_remove
                call_rcu(mnt_ns_release_rcu)
                                        mnt_ns_release_rcu
                                          mnt_ns_release
                                            kfree
                            [mnt_namespace::user_ns access] **UAF**

Fix it by taking rcu_read_lock() around the mount::mnt_ns access, like
in __prepend_path().
Additionally, document the semantics of mount::mnt_ns, and use WRITE_ONCE()
for writers that can race with lockless readers.

This bug is unreachable unless one of the following is set:

- CONFIG_PREEMPTION
- CONFIG_RCU_STRICT_GRACE_PERIOD

because it requires an RCU grace period to happen during a syscall without
an explicit preemption.

This doesn't seem to have interesting security impact; worst-case, it could
leak the result of an integer comparison to userspace (from the level
check in cap_capable()), cause an endless loop, or crash the kernel by
dereferencing an invalid address.

Fixes: 620c266f3949 ("fhandle: relax open_by_handle_at() permission checks")
Cc: stable@vger.kernel.org
Signed-off-by: Jann Horn <jannh@google.com>
Link: https://patch.msgid.link/20260603-vfs-fhandle-uaf-fix-v2-1-d05db76a5084@google.com
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

Merge tag 'zynqmp-soc-for-7.2' of https://github.com/Xilinx/linux-xlnx into soc/drivers

arm64: Xilinx SOC changes for 7.2

firmware:
- Add CSU register discovery with sysfs interface

zynqmp_power:
- Fix race condition in event registration
- Fix shutdown and free rx mailbox channel

* tag 'zynqmp-soc-for-7.2' of https://github.com/Xilinx/linux-xlnx:
  firmware: zynqmp: Add dynamic CSU register discovery and sysfs interface
  Documentation: ABI: add sysfs interface for ZynqMP CSU registers
  soc: xilinx: Shutdown and free rx mailbox channel
  soc: xilinx: Fix race condition in event registration

Signed-off-by: Linus Walleij <linusw@kernel.org>

kbuild: rust: rename flag to `-Zdebuginfo-for-profiling` for Rust >= 1.98

Starting with Rust 1.98.0 (expected 2026-08-20), the
`-Zdebug-info-for-profiling` flag has been renamed to
`-Zdebuginfo-for-profiling` (i.e. one less dash, to match `debuginfo`s
in other flags) [1].

Without this change, one gets in the latest nightlies:

error: unknown unstable option: `debug-info-for-profiling`

Thus pass the right name.

Link: https://github.com/rust-lang/rust/pull/156887
Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Acked-by: Nathan Chancellor <nathan@kernel.org>
Link: https://patch.msgid.link/20260602151638.14358-1-ojeda@kernel.org
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>

arm64: dts: hisilicon: hi3660-hikey960: move role-switch endpoint into connector

The rt1711h Type-C controller on the HiKey960 has the USB role-switch
endpoint placed as a top-level 'port' node, outside the connector
subnode. This triggers two dtbs_check warnings against
richtek,rt1711h.yaml:

- 'port' does not match any of the regexes: '^pinctrl-[0-9]+$'
- connector:ports: 'port@0' is a required property

Move the role-switch endpoint into the connector's port@0, which is
where usb-connector.yaml expects it. Update the DWC3 remote-endpoint
phandle accordingly.

The TCPM core (tcpm.c) looks up the role switch starting from the
connector fwnode via fwnode_usb_role_switch_get(). With the endpoint
inside the connector's port@0, it is found through the primary lookup
path rather than the device-level fallback.

Cross-compiled for arm64. Verified with dt_binding_check and
dtbs_check. Not runtime-tested on hardware.

Signed-off-by: Akash Sukhavasi <akash.sukhavasi@gmail.com>
Signed-off-by: Wei Xu <xuwei5@hisilicon.com>

iommu/vt-d: Fix RB-tree corruption in probe error path

The info->node RB-tree member is zero-initialized via kzalloc. If
a device does not support ATS, the device_rbtree_insert() call is
skipped. If a subsequent probe step fails, the error path jumps to
device_rbtree_remove(), which misinterprets the zeroed node as
a tree root and corrupts the device RB-tree.

Fix this by explicitly initializing the RB-node as empty using
RB_CLEAR_NODE() during initialization and guarding the removal with
RB_EMPTY_NODE().

Fixes: 4f1492efb495 ("iommu/vt-d: Revert ATS timing change to fix boot failure")
Reported-by: sashiko-bot@kernel.org
Closes: https://lore.kernel.org/all/20260525205628.CD4431F000E9@smtp.kernel.org/
Suggested-by: Baolu Lu <baolu.lu@linux.intel.com>
Signed-off-by: Pranjal Shrivastava <praan@google.com>
Link: https://lore.kernel.org/r/20260531170254.60493-2-praan@google.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>

iommu/vt-d: Improve IOMMU fault information

In some environments, multiple PCIe segments exist, and PCIe device
information needs to be differentiated and identified based on the
segment. When an IOMMU fault event occurs, the IOMMU and device segment
information should be output in detail in dmar_fault_do_one.

Signed-off-by: Guanghui Feng <guanghuifeng@linux.alibaba.com>
Link: https://lore.kernel.org/r/20260528022943.1697564-1-guanghuifeng@linux.alibaba.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>

iommu/vt-d: Remove typo from pasid_pte_config_nested()

Rename pasid_pte_config_nestd() into pasid_pte_config_nested(). Do it to
match other function names ending with _nested().

Signed-off-by: Michał Grzelak <michal.grzelak@intel.com>
Link: https://lore.kernel.org/r/20260509174503.831134-1-michal.grzelak@intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>

iommu/vt-d: Clear Present bit before tearing down scalable-mode context entry

device_pasid_table_teardown() zeroes the 128-bit scalable-mode context
entry with context_clear_entry() while the Present bit is still set. This
creates a window where the hardware can fetch a torn entry, with some
fields already zeroed while Present is still set, leading to unpredictable
behavior or spurious faults. The context-cache invalidation is issued only
after the entry has been zeroed, and intel_pasid_free_table() then frees
the PASID directory pages, so the IOMMU can keep walking a stale Present=1
entry that points at freed memory.

While x86 provides strong write ordering, the compiler may reorder the two
64-bit writes to the entry, and the hardware fetch is not guaranteed to be
atomic with respect to multiple CPU writes.

Commit c1e4f1dccbe9d ("iommu/vt-d: Clear Present bit before tearing down
context entry") fixed this exact pattern in domain_context_clear_one() and
the copied-context path, but device_pasid_table_teardown() was not
converted.

Align it with the "Guidance to Software for Invalidations" in the VT-d
spec, Section 6.5.3.3, using the same ownership handshake as the sibling
fix: clear only the Present bit, flush it to the IOMMU, perform the
context-cache invalidation, and only then zero the rest of the entry.

Fixes: 81e921fd32161 ("iommu/vt-d: Fix NULL domain on device release")
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Assisted-by: Claude:claude-opus-4-7
Link: https://lore.kernel.org/r/20260528025557.3209367-1-michael.bommarito@gmail.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>

iommu/vt-d: Avoid WARNING in sva unbind path

The Intel IOMMU driver allows SVA on devices even if they do not support
PCI/PRI. Commit 39c20c4e83b9 ("iommu/vt-d: Only handle IOPF for SVA when
PRI is supported") modified the SVA bind path to allow this configuration
by skipping IOPF enablement when PRI is missing. However, it failed to
update the unbind path.

This creates an imbalance: the unbind path attempts to disable IOPF for
a device that never had it enabled, triggering a WARNING in
intel_iommu_disable_iopf():

WARNING: drivers/iommu/intel/iommu.c:3475 at intel_iommu_disable_iopf+0x4f/0x90d
Call Trace:
  <TASK>
  blocking_domain_set_dev_pasid+0x50/0x70
  iommu_detach_device_pasid+0x89/0xc0
  iommu_sva_unbind_device+0x73/0x150
  xe_vm_close_and_put+0x4d2/0x1200 [xe]

Fix this by bypassing IOPF operations for SVA domains on non-PRI hardware
in both the bind and unbind paths.

Fixes: 39c20c4e83b9 ("iommu/vt-d: Only handle IOPF for SVA when PRI is supported")
Cc: stable@vger.kernel.org
Reported-by: Nareshkumar Gollakoti <naresh.kumar.g@intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20260519052917.3729796-1-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>

USB: serial: option: add usb-id for Dell Wireless DW5826e-m

Add support for Dell DW5826e-m with USB-id 0x413c:0x81ea

T:  Bus=03 Lev=01 Prnt=01 Port=04 Cnt=01 Dev#=  8 Spd=480  MxCh= 0
D:  Ver= 2.10 Cls=ef(misc ) Sub=02 Prot=01 MxPS=64 #Cfgs=  1
P:  Vendor=413c ProdID=81ea Rev= 5.04
S:  Manufacturer=DELL
S:  Product=DW5826e-m Qualcomm Snapdragon X12 Global LTE-A
S:  SerialNumber=358988870177734
C:* #Ifs= 7 Cfg#= 1 Atr=a0 MxPwr=500mA
A:  FirstIf#=12 IfCount= 2 Cls=02(comm.) Sub=0e Prot=00
I:* If#= 0 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=30 Driver=option
E:  Ad=01(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms
E:  Ad=81(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
I:* If#= 1 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=42 Prot=01 Driver=usbfs
E:  Ad=02(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms
E:  Ad=82(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
I:* If#= 2 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=60 Driver=option
E:  Ad=84(I) Atr=03(Int.) MxPS=  10 Ivl=32ms
E:  Ad=83(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
E:  Ad=03(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms
I:* If#= 3 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=40 Driver=option
E:  Ad=86(I) Atr=03(Int.) MxPS=  10 Ivl=32ms
E:  Ad=85(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
E:  Ad=04(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms
I:* If#= 4 Alt= 0 #EPs= 1 Cls=ff(vend.) Sub=ff Prot=ff Driver=(none)
E:  Ad=87(I) Atr=03(Int.) MxPS=  64 Ivl=32ms
I:* If#=12 Alt= 0 #EPs= 1 Cls=02(comm.) Sub=0e Prot=00 Driver=cdc_mbim
E:  Ad=88(I) Atr=03(Int.) MxPS=  64 Ivl=32ms
I:  If#=13 Alt= 0 #EPs= 0 Cls=0a(data ) Sub=00 Prot=02 Driver=cdc_mbim
I:* If#=13 Alt= 1 #EPs= 2 Cls=0a(data ) Sub=00 Prot=02 Driver=cdc_mbim
E:  Ad=8e(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
E:  Ad=0f(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms

Signed-off-by: Jack Wu <jackbb_wu@compal.com>
Reviewed-by: Lars Melin <larsm17@gmail>
Cc: stable@vger.kernel.org
[ johan: reserve also interface 4 ]
Signed-off-by: Johan Hovold <johan@kernel.org>

hwrng: virtio: clamp device-reported used.len at copy_data()

random_recv_done() stores the device-reported used.len directly into
vi->data_avail.  copy_data() then indexes vi->data[] using
vi->data_idx (advanced by previous copy_data() calls) and issues a
memcpy() without re-validating either value against the posted
buffer size sizeof(vi->data) (SMP_CACHE_BYTES bytes, typically 32
or 64).

A malicious or buggy virtio-rng backend can set used.len beyond
sizeof(vi->data), steering the memcpy() past the end of the inline
array into adjacent kmalloc-1k slab bytes.  hwrng_fillfn() mixes
those bytes into the guest RNG, and guest root can also observe
them directly via /dev/hwrng.

Concrete impact is inside the guest:

- Memory-safety / hardening: any virtio-rng backend that
   over-reports used.len causes the driver to read past vi->data
   into unrelated slab contents.  hwrng_fillfn() is a kernel thread
   that runs as soon as the device is probed; no guest userspace
   interaction is required to first-trigger the OOB.

- Cross-boundary leak (confidential-compute threat model): a
   malicious hypervisor cooperating with a malicious or compromised
   guest root userspace can use /dev/hwrng as a leak channel for
   guest-kernel heap data.  The host sets a large used.len, guest
   root reads /dev/hwrng, and the returned bytes contain guest
   kernel slab contents that were adjacent to vi->data.  In
   practice, confidential-compute guests (SEV-SNP, TDX) usually
   disable virtio-rng entirely, so this path is narrow, but the
   fix is still worth carrying because the underlying
   memory-safety bug contaminates the guest RNG on any host.

KASAN confirms the OOB on a 7.1-rc4 guest whose virtio-rng backend
has been patched to report used.len = 0x10000:

  BUG: KASAN: slab-out-of-bounds in virtio_read+0x394/0x5d0
  Read of size 64 at addr ffff88800ae0ba20 by task hwrng/52
  Call Trace:
   __asan_memcpy+0x23/0x60
   virtio_read+0x394/0x5d0
   hwrng_fillfn+0xb2/0x470
   kthread+0x2cc/0x3a0
  Allocated by task 1:
   probe_common+0xa5/0x660
   virtio_dev_probe+0x549/0xbc0
  The buggy address belongs to the object at ffff88800ae0b800
   which belongs to the cache kmalloc-1k of size 1024
  The buggy address is located 0 bytes to the right of
   allocated 544-byte region [ffff88800ae0b800, ffff88800ae0ba20)

Same class of bug as commit c04db81cd028 ("net/9p: Fix buffer
overflow in USB transport layer"), which hardened
usb9pfs_rx_complete() against unchecked device-reported length in
the USB 9p transport.

With the clamp at point of use and array_index_nospec() in place,
the same harness boots cleanly: copy_data() returns zero for the
bogus report, the device-supplied bytes after data_idx are
discarded, and the driver issues a fresh request.

Fixes: f7f510ec1957 ("virtio: An entropy device, as suggested by hpa.")
Cc: stable@vger.kernel.org
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <20260531142251.2792061-1-michael.bommarito@gmail.com>

virtio_pci: fix vq info pointer lookup via wrong index

Unbinding a virtio balloon device:

    echo virtio0 > /sys/bus/virtio/drivers/virtio_balloon/unbind

triggers a NULL pointer dereference. The dmesg says:

    BUG: kernel NULL pointer dereference, address: 0000000000000008
    [...]
    RIP: 0010:__list_del_entry_valid_or_report+0x5/0xf0
    Call Trace:
    <TASK>
    vp_del_vqs+0x121/0x230
    remove_common+0x135/0x150
    virtballoon_remove+0xee/0x100
    virtio_dev_remove+0x3b/0x80
    device_release_driver_internal+0x187/0x2c0
    unbind_store+0xb9/0xe0
    kernfs_fop_write_iter.llvm.11660790530567441834+0xf6/0x180
    vfs_write+0x2a9/0x3b0
    ksys_write+0x5c/0xd0
    do_syscall_64+0x54/0x230
    entry_SYSCALL_64_after_hwframe+0x29/0x31
    [...]
    </TASK>

The virtio_balloon device registers 5 queues (inflate, deflate, stats,
free_page, reporting) but only the first two are unconditional. The
stats, free_page and reporting queues are each conditional on their
respective feature bits. When any of these features are absent, the
corresponding vqs_info entry has name == NULL, creating holes in the
array.

The root cause is an indexing mismatch introduced when vq info storage
was changed to be passed as an argument. vp_find_vqs_msix() and
vp_find_vqs_intx() store the info pointer at vp_dev->vqs[i], where 'i'
is the caller's sparse array index. However, the virtqueue itself gets
vq->index assigned from queue_idx, a dense index that skips NULL
entries. When holes exist, 'i' and queue_idx diverge. Later,
vp_del_vqs() looks up info via vp_dev->vqs[vq->index] using the dense
index into the sparsely-populated array, and hits NULL.

Fix this by storing info at vp_dev->vqs[queue_idx] instead of
vp_dev->vqs[i], so the store index matches the lookup index
(vq->index). Apply the fix to both the MSIX and INTX paths.

Cc: Yichun Zhang <yichun@openresty.com>
Cc: Jiri Pirko <jiri@nvidia.com>
Cc: stable@vger.kernel.org # v6.11+
Tested-by: Yuka <yuka@umeyashiki.org>
Fixes: 89a1c435aec2 ("virtio_pci: pass vq info as an argument to vp_setup_vq()")
Signed-off-by: Ammar Faizi <ammarfaizi2@openresty.com>
Message-Id: <20260315141808.547081-1-ammarfaizi2@openresty.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

vhost: fix vhost_get_avail_idx for a non empty ring

vhost_get_avail_idx is supposed to report whether it has updated
vq->avail_idx. Instead, it returns whether all entries have been
consumed, which is usually the same. But not always - in
drivers/vhost/net.c and when mergeable buffers have been enabled, the
driver checks whether the combined entries are big enough to store an
incoming packet. If not, the driver re-enables notifications with
available entries still in the ring. The incorrect return value from
vhost_get_avail_idx propagates through vhost_enable_notify and causes
the host to livelock if the guest is not making progress, as vhost will
immediately disable notifications and retry using the available entries.

This goes back to commit d3bb267bbdcb ("vhost: cache avail index in
vhost_enable_notify()") which changed vhost_enable_notify() to compare
the freshly read avail index against vq->last_avail_idx instead of the
previously cached vq->avail_idx. Commit 7ad472397667 ("vhost: move
smp_rmb() into vhost_get_avail_idx()") then carried over the same
comparison when refactoring vhost_enable_notify() to call the unified
vhost_get_avail_idx().

The obvious fix is to make vhost_get_avail_idx do what the comment
says it does and report whether new entries have been added.

Reported-by: ShuangYu <shuangyu@yunyoo.cc>
Fixes: d3bb267bbdcb ("vhost: cache avail index in vhost_enable_notify()")
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Message-Id: <559b04ae6ce52973c535dc47e461638b7f4c3d63.1772441455.git.mst@redhat.com>

net: axienet: Use dedicated ethtool_ops for the dmaengine path

The dmaengine path shares ethtool_ops with the legacy AXI DMA path,
including .get_coalesce/.set_coalesce that poke XAXIDMA_*_CR_OFFSET
directly. In dmaengine mode lp->dma_regs is not mapped by axienet, so
those ethtool calls touch unmapped/unrelated memory and report values
unrelated to the channel actually in use.

.get_ringparam/.set_ringparam only touch lp->rx_bd_num/lp->tx_bd_num,
fields used only by the legacy path for BD ring sizing. In dmaengine
mode the descriptor ring is owned by the dmaengine provider and these
fields are not consulted, so reporting them is misleading.

No dmaengine API exists today to query or program either coalescing
or ring size on behalf of the client, so neither can be exposed
meaningfully in dmaengine mode.

Add axienet_ethtool_dmaengine_ops without the coalesce and ringparam
hooks. Also move the ethtool_ops assignment from early probe into the
if/else alongside netdev_ops, so the legacy and dmaengine paths pick
their respective ops in one place. No functional change for the
legacy DMA path.

Signed-off-by: Suraj Gupta <suraj.gupta2@amd.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260601124454.3384601-1-suraj.gupta2@amd.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

kconfig: add kconfig-sym-check static checker

Add 'make kconfig-sym-check', a static checker that finds Kconfig
symbols referenced in expressions (select, depends on, default, etc.)
but never defined via config/menuconfig anywhere in the tree. New
dangling symbols are reported as errors (exit 1) unless they are
listed in an exclusion file, e.g.

KCONFIG_SYM_CHECK_EXCLUDES=sym-check-excludes make kconfig-sym-check

The exclusion file lists one symbol per line; blank lines and lines
starting with '#' are ignored.

The checker also warns about uppercase N/Y/M used as tristate literal
values following the same logic as checkpatch.

This new static checker is the script used for [1] with a few
improvements to avoid some false positives.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=216748
Assisted-by: Claude:claude-sonnet-4-6
Signed-off-by: Andrew Jones <andrew.jones@linux.dev>
Acked-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Julian Braha <julianbraha@gmail.com>
Tested-by: Nicolas Schier <nsc@kernel.org>
Acked-by: Nicolas Schier <nsc@kernel.org>
Link: https://patch.msgid.link/20260527142703.107110-1-andrew.jones@linux.dev
Signed-off-by: Nathan Chancellor <nathan@kernel.org>

net/sun: Fix multiple typos in comments

There are some typos in comments and while they are harmless and not visible,
there is no reason not to fix them. Fix the ones that are not register related,
which might have intentional naming convention.

Signed-off-by: Jakub Raczynski <j.raczynski@samsung.com>
Link: https://patch.msgid.link/20260601163727.554364-1-j.raczynski@samsung.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

eth: bnxt: disable rx-copybreak by default

rx-copybreak requires an extra slab allocation. Since bnxt uses
page pool frags and HDS by default, the rx-copybreak doesn't
buy us anything. The extra pressure on slab causes overload
on pre-sheaves kernels on modern AMD platforms.

In synthetic testing on net-next this patch shows little difference
but I think copybreak is "obvious waste" at this point.

Default rx-copybreak threshold to 0 / disabled.

The "copybreak" defines are really the size bounds for the Rx header
buffer. Rename them.

Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Link: https://patch.msgid.link/20260602003759.1545645-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'fix-use-after-free-in-metadata-dst-teardown-in-airoha_eth-and-mtk_eth_soc-drivers'

Lorenzo Bianconi says:

====================
Fix use-after-free in metadata dst teardown in airoha_eth and mtk_eth_soc drivers

airoha_metadata_dst_free() and mtk_free_dev() call metadata_dst_free()
which frees the metadata_dst with kfree() immediately, bypassing the RCU
grace period.
Replace metadata_dst_free() with dst_release() which properly goes
through the refcount path and runs call_rcu_hurry() if refcount goes to
zero.
====================

Link: https://patch.msgid.link/20260602-airoha-mtk-metadata-uaf-fix-v1-0-3aaa99d83351@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ethernet: mtk_eth_soc: Fix use-after-free in metadata dst teardown

mtk_free_dev() calls metadata_dst_free() which frees the metadata_dst
with kfree() immediately, bypassing the RCU grace period.
In the RX path, skb_dst_set_noref() sets a non-refcounted pointer from
the skb to the metadata_dst. This function requires RCU read-side
protection and the dst must remain valid until all RCU readers complete.
Since metadata_dst_free() calls kfree() directly, a use-after-free can
occur if any skb still holds a noref pointer to the dst when the driver
tears it down.
Replace metadata_dst_free() with dst_release() which properly goes
through the refcount path: when the refcount drops to zero, it schedules
the actual free via call_rcu_hurry(), ensuring all RCU readers have
completed before the memory is freed.

Fixes: 2d7605a72906 ("net: ethernet: mtk_eth_soc: enable hardware DSA untagging")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20260602-airoha-mtk-metadata-uaf-fix-v1-2-3aaa99d83351@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: Fix use-after-free in metadata dst teardown

airoha_metadata_dst_free() runs metadata_dst_free() which frees the
metadata_dst with kfree() immediately, bypassing the RCU grace period.
In the RX path, skb_dst_set_noref() sets a non-refcounted pointer from
the skb to the metadata_dst. This function requires RCU read-side
protection and the dst must remain valid until all RCU readers complete.
Since metadata_dst_free() calls kfree() directly, an use-after-free can
occur if any skb still holds a noref pointer to the dst when the driver
tears it down.
Replace metadata_dst_free() with dst_release() which properly goes
through the refcount path: when the refcount drops to zero, it schedules
the actual free via call_rcu_hurry(), ensuring all RCU readers have
completed before the memory is freed.

Fixes: af3cf757d5c9 ("net: airoha: Move DSA tag in DMA descriptor")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20260602-airoha-mtk-metadata-uaf-fix-v1-1-3aaa99d83351@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: exthdrs: recompute network header pointer once

In ip6_parse_tlv(), recompute the network header pointer once regardless
of the option processed (Hbh or Dest), as missing recomputation for
specific options has caused issues in the past.

Signed-off-by: Justin Iurman <justin.iurman@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260602213033.12244-1-justin.iurman@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'amd-drm-next-7.2-2026-06-03' of https://gitlab.freedesktop.org/agd5f/linux into drm-next

amd-drm-next-7.2-2026-06-03:

amdgpu:
- BT.2020 fix for DCE
- DC bounds checking fixes
- SDMA 7.1 fix
- UserQ fixes
- SI fix
- SMU 13 fixes
- SMU 14 fixes
- GC 12.1 fix
- Userptr fix
- GC 10.1 fix
- GART fix for non-4K pages
- DCN 4.x fixes
- DCN 4.2 updates
- More DC KUnit tests
- PSR cleanup
- Support for connectors without DDC pins
- Initial DCN 4.2.1 support
- Initial HDMI 2.1 FRL support
- Misc bounds check fixes
- RAS fixes
- GC 11.5.6 support
- SDMA 6.4.0 support
- NBIO 7.11.5 support
- IH 6.4.0 support
- HDP 6.4.0 support
- MMHUB 3.4.2 support
- SMU 15.0.5 support
- ATHUB 3.4.2 support
- VPE 2.2 support
- Devcoredump fixes
- _PR3 fix

amdkfd:
- UAF race fix
- Fix a potential NULL pointer dereference
- GC 11 buffer overflow fix for SDMA
- Profiler locking order fix

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patch.msgid.link/20260604013527.2373534-1-alexander.deucher@amd.com

net: ibm: emac: fix unchecked platform_get_irq return value

platform_get_irq() returns a negative errno on failure.
Commit a598f66d9169 replaced irq_of_parse_and_map() (which returns 0
on failure) with platform_get_irq() but dropped the error check.
Without it, a negative IRQ number is passed to devm_request_irq(),
which fails with -EINVAL instead of propagating the real error
from platform_get_irq().

Add the missing error check and goto err_gone.

Signed-off-by: Rosen Penev <rosenp@gmail.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260601040201.103481-1-rosenp@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'for-net-2026-06-03' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth

Luiz Augusto von Dentz says:

====================
bluetooth pull request for net:

- hci_core: fix memory leak in error path of hci_alloc_dev()
- hci_sync: reject oversized Broadcast Announcement prepend
- MGMT: Fix backward compatibility with userspace
- MGMT: validate advertising TLV before type checks
- L2CAP: reject BR/EDR signaling packets over MTUsig
- RFCOMM: validate skb length in MCC handlers
- RFCOMM: hold listener socket in rfcomm_connect_ind()
- ISO: Fix not releasing hdev reference on iso_conn_big_sync
- ISO: Fix a use-after-free of the hci_conn pointer
- ISO: Fix data-race on iso_pi fields in hci_get_route calls
- SCO: Fix data-race on sco_pi fields in sco_connect
- BNEP: reject short frames before parsing

* tag 'for-net-2026-06-03' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
  Bluetooth: MGMT: Fix backward compatibility with userspace
  Bluetooth: SCO: Fix data-race on sco_pi fields in sco_connect
  Bluetooth: ISO: Fix data-race on iso_pi fields in hci_get_route calls
  Bluetooth: ISO: Fix a use-after-free of the hci_conn pointer
  Bluetooth: ISO: Fix not releasing hdev reference on iso_conn_big_sync
  Bluetooth: fix memory leak in error path of hci_alloc_dev()
  Bluetooth: bnep: reject short frames before parsing
  Bluetooth: hci_sync: reject oversized Broadcast Announcement prepend
  Bluetooth: L2CAP: reject BR/EDR signaling packets over MTUsig
  Bluetooth: RFCOMM: validate skb length in MCC handlers
  Bluetooth: MGMT: validate advertising TLV before type checks
  Bluetooth: RFCOMM: hold listener socket in rfcomm_connect_ind()
====================

Link: https://patch.msgid.link/20260603162714.342496-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'wireless-2026-06-03' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless

Johannes Berg says:

====================
Things are finally quieting down:
- iwlwifi:
   - FW reset handshake removal for older devices
   - NIC access fix in fast resume
   - avoid too large command for some BIOSes
   - fix TX power constraints in AP mode
- cfg80211:
   - fix netlink parse overflow
   - fix potential 6 GHz scan memory leak
   - enforce HE/EHT consistency to avoid mac80211 crash
- mac80211: guard radiotap antenna parsing

* tag 'wireless-2026-06-03' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
  wifi: cfg80211: enforce HE/EHT cap/oper consistency
  wifi: fix leak if split 6 GHz scanning fails
  wifi: mac80211: limit injected antenna index in ieee80211_parse_tx_radiotap
  wifi: nl80211: reject oversized EMA RNR lists
  wifi: iwlwifi: pcie: simplify the resume flow if fast resume is not used
  wifi: iwlwifi: mvm: avoid oversized UATS command copy
  wifi: iwlwifi: mld: send tx power constraints before link activation
  wifi: iwlwifi: mvm: don't support the reset handshake for old firmwares
====================

Link: https://patch.msgid.link/20260603113208.171874-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

x86/cpuid: Update bitfields to x86-cpuid-db v3.1

Update leaf_types.h to version 3.1, as generated by x86-cpuid-db.

Summary of the v3.1 changes:

* Fix a few typos that were found during the kernel CPUID data model
review. Also include fixes found using an LLM agent review, from Ahmed.

* Rename thrd_director_nclasses to hw_feedback_nclasses as it's the
name used in Intel SDM.

See https://gitlab.com/x86-cpuid.org/x86-cpuid-db/-/blob/v3.1/CHANGELOG.rst
for more info.

Signed-off-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://patch.msgid.link/9653d8690ec7093c8190b12d1fa8c689c4da50fe.1780506200.git.m.wieczorretman@pm.me

Merge branch 'mptcp-misc-fixes-for-v7-1-rc7'

Matthieu Baerts says:

====================
mptcp: misc fixes for v7.1-rc7

Here are various unrelated fixes:

- Patch 1: fix missing wakeups when multiple threads are reading from
  the same fd. A fix for v5.7.

- Patch 2: fix retransmission loop when MPTCP checksum is enabled. A fix
  for v5.14.

- Patch 3: fix a TOCTOU race while computing rcv_wnd. A fix for v5.11.

- Patch 4: allow subflows receive window to shrink if needed. A fix for
  v5.19.

- Patches 5-6: avoid 'extra_subflows' to underflow with the userspace
  PM. A fix for v5.19.

- Patch 7: report errors if one subflow cannot set SO_TIMESTAMPING. A
  fix for v5.14.

- Patch 8: try to set TCP_MAXSEG on all subflows, before reporting
  errors, if any. A fix for v6.17.

- Patch 9: check desc->count in read_sock, to act as expected. A fix
  for v7.0.

- Patch 10: fix an uninit value in mptcp_established_options, reported
  by syzbot. A fix for v7.1-rc1.

- Patch 11: fix a similar issue than the previous patch, exposed by the
  same modification from v7.1-rc1, but was already causing issues since
  v5.15.
====================

Link: https://patch.msgid.link/20260602-net-mptcp-misc-fixes-7-1-rc7-v2-0-856831229976@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: change mptcp_established_options() to return opt_size

Instead of passing opt_size address to mptcp_established_options(),
change this function to return it by value.

This removes the need for an expensive stack canary in
tcp_established_options() when CONFIG_STACKPROTECTOR_STRONG=y.

$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-92 (-92)
Function                                     old     new   delta
tcp_options_write.isra                      1423    1407     -16
mptcp_established_options                   2746    2720     -26
tcp_established_options                      553     503     -50
Total: Before=22110750, After=22110658, chg -0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260602125138.2317015-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: add-addr: always drop other suboptions

When an ADD_ADDR needs to be sent, it could be prepared if there is
enough remaining space and even if the packet is not a pure ACK. But it
would be dropped soon after.

Indeed, in mptcp_pm_add_addr_signal(), there is enough space to fit a
DSS of 20 octets and an ADD_ADDR echo containing an IPv4 address on 8
octets for example. In this case, the packet would be prepared, the
MPTCP_ADD_ADDR_ECHO bit would be removed from pm->addr_signal, but the
option would be silently dropped in mptcp_established_options_add_addr()
not to override DSS info in the union from 'struct mptcp_out_options',
and also because mptcp_write_options() will enforce mutually exclusion
with DSS.

Instead, don't even try to send an ADD_ADDR if it is not a pure ACK.
Retry for each new packet until a pure-ACK is emitted. That's fine to do
that, because each time an ADD_ADDR (echo) is scheduled, a pure ACK is
queued.

This also simplifies the code, and the skb checks can be done earlier,
before the lock.

Note: also, since commit 6d0060f600ad ("mptcp: Write MPTCP DSS headers
to outgoing data packets"), opts->ahmac would not have been set to 0
when other suboptions were not dropped, and when sending an ADD_ADDR
echo. That would have resulted in sending an ADD_ADDR using garbage
info, where there was not enough space, instead of an echo one without
the ADD_ADDR HMAC.

Fixes: 1bff1e43a30e ("mptcp: optimize out option generation")
Cc: stable@vger.kernel.org
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260602-net-mptcp-misc-fixes-7-1-rc7-v2-11-856831229976@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: fix uninit-value in mptcp_established_options

syzbot reported the following uninit splat:

  BUG: KMSAN: uninit-value in mptcp_write_data_fin net/mptcp/options.c:542 [inline]
  BUG: KMSAN: uninit-value in mptcp_established_options_dss net/mptcp/options.c:590 [inline]
  BUG: KMSAN: uninit-value in mptcp_established_options+0x112f/0x3530 net/mptcp/options.c:874
   mptcp_write_data_fin net/mptcp/options.c:542 [inline]
   mptcp_established_options_dss net/mptcp/options.c:590 [inline]
   mptcp_established_options+0x112f/0x3530 net/mptcp/options.c:874
   tcp_established_options+0x312/0xcc0 net/ipv4/tcp_output.c:1192
   __tcp_transmit_skb+0x5dc/0x5fe0 net/ipv4/tcp_output.c:1575
   __tcp_send_ack+0x967/0xad0 net/ipv4/tcp_output.c:4499
   tcp_send_ack+0x3d/0x60 net/ipv4/tcp_output.c:4505
   mptcp_subflow_shutdown+0x164/0x690 net/mptcp/protocol.c:3137
   mptcp_check_send_data_fin+0x31b/0x3d0 net/mptcp/protocol.c:3218
   __mptcp_wr_shutdown net/mptcp/protocol.c:3234 [inline]
   __mptcp_close+0x860/0x1360 net/mptcp/protocol.c:3313
   mptcp_close+0x42/0x260 net/mptcp/protocol.c:3367
   inet_release+0x1ee/0x2a0 net/ipv4/af_inet.c:442
   __sock_release net/socket.c:722 [inline]
   sock_close+0xd6/0x2f0 net/socket.c:1514
   __fput+0x60e/0x1010 fs/file_table.c:510
   ____fput+0x25/0x30 fs/file_table.c:538
   task_work_run+0x208/0x2b0 kernel/task_work.c:233
   resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
   __exit_to_user_mode_loop kernel/entry/common.c:67 [inline]
   exit_to_user_mode_loop+0x306/0x1b60 kernel/entry/common.c:98
   __exit_to_user_mode_prepare include/linux/irq-entry-common.h:207 [inline]
   syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:238 [inline]
   syscall_exit_to_user_mode include/linux/entry-common.h:318 [inline]
   __do_fast_syscall_32+0x2c7/0x460 arch/x86/entry/syscall_32.c:310
   do_fast_syscall_32+0x37/0x80 arch/x86/entry/syscall_32.c:332
   do_SYSENTER_32+0x1f/0x30 arch/x86/entry/syscall_32.c:370
   entry_SYSENTER_compat_after_hwframe+0x84/0x8e

  Local variable opts created at:
   __tcp_transmit_skb+0x4d/0x5fe0 net/ipv4/tcp_output.c:1536
   __tcp_send_ack+0x967/0xad0 net/ipv4/tcp_output.c:4499

The output path currently omits initializing the mptcp extension
`use_map` flag in a few corner cases.

Address the issue always zeroing all the extensions flags before
eventually initializing the individual bits. To that extent, introduce
and use a struct_group to avoid multiple bitwise operations.

Fixes: cfcceb7a39fc ("tcp: shrink per-packet memset in __tcp_transmit_skb()")
Cc: stable@vger.kernel.org
Reported-by: syzbot+ff020673c5e3d94d9478@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=ff020673c5e3d94d9478
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260602-net-mptcp-misc-fixes-7-1-rc7-v2-10-856831229976@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: check desc->count in read_sock

__tcp_read_sock() checks desc->count after each skb is consumed and
breaks the loop when it reaches 0. The MPTCP variant lacks this check.

This is a functional bug, other subsystems also rely on this check:
TLS strparser sets desc->count to 0 once a full TLS record is assembled
and depends on this break to stop reading.

Add the same desc->count check to __mptcp_read_sock(), mirroring
__tcp_read_sock().

Fixes: 250d9766a984 ("mptcp: implement .read_sock")
Cc: stable@vger.kernel.org
Co-developed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Gang Yan <yangang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260602-net-mptcp-misc-fixes-7-1-rc7-v2-9-856831229976@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: sockopt: set sockopt on all subflows

The mptcp_setsockopt_all_sf(), currently used only with TCP_MAXSEG,
stopped when one subflow returned an error.

Even if it is not wrong, this is different from the other helpers trying
to set the option on all subflows, and then returning an error if at
least one of them had an issue.

Follow this behaviour, for a question of uniformity.

Fixes: 51c5fd09e1b4 ("mptcp: add TCP_MAXSEG sockopt support")
Cc: stable@vger.kernel.org
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260602-net-mptcp-misc-fixes-7-1-rc7-v2-8-856831229976@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: sockopt: check timestamping ret value

sock_set_timestamping() can fail for different reasons. The returned
value should then be checked.

If sock_set_timestamping() fails for at least one subflow, the first
error is now reported to the userspace, similar to what is done with
other socket options.

Fixes: 9061f24bf82e ("mptcp: sockopt: propagate timestamp request to subflows")
Cc: stable@vger.kernel.org
Reported-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Closes: https://lore.kernel.org/willemdebruijn.kernel.178a41a53d041@gmail.com
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260602-net-mptcp-misc-fixes-7-1-rc7-v2-7-856831229976@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mptcp: add test for extra_subflows underflow on userspace PM

Add a test to verify that when userspace PM fails to create a subflow
(e.g. using an unreachable address), the extra_subflows counter is not
decremented below zero.

Fixes: 77e4b94a3de6 ("mptcp: update userspace pm infos")
Cc: stable@vger.kernel.org
Signed-off-by: Tao Cui <cuitao@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260602-net-mptcp-misc-fixes-7-1-rc7-v2-6-856831229976@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: pm: fix extra_subflows underflow on userspace PM subflow creation

The userspace PM increments extra_subflows after __mptcp_subflow_connect()
succeeds, but __mptcp_subflow_connect() calls mptcp_pm_close_subflow()
on failure to roll back the pre-increment done by the kernel PM's fill_*()
helpers. Because the userspace PM hasn't incremented yet at that point,
this decrement is spurious and causes extra_subflows to underflow.

Fix it by aligning the userspace PM with the kernel PM: increment
extra_subflows before calling __mptcp_subflow_connect(), so the existing
error path in subflow.c correctly rolls it back on failure. Also simplify
the error handling by taking pm.lock only when needed for cleanup.

Fixes: 77e4b94a3de6 ("mptcp: update userspace pm infos")
Cc: stable@vger.kernel.org
Signed-off-by: Tao Cui <cuitao@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260602-net-mptcp-misc-fixes-7-1-rc7-v2-5-856831229976@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: allow subflow rcv wnd to shrink

In MPTCP connection, the `window` field in the TCP header refers to the
MPTCP-level rcv_nxt and it's right edge should not move backward. Such
constraint is enforced at DSS option generation time.

At the same time, the TCP stack ensures independently that the TCP-level
rcv wnd right's edge does not move backward. That in turn causes artificial
inflating of the MPTCP rcv window when the incoming data is acked at the
TCP level and is OoO in the MPTCP sequence space (or lands in the backlog).

As a consequence, the incoming traffic can exceed the receiver rcvbuf size
even when the sender is not misbehaving.

Prevent such scenario forcibly allowing the TCP subflow to shrink the
TCP-level rcv wnd regardless of the current netns setting.

Fixes: f3589be0c420 ("mptcp: never shrink offered window")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260602-net-mptcp-misc-fixes-7-1-rc7-v2-4-856831229976@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: close TOCTOU race while computing rcv_wnd

The MPTCP output path access locklessly the MPTCP-level ack_seq
in multiple times, using possibly different values for the data_ack
in the DSS option and to compute the announced rcv wnd for the same
packet.

Refactor the cote to avoid inconsistencies which may confuse the
peer. Also ensure that the MPTCP level rcv wnd is updated only when
the egress packet actually contains a DSS ack.

Fixes: fa3fe2b15031 ("mptcp: track window announced to peer")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260602-net-mptcp-misc-fixes-7-1-rc7-v2-3-856831229976@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: fix retransmission loop when csum is enabled

Sashiko noted that retransmission with csum enabled can actually
transmit new data, but currently the relevant code does not update
accordingly snd_nxt.

The may cause incoming ack drop and an endless retransmission loop.

Address the issue incrementing snd_nxt as needed.

Fixes: 4e14867d5e91 ("mptcp: tune re-injections for csum enabled mode")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260602-net-mptcp-misc-fixes-7-1-rc7-v2-2-856831229976@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: fix missing wakeups in edge scenarios

The mptcp_recvmsg() can fill MPTCP socket receive queue via
mptcp_move_skbs(), but currently does not try to wakeup any listener,
because the same process is going to check the receive queue soon.

When multiple threads are reading from the same fd, the above can
cause stall. Add the missing wakeup.

Fixes: 6771bfd9ee24 ("mptcp: update mptcp ack sequence from work queue")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260602-net-mptcp-misc-fixes-7-1-rc7-v2-1-856831229976@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

drivers/of: fdt: Make ibm,phandle logic only happen on pseries

The "ibm,phandle" thing only seems to be needed on pseries
machines but everyone gets it so they get a string and a little
bit of useless code.

In __of_attach_node() the pseries specific part uses
IS_ENABLED(CONFIG_PPC_PSERIES) so do that here too.

Signed-off-by: Daniel Palmer <daniel@thingy.jp>
Link: https://patch.msgid.link/20260603151809.3256280-1-daniel@thingy.jp
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>

tools/x86/kcpuid: Update bitfields to x86-cpuid-db v3.1

Update kcpuid's CSV file to version 3.1, as generated by x86-cpuid-db.

Summary of the v3.1 changes:

* Fix a few typos that were found during the kernel CPUID data model
review. Also include fixes found using an LLM agent review.

* Rename thrd_director_nclasses to hw_feedback_nclasses as it's the
name used in Intel SDM.

See https://gitlab.com/x86-cpuid.org/x86-cpuid-db/-/blob/v3.1/CHANGELOG.rst
for more info.

Signed-off-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://patch.msgid.link/cbe9ff395b3269e112ff7ca414d726ffd7bf0787.1780506200.git.m.wieczorretman@pm.me

ptp: vclock: Switch from RCU to SRCU

The usage of PTP vClocks leads immediately to the following issues with
ptp4l with LOCKDEP and DEBUG_ATOMIC_SLEEP enabled: "BUG: sleeping function
called from invalid context".

ptp_convert_timestamp() acquires a mutex_t within a RCU read section. This
is illegal, because acquiring a mutex_t can result in voluntary scheduling
request which is not allowed within a RCU read section.

Replace the RCU usage with SRCU where sleeping is allowed.

Reported-by: Florian Zeitz <florian.zeitz@schettke.com>
Closes: https://lore.kernel.org/all/00a8cce8-410e-4038-98af-49be6d93d7bd@schettke.com/
Fixes: 67d93ffc0f3c ("ptp: vclock: use mutex to fix "sleep on atomic" bug")
Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20260529-vclock_rcu-v2-1-02a5531fab92@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv4: raw: remove six obsolete EXPORT_SYMBOL_GPL()

IPv6 can not be a module anymore, we no longer need to export:

- raw_hash_sk()
- raw_unhash_sk()
- raw_abort()
- raw_seq_start()
- raw_seq_next()
- raw_seq_stop()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20260602165036.2712408-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv4: restrict IPOPT_SSRR and IPOPT_LSRR options

This patch restricts setting Loose Source and Record Route (LSRR)
and Strict Source and Record Route (SSRR) IP options to users
with CAP_NET_RAW capability.

This prevents unprivileged applications from forcing packets to route
through attacker-controlled nodes to leak TCP ISN and possibly other
protocol information.

While LSRR and SSRR are commonly filtered in many network environments,
they may still be supported and forwarded along some network paths.

RFC 7126 (Recommendations on Filtering of IPv4 Packets Containing
IPv4 Options) recommend to drop these options in 4.3 and 4.4.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: Tamir Shahar <tamirthesis@gmail.com>
Reported-by: Amit Klein <aksecurity@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260602161547.2642155-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'af_unix-fix-inq_len-update-issue'

Jianyu Li says:

====================
af_unix: Fix inq_len update issue

From: Jianyu Li <jianyu.li@mediatek.com>

This series fix the problem that inq_len is inconsistent with
actual remaining byte count when only part of a skb is consumed.
====================

Link: https://patch.msgid.link/20260601113640.231897-1-jianyu.li@mediatek.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

af_unix: Add test for SCM_INQ on partial read

Add test to verify that when a skb is partially consumed,
unix_inq_len() return correct remaining byte count.

Before:

  #  RUN           scm_inq.stream.partial_read ...
  # scm_inq.c:165:partial_read:Expected remain (512) == *(int *)CMSG_DATA(cmsg) (768)
  # partial_read: Test terminated by assertion
  #          FAIL  scm_inq.stream.partial_read
  not ok 2 scm_inq.stream.partial_read

After:

  #  RUN           scm_inq.stream.partial_read ...
  #            OK  scm_inq.stream.partial_read
  ok 2 scm_inq.stream.partial_read

Signed-off-by: Jianyu Li <jianyu.li@mediatek.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260601113640.231897-3-jianyu.li@mediatek.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

af_unix: Fix inq_len update problem in partial read

Currently inq_len is updated only when the whole skb is consumed.
If only part of the data is read, following SIOCINQ query would
get value greater than what actually left.

This change update inq_len timely in unix_stream_read_generic(),
and adjust unix_stream_read_skb() accordingly to prevent
repetitive update.

Fixes: f4e1fb04c123 ("af_unix: Use cached value for SOCK_STREAM in unix_inq_len().")
Signed-off-by: Jianyu Li <jianyu.li@mediatek.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260601113640.231897-2-jianyu.li@mediatek.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

sched_ext: Make scx_bpf_kick_cid() return s32

Switch scx_bpf_kick_cid() from void to s32 so future cap enforcement can
surface failures. cid interface is introduced in this cycle and has no
external users, so the ABI change is safe. Subsequent patches will add
-EPERM returns when the calling sub-sched lacks the required cap on the
target cid.

v2: Return scx_cid_to_cpu()'s errno instead of -EINVAL. (Andrea)

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>

sched_ext: Add scx_cmask_test() and scx_cmask_for_each_cid()

Add single-bit test and iterator over set cids in an scx_cmask.

v2: Bound scx_cmask_for_each_cid() to the active span. (sashiko AI)

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>

tools/sched_ext: Order single-cid cmask helpers as (cid, mask)

The BPF arena single-cid cmask helpers take the cmask first and the cid
second. Reorder them to (cid, mask) to match the kernel-side helpers and
the test_bit(nr, addr), cpumask_test_cpu(cpu, mask) convention. Range and
iteration helpers keep (mask, start).

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>

sched_ext: Order single-cid cmask helpers as (cid, mask)

__scx_cmask_set(), __scx_cmask_contains() and __scx_cmask_word() take the
cmask first and the cid second. The kernel's bit and cpumask predicates put
the index first: test_bit(nr, addr), cpumask_test_cpu(cpu, mask). Reorder
the cmask helpers to (cid, mask) for consistency, ahead of new single-cid
ops added next. Mask-level ops (and/or/andnot/copy/subset/intersects) keep
(dst, src).

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>

Merge tag 'amd-drm-fixes-7.1-2026-06-03' of https://gitlab.freedesktop.org/agd5f/linux into drm-fixes

amd-drm-fixes-7.1-2026-06-03:

amdgpu:
- BT.2020 fix for DCE
- DC bounds checking fixes
- SDMA 7.1 fix
- UserQ fixes
- SI fix
- SMU 13 fixes
- SMU 14 fixes
- GC 12.1 fix
- Userptr fix
- GC 10.1 fix
- GART fix for non-4K pages

amdkfd:
- UAF race fix
- Fix a potential NULL pointer dereference
- GC 11 buffer overflow fix for SDMA

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patch.msgid.link/20260604011351.2373027-1-alexander.deucher@amd.com

octeontx2-af: Fix initialization of mcam's entry2target_pffunc field

NPC mcam entry stores a mapping between mcam entry and target pcifunc.
During initialization of this field, API kmalloc_array has been used which
caused some junk values to array. Whereas, the array is expected to be
initialized by 0. This patch fixes the same by using kcalloc instead of
kmalloc_array.

Fixes: 55307fcb9258 ("octeontx2-af: Add mbox messages to install and delete MCAM rules")
Signed-off-by: Suman Ghosh <sumang@marvell.com>
Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/1780054625-17090-1-git-send-email-sbhatta@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-pf: Fix NDC sync operation errors

On system reboot "rvu_nicpf 0002:03:00.0: NDC sync operation failed"
error messages are shown, even if the operations is successful.
This is due to wrong if error check in ndc_syc() function.

Fixes: 42c45ac1419c ("octeontx2-af: Sync NIX and NPA contexts from NDC to LLC/DRAM")
Signed-off-by: Geetha sowjanya <gakula@marvell.com>
Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/1780054677-17249-1-git-send-email-sbhatta@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

appletalk: aarp: zero-initialize aarp_entry to prevent heap info leak

aarp_alloc() allocates struct aarp_entry without zeroing it, but only
initializes refcnt and packet_queue. When an unresolved AARP entry is
created, hwaddr[ETH_ALEN] is left uninitialized.

aarp_seq_show() later prints this field with %pM when users read
/proc/net/atalk/arp. This can expose 6 bytes of stale heap data for
each unresolved entry.

Fix this by zero-initializing struct aarp_entry at allocation time.

Reported-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn>
Reported-by: Yuxiang Yang <yangyx22@mails.tsinghua.edu.cn>
Reported-by: Ao Wang <wangao@seu.edu.cn>
Reported-by: Xuewei Feng <fengxw06@126.com>
Reported-by: Qi Li <qli01@tsinghua.edu.cn>
Reported-by: Ke Xu <xuke@tsinghua.edu.cn>
Signed-off-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260529105017.81531-1-zhaoyz24@mails.tsinghua.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'geneve-allow-binding-udp-socket-to-a-specific-address'

Kuniyuki Iwashima says:

====================
geneve: Allow binding UDP socket to a specific address.

By default, a GENEVE device bind()s its underlying UDP socket(s) to
the IPv4 or IPv6 wildcard address because there is no way to specify
a specific local IP address to bind() to.

This prevents deploying multiple GENEVE devices on a multi-homed host
where each device should be isolated and bound to a different local IP
address on the same UDP port.

This series introduces two options to specify local IPv4 or IPv6
addresses for a GENEVE device.
====================

Link: https://patch.msgid.link/20260602190436.139591-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

geneve: Introduce IFLA_GENEVE_LOCAL and IFLA_GENEVE_LOCAL6.

By default, a GENEVE device bind()s its underlying UDP socket(s) to
the IPv4 or IPv6 wildcard address because there is no way to specify
a specific local IP address to bind() to.

This prevents deploying multiple GENEVE devices on a multi-homed host
where each device should be isolated and bound to a different local IP
address on the same UDP port.

Let's introduce new options, IFLA_GENEVE_LOCAL and IFLA_GENEVE_LOCAL6,
to allow specifying a local IPv4/IPv6 address for the backend UDP
socket.

By default, when collect metadata mode (IFLA_GENEVE_COLLECT_METADATA)
is enabled, both IPv4 and IPv6 sockets are created.  However, if a
source address is specified via the new attributes, only a single
socket corresponding to that specific address family is created.

Accordingly, geneve_find_sock() and geneve_find_dev() are updated to
take the source address into account, ensuring that multiple devices
and sockets configured with different source addresses can coexist
without conflict.

In addition, the source address is validated in geneve_xmit_skb()
and geneve6_xmit_skb(), so the BPF prog must set it in bpf_tunnel_key.

With this change, multiple GENEVE devices can be successfully created
and bound to their respective local IP addresses:

  (*) "local" is the keyword for IFLA_GENEVE_LOCAL / IFLA_GENEVE_LOCAL6

  # for i in $(seq 1 2);
  do
          ip link add geneve4_${i} type geneve local 192.168.0.${i} external
          ip addr add 192.168.0.${i}/24 dev geneve4_${i}
          ip link set geneve4_${i} up

          ip link add geneve6_${i} type geneve local 2001:9292::${i} external
          ip addr add 2001:9292::${i}/64 dev geneve6_${i} nodad
          ip link set geneve6_${i} up
  done

  # ip -d l | grep geneve
  9: geneve4_1: <BROADCAST,MULTICAST,UP,LOWER_UP> ...
      geneve external id 0 local 192.168.0.1 ...
  10: geneve6_1: <BROADCAST,MULTICAST,UP,LOWER_UP> ...
      geneve external id 0 local 2001:9292::1 ...
  11: geneve4_2: <BROADCAST,MULTICAST,UP,LOWER_UP> ...
      geneve external id 0 local 192.168.0.2 ...
  12: geneve6_2: <BROADCAST,MULTICAST,UP,LOWER_UP> ...
      geneve external id 0 local 2001:9292::2 ...

  # ss -ua | grep geneve
  UNCONN 0      0         192.168.0.2:geneve      0.0.0.0:*
  UNCONN 0      0         192.168.0.1:geneve      0.0.0.0:*
  UNCONN 0      0      [2001:9292::2]:geneve            *:*
  UNCONN 0      0      [2001:9292::1]:geneve            *:*

Note that even if the local address is explicitly configured with
the wildcard address, kernel does not dump it except for devices with
IFLA_GENEVE_COLLECT_METADATA.  This is consistent with the behaviour
of is_tnl_info_zero(), which treats the wildcard remote address as not
configured.

  ## ynl example.
  # ./tools/net/ynl/pyynl/cli.py \
    --spec ./Documentation/netlink/specs/rt-link.yaml \
    --do newlink --create \
    --json '{"ifname": "geneve0",
             "linkinfo": {"kind":"geneve",
                  "data": {"local": "0.0.0.0",
           "collect-metadata": true}}}'

  # ./tools/net/ynl/pyynl/cli.py \
    --spec ./Documentation/netlink/specs/rt-link.yaml \
    --do getlink \
    --json '{"ifname": "geneve0"}' --output-json | \
    jq .linkinfo.data.local
  "0.0.0.0"

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260602190436.139591-6-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

geneve: Add dualstack flag to struct geneve_config.

When collect metadata mode (IFLA_GENEVE_COLLECT_METADATA) is
enabled, the GENEVE device creates both IPv4 and IPv6 sockets
and bind()s them to wildcard addresses.

The next patch allows creating only one socket bound to a
specific address even when the collect metadata mode is
enabled.

Then, we need a flag to distinguish dualstack GENEVE devices
to detect local address conflict.

Let's add the dualstack flag to struct geneve_config.

IFLA_GENEVE_COLLECT_METADATA processing is moved up in
geneve_nl2info() for the next patch to overwrite dualstack
to false while keeping collect_md true.

Note that IFLA_GENEVE_REMOTE and IFLA_GENEVE_REMOTE6 does not
set cfg->dualstack to false since is_tnl_info_zero() ignores
the wildcard remote address:

  # ip link add geneve0 type geneve external remote 0.0.0.1
  Error: Device is externally controlled, so attributes (VNI, Port, and so on) must not be specified.

  # ip link add geneve0 type geneve external remote 0.0.0.0
  # ss -ua | grep geneve
  UNCONN 0      0            0.0.0.0:geneve      0.0.0.0:*
  UNCONN 0      0                  *:geneve            *:*

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260602190436.139591-5-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

geneve: Pass struct geneve_dev to geneve_find_sock().

This is a prep patch to make a subsequent patch clean.

We will need to access geneve_dev->cfg.info.key.u.{ipv4,ipv6}.src
in geneve_find_sock() later and extend conditions there.

Let's pass down struct geneve from geneve_sock_add() to
geneve_find_sock() and flatten the conditional logic.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260602190436.139591-4-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

geneve: Pass struct geneve_dev to geneve_create_sock().

This is a prep patch to make a subsequent patch clean.

We will need to access geneve_dev->cfg.info.key.u.{ipv4,ipv6}.src
in geneve_create_sock() later.

Let's pass down struct geneve_dev from geneve_sock_add() to
geneve_create_sock() instead of individual config fields.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260602190436.139591-3-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

geneve: Reuse ipv6_addr_type() result in geneve_nl2info().

geneve_nl2info() calls ipv6_addr_type() to check if the remote
IPv6 address is link-local.

Then, it also calls ipv6_addr_is_multicast() for the same address.

Let's not call ipv6_addr_is_multicast() and reuse ipv6_addr_type().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260602190436.139591-2-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sfp: initialize i2c_block_size at adapter configure time

sfp->i2c_block_size is only assigned in sfp_sm_mod_probe(), which runs
from the state machine timer after SFP_F_PRESENT has been set. Between
those two points, sfp_module_eeprom() (the ethtool -m callback) gates
only on SFP_F_PRESENT and can be entered with i2c_block_size still at
its kzalloc'd value of 0.

On a pure-I2C adapter, sfp_i2c_read() then issues an i2c_transfer()
with msgs[1].len = 0 inside a loop that subtracts this_len from len
each iteration; on adapters that succeed a zero-length read the loop
never advances, spinning while holding rtnl_lock.

This was previously addressed by initializing i2c_block_size in
sfp_alloc() (commit 813c2dd78618), but the initialization was dropped
when i2c_block_size was split from i2c_max_block_size.

Initialize sfp->i2c_block_size from sfp->i2c_max_block_size in
sfp_i2c_configure(), so the field is valid as soon as the adapter is
known. sfp_sm_mod_probe() still reassigns it on each module insertion
to recover from a per-module clamp to 1 (sfp_id_needs_byte_io).

Fixes: 7662abf4db94 ("net: phy: sfp: Add support for SMBus module access")
Cc: stable@vger.kernel.org
Signed-off-by: Jonas Jelonek <jelonek.jonas@gmail.com>
Link: https://patch.msgid.link/20260528205242.971410-2-jelonek.jonas@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

xsk: cache csum_start/csum_offset to fix TOCTOU in xsk_skb_metadata()

The TX metadata area resides in the UMEM buffer which is memory-mapped
and concurrently writable by userspace. In xsk_skb_metadata(),
csum_start and csum_offset are read from shared memory for bounds
validation, then read again for skb assignment. A malicious userspace
application can race to overwrite these values between the two reads,
bypassing the bounds check and causing out-of-bounds memory access
during checksum computation in the transmit path.

Fix this by reading csum_start and csum_offset into local variables
once, then using the local copies for both validation and assignment.

Note that other metadata fields (flags, launch_time) and the cached
csum fields may be mutually inconsistent due to concurrent userspace
writes, but this is benign: the only security-critical invariant is
that each field's validated value is the same one used, which local
caching guarantees.

Closes: https://lore.kernel.org/all/20260503200927.73EA1C2BCB4@smtp.kernel.org/
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Fixes: 48eb03dd2630 ("xsk: Add TX timestamp and TX checksum offload support")
Link: https://patch.msgid.link/20260530042630.80626-1-kerneljasonxing@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: b44: use ethtool_puts

There's a subtle error with the memcpy here, where b44_gstrings should
not be dereferenced. Dereferening causes the following error with W=1:

In file included from drivers/net/ethernet/broadcom/b44.c:17:
In file included from ./include/linux/module.h:18:
In file included from ./include/linux/kmod.h:9:
In file included from ./include/linux/umh.h:4:
In file included from ./include/linux/gfp.h:7:
In file included from ./include/linux/mmzone.h:8:
In file included from ./include/linux/spinlock.h:56:
In file included from ./include/linux/preempt.h:79:
In file included from ./arch/powerpc/include/asm/preempt.h:5:
In file included from ./include/asm-generic/preempt.h:5:
In file included from ./include/linux/thread_info.h:23:
In file included from ./arch/powerpc/include/asm/current.h:13:
In file included from ./arch/powerpc/include/asm/paca.h:16:
In file included from ./include/linux/string.h:386:
./include/linux/fortify-string.h:578:4: error: call to
'__read_overflow2_field' declared with 'warning' attribute: detected read
beyond size of field (2nd parameter); maybe use>
578 | __read_overflow2_field(q_size_field, size);
| ^

Instead of fixing the memcpy, use ethtool_puts, which is the proper
helper for printing ethtool gstrings.

Signed-off-by: Rosen Penev <rosenp@gmail.com>
Link: https://patch.msgid.link/20260531000334.388351-1-rosenp@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-mlx5-add-switchdev-mode-support-for-socket-direct-single-netdev-part-1-2'

Tariq Toukan says:

====================
net/mlx5: Add switchdev mode support for Socket Direct single netdev, part 1/2

This series enables Socket Direct single netdev to operate in switchdev
mode with shared FDB. SD single netdev combines multiple PCI functions
behind a single netdev interface. To support switchdev offloads, these
functions must participate in virtual LAG (shared FDB).

Design

Rather than introducing a separate LAG instance for SD, this series
integrates SD secondary devices into the existing LAG structure
(priv.lag) created at probe time. Each lag_func entry carries a
group_id field that identifies its SD group membership (0 means not
part of any SD group). An xarray mark (XA_MARK_PORT) distinguishes
physical port entries from SD secondaries, enabling a single unified
iterator that filters by group:

  - MLX5_LAG_FILTER_PORTS: iterate port-level entries only (existing
    behavior, used by bonding, FW LAG commands, v2p_map)
  - MLX5_LAG_FILTER_ALL: iterate all devices including SD secondaries
    (used by MPESW shared FDB across all devices)
  - specific group_id: iterate only devices in that SD group (used by
    per-group SD shared FDB operations)

Existing callers use mlx5_ldev_for_each() which maps to
MLX5_LAG_FILTER_PORTS, preserving current behavior for non-SD
configurations.

Lifecycle and ownership

The SD LAG lifecycle is tied to the SD group, not to bonding events:

1. At PCI probe, mlx5_lag_add_mdev() creates the LAG structure
   (priv.lag) for each LAG-capable PF. e.g.: SD primary devices

2. During mlx5_sd_init(), after the SD group is fully formed (primary
   and secondaries paired), sd_lag_init() registers the secondary
   devices into the primary's existing priv.lag by calling
   mlx5_ldev_add_mdev() with the SD group_id. The primary's lag_func
   also gets its group_id set. No separate LAG instance is created.

3. After all the devices in SD group transition to switchdev,
   mlx5_lag_shared_fdb_create() is invoked with the group_id to create
   a software-only shared FDB scoped to that SD group. This sets
   sd_fdb_active on all lag_func entries in the group. No FW LAG
   commands are issued since SD devices share the same physical port.

4. If MPESW (multi-port eswitch) is enabled on top of SD groups, the
   per-group SD shared FDB is torn down first, then MPESW shared FDB is
   created spanning all devices (ports + SD secondaries) using
   MLX5_LAG_FILTER_ALL. On MPESW disable, per-group SD shared FDB is
   restored.

5. On SD teardown (mlx5_sd_cleanup or device unbind), sd_lag_cleanup()
   removes secondaries from priv.lag and clears the primary's group_id.
   The LAG structure itself is not destroyed.

The sd_fdb_active flag is set on all lag_func entries in a group (not
just the primary), so any device can detect the SD shared FDB state
during lag_disable_change teardown without needing to look up peer
entries.

SD shared FDB is a pure software construct -- unlike regular LAG modes
(ROCE, SRIOV, MPESW), it does not issue FW create_lag/destroy_lag
commands. The software vport LAG for SD is implemented via eswitch
egress ACL bounce rules, managed by the IB layer through
mlx5_eth_lag_init(). And the software LAG demux is implemented via
steering rules that utilize new destination, VHCA_RX.

Patches

Infrastructure (patches 1, 5-6):
  - Factor out shared FDB code into a dedicated file
  - Extend lag_func with group_id and sd_fdb_active fields;
    add XA_MARK_PORT and unified iterator with group_id filter
  - Extend shared FDB API with group_id parameter

E-Switch preparation (patches 2-3):
  - Align eswitch disable sequence ordering
  - Move devcom init from TC to eswitch layer

SD group management (patches 4, 7-9):
  - Replace peer count check with direct peer lookup
  - Register SD secondaries in the existing LAG at SD init time
  - Block RoCE and VF LAG for SD devices
  - Block multipath LAG for SD devices

Switchdev integration (patch 10):
  - Keep netdev resources local in switchdev mode

Steering (patches 11-12):
  - Track peer flow slots with bitmap for selective peer flow deletion
  - Enable TC flow steering for SD LAG

Enablement (patch 13):
  - Verify unique vhca_id count for cross-VHCA RQT
====================

Link: https://patch.msgid.link/20260531113954.395443-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Verify unique vhca_id count instead of range

Change verify_num_vhca_ids() to count the number of unique vhca_ids
and verify this count doesn't exceed max_num_vhca_id, rather than
validating individual vhca_id values are within a specific range.

The previous implementation checked if each vhca_id was in the range
[0, max_num_vhca_id - 1], which is overly restrictive. The hardware
capability max_rqt_vhca_id represents the maximum number of unique
vhca_ids that can be used, not a range constraint on individual IDs.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260531113954.395443-14-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: TC, enable steering for SD LAG

Enable TC flow steering for SD LAG mode by extending multiport
eligibility checks and peer flow handling.

SD LAG operates similarly to MPESW for TC offloads - flows on
secondary devices need peer flow creation on the primary, and
multiport forwarding rules are eligible when either MPESW or SD LAG
is active.

Add mlx5_lag_is_sd() helper to query SD LAG mode, and
mlx5_sd_is_primary() to identify the primary device. Redirect uplink
priv/proto_dev queries to the primary device's eswitch in SD
configurations.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260531113954.395443-13-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: TC, track peer flow slots with bitmap

With SD devices joining the LAG, peer flows are not created for all
devcom peers - SD devices skip peers that belong to a different SD
group. However, the delete path iterated all devcom peers
unconditionally, attempting to delete from slots that were never
populated.

Track which peer slots are populated using a bitmap in mlx5e_tc_flow.
The delete path now iterates only set bits, matching exactly the slots
that were set up during flow creation.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260531113954.395443-12-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: SD, keep netdev resources on same PF in switchdev mode

In SD switchdev mode, network device resources such as channels and
completion vectors must remain on the same PF rather than being
distributed across SD group members.

Modify mlx5_sd_ch_ix_get_dev_ix() to return 0 and
mlx5_sd_ch_ix_get_vec_ix() to return the channel index directly when
in switchdev mode, keeping resources local to the requesting PF.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260531113954.395443-11-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: LAG, block multipath LAG for SD devices

SD devices are not compatible with multipath LAG since they use
dedicated SD LAG for cross-socket connectivity. Add an SD check
to the multipath prereq validation to prevent multipath LAG
activation on SD-configured ports.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260531113954.395443-10-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: LAG, block RoCE and VF LAG for SD devices

Socket Direct devices manage their own LAG via SD LAG infrastructure.
Block the standard netdev-event-driven LAG path (RoCE LAG and VF LAG)
for SD devices to prevent conflicting LAG configurations.

Expose mlx5_sd_is_supported() as a public helper that encapsulates all
SD eligibility checks. Use it in mlx5_lag_dev_alloc() to skip netdev
notifier registration for SD-capable devices at alloc time. Some sd
code is reordered to expose the new function, no logic is changed.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260531113954.395443-9-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: SD, introduce Socket Direct LAG

Register SD secondary devices with the existing LAG structure by
adding them to the primary's ldev xarray with a shared group_id.
This ties the SD LAG lifecycle to the SD group lifecycle.

Add sd_lag_state debugfs entry for LAG state visibility. To avoid
race between this entry and LAG deletion, have debugfs creation
and deletion done last on SD init and first on SD cleanup.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260531113954.395443-8-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: LAG, extend shared FDB API with group_id filter

Add a group_id parameter to mlx5_lag_shared_fdb_create() and
mlx5_lag_shared_fdb_destroy() to scope shared FDB operations to a
specific SD group.

When group_id is U32_MAX, the functions operate on all LAG devices. When
group_id is non-zero, they operate only on devices in that SD group
without issuing FW LAG commands, since SD LAG is a pure software
construct.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260531113954.395443-7-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: LAG, prepare for SD device integration

Socket Direct (SD) secondaries devices will participate in LAG, even
though they are silent. SD secondary devices share the same physical
port as their primary but are separate PCI functions that need to be
tracked alongside regular LAG ports.

Extend lag_func with a group_id field to identify SD group membership
and introduce a unified iterator that can filter by group. Add APIs
for registering SD secondary devices in an existing LAG.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260531113954.395443-6-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: LAG, replace peer count check with direct peer lookup

Replace mlx5_eswitch_get_npeers() count-based check with a new
mlx5_eswitch_is_peer() function that directly verifies the peer
relationship between two eswitches.

This change prepares for SD LAG support, which is a virtual LAG that
does not have num_lag_ports capability and cannot use the count-based
peer validation.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260531113954.395443-5-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: E-Switch, move devcom init from TC to eswitch layer

Move the E-swtich devcom component management from TC layer to ESW
layer.

This refactoring places devcom lifecycle management at the appropriate
layer and prepares for SD LAG which needs devcom registration
independent of the TC/representor initialization.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260531113954.395443-4-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: E-Switch, align disable sequence with switchdev-to-legacy transition

This patch align the eswitch disable sequence with the
switchdev-to-legacy mode transition, where eswitch must be disabled
before device detachment. The consistent ordering is required for proper
SD LAG cleanup which depends on eswitch state during teardown.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260531113954.395443-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: LAG, factor out shared FDB code into dedicated file

Refactor shared FDB LAG logic into a new lag/shared_fdb.c file to
improve code organization and enable reuse. Move shared FDB specific
functions from lag.c and introduce consolidated APIs:
- mlx5_lag_shared_fdb_create() handles LAG activation with shared FDB
- mlx5_lag_shared_fdb_destroy() handles LAG deactivation with shared FDB

Update mlx5_do_bond(), mlx5_disable_lag() and mpesw.c to use the new
APIs, which simplifies the shared FDB code paths.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260531113954.395443-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mm/mincore: handle non-swap entries before !CONFIG_SWAP guard

mincore_swap() also fields migration/hwpoison entries (and shmem
swapin-error entries), which can exist on !CONFIG_SWAP builds when
CONFIG_MIGRATION or CONFIG_MEMORY_FAILURE is enabled. The
!IS_ENABLED(CONFIG_SWAP) guard ran before the non-swap-entry early return,
so mincore_pte_range() can spuriously WARN and report these pages
nonresident on !CONFIG_SWAP kernels.

Move the guard below the non-swap-entry check so only true swap entries
trip the WARN, and migration/hwpoison entries take the existing "uptodate
/ non-shmem" path.

Link: https://lore.kernel.org/20260602172247.279421-1-usama.arif@linux.dev
Fixes: 1f2052755c15 ("mm/mincore: use a helper for checking the swap cache")
Signed-off-by: Usama Arif <usama.arif@linux.dev>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Baoquan He <baoquan.he@linux.dev>
Cc: Chris Li <chrisl@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>