]> git.ipfire.org Git - thirdparty/kernel/linux.git/log
thirdparty/kernel/linux.git
10 days agoselftests: netfilter: add phony nft_offload test
Florian Westphal [Fri, 12 Jun 2026 09:22:09 +0000 (11:22 +0200)] 
selftests: netfilter: add phony nft_offload test

... "phony", because its not testing offloads, it tests the control
plane code.  Also test error unwind via fault injection framework.

For a proper test, real hardware would be required given we'd have
check if 'previously handed off to hardware' offload commands are
properly removed again on failure or rule flush.

Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20260612092209.11966-3-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 days agonetdevsim: tc: allow to test nf_tables offload control plane code
Florian Westphal [Fri, 12 Jun 2026 09:22:08 +0000 (11:22 +0200)] 
netdevsim: tc: allow to test nf_tables offload control plane code

The actual 'offload' is phony, all commands are ignored: this is only
useful to test control plane code.

Tag the existing callback to permit error injection to test rollback/abort
code in nf_tables.  This is also for fuzzers - the fault injection
framework allows probabilistic error insertion.

Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20260612092209.11966-2-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 days agonet: airoha: Fix error handling in airoha_ppe_flush_sram_entries()
Wayen.Yan [Fri, 12 Jun 2026 09:37:00 +0000 (17:37 +0800)] 
net: airoha: Fix error handling in airoha_ppe_flush_sram_entries()

In airoha_ppe_flush_sram_entries(), the outer "err" variable was never
updated when the inner loop variable shadowed it, causing the function
to always return 0 even when airoha_ppe_foe_commit_sram_entry() fails.

Drop the outer "err" variable and return directly on error, propagating
the error code from airoha_ppe_foe_commit_sram_entry() correctly.

Fixes: 620d7b91aadb ("net: airoha: ppe: Flush PPE SRAM table during PPE setup")
Link: https://lore.kernel.org/netdev/6a2b40e4.4dd82583.3a5c46.e52f@mx.google.com/
Signed-off-by: Wayen.Yan <win847@gmail.com>
Acked-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/6a2bd37a.4034e349.1b41bb.1caf@mx.google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 days agoMAINTAINERS: Update Coly Li's email address
Coly Li [Sat, 13 Jun 2026 15:04:58 +0000 (23:04 +0800)] 
MAINTAINERS: Update Coly Li's email address

I switch to colyli@fygo.io as my current email address.

Signed-off-by: Coly Li <colyli@fygo.io>
Link: https://patch.msgid.link/20260613150458.682707-1-colyli@fygo.io
Signed-off-by: Jens Axboe <axboe@kernel.dk>
10 days agoMerge tag 'core-urgent-2026-06-13' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sat, 13 Jun 2026 15:23:36 +0000 (08:23 -0700)] 
Merge tag 'core-urgent-2026-06-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull debugobjects fix from Ingo Molnar:

 - Fix potential debugobjects deadlock on PREEMPT_RT kernels (Waiman
   Long)

* tag 'core-urgent-2026-06-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  debugobjects: Don't call fill_pool() in early boot hardirq context

10 days agoMerge tag 'i2c-for-7.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa...
Linus Torvalds [Sat, 13 Jun 2026 15:14:17 +0000 (08:14 -0700)] 
Merge tag 'i2c-for-7.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux

Pull i2c fixes from Wolfram Sang:
 "The biggest news here is that this is my last pull request as I2C
  maintainer after 13.5 years. Starting with the 7.2 cycle, Andi Shyti
  is taking over who helped me greatly maintaining the host drivers for
  a while now. Thank you, Andi, and good luck with the subsystem. I'll
  be around for help, of course.

  Technically, there are two patches which might be a tad large for this
  late cycle, but most of them is explaining comments, so I think they
  are suitable.

   - MAINTAINERS:
      - hand over I2C maintainership to Andi
      - minor updates

   - rust: fix I2cAdapter refcount double increment

   - imx: keep clock and pinctrl states consistent in runtime PM

   - imx-lpi2c: fix DMA resource leaks on PIO fallback

   - qcom-cci: fix NULL pointer dereference on remove

   - riic: fix reset refcount leak on resume_noirq error path

   - stm32f7: account for analog filter in timing computation

   - tegra:
      - fix suspend/resume handling in NOIRQ phase
      - update Tegra410 I2C timings to match hardware specs"

* tag 'i2c-for-7.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  dt-bindings: i2c: mux-gpio: name correct maintainer
  MAINTAINERS: hand over I2C to Andi Shyti
  i2c: imx-lpi2c: fix resource leaks switching to devm_dma_request_chan()
  MAINTAINERS: i2c: designware: Remove inactive reviewer
  i2c: tegra: Fix NOIRQ suspend/resume
  i2c: tegra: Update Tegra410 I2C timing parameters
  i2c: qcom-cci: Fix NULL pointer dereference in cci_remove()
  i2c: stm32f7: fix timing computation ignoring i2c-analog-filter
  i2c: imx: fix clock and pinctrl state inconsistency in runtime PM
  i2c: riic: fix refcount leak in riic_i2c_resume_noirq()
  rust: i2c: fix I2cAdapter refcounts double increment

10 days agoMerge tag 'timers-v7.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/daniel...
Thomas Gleixner [Sat, 13 Jun 2026 14:24:29 +0000 (16:24 +0200)] 
Merge tag 'timers-v7.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/daniel.lezcano/linux into timers/clocksource

Pull clocksource/driver updates from Daniel Lezcano:

  - Remove the sifive,fine-ctr-bits property bindings because it is a
    redundant information (Nick Hu)

  - Remove the TCIU8 interrupt bindings on Renesas because it should not
    be described as the documentation marked reserved and fix the
    conditional reset line for the RZ/{T2H,N2H} (Cosmin Tanislav)

  - Add the StarFive JHB100 clint DT bindings compatible string (Ley
    Foon Tan)

  - Extend schema condition for interrupts to cover D1 compatible
    variant an add the D1 hstimer support (Michal Piekos)

  - Update the ARM architected timer support to handle the ACPI GTDT v3
    format and the EL2 virtual timer, enabling Linux to use the most
    appropriate timer when running with VHE, while also fixing several
    Device Trees to accurately reflect the underlying hardware (Marc
    Zyngier)

  - Cleanup and add the clocksource and the clockevent in the TI DM
    timer (Markus Schneider-Pargmann)

  - Add the multiple watchdogs support in the tegra186 and
    tegra234. Dedicate one as a kernel watchdog (Kartik Rajput)

  - Add the NXP clocksource selection for the scheduler in the Kconfig
    (Enric Balletbo i Serra)

Link: https://lore.kernel.org/all/1e55e8d6-8024-4f17-8620-ab3385465d76@oss.qualcomm.com
10 days agoposix-cpu-timers: Fix pid refcount leak in do_cpu_nanosleep() error path
WenTao Liang [Thu, 11 Jun 2026 16:17:38 +0000 (00:17 +0800)] 
posix-cpu-timers: Fix pid refcount leak in do_cpu_nanosleep() error path

In do_cpu_nanosleep(), posix_cpu_timer_create() takes a pid reference
via get_pid() and stores it in timer.it.cpu.pid. If the subsequent
posix_cpu_timer_set() call fails, the function returns immediately
without calling posix_cpu_timer_del() to release the pid reference,
causing a leak.

Fix it by calling posix_cpu_timer_del() before the unlock-and-return
on the error path, consistent with the other exit paths in the same
function.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: WenTao Liang <vulab@iscas.ac.cn>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260611161738.97043-1-vulab@iscas.ac.cn
10 days agox86/irq: Add missing 's' back to thermal event printout
Thomas Gleixner [Sat, 13 Jun 2026 13:31:03 +0000 (15:31 +0200)] 
x86/irq: Add missing 's' back to thermal event printout

The /proc/interrupt handling rework dropped a 's' in the thermal event
printout, which breaks the thermal test in the Intel LKVS suite.

Bring the important letter back.

Fixes: 2b57c69917ee ("x86/irq: Make irqstats array based")
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Closes: https://lore.kernel.org/oe-lkp/202606121325.97b29701-lkp@intel.com
10 days agotime/jiffies: Register jiffies clocksource before usage
Thomas Gleixner [Tue, 9 Jun 2026 15:14:45 +0000 (17:14 +0200)] 
time/jiffies: Register jiffies clocksource before usage

Teddy reported that a XEN HVM has a long boot delay, which was bisected to
the recent enhancements to the negative motion detection. It turned out
that the jiffies clocksource is used in early boot before it is registered,
which leaves the max_delta_raw field at zero. That causes the read out to
be clamped to the max delta of 0, which means time is not making progress.

Cure it by ensuring that it is initialized before its first usage in
timekeeping_init().

Fixes: 76031d9536a0 ("clocksource: Make negative motion detection more robust")
Reported-by: Teddy Astie <teddy.astie@vates.tech>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Teddy Astie <teddy.astie@vates.tech>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/87y0gn3fve.ffs@fw13
Closes: https://lore.kernel.org/all/1780914594.8631fc262581453bbf619ec5b2062170.19ea6c8227b000701b@vates.tech
10 days agohwmon: tmp401: Read "ti,n-factor" as signed
Rob Herring (Arm) [Fri, 12 Jun 2026 21:53:32 +0000 (16:53 -0500)] 
hwmon: tmp401: Read "ti,n-factor" as signed

The "ti,n-factor" binding and examples allow negative correction
values. Reading it as u32 makes the helper type disagree with the
documented signed value and hides real schema mismatches.

Use the signed helper so the DT access matches the s32 value stored by
the driver.

Assisted-by: Codex:gpt-5-5
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://lore.kernel.org/r/20260612215332.1889497-1-robh@kernel.org
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
10 days agoio_uring/bpf-ops: add a separate maintainer entry
Pavel Begunkov [Fri, 12 Jun 2026 17:36:22 +0000 (18:36 +0100)] 
io_uring/bpf-ops: add a separate maintainer entry

Add a maintainer entry for io_uring bpf struct_ops related files.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/d89f3b89e77b09a18daa45476fd1a40f2ee253cd.1780930463.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
10 days agoblock: check bio split for unaligned bvec
Keith Busch [Fri, 12 Jun 2026 22:32:04 +0000 (15:32 -0700)] 
block: check bio split for unaligned bvec

Offsets and lengths need to be validated against the dma alignment. This
check was skipped for sufficiently a small bio with a single bvec, which
may allow an invalid request dispatched to the driver. Force the
validation for an unaligned bvec by forcing the bio split path that
handles this condition.

Fixes: 7eac33186957 ("iomap: simplify direct io validity check")
Fixes: 5ff3f74e145a ("block: simplify direct io validity check")
Reported-by: Carlos Maiolino <cem@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://patch.msgid.link/20260612223205.465913-1-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
10 days agonbd: Reclassify sockets to avoid lockdep circular dependency
Eric Dumazet [Sat, 13 Jun 2026 04:26:19 +0000 (04:26 +0000)] 
nbd: Reclassify sockets to avoid lockdep circular dependency

syzbot reported a possible circular locking dependency in udp_sendmsg()
where fs_reclaim can be triggered while holding sk_lock, and fs_reclaim
can eventually depend on another sk_lock (e.g., if NBD is used for swap
or writeback and NBD uses TLS/TCP which acquires sk_lock).

Since the UDP socket and the NBD TCP/TLS socket are different, this is a
false positive. Fix this by reclassifying NBD sockets to a separate lock
class when they are added to the NBD device.

This is similar to what nvme-tcp and other network block devices do.

Fixes: ffa1e7ada456 ("block: Make request_queue lockdep splats show up earlier")
Reported-by: syzbot+607cdcf978b3e79da878@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/6a2cdafe.428ffe26.258b27.0161.GAE@google.com/T/#u
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260613042619.1108126-1-edumazet@google.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
10 days agoio_uring/net: make POLL_FIRST receive side checks consistent
Jens Axboe [Sat, 30 May 2026 02:03:47 +0000 (20:03 -0600)] 
io_uring/net: make POLL_FIRST receive side checks consistent

io_recv() and io_recvzc() are the odd ones out, as they checks for
whether POLL_FIRST should be honored before checking if the file is a
socket. It doesn't really matter, but might as well make it consistent
across all receive and send types.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
10 days agoio_uring: remove the per-ctx fallback task_work machinery
Jens Axboe [Thu, 11 Jun 2026 17:44:47 +0000 (11:44 -0600)] 
io_uring: remove the per-ctx fallback task_work machinery

With the tctx fallback running its entries directly, the per-ctx
fallback work has a single user left: moving local (DEFER_TASKRUN)
task_work entries out of a ring that is going away. Both of its call
sites are process context and don't hold ->uring_lock, the same
conditions the deferred fallback work itself ran under - so run the
entries in cancel mode right there instead, and rename the helper to
io_cancel_local_task_work() to match what it now does.

With that, ->fallback_llist, ->fallback_work, io_fallback_req_func()
and __io_fallback_tw() can all go away, along with the fallback work
flushing in the ring exit and cancel paths. Requests that get
orphaned by an exiting task now run via the tctx fallback work, which
the ring exit side implicitly waits on through the ctx refs those
requests hold.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
10 days agoio_uring: run the tctx task_work fallback directly
Jens Axboe [Thu, 11 Jun 2026 17:41:25 +0000 (11:41 -0600)] 
io_uring: run the tctx task_work fallback directly

The fallback work drains the tctx queue only to redistribute the entries
into the per-ctx fallback lists, bouncing them through a second
(per-ctx) work item before they finally run. That made sense when the
producer side did the draining and could be in any context, but the
fallback work is a regular process context kworker: it can just run the
entries itself. Reuse the normal run loop - if run from the fallback
kernel thread, ts.cancel will get set, and the work terminated.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
10 days agoio_uring: switch normal task_work to a mpscq
Jens Axboe [Thu, 11 Jun 2026 16:13:22 +0000 (10:13 -0600)] 
io_uring: switch normal task_work to a mpscq

Like the local task_work list, the normal (tctx) task_work list is an
llist, and hence needs the O(n) llist_reverse_order() pass before
running entries in queue order. On top of that, capped runs - sqpoll
processing IORING_TW_CAP_ENTRIES_VALUE entries at a time - need the
claimed-but-unprocessed leftovers carried in a separate retry_list,
as they can't be pushed back to the shared list.

Switch tctx->task_list to a mpscq, like what was done for the
DEFER_TASKRUN paths as well.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
10 days agoio_uring: switch local task_work to a mpscq
Jens Axboe [Wed, 10 Jun 2026 21:19:35 +0000 (15:19 -0600)] 
io_uring: switch local task_work to a mpscq

The local (DEFER_TASKRUN) task_work list is an llist, which is LIFO
ordered, and hence __io_run_local_work() has to restore the right
running order with an O(n) llist_reverse_order() pass first. On top of
that, a batch that gets capped by max_events needs the leftover entries
parked on a separate ->retry_llist, as they can't be pushed back to the
shared list.

Switch it to the FIFO mpscq. Adds are wait-free instead of a cmpxchg
retry loop, entries are popped in queue order with no reversal pass,
capping a run simply leaves the remainder on the queue, and
->retry_llist goes away entirely. The consumer cursor, ->work_head,
lives with the rest of the ->uring_lock protected state rather than
next to the queue, so that popping entries doesn't dirty the producer
side cacheline.

For low amounts of task_work, this ends up being a bit more efficient
than the existing scheme. As an example of that, doing multishot
receives for 8 clients has the following task_work overhead:

     1.02%  sock-test  [kernel.kallsyms]  [k] io_req_local_work_add
     0.88%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work_loop
     0.60%  sock-test  [kernel.kallsyms]  [k] llist_reverse_order
     0.14%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work
     2.64% at ~46Gb/sec

and after this change:

     1.08%  sock-test  [kernel.kallsyms]  [k] io_req_local_work_add
     1.03%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work
     2.11% at ~53Gb/sec

which has less overhead even though that test run was faster. For a case
of having 1024 clients on a single ring:

     2.22%  sock-test  [kernel.kallsyms]  [k] llist_reverse_order
     0.84%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work_loop
     0.42%  sock-test  [kernel.kallsyms]  [k] io_req_local_work_add
     0.02%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work
     3.50% at ~24Gb/sec

we start to see the llist reversing taking a considerable amount of
time, and the total add+run task_work overhead is around 3.5%. After
the change:

     0.90%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work
     0.42%  sock-test  [kernel.kallsyms]  [k] io_req_local_work_add
     1.32% at ~26Gb/sec

most of that overhead is gone, and performance is better as well.

Caleb Sander Mateos <csander@purestorage.com> reports that it improves
the performance of a ublk 4kb workload by 4% [1], while testing v1 of
this patchset.

[1] https://lore.kernel.org/io-uring/CADUfDZr-MMYBaP-e+y9+xuRhuiunO2sBTUCmwZyd7AgT8sVtiQ@mail.gmail.com/

Signed-off-by: Jens Axboe <axboe@kernel.dk>
10 days agoio_uring/mpscq: add lockless multi-producer, single-consumer FIFO queue
Jens Axboe [Wed, 10 Jun 2026 21:19:15 +0000 (15:19 -0600)] 
io_uring/mpscq: add lockless multi-producer, single-consumer FIFO queue

Local task_work is currently using llists for managing the work,
but that's a LIFO type of list. This means that running this task_work
needs to reverse the list first, to ensure fairness in running the
queued items.

Add a lockless FIFO queued, based on Dmitry Vyukov's intrusive MPSC
node-based queue algorithm, modified with an externally held consumer
cursor and conditional stub reinsertion. See comments in the header.

Producers are wait-free: a push is a single xchg() on the queue tail,
which serializes concurrent producers and defines the FIFO order, plus
a store linking the node to its predecessor. There are no cmpxchg retry
loops, and pushing is safe from any context, including hardirq.

The cost of linked list FIFO ordering is that a push publishes the node
in two steps - the xchg() makes it visible as the new tail before the
subsequent store links it into the chain that is reachable from the
head. A consumer hitting that window gets a NULL from mpscq_pop() while
mpscq_empty() reports false, and must retry later rather than treat the
queue as empty. The window is two instructions wide, but a producer can
get preempted inside it, so the consumer must not busy wait on it.

The consumer side supports a single consumer at a time, with callers
providing their own serialization. A stub node, which also defines the
empty state (tail == stub), allows the consumer to detach the final
node without racing against producer link stores: that node is only
handed out once the stub has been cmpxchg'ed back in as the tail. This
also guarantees that the previous tail returned by mpscq_push() cannot
get freed before that push has linked it, making it always valid for
comparisons.

The consumer cursor is deliberately not part of the queue struct - the
caller owns it and passes it to mpscq_pop(). This is done to separate
the consumer and producers cacheline. The cursor is written for every
popped entry, and keeping it on the same cacheline as ->tail would have
the consumer invalidating the line that producers need for every push.
Keeping it external lets the caller place it with its own consumer side
data instead.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
10 days agoio_uring: grab RCU read lock marking task run
Jens Axboe [Fri, 12 Jun 2026 02:27:22 +0000 (20:27 -0600)] 
io_uring: grab RCU read lock marking task run

Not required right now, as io_req_local_work_add() already calls this
helper with the RCU read lock held. But in preparation for that not
being the case, grab it locally.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
10 days agoMerge branch '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net...
Paolo Abeni [Sat, 13 Jun 2026 09:50:31 +0000 (11:50 +0200)] 
Merge branch '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue

Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2026-06-09 (idpf, ixgbe, igc)

Przemyslaw adds needed padding to idpf PTP structures to match firmware
expectations.

Larysa bypasses XPS configuration on XDP queues for ixgbe.

Khai Wen corrects offset into packet buffer when handling for frame
preemption on igc.

* '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
  igc: skip RX timestamp header for frame preemption verification
  ixgbe: do not configure xps for XDP queues
  idpf: add padding to PTP virtchnl structures
====================

Link: https://patch.msgid.link/
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
10 days agoocteontx2-af: npc: Fix size of entry2cntr_map
Ratheesh Kannoth [Wed, 10 Jun 2026 02:23:44 +0000 (07:53 +0530)] 
octeontx2-af: npc: Fix size of entry2cntr_map

KASAN prints below splat. This is caused by allocating counter for
reserved mcam entry for cpt 2nd pass entry. But mcam->entry2cntr_map
is not allocated for reserved entries.

BUG: KASAN: slab-out-of-bounds in npc_map_mcam_entry_and_cntr+0xb0/0x1a0
Write of size 2 at addr ffff0001033e7ffe by task kworker/0:1/14

CPU: 0 PID: 14 Comm: kworker/0:1 Not tainted 6.1.67 #1
Hardware name: Marvell CN106XX board (DT)
Workqueue: events work_for_cpu_fn
Call trace:
 dump_backtrace.part.0+0xe4/0xf0
 show_stack+0x18/0x30
 dump_stack_lvl+0x88/0xb4
 print_report+0x154/0x458
 kasan_report+0xb8/0x194
 __asan_store2+0x7c/0xa0
 npc_map_mcam_entry_and_cntr+0xb0/0x1a0
 rvu_mbox_handler_npc_mcam_write_entry+0x268/0x280
 npc_install_flow+0x840/0xfe0
 rvu_npc_install_cpt_pass2_entry+0x138/0x190
 rvu_nix_init+0x148c/0x2880
 rvu_probe+0x1800/0x30b0
 local_pci_probe+0x78/0xe0
 work_for_cpu_fn+0x30/0x50
 process_one_work+0x4cc/0x97c
 worker_thread+0x360/0x630
 kthread+0x1a0/0x1b0
 ret_from_fork+0x10/0x20

Fixes: 55307fcb9258 ("octeontx2-af: Add mbox messages to install and delete MCAM rules")
Cc: Subbaraya Sundeep <sbhatta@marvell.com>
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/20260610022344.969774-1-rkannoth@marvell.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
10 days agoselftests/bpf: Add arena direct-value one-past-end reject test
Woojin Ji [Fri, 12 Jun 2026 05:26:55 +0000 (14:26 +0900)] 
selftests/bpf: Add arena direct-value one-past-end reject test

BPF_MAP_TYPE_ARENA supports direct-value pseudo loads, but unlike array
maps its map value_size is zero and the valid direct-value range is the
arena mmap size, max_entries * PAGE_SIZE.

Commit 3ac1a467e376 ("bpf: Fix off-by-one boundary validation in arena
direct-value access") fixed arena_map_direct_value_addr() to reject an
offset exactly at the end of the arena mapping. Add a regression test
that loads a BPF_PSEUDO_MAP_VALUE with off == arena_size and verifies
that the verifier rejects it with the expected offset in the log.

This is intentionally kept as a userspace raw-instruction test. I tried
expressing the same BPF_PSEUDO_MAP_VALUE + off == arena_size case in
verifier_arena.c with inline assembly. The only form that produces the
desired instruction bytes uses __imm_addr(arena), but that emits
R_BPF_64_NODYLD32, which the libbpf/bpftool link step rejects. Other
register, immediate, and memory constraints either fail in the BPF
backend or lower to a normal R_BPF_64_64 load followed by an ALU add,
which does not exercise arena_map_direct_value_addr() with the boundary
offset in the second ldimm64 slot.

A legacy test_verifier fixture can express the raw instruction directly,
but it needs arena map creation, mmap, and fixup plumbing in the legacy
runner. That is more intrusive than the small prog_tests raw-instruction
test.

Use the userspace raw-instruction test, following the existing selftests
pattern used for direct map-value pseudo loads, so insns[1].imm can be
set to arena_size precisely.

Assisted-by: ChatGPT:gpt-5.5
Signed-off-by: Woojin Ji <random6.xyz@gmail.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Cc: Emil Tsalapatis <emil@etsalapatis.com>
Cc: Junyoung Jang <graypanda.inzag@gmail.com>
Link: https://lore.kernel.org/r/20260612-arena-direct-value-v1-v4-1-b81b642f5277@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 days agorqspinlock: Fix order in raw_res_spin_(un)lock_irq to allow schedule
Gabriele Monaco [Wed, 10 Jun 2026 09:04:29 +0000 (11:04 +0200)] 
rqspinlock: Fix order in raw_res_spin_(un)lock_irq to allow schedule

raw_res_spin_unlock_irqrestore() calls raw_res_spin_unlock() and then
restores interrupts, this means preemption is enabled when interrupts
are still disabled (as part of raw_res_spin_unlock()) so this cannot
trigger an actual preemption.
This is inconsistent with other spinlock implementations
(raw_spin_unlock_irqrestore() and bpf_res_spin_unlock_irqrestore()
itself).

Adjust the macro to ensure interrupts are enabled before enabling
preemption, allowing to schedule at that point. Make the same
modification in the error path of raw_res_spin_lock_irqsave().

Fixes: 101acd2e78b1 ("rqspinlock: Add macros for rqspinlock usage")
Cc: stable@vger.kernel.org
Acked-by: Arnd Bergmann <arnd@arndb.de> # asm-generic
Acked-by: Waiman Long <longman@redhat.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Link: https://lore.kernel.org/r/20260610090431.32427-1-gmonaco@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 days agoMerge branch 'bpf-fix-setting-retval-to-eperm-for-cgroup-hooks-not-returning-errno'
Alexei Starovoitov [Sat, 13 Jun 2026 03:33:16 +0000 (20:33 -0700)] 
Merge branch 'bpf-fix-setting-retval-to-eperm-for-cgroup-hooks-not-returning-errno'

Xu Kuohai says:

====================
bpf: Fix setting retval to -EPERM for cgroup hooks not returning errno

This series fixes the issue reported by sashiko in [1]. The issue is that,
when a cgroup BPF program exits with 0, bpf_prog_run_array_cg() sets
the hook return value to -EPERM if it is not a valid errno. This is
correct for errno-based hooks, which return 0 on success and negative
errno on failure, but wrong for void and boolean LSM hooks. Boolean
LSM hooks should only return true or false, and void LSM hooks have
no return value at all.

Fix it by skipping setting -EPERM for hooks not returning errno.

[1] https://lore.kernel.org/bpf/20260605144232.95A141F00893@smtp.kernel.org/
====================

Link: https://patch.msgid.link/20260610201724.733943-1-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 days agoselftests/bpf: Add retval test for bool and errno LSM cgroup hooks
Xu Kuohai [Wed, 10 Jun 2026 20:17:24 +0000 (20:17 +0000)] 
selftests/bpf: Add retval test for bool and errno LSM cgroup hooks

Add test to check the return value when a BPF program exits with 0 for
a boolean and an errno LSM hook.

For each hook, two BPF programs are attached. The first program returns
0 without calling bpf_set_retval() to exercise the return value translation
logic, while the second program reads the retval via bpf_get_retval().

Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20260610201724.733943-3-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 days agobpf: Fix setting retval to -EPERM for cgroup hooks not returning errno
Xu Kuohai [Wed, 10 Jun 2026 20:17:23 +0000 (20:17 +0000)] 
bpf: Fix setting retval to -EPERM for cgroup hooks not returning errno

When a cgroup BPF program exits with 0, bpf_prog_run_array_cg() sets
the hook return value to -EPERM if it is not a valid errno. This is
correct for errno-based hooks, which return 0 on success and negative
errno on failure, but wrong for boolean and void LSM hooks. Boolean
LSM hooks should only return true or false, and void LSM hooks have
no return value at all.

Fix it by skipping setting -EPERM for hooks not returning errno.

Fixes: 69fd337a975c ("bpf: per-cgroup lsm flavor")
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20260610201724.733943-2-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 days agofirewire: core: Open-code topology list walk
Kaitao Cheng [Tue, 9 Jun 2026 06:13:35 +0000 (14:13 +0800)] 
firewire: core: Open-code topology list walk

A later change will make list_for_each_entry() cache the next element
before entering the loop body. for_each_fw_node() intentionally appends
newly discovered child nodes to the temporary walk list while the list is
being traversed.

Keep the loop open-coded so the next node is looked up only after
children have been appended. This preserves the current breadth-first
traversal semantics and prepares the code for the list iterator update.

Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
Link: https://lore.kernel.org/r/20260609061347.93688-3-kaitao.cheng@linux.dev
Signed-off-by: Takashi Sakamoto <o-takashi@sakamocchi.jp>
10 days agonet: qrtr: fix 32-bit integer overflow in qrtr_endpoint_post()
Michael Bommarito [Thu, 11 Jun 2026 12:54:55 +0000 (08:54 -0400)] 
net: qrtr: fix 32-bit integer overflow in qrtr_endpoint_post()

qrtr_endpoint_post() validates an incoming packet with

if (!size || len != ALIGN(size, 4) + hdrlen)
goto err;

where size comes from the wire. On 32-bit, size_t is 32 bits and
ALIGN(size, 4) wraps to 0 for size >= 0xfffffffd, so the check
passes and skb_put_data(skb, data + hdrlen, size) writes past the
hdrlen-sized skb and oopses the kernel. 64-bit is unaffected.

This is the 32-bit residual of ad9d24c9429e2 ("net: qrtr: fix OOB
Read in qrtr_endpoint_post"), which fixed only the 64-bit case.

Reject any size that cannot fit the buffer before the ALIGN.

Fixes: ad9d24c9429e2 ("net: qrtr: fix OOB Read in qrtr_endpoint_post")
Cc: stable@vger.kernel.org
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260611125455.2352279-1-michael.bommarito@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 days agonet/mlx5: Check max_macs devlink param value against max capability
Dragos Tatulea [Thu, 11 Jun 2026 13:52:30 +0000 (16:52 +0300)] 
net/mlx5: Check max_macs devlink param value against max capability

The max_macs devlink param is checked against the FW max value only at
param register time (driver load) and inside the validate callback
(devlink param set). The stored DRIVERINIT value persists across FW
resets and devlink reloads without any further checks against the max.

If the FW link type changes from Ethernet to IB and a FW reset happens,
the MAX cap for log_max_current_uc_list will become zero, but the
previously stored max_macs value remains and is unconditionally
programmed into the HCA caps in handle_hca_cap(). FW will then return a
syndrome during SET_HCA_CAP:

 mlx5_cmd_out_err:839:(pid 3831): SET_HCA_CAP(0x109) op_mod(0x0) failed,
 status bad parameter(0x3), syndrome (0x537801), err(-22)
 set_hca_cap:907:(pid 3831): handle_hca_cap failed

This results in a failure to register the RDMA device.

This patch skips programming log_max_current_uc_list when the MAX
capability is 0 (in case of IB).

Fixes: 8680a60fc1fc ("net/mlx5: Let user configure max_macs generic param")
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Yael Chemla <ychemla@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20260611135230.534513-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 days agoMerge branch 'psp-add-support-for-dev-assoc-disassoc'
Jakub Kicinski [Sat, 13 Jun 2026 01:31:35 +0000 (18:31 -0700)] 
Merge branch 'psp-add-support-for-dev-assoc-disassoc'

Wei Wang says:

====================
psp: Add support for dev-assoc/disassoc

The main purpose of this feature is to associate virtual devices like
veth or netkit with a real PSP device, so we could provide PSP
functionality to the application running with virtual devices.

A typical deployment that works with this feature is as follows:
     Host Namespace:
     psp_dev_local  ←──physically linked──→ psp_dev_peer
  (PSP device)
       │
       │ BPF on psp_dev_local ingress: bpf_redirect_peer() to nk_guest
       │
  nk_host / veth_host
       │
       │ BPF on nk_host ingress: bpf_redirect_neigh() to psp_dev_local
       │
      Guest Namespace (netns):
       │
  nk_guest / veth_guest
  ★ PSP application run here

      Remote Namespace (_netns):
  psp_dev_peer
  ★ PSP server application runs here

Note:
The general requirement for this feature to work:
For PSP to work correctly, the egress device at validate_xmit_skb()
time must have psp_dev matching the association's psd. Any device
stacking or traffic redirection that changes the egress device will
cause either:
1. TX validation failure (SKB_DROP_REASON_PSP_OUTPUT) - fail-safe
2. RX policy failure after tx-assoc - packets without PSP extension
   are rejected by receiver expecting encrypted traffic

Here are a few examples that this feature would not work:
- Bonding with load balancing in round-robin, XOR, 802.3ad mode across
  multiple PSP devices, or mixed PSP and non-PSP devices
- Bonding with active-backup mode might work without PSP migration for
  failover case.
- ipvlan/macvlan in bridge mode would not work given packets are
  loopbacked locally without going through the PSP device.
====================

Link: https://patch.msgid.link/20260608233118.2694144-1-weibunny.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 days agoselftests/net: psp: add dev-get, no-nsid, and cleanup tests
Wei Wang [Mon, 8 Jun 2026 23:31:18 +0000 (16:31 -0700)] 
selftests/net: psp: add dev-get, no-nsid, and cleanup tests

Add the following 3 tests:

- _psp_dev_get_check_netkit_psp_assoc: verifies dev-get output in both
  host and guest namespaces, checking assoc-list, by-association flag,
  and nsid values
- _dev_assoc_no_nsid: tests dev-assoc and dev-disassoc without the nsid
  attribute, verifying ifindex lookup in the caller's namespace
- _psp_dev_assoc_cleanup_on_netkit_del: verifies that deleting the
  associated netkit interface properly cleans up the assoc-list, using
  a disposable netkit pair to avoid disturbing the shared environment

Signed-off-by: Wei Wang <weibunny@fb.com>
Link: https://patch.msgid.link/20260608233118.2694144-11-weibunny.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 days agoselftests/net: psp: add cross-namespace notification tests
Wei Wang [Mon, 8 Jun 2026 23:31:17 +0000 (16:31 -0700)] 
selftests/net: psp: add cross-namespace notification tests

Add tests that verify PSP notifications are delivered to listeners in
associated namespaces:

- _key_rotation_notify_multi_ns_netkit: triggers key rotation and
  verifies the notification is received in both main and guest namespaces
- _dev_change_notify_multi_ns_netkit: triggers dev_set and verifies the
  dev_change notification is received in both namespaces

Signed-off-by: Wei Wang <weibunny@fb.com>
Link: https://patch.msgid.link/20260608233118.2694144-10-weibunny.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 days agoselftests/net: psp: add dev-assoc data path test
Wei Wang [Mon, 8 Jun 2026 23:31:16 +0000 (16:31 -0700)] 
selftests/net: psp: add dev-assoc data path test

Add _assoc_check_list() test that associates nk_guest with the PSP
device and verifies the assoc-list is correctly populated.

Add _data_basic_send_netkit_psp_assoc() which tests PSP data send
through a netkit interface associated with a PSP device. The test
associates nk_guest with the PSP device, then sends PSP-encrypted
traffic from the guest namespace.

Signed-off-by: Wei Wang <weibunny@fb.com>
Link: https://patch.msgid.link/20260608233118.2694144-9-weibunny.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 days agoselftests/net: psp: support PSP in NetDrvContEnv infrastructure
Wei Wang [Mon, 8 Jun 2026 23:31:15 +0000 (16:31 -0700)] 
selftests/net: psp: support PSP in NetDrvContEnv infrastructure

Add infrastructure to support PSP tests across network namespaces
using NetDrvContEnv with netkit pairs. This enables testing PSP device
association, where a non-PSP-capable device (e.g. netkit) in a guest
namespace is associated with a real PSP device in the host namespace,
allowing the guest to perform PSP encryption/decryption through the
host's PSP hardware.

The topology is:
  Host NS:  psp_dev_local <---> nk_host
                |                  |
                |                  | (netkit pair)
                |                  |
  Remote NS: psp_dev_peer      Guest NS: nk_guest
             (responder)             (PSP tests)

env.py:
- nk_guest_ifindex is queried after moving the device into the guest
  namespace, so tests can use it directly for dev-assoc

psp.py:
- PSP device lookup supports container environments where the PSP
  device is on the physical interface, not the test interface
- Association helpers handle dev-assoc/dev-disassoc with defer-based
  cleanup to prevent state leaks on test assertion failures
- main() tries NetDrvContEnv with primary_rx_redirect and falls back
  to NetDrvEpEnv, so existing tests continue to work without the
  container environment

Signed-off-by: Wei Wang <weibunny@fb.com>
Link: https://patch.msgid.link/20260608233118.2694144-8-weibunny.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 days agoselftests/net: rename _nk_host_ifname to nk_host_ifname
Wei Wang [Mon, 8 Jun 2026 23:31:14 +0000 (16:31 -0700)] 
selftests/net: rename _nk_host_ifname to nk_host_ifname

Rename _nk_host_ifname to nk_host_ifname in NetDrvContEnv to make it
a public attribute, matching the nk_guest_ifname rename. Tests that
access the host-side netkit interface name (e.g. for cleanup after
deleting the netkit pair) no longer trigger pylint protected-access
warnings.

Signed-off-by: Wei Wang <weibunny@fb.com>
Link: https://patch.msgid.link/20260608233118.2694144-7-weibunny.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 days agoselftests/net: add _find_bpf_obj() to search hw/ for BPF objects
Wei Wang [Mon, 8 Jun 2026 23:31:13 +0000 (16:31 -0700)] 
selftests/net: add _find_bpf_obj() to search hw/ for BPF objects

Add _find_bpf_obj() helper to NetDrvContEnv that searches the test
directory first, then falls back to the hw/ subdirectory. This allows
tests outside drivers/net/hw/ (e.g. psp.py in drivers/net/) to find
BPF objects built in the hw/ directory.

Update _attach_bpf() and _attach_primary_rx_redirect_bpf() to use
_find_bpf_obj() for BPF object discovery.

Signed-off-by: Wei Wang <weibunny@fb.com>
Link: https://patch.msgid.link/20260608233118.2694144-6-weibunny.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 days agoselftests/net: psp: refactor test builders to use ksft_variants
Wei Wang [Mon, 8 Jun 2026 23:31:12 +0000 (16:31 -0700)] 
selftests/net: psp: refactor test builders to use ksft_variants

Replace the manual psp_ip_ver_test_builder() and ipver_test_builder()
functions with @ksft_variants decorators for data_basic_send and
data_mss_adjust. This is a pure refactor with no behavior change.

Signed-off-by: Wei Wang <weibunny@fb.com>
Link: https://patch.msgid.link/20260608233118.2694144-5-weibunny.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 days agopsp: add a new netdev event for dev unregister
Wei Wang [Mon, 8 Jun 2026 23:31:11 +0000 (16:31 -0700)] 
psp: add a new netdev event for dev unregister

Add a new netdev event for dev unregister and handle the removal of this
dev from psp->assoc_dev_list, upon the first dev-assoc operation.

Signed-off-by: Wei Wang <weibunny@fb.com>
Reviewed-by: Daniel Zahka <daniel.zahka@gmail.com>
Link: https://patch.msgid.link/20260608233118.2694144-4-weibunny.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 days agopsp: add new netlink cmd for dev-assoc and dev-disassoc
Wei Wang [Mon, 8 Jun 2026 23:31:10 +0000 (16:31 -0700)] 
psp: add new netlink cmd for dev-assoc and dev-disassoc

The main purpose of this cmd is to be able to associate a
non-psp-capable device (e.g. veth or netkit) with a psp device.
One use case is if we create a pair of veth/netkit, and assign 1 end
inside a netns, while leaving the other end within the default netns,
with a real PSP device, e.g. netdevsim or a physical PSP-capable NIC.
With this command, we could associate the veth/netkit inside the netns
with PSP device, so the virtual device could act as PSP-capable device
to initiate PSP connections, and performs PSP encryption/decryption on
the real PSP device.

Signed-off-by: Wei Wang <weibunny@fb.com>
Reviewed-by: Daniel Zahka <daniel.zahka@gmail.com>
Link: https://patch.msgid.link/20260608233118.2694144-3-weibunny.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 days agopsp: add admin/non-admin version of psp_device_get_locked
Wei Wang [Mon, 8 Jun 2026 23:31:09 +0000 (16:31 -0700)] 
psp: add admin/non-admin version of psp_device_get_locked

Introduce 2 versions of psp_device_get_locked:
1. psp_device_get_locked_admin(): This version is used for operations
   that would change the status of the psd, and are currently used for
   dev-set and key-rotation.
2. psp_device_get_locked(): This is the non-admin version, which are
   used for broader user issued operations including: dev-get, rx-assoc,
   tx-assoc, get-stats.

Following commit will be implementing both of the checks.

Signed-off-by: Wei Wang <weibunny@fb.com>
Reviewed-by: Daniel Zahka <daniel.zahka@gmail.com>
Link: https://patch.msgid.link/20260608233118.2694144-2-weibunny.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 days agoMerge branch 'bpf-fix-generic-devmap-egress-skb-sharing'
Alexei Starovoitov [Sat, 13 Jun 2026 01:21:01 +0000 (18:21 -0700)] 
Merge branch 'bpf-fix-generic-devmap-egress-skb-sharing'

Sun Jian says:

====================
bpf: Fix generic devmap egress skb sharing

Generic XDP devmap multi redirect can leave cloned skbs sharing packet
data. When a devmap egress program mutates packet data, another
destination sharing the same data may observe that mutation.

Fix this by making cloned skbs private before running the generic devmap
egress program. The private copy is made in dev_map_generic_redirect()
so dev_map_bpf_prog_run_skb() can keep returning the XDP action directly.

Add selftest coverage for the last-destination case, where the final
destination runs on the original skb while earlier destinations use
cloned skbs. The test records the source MAC observed by an earlier
destination and checks that it is neither the sentinel value left in the
result map nor the MAC written by the final destination.
---

v5:
- Move the skb_copy() check back to dev_map_generic_redirect() to keep
  dev_map_bpf_prog_run_skb() returning only the XDP action.
- Preserve mac_len after skb_copy().
- Use __be64 temporary values when updating mac_map from userspace.
- Initialize rx_mac with a sentinel in the last-destination test instead
  of relying on -ENOENT for ARRAY map lookups.
- Adjust the last-destination test topology so the checked earlier
  destination is not the ingress/source veth.
- Split the last-destination check into two assertions: one for store_mac_1
  updating rx_mac and one for detecting last-destination rewrite leakage.

v4: https://lore.kernel.org/bpf/20260611080850.536996-1-sun.jian.kdev@gmail.com/T/#mf830f03d362f33e0941d1b0e425169698fce76e5
- Preserve mac_len after skb_copy().
- Separate errno return from XDP action output in
  dev_map_bpf_prog_run_skb().
- Zero-initialize net_config in the new selftest.

v3: https://lore.kernel.org/bpf/20260611043317.512843-1-sun.jian.kdev@gmail.com/
- Split the kernel fix and selftest into separate patches.
- Move the private-copy logic into dev_map_bpf_prog_run_skb().
- Use deterministic DEVMAP_HASH keys in the last-destination selftest.
- Fix the Fixes tag.

v2: https://lore.kernel.org/bpf/08c35c70-a59e-4e0e-91db-22b5ec30b611@linux.dev/
- Move the private-copy step into dev_map_generic_redirect() so the
  last-destination path is covered as well.
- Use skb_copy() instead of skb_unshare() to keep caller ownership
  unchanged on allocation failure.
- Add a generic XDP last-destination selftest case.

v1: https://lore.kernel.org/bpf/CABFUUZFimdrZdq=NWi+N-0sJZWvMwY=f4iF6-3TVMS8=m07Zmw@mail.gmail.com/
====================

Link: https://patch.msgid.link/20260612114032.244616-1-sun.jian.kdev@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 days agoselftests/bpf: Cover generic devmap egress last-dst rewrite
Sun Jian [Fri, 12 Jun 2026 11:40:32 +0000 (19:40 +0800)] 
selftests/bpf: Cover generic devmap egress last-dst rewrite

Strengthen xdp_veth_egress to check that each destination observes the
MAC selected for its own egress ifindex, instead of only checking that
the observed MAC differs from a single magic value.

Add a generic XDP last-destination test where an earlier destination does
not have a devmap egress program while the final destination does. This
covers the case where the final destination runs on the original skb and
could otherwise rewrite packet data still shared with an earlier cloned
skb.

Use deterministic DEVMAP_HASH keys for the egress map so the intended
last destination is stable. Initialize the result map with a sentinel
value and check that store_mac_1 overwrites it before checking that the
earlier destination did not observe the MAC written by the final
destination.

Suggested-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Sun Jian <sun.jian.kdev@gmail.com>
Link: https://lore.kernel.org/r/20260612114032.244616-3-sun.jian.kdev@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 days agobpf: Run generic devmap egress prog on private skb
Sun Jian [Fri, 12 Jun 2026 11:40:31 +0000 (19:40 +0800)] 
bpf: Run generic devmap egress prog on private skb

Generic XDP devmap multi redirect uses skb_clone() for intermediate
destinations and sends the last destination with the original skb. This
can leave multiple destinations sharing the same packet data.

This becomes visible after generic devmap egress-program support was
added: a devmap egress program may mutate packet data, and another
destination sharing the same data can observe that mutation.

Native XDP broadcast redirect does not have this issue because
xdpf_clone() copies the frame data for each destination. Generic XDP
should provide the same per-destination isolation before running a
devmap egress program.

Fix this by making cloned skbs private before running the generic devmap
egress program. Use skb_copy() instead of skb_unshare() so allocation
failure does not consume the skb and the existing caller error paths keep
their ownership semantics.

Fixes: 2ea5eabaf04a ("bpf: devmap: Implement devmap prog execution for generic XDP")
Suggested-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Sun Jian <sun.jian.kdev@gmail.com>
Link: https://lore.kernel.org/r/20260612114032.244616-2-sun.jian.kdev@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
11 days agoMerge branch 'net-dsa-microchip-remove-unnecessary-dsa_switch_ops-callbacks'
Jakub Kicinski [Sat, 13 Jun 2026 01:08:10 +0000 (18:08 -0700)] 
Merge branch 'net-dsa-microchip-remove-unnecessary-dsa_switch_ops-callbacks'

Bastien Curutchet says:

====================
net: dsa: microchip: remove unnecessary dsa_switch_ops callbacks

This series continues the rework of the KSZ driver initiated by two previous
series (see [1] & [2]).

The KSZ driver handles more than 20 switches split in several families.
This was previously handled through a common set of dsa_switch_ops
operations that used device-specific ksz_dev_ops callbacks. The two
previous series have split this common struct dsa_switch_ops into 5
to connect the ksz_dev_ops's implentations directly to the new
dsa_swicth ops.

This series continues in the same vein and removes the dsa_switch_ops
operations that aren't used.

On top of this on-going rework I added PTP and periodic output support for
the KSZ8463 (which was my first goal). There are still more than 20 patches
left for all this so this series will be followed by three others and if you
want to see the full picture we can check my github ([3]).

FYI, I only have a KSZ8463 so, unfortunately, I can't test other switches.

The next series is going to move out of ksz_common.c the last remaining
functions that aren't truly common to all KSZ switches. The series after
that will add PTP support for the KSZ8463 and the final one will add
periodic output support for the KSZ8463.

[1]: https://lore.kernel.org/r/20260505-clean-ksz-driver-v1-0-05d70fa42461@bootlin.com
[2]: https://lore.kernel.org/r/20260521-clean-ksz-2nd-series-v3-0-75c38971c19a@bootlin.com
[3]: https://github.com/bastien-curutchet/linux/tree/ksz_rework
====================

Link: https://patch.msgid.link/20260608-clean-ksz-3rd-v2-0-6e61b7be23c4@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agonet: dsa: microchip: implement port_teardown only if needed
Bastien Curutchet (Schneider Electric) [Mon, 8 Jun 2026 14:10:13 +0000 (16:10 +0200)] 
net: dsa: microchip: implement port_teardown only if needed

The port_teardown() operation is optional. Yet, it is implemented by all
the KSZ switches through a common function that doesn't do anything for
the switches that aren't part of the ksz9477 family

Remove the implementation from the switches that don't need it.
Implement instead a ksz9477-specific port_teardown.

Signed-off-by: Bastien Curutchet (Schneider Electric) <bastien.curutchet@bootlin.com>
Link: https://patch.msgid.link/20260608-clean-ksz-3rd-v2-10-6e61b7be23c4@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agonet: dsa: microchip: implement lan937x-specific MDIO registration
Bastien Curutchet (Schneider Electric) [Mon, 8 Jun 2026 14:10:12 +0000 (16:10 +0200)] 
net: dsa: microchip: implement lan937x-specific MDIO registration

All the switches use a common mdio_register() function that uses two
ksz_dev_ops callbacks (.mdio_bus_preinit() and .create_phy_addr_map())
to handle the lan937x specific case. These two callbacks are used only
at this place in the code.

Implement a new lan937x-specific MDIO registration functions that uses
these two lan937x-specific functions. The lan937x bindings don't
have any 'interrupts' property so this lan937x_mdio_register() doesn't
call ksz_irq_phy_setup().
Expose the common ksz_*_mdio_{read/write} functions so they can be used
in lan937x.c
Remove the callbacks from ksz_dev_ops.

Signed-off-by: Bastien Curutchet (Schneider Electric) <bastien.curutchet@bootlin.com>
Link: https://patch.msgid.link/20260608-clean-ksz-3rd-v2-9-6e61b7be23c4@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agonet: dsa: microchip: implement port_hsr_join for KSZ9477 only
Bastien Curutchet (Schneider Electric) [Mon, 8 Jun 2026 14:10:11 +0000 (16:10 +0200)] 
net: dsa: microchip: implement port_hsr_join for KSZ9477 only

All switches implement the optional .port_hsr_join operation while only
the KSZ9477 truly supports it.

Remove the common port_hsr_join implementation.
Replace it with a specific implementation for the KSZ9477 case.

Signed-off-by: Bastien Curutchet (Schneider Electric) <bastien.curutchet@bootlin.com>
Link: https://patch.msgid.link/20260608-clean-ksz-3rd-v2-8-6e61b7be23c4@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agonet: dsa: microchip: implement .{get/set}_wol only if needed
Bastien Curutchet (Schneider Electric) [Mon, 8 Jun 2026 14:10:10 +0000 (16:10 +0200)] 
net: dsa: microchip: implement .{get/set}_wol only if needed

All the KSZ switches use common {get/set}_wol operations while only the
ksz9477 and the ksz87xx families really support it. These operations are
optional so there is no point implementing them to return -EOPNOTSUPP.

Remove the {get/set}_wol callbacks from the switch operations for the
ksz88xx, the ksz8463 and the lan937x families.
Remove the family check from the common {get/set}_wol implementation.

Note that is_ksz9477() is only true for the KSZ9477 so this change will
also add WoL support for the other switches using the
ksz9477_switch_ops. I checked their datasheet, they implement the same
PME_WOL registers, at the same addresses, so this should go fine.
Modify the ksz_wol_pre_shutdown() initial check to ensure consistency in
the WoL handling for these non-KSZ9477 switches using ksz9477_switch_ops.

Signed-off-by: Bastien Curutchet (Schneider Electric) <bastien.curutchet@bootlin.com>
Link: https://patch.msgid.link/20260608-clean-ksz-3rd-v2-7-6e61b7be23c4@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agonet: dsa: microchip: implement .support_eee() only if needed
Bastien Curutchet (Schneider Electric) [Mon, 8 Jun 2026 14:10:09 +0000 (16:10 +0200)] 
net: dsa: microchip: implement .support_eee() only if needed

The .support_eee() operation is optional. Yet, it is implemented by the
KSZ switches through a common functon that reports false for every chip
except for KSZ8563, KSZ9563 and KSZ9893 from the KSZ9477 family.

Remove the implementation from the switches that don't support EEE.
Also remove .set_mac_eee() for them as .set_mac_eee() is gated by the
`support_eee` presence in the core.

Implement instead a ksz9477-specific support_eee for these three supported
switches.

Note that comment /* KSZ879x/KSZ877x/KSZ876x Errata DS80000687C Module 2 */
is completely removed because it concerns the KSZ87xx family that doesn't
support at all EEE.

Signed-off-by: Bastien Curutchet (Schneider Electric) <bastien.curutchet@bootlin.com>
Link: https://patch.msgid.link/20260608-clean-ksz-3rd-v2-6-6e61b7be23c4@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agonet: dsa: microchip: remove setup_rgmii_delay() KSZ operation
Bastien Curutchet (Schneider Electric) [Mon, 8 Jun 2026 14:10:08 +0000 (16:10 +0200)] 
net: dsa: microchip: remove setup_rgmii_delay() KSZ operation

setup_rgmii_delay() operation is only used once during the common phylink
MAC configuration. Only the lan937x switch implements this
setup_rgmii_delay().

Remove the setup_rgmii_delay operation from ksz_dev_ops.
Implement a lan937x-specific phylink MAC configuration that does this
RGMII delay setup.
Export ksz_set_xmii since it's needed by the lan937x implementation.

Signed-off-by: Bastien Curutchet (Schneider Electric) <bastien.curutchet@bootlin.com>
Link: https://patch.msgid.link/20260608-clean-ksz-3rd-v2-5-6e61b7be23c4@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agonet: dsa: microchip: wrap the MAC configuration checks in a function
Bastien Curutchet (Schneider Electric) [Mon, 8 Jun 2026 14:10:07 +0000 (16:10 +0200)] 
net: dsa: microchip: wrap the MAC configuration checks in a function

The common .mac_config() implementation checks some conditions before
doing any register access. As this common implementation is about to be
split in the upcoming patch, these checks would lead to code
duplication.

Wrap all the checks in a need_config() function that returns true when
the driver really need to access the switch registers to configure the
MAC.

Signed-off-by: Bastien Curutchet (Schneider Electric) <bastien.curutchet@bootlin.com>
Link: https://patch.msgid.link/20260608-clean-ksz-3rd-v2-4-6e61b7be23c4@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agonet: dsa: microchip: implement get_phy_flags only if needed
Bastien Curutchet (Schneider Electric) [Mon, 8 Jun 2026 14:10:06 +0000 (16:10 +0200)] 
net: dsa: microchip: implement get_phy_flags only if needed

The common ksz_get_phy_flags() is used by all the switches to implement
the optional .get_phy_flags DSA operation. It always returns 0 except
for KSZ88X3 switches where an errata has to be handled.

Make ksz_get_phy_flags() ksz88xx-specific.
Remove the get_phy_flags implementation for the switches that don't need
it.

Signed-off-by: Bastien Curutchet (Schneider Electric) <bastien.curutchet@bootlin.com>
Link: https://patch.msgid.link/20260608-clean-ksz-3rd-v2-3-6e61b7be23c4@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agonet: dsa: microchip: remove VLAN operations for ksz8463
Vladimir Oltean [Mon, 8 Jun 2026 14:10:05 +0000 (16:10 +0200)] 
net: dsa: microchip: remove VLAN operations for ksz8463

KSZ8463 uses the common KSZ8 implementation for its VLAN operations.
This implementation returns -ENOTSUPP for the KSZ8463 case, which is
pointless.

Remove the VLAN operations from the ksz8463_switch_ops so the core can
directly return -ENOTSUPP.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Bastien Curutchet (Schneider Electric) <bastien.curutchet@bootlin.com>
Link: https://patch.msgid.link/20260608-clean-ksz-3rd-v2-2-6e61b7be23c4@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agonet: dsa: microchip: remove useless common cls_flower_{add/del} operations
Bastien Curutchet (Schneider Electric) [Mon, 8 Jun 2026 14:10:04 +0000 (16:10 +0200)] 
net: dsa: microchip: remove useless common cls_flower_{add/del} operations

All the KSZ switches share a common implementation of the
cls_flower_{add/del} operations. These common implementations return
ksz9477-specific implementations for the KSZ9477 family and -EOPNOTSUPP
for the others. -EOPNOTSUPP is already returned by the DSA core when
the operation isn't implemented.

Remove the common implementations.
Directly link the ksz9477_cls_flower_{add/del}() to the KSZ9477 callback.

Signed-off-by: Bastien Curutchet (Schneider Electric) <bastien.curutchet@bootlin.com>
Link: https://patch.msgid.link/20260608-clean-ksz-3rd-v2-1-6e61b7be23c4@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agoMerge branch 'net-bridge-take-care-of-p-flags-accesses'
Jakub Kicinski [Sat, 13 Jun 2026 01:03:48 +0000 (18:03 -0700)] 
Merge branch 'net-bridge-take-care-of-p-flags-accesses'

Eric Dumazet says:

====================
net: bridge: take care of p->flags accesses

(struct net_bridge_port)->flags can be read/written locklessly,
and thus can fire KCSAN warnings, or real bugs.

Prefer atomic operations (test_bit(), clear_bit(), set_bit())
and use READ_ONCE() for the remaining uses.
====================

Link: https://patch.msgid.link/20260611203453.3067462-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agonet: bridge: use atomic ops to read/change p->flags (III)
Eric Dumazet [Thu, 11 Jun 2026 20:34:53 +0000 (20:34 +0000)] 
net: bridge: use atomic ops to read/change p->flags (III)

Use test_bit(), clear_bit(), set_bit() in:

   net/bridge/br_multicast.c
   net/bridge/br_netlink.c
   net/bridge/br_stp.c
   net/bridge/br_stp_bpdu.c
   net/bridge/br_switchdev.c
   net/bridge/br_vlan_options.c

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260611203453.3067462-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agonet: bridge: use atomic ops to read/change p->flags (II)
Eric Dumazet [Thu, 11 Jun 2026 20:34:52 +0000 (20:34 +0000)] 
net: bridge: use atomic ops to read/change p->flags (II)

Use READ_ONCE(p->flags) in br_port_flag_is_set() to keep its ABI.

Use test_bit(), clear_bit(), set_bit() in:

   net/bridge/br_input.c
   net/bridge/br_mrp.c
   net/bridge/br_mrp_netlink.c

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260611203453.3067462-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agonet: bridge: use atomic ops to read/change p->flags (I)
Eric Dumazet [Thu, 11 Jun 2026 20:34:51 +0000 (20:34 +0000)] 
net: bridge: use atomic ops to read/change p->flags (I)

Use test_bit() in net/bridge/br_arp_nd_proxy.c,
net/bridge/br_fdb.c and net/bridge/br_forward.c.

Use READ_ONCE(p->flags) in br_recalculate_neigh_suppress_enabled()
as we test two bits at once.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260611203453.3067462-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agobridge: use atomic ops to read/change p->flags in br_netlink.c
Eric Dumazet [Thu, 11 Jun 2026 20:34:50 +0000 (20:34 +0000)] 
bridge: use atomic ops to read/change p->flags in br_netlink.c

Change net/bridge/br_netlink.c to use atomic operations
to read/change bits in p->flags.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260611203453.3067462-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agobridge: use atomic ops to read/change p->flags in sysfs
Eric Dumazet [Thu, 11 Jun 2026 20:34:49 +0000 (20:34 +0000)] 
bridge: use atomic ops to read/change p->flags in sysfs

Change net/bridge/br_sysfs_if.c to use atomic operations
to read/change bits in p->flags.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260611203453.3067462-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agonet/sched: sch_dualpi2: Add missing module alias
Victor Nogueira [Thu, 11 Jun 2026 20:58:49 +0000 (17:58 -0300)] 
net/sched: sch_dualpi2: Add missing module alias

When a qdisc is added by name, the kernel tries to autoload its module
via request_qdisc_module(), which calls:

request_module(NET_SCH_ALIAS_PREFIX "%s", name);

i.e. it asks modprobe to resolve the "net-sch-<kind>" alias (e.g.
"net-sch-dualpi2") rather than the module's file name. Since dualpi2
was shipped without this alias, the autoload fails:

tc qdisc add dev lo root handle 1: dualpi2
Error: Specified qdisc kind is unknown.

Fix this by adding the missing alias so the qdisc is autoloaded on demand
like the others.

Fixes: 320d031ad6e4 ("sched: Struct definition and parsing of dualpi2 qdisc")
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Link: https://patch.msgid.link/20260611205849.3287640-1-victor@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agodocs: networking: add guidance on what to push via extack
Jakub Kicinski [Thu, 11 Jun 2026 17:21:49 +0000 (10:21 -0700)] 
docs: networking: add guidance on what to push via extack

Every now and then someone tries to duplicated extack
messages to dmesg. Document our guidance against this.
Also indicate that system level faults should continue
to go to system logs. The high level thinking is to try
to distinguish between what's important to the user vs
system admin.

Reviewed-by: Joe Damato <joe@dama.to>
Link: https://patch.msgid.link/20260611172149.1877704-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agoptp: ocp: add shutdown callback
Vadim Fedorenko [Thu, 11 Jun 2026 19:03:33 +0000 (19:03 +0000)] 
ptp: ocp: add shutdown callback

The shutdown callback was never implemented for this driver, but it's
needed because .remove() callback is never called during kexec/reboot
process. That leaves HW with some interrupts enabled and may cause
spurious interrupt while booting into a new kernel during with kexec.
If it happens that I2C interrupt fires during kexec, the whole I2C bus
is disabled leaving TimeCard with no devlink communication. The same
happens if timestampers were enabled, leaving the card without
timestamper interrupts until full reboot cycle.

Implement .shutdown() callback with the same function as remove
callback.

Signed-off-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20260611190333.787132-1-vadim.fedorenko@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agoMerge branch 'ipv6-honor-oif-when-choosing-nexthop-for-locally-generated-traffic'
Jakub Kicinski [Sat, 13 Jun 2026 00:53:51 +0000 (17:53 -0700)] 
Merge branch 'ipv6-honor-oif-when-choosing-nexthop-for-locally-generated-traffic'

Ido Schimmel says:

====================
ipv6: Honor oif when choosing nexthop for locally generated traffic

Patch #1 is a preparation patch following the comment from Sashiko on
v2. See details in the commit message.

Patch #2 aligns IPv6 with IPv4 and changes IPv6 route lookup to prefer a
nexthop whose nexthop device matches the specified oif.

Patch #3 adds a selftest.
====================

Link: https://patch.msgid.link/20260611154605.992528-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agoselftests: fib_tests: Add test cases for route lookup with oif
Ido Schimmel [Thu, 11 Jun 2026 15:46:05 +0000 (18:46 +0300)] 
selftests: fib_tests: Add test cases for route lookup with oif

Test that both address families respect the oif parameter when a
matching multipath route is found, regardless of the presence of a
source address.

Output without "ipv6: Select best matching nexthop object in
fib6_table_lookup()" and "ipv6: Honor oif when choosing nexthop for
locally generated traffic":

 # ./fib_tests.sh -t "ipv4_mpath_oif ipv4_mpath_oif_nh ipv4_mpath_oif_vrf ipv6_mpath_oif ipv6_mpath_oif_nh ipv6_mpath_oif_vrf"

 IPv4 multipath oif test
     TEST: IPv4 multipath via first nexthop                              [ OK ]
     TEST: IPv4 multipath via second nexthop                             [ OK ]
     TEST: IPv4 multipath via first nexthop with source address          [ OK ]
     TEST: IPv4 multipath via second nexthop with source address         [ OK ]

 IPv4 multipath oif with nexthop object test
     TEST: IPv4 multipath via first nexthop                              [ OK ]
     TEST: IPv4 multipath via second nexthop                             [ OK ]
     TEST: IPv4 multipath via first nexthop with source address          [ OK ]
     TEST: IPv4 multipath via second nexthop with source address         [ OK ]

 IPv4 multipath oif with VRF test
     TEST: IPv4 multipath via first nexthop                              [ OK ]
     TEST: IPv4 multipath via second nexthop                             [ OK ]
     TEST: IPv4 multipath via first nexthop with source address          [ OK ]
     TEST: IPv4 multipath via second nexthop with source address         [ OK ]

 IPv6 multipath oif test
     TEST: IPv6 multipath via first nexthop                              [ OK ]
     TEST: IPv6 multipath via second nexthop                             [ OK ]
     TEST: IPv6 multipath via first nexthop with source address          [FAIL]
     TEST: IPv6 multipath via second nexthop with source address         [FAIL]

 IPv6 multipath oif with nexthop object test
     TEST: IPv6 multipath via first nexthop                              [FAIL]
     TEST: IPv6 multipath via second nexthop                             [FAIL]
     TEST: IPv6 multipath via first nexthop with source address          [FAIL]
     TEST: IPv6 multipath via second nexthop with source address         [FAIL]

 IPv6 multipath oif with VRF test
     TEST: IPv6 multipath via first nexthop                              [ OK ]
     TEST: IPv6 multipath via second nexthop                             [ OK ]
     TEST: IPv6 multipath via first nexthop with source address          [FAIL]
     TEST: IPv6 multipath via second nexthop with source address         [FAIL]

 Tests passed:  16
 Tests failed:   8

Output with the patches:

 # ./fib_tests.sh -t "ipv4_mpath_oif ipv4_mpath_oif_nh ipv4_mpath_oif_vrf ipv6_mpath_oif ipv6_mpath_oif_nh ipv6_mpath_oif_vrf"

 IPv4 multipath oif test
     TEST: IPv4 multipath via first nexthop                              [ OK ]
     TEST: IPv4 multipath via second nexthop                             [ OK ]
     TEST: IPv4 multipath via first nexthop with source address          [ OK ]
     TEST: IPv4 multipath via second nexthop with source address         [ OK ]

 IPv4 multipath oif with nexthop object test
     TEST: IPv4 multipath via first nexthop                              [ OK ]
     TEST: IPv4 multipath via second nexthop                             [ OK ]
     TEST: IPv4 multipath via first nexthop with source address          [ OK ]
     TEST: IPv4 multipath via second nexthop with source address         [ OK ]

 IPv4 multipath oif with VRF test
     TEST: IPv4 multipath via first nexthop                              [ OK ]
     TEST: IPv4 multipath via second nexthop                             [ OK ]
     TEST: IPv4 multipath via first nexthop with source address          [ OK ]
     TEST: IPv4 multipath via second nexthop with source address         [ OK ]

 IPv6 multipath oif test
     TEST: IPv6 multipath via first nexthop                              [ OK ]
     TEST: IPv6 multipath via second nexthop                             [ OK ]
     TEST: IPv6 multipath via first nexthop with source address          [ OK ]
     TEST: IPv6 multipath via second nexthop with source address         [ OK ]

 IPv6 multipath oif with nexthop object test
     TEST: IPv6 multipath via first nexthop                              [ OK ]
     TEST: IPv6 multipath via second nexthop                             [ OK ]
     TEST: IPv6 multipath via first nexthop with source address          [ OK ]
     TEST: IPv6 multipath via second nexthop with source address         [ OK ]

 IPv6 multipath oif with VRF test
     TEST: IPv6 multipath via first nexthop                              [ OK ]
     TEST: IPv6 multipath via second nexthop                             [ OK ]
     TEST: IPv6 multipath via first nexthop with source address          [ OK ]
     TEST: IPv6 multipath via second nexthop with source address         [ OK ]

 Tests passed:  24
 Tests failed:   0

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20260611154605.992528-4-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agoipv6: Honor oif when choosing nexthop for locally generated traffic
Ido Schimmel [Thu, 11 Jun 2026 15:46:04 +0000 (18:46 +0300)] 
ipv6: Honor oif when choosing nexthop for locally generated traffic

Commit 741a11d9e410 ("net: ipv6: Add RT6_LOOKUP_F_IFACE flag if oif is
set") made the kernel honor the oif parameter when specified as part of
output route lookup:

 # ip route add 2001:db8:1::/64 dev dummy1
 # ip route add ::/0 dev dummy2
 # ip route get 2001:db8:1::1 oif dummy2 fibmatch
 default dev dummy2 metric 1024 pref medium

Due to regression reports, the behavior was partially reverted in commit
d46a9d678e4c ("net: ipv6: Dont add RT6_LOOKUP_F_IFACE flag if saddr
set") to only honor the oif if source address is not specified:

 # ip route get 2001:db8:1::1 from 2001:db8:2::1 oif dummy2 fibmatch
 2001:db8:1::/64 dev dummy1 metric 1024 pref medium

That is, when source address is specified, the kernel will choose the
most specific route even if its nexthop device does not match the
specified oif.

This creates a problem for multipath routes. After looking up a route,
when source address is not specified, the kernel will choose a nexthop
whose nexthop device matches the specified oif:

 # sysctl -wq net.ipv6.conf.all.forwarding=1
 # ip route add 2001:db8:10::/64 nexthop via fe80::1 dev dummy1 nexthop via fe80::2 dev dummy2
 # for i in {1..100}; do ip route get 2001:db8:10::${i} oif dummy2; done | grep -o dummy[0-9] | sort | uniq -c
      100 dummy2

But will disregard the oif when source address is specified despite the
fact that a matching nexthop exists:

 # for i in {1..100}; do ip route get 2001:db8:10::${i} from 2001:db8:2::1 oif dummy2; done | grep -o dummy[0-9] | sort | uniq -c
      53 dummy1
      47 dummy2

This behavior differs from IPv4:

 # ip address add 192.0.2.1/32 dev lo
 # ip route add 198.51.100.0/24 nexthop via inet6 fe80::1 dev dummy1 nexthop via inet6 fe80::2 dev dummy2
 # for i in {1..100}; do ip route get 198.51.100.${i} from 192.0.2.1 oif dummy2; done | grep -o dummy[0-9] | sort | uniq -c
     100 dummy2

What happens is that fib6_table_lookup() returns a route with a matching
nexthop device (assuming it exists):

 # perf record -e fib6:fib6_table_lookup -- bash -c "for i in {1..100}; do ip route get 2001:db8:10::${i} from 2001:db8:2::1 oif dummy2; done > /dev/null"
 # perf script | grep -o dummy[0-9] | sort | uniq -c
      100 dummy2

But it is later overwritten during path selection in fib6_select_path()
which instead chooses a nexthop according to the calculated hash.

Solve this by telling fib6_select_path() to skip path selection if we
have an oif match during output route lookup (iif being
LOOPBACK_IFINDEX).

Behavior after the change:

 # sysctl -wq net.ipv6.conf.all.forwarding=1
 # ip route add 2001:db8:10::/64 nexthop via fe80::1 dev dummy1 nexthop via fe80::2 dev dummy2
 # for i in {1..100}; do ip route get 2001:db8:10::${i} from 2001:db8:2::1 oif dummy2; done | grep -o dummy[0-9] | sort | uniq -c
     100 dummy2

Note that enabling forwarding is only needed because we did not add
neighbor entries for the gateway addresses. When forwarding is disabled
and CONFIG_IPV6_ROUTER_PREF is not enabled in kernel config, the kernel
will treat non-existing neighbor entries as errors and perform
round-robin between the nexthops:

 # sysctl -wq net.ipv6.conf.all.forwarding=0
 # for i in {1..100}; do ip route get 2001:db8:10::${i} from 2001:db8:2::1 oif dummy2; done | grep -o dummy[0-9] | sort | uniq -c
      50 dummy1
      50 dummy2

Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260611154605.992528-3-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agoipv6: Select best matching nexthop object in fib6_table_lookup()
Ido Schimmel [Thu, 11 Jun 2026 15:46:03 +0000 (18:46 +0300)] 
ipv6: Select best matching nexthop object in fib6_table_lookup()

Currently, when using multipath routes without nexthop objects,
fib6_table_lookup() selects the nexthop with the highest score. This
means that when both a source address and an oif are specified, the
nexthop that is chosen is the one that matches in terms of oif:

 # sysctl -wq net.ipv6.conf.all.forwarding=1
 # ip address add 2001:db8:2::1/64 dev lo
 # ip route add 2001:db8:10::/64 nexthop via fe80::1 dev dummy1 nexthop via fe80::2 dev dummy2

 # perf record -e fib6:fib6_table_lookup -- bash -c "for i in {1..100}; do ip route get 2001:db8:10::${i} from 2001:db8:2::1 oif dummy1; done > /dev/null"
 # perf script | grep -o dummy[0-9] | sort | uniq -c
     100 dummy1
 # perf record -e fib6:fib6_table_lookup -- bash -c "for i in {1..100}; do ip route get 2001:db8:10::${i} from 2001:db8:2::1 oif dummy2; done > /dev/null"
 # perf script | grep -o dummy[0-9] | sort | uniq -c
     100 dummy2

When using nexthop objects, fib6_table_lookup() selects the first
matching nexthop and not necessarily the one with the highest score:

 # ip nexthop add id 1 via fe80::1 dev dummy1
 # ip nexthop add id 2 via fe80::2 dev dummy2
 # ip nexthop add id 3 group 1/2
 # ip route add 2001:db8:20::/64 nhid 3

 # perf record -e fib6:fib6_table_lookup -- bash -c "for i in {1..100}; do ip route get 2001:db8:20::${i} from 2001:db8:2::1 oif dummy1; done > /dev/null"
 # perf script | grep -o dummy[0-9] | sort | uniq -c
     100 dummy1
 # perf record -e fib6:fib6_table_lookup -- bash -c "for i in {1..100}; do ip route get 2001:db8:20::${i} from 2001:db8:2::1 oif dummy2; done > /dev/null"
 # perf script | grep -o dummy[0-9] | sort | uniq -c
     100 dummy1

This is not very significant right now because the nexthop is later
overwritten during path selection in fib6_select_path(). However, the
next patch is going to skip path selection when we have an oif match
during output route lookup.

As a preparation for this change, align the nexthop object behavior with
the legacy one and make sure that fib6_table_lookup() always selects the
best matching nexthop. Do that by always returning 0 from
rt6_nh_find_match() in order not to terminate the loop in
nexthop_for_each_fib6_nh() and storing in arg->nh the best matching
nexthop so far.

Behavior after the change:

 # perf record -e fib6:fib6_table_lookup -- bash -c "for i in {1..100}; do ip route get 2001:db8:20::${i} from 2001:db8:2::1 oif dummy1; done > /dev/null"
 # perf script | grep -o dummy[0-9] | sort | uniq -c
     100 dummy1
 # perf record -e fib6:fib6_table_lookup -- bash -c "for i in {1..100}; do ip route get 2001:db8:20::${i} from 2001:db8:2::1 oif dummy2; done > /dev/null"
 # perf script | grep -o dummy[0-9] | sort | uniq -c
     100 dummy2

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20260611154605.992528-2-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agonetconsole: clear cached dev_name on resume-window cleanup
Breno Leitao [Wed, 10 Jun 2026 14:26:04 +0000 (07:26 -0700)] 
netconsole: clear cached dev_name on resume-window cleanup

When process_resume_target() catches a device that was unregistered
while the target was off target_list, it calls do_netpoll_cleanup() to
release the reference but leaves the cached np.dev_name in place. The
other cleanup path, netconsole_process_cleanups_core(), already wipes
dev_name for MAC-bound targets because the name was only a cache of the
device that last carried the MAC and may no longer match.

The pattern is the same in both spots, so fold it into a small helper
netcons_release_dev() and route both call sites through it. This makes
the resume-window cleanup consistent with the notifier-driven one so a
later enable does not let netpoll_setup() pick a stale interface by name
when the user bound the target by MAC.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Andre Carvalho <asantostc@gmail.com>
Link: https://patch.msgid.link/20260610-netconsole_fix_more-v1-1-a18652c47cef@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agonet: ethernet: mtk_wed: fix loading WO firmware for MT7986
Zhi-Jun You [Thu, 11 Jun 2026 15:00:51 +0000 (23:00 +0800)] 
net: ethernet: mtk_wed: fix loading WO firmware for MT7986

MT7986 requires a different mask for second WO firmware.
Without this, WO would timeout after loading FW.

The correct mask was removed when adding WED for MT7988.
Add it back and add a WED version check to fix it.

This can be reproduced with a MT7986 + MT7916 board.

Fixes: e2f64db13aa1 ("net: ethernet: mtk_wed: introduce WED support for MT7988")
Signed-off-by: Zhi-Jun You <hujy652@gmail.com>
Link: https://patch.msgid.link/20260611150051.586-1-hujy652@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agonet: watchdog: fix refcount tracking races
Eric Dumazet [Thu, 11 Jun 2026 15:27:37 +0000 (15:27 +0000)] 
net: watchdog: fix refcount tracking races

Blamed commit converted the untracked dev_hold()/dev_put() calls
in the watchdog code to use the tracked dev_hold_track()/dev_put_track()
(which were later renamed/interfaced to netdev_hold() and netdev_put()).

By introducing dev->watchdog_dev_tracker to store the
reference tracking information without adding synchronization
between netdev_watchdog_up() and dev_watchdog(), it enabled the
race condition where this pointer could be overwritten or freed
concurrently, leading to the list corruption crash syzbot reported:

list_del corruption, ffff888114a18c00->next is NULL
 kernel BUG at lib/list_debug.c:52 !
Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 1 UID: 0 PID: 91 Comm: kworker/u8:5 Not tainted syzkaller #0 PREEMPT(lazy)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/09/2026
Workqueue: events_unbound linkwatch_event
 RIP: 0010:__list_del_entry_valid_or_report.cold+0x22/0x2a lib/list_debug.c:52
Call Trace:
 <TASK>
  __list_del_entry_valid include/linux/list.h:132 [inline]
  __list_del_entry include/linux/list.h:246 [inline]
  list_move_tail include/linux/list.h:341 [inline]
  ref_tracker_free+0x1a7/0x6c0 lib/ref_tracker.c:329
  netdev_tracker_free include/linux/netdevice.h:4491 [inline]
  netdev_put include/linux/netdevice.h:4508 [inline]
  netdev_put include/linux/netdevice.h:4504 [inline]
  netdev_watchdog_down net/sched/sch_generic.c:600 [inline]
  dev_deactivate_many+0x28c/0xfe0 net/sched/sch_generic.c:1363
  dev_deactivate+0x109/0x1d0 net/sched/sch_generic.c:1397
  linkwatch_do_dev net/core/link_watch.c:184 [inline]
  linkwatch_do_dev+0xd3/0x120 net/core/link_watch.c:166
  __linkwatch_run_queue+0x3a5/0x810 net/core/link_watch.c:240
  linkwatch_event+0x8f/0xc0 net/core/link_watch.c:314
  process_one_work+0xa0e/0x1980 kernel/workqueue.c:3314
  process_scheduled_works kernel/workqueue.c:3397 [inline]
  worker_thread+0x5ef/0xe50 kernel/workqueue.c:3478
  kthread+0x370/0x450 kernel/kthread.c:436
  ret_from_fork+0x69a/0xc80 arch/x86/kernel/process.c:158
  ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245

This patch has three coordinated parts:

1) Add dev->watchdog_lock and dev->watchdog_ref_held to serialize watchdog operations.

2) Remove netdev_watchdog_up() call from netif_carrier_on():
   This ensures netdev_watchdog_up() is only called from process/BH context
   (via linkwatch workqueue dev_activate()), allowing us to use
   spin_lock_bh() for synchronization.

3) Synchronize watchdog up and watchdog timer:
   Protect netdev_watchdog_up() with tx_global_lock and watchdog_lock.
   Only allocate a new tracker in netdev_watchdog_up() if one is
   not already present.
   In dev_watchdog(), ensure we don't release the tracker if the
   timer was rescheduled either by dev_watchdog() itself or concurrently
   by netdev_watchdog_up().

Fixes: f12bf6f3f942 ("net: watchdog: add net device refcount tracker")
Reported-by: syzbot+381d82bbf0253710b35d@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/6a26b751.c25708ab.1b19ef.0013.GAE@google.com/T/#u
Tested-by: syzbot+3479efbc2821cb2a79f2@syzkaller.appspotmail.com
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260611152737.2580480-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agoselftests: iou-zcrx: defer listen() until after zcrx setup
Dragos Tatulea [Thu, 11 Jun 2026 16:03:41 +0000 (19:03 +0300)] 
selftests: iou-zcrx: defer listen() until after zcrx setup

The server binds the queues for zero-copy after listen(). If the client
does a connect() during this time it can fail with EHOSTUNREACH on
a cold system. This was encountered with the mlx5 driver where binding
the .ndo_queue_start() is a slow operation during which no packets
can be exchanged.

This change moves listen() after queue binding, when the test server is
fully operational.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Link: https://patch.msgid.link/20260611160341.3697227-2-dtatulea@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agoMerge branch 'net-mana-fix-error-path-issues-in-queue-setup'
Jakub Kicinski [Sat, 13 Jun 2026 00:26:16 +0000 (17:26 -0700)] 
Merge branch 'net-mana-fix-error-path-issues-in-queue-setup'

Aditya Garg says:

====================
net: mana: fix error-path issues in queue setup

Two error-path fixes in MANA queue setup, both surfaced during Sashiko
AI review of a recently upstreamed patch series.

Patch 1 initializes queue->id to INVALID_QUEUE_ID in
mana_gd_create_mana_wq_cq() so that a CQ creation failure before the
firmware id is assigned does not NULL gc->cq_table[0] and silently
break whichever real CQ owns that slot. This mirrors the existing
pattern in mana_gd_create_eq().

Patch 2 guards mana_destroy_txq()'s call to mana_destroy_wq_obj() with
an INVALID_MANA_HANDLE check, mirroring mana_destroy_rxq(). Without
it, TX setup failures lead to a firmware-rejected destroy of (u64)-1
and a spurious error in dmesg.
====================

Link: https://patch.msgid.link/20260608101345.2267320-1-gargaditya@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agonet: mana: guard TX wq object destroy with INVALID_MANA_HANDLE check
Aditya Garg [Mon, 8 Jun 2026 10:13:41 +0000 (03:13 -0700)] 
net: mana: guard TX wq object destroy with INVALID_MANA_HANDLE check

mana_create_txq() has several error paths (after mana_alloc_queues() or
mana_create_wq_obj() failure) where tx_qp[i].tx_object stays as the
INVALID_MANA_HANDLE sentinel set at allocation. mana_destroy_txq() then
unconditionally calls mana_destroy_wq_obj() with (u64)-1, which firmware
rejects and logs an error.

Mirror the RX-side pattern in mana_destroy_rxq() and skip the destroy
when the handle is still INVALID_MANA_HANDLE.

Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
Signed-off-by: Aditya Garg <gargaditya@linux.microsoft.com>
Reviewed-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Link: https://patch.msgid.link/20260608101345.2267320-3-gargaditya@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agonet: mana: initialize gdma queue id to INVALID_QUEUE_ID
Aditya Garg [Mon, 8 Jun 2026 10:13:40 +0000 (03:13 -0700)] 
net: mana: initialize gdma queue id to INVALID_QUEUE_ID

mana_gd_create_mana_wq_cq() leaves queue->id as 0 (from kzalloc_obj())
until mana_create_wq_obj() assigns the firmware-returned id. If creation
fails before that, cleanup calls mana_gd_destroy_cq() with id 0, NULLing
gc->cq_table[0] and silently breaking whichever real CQ owns that slot.

Initialize queue->id to INVALID_QUEUE_ID right after allocation, matching
mana_gd_create_eq(). The existing (id >= max_num_cqs) guard then
short-circuits cleanly.

Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
Signed-off-by: Aditya Garg <gargaditya@linux.microsoft.com>
Reviewed-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Link: https://patch.msgid.link/20260608101345.2267320-2-gargaditya@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agoMerge branch 'net-mdio-realtek-rtl9300-add-rtl931x-support'
Jakub Kicinski [Sat, 13 Jun 2026 00:24:13 +0000 (17:24 -0700)] 
Merge branch 'net-mdio-realtek-rtl9300-add-rtl931x-support'

Markus Stockhausen says:

====================
net: mdio: realtek-rtl9300: Add RTL931x support

The Realtek Otto switch platform consists of four different series

- RTL838x aka maple   : 28 port 1G Switches
- RTL839x aka cypress : 52 port 1G Switches
- RTL930x aka longan  : 28 port 1G/2.5G/10G Switches
- RTL931x aka mango   : 56 port 1G/2.5G/10G Switches

This patch series adds support for the RTL931x devices. For this

- Enhance device tree binding.
- Implement final cleanups and enhancments for the driver.
- Add RTL931x coding.

Remark: Instead of this series it was planned to bring support for
hardware polling configuration first. It turns out that more testing
is needed - especially for the RTL83xx SoCs. Instead add the lineup
of the RTL931x devices, that are known to have no obvious bus and
polling issues (at least from testing and vendor SDK perspective).
====================

Link: https://patch.msgid.link/20260610194145.4153668-1-markus.stockhausen@gmx.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agonet: mdio: realtek-rtl9300: Add support for RTL931x
Markus Stockhausen [Wed, 10 Jun 2026 19:41:45 +0000 (21:41 +0200)] 
net: mdio: realtek-rtl9300: Add support for RTL931x

The MDIO driver has been prepared for multiple device support. Add all
required bits for the RTL931x (aka mango) series. This is straightforward
but some things are worth to be mentioned.

- In contrast to RTL930x the I/O register has the input/output fields
  swapped. Upper 16 bits are for read/outputs, and the lower 16 bits
  are for write/inputs.
- The supported "pages" are 8192 and thus the raw page is 8191
- The devices support up to 56 ports. Thus the MAX_PORTS definition
  is increased by this commit.
- There are multiple global SMI controller registers with a different
  layout from RTL930x devices. Therefore a separate setup_controller()
  callback is added.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Markus Stockhausen <markus.stockhausen@gmx.de>
Link: https://patch.msgid.link/20260610194145.4153668-6-markus.stockhausen@gmx.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agonet: mdio: realtek-rtl9300: Add registers for high port count models
Markus Stockhausen [Wed, 10 Jun 2026 19:41:44 +0000 (21:41 +0200)] 
net: mdio: realtek-rtl9300: Add registers for high port count models

The high port count models of the Realtek Otto switches have additional
registers to instrument the MDIO controller. These are:

- High port mask: A bitfield that extends the already existing low port
  mask to select ports starting from 32.
- Broadcast: This takes the port number during reads on the RTL931x.
- Extended page: Some additional page info. The SDK does not give much
  information about this. Basically some fixed value must be written
  into it during access.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Markus Stockhausen <markus.stockhausen@gmx.de>
Link: https://patch.msgid.link/20260610194145.4153668-5-markus.stockhausen@gmx.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agonet: mdio: realtek-rtl9300: Make otto_emdio_read_cmd() generic
Markus Stockhausen [Wed, 10 Jun 2026 19:41:43 +0000 (21:41 +0200)] 
net: mdio: realtek-rtl9300: Make otto_emdio_read_cmd() generic

The otto_emdio_read_cmd() helper still uses RTL9300 specific properties.
This cannot be made generic as the I/O register has different layouts for
the different SoCs. E.g.

- RTL930x: data in bits 31-16, data out bits 15-0
- RTL931x: data in bits 15-0, data out bits 31-16

Add a mask parameter to the function signature and fill it properly
in the callers. As the masks will always have bits set from constant
defines, there is no need for a consistency check.

Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: Markus Stockhausen <markus.stockhausen@gmx.de>
Link: https://patch.msgid.link/20260610194145.4153668-4-markus.stockhausen@gmx.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agonet: mdio: realtek-rtl9300: Add prefix to register field defines
Markus Stockhausen [Wed, 10 Jun 2026 19:41:42 +0000 (21:41 +0200)] 
net: mdio: realtek-rtl9300: Add prefix to register field defines

The current Realtek Otto MDIO driver has some define leftovers without
a SoC prefix. When adding new devices there will be an overlap for some
of them. Sort this out as follows:

- PHY_CTRL_CMD/PHY_CTRL_MMD_DEVAD/PHY_CTRL_MMD_REG are common for all
  series. Leave them as is but move them into a separate block.
- Add RTL9300 prefix to all other defines and adapt the callers.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Markus Stockhausen <markus.stockhausen@gmx.de>
Link: https://patch.msgid.link/20260610194145.4153668-3-markus.stockhausen@gmx.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agodt-bindings: net: realtek,rtl9301-mdio: Add RTL931x series
Markus Stockhausen [Wed, 10 Jun 2026 19:41:41 +0000 (21:41 +0200)] 
dt-bindings: net: realtek,rtl9301-mdio: Add RTL931x series

The 10G Realtek Otto switches are divided into two series

- Longan: RTL930x up to 28 ports
- Mango : RTL931x up to 56 ports

The Mango based devices have 3 different SoCs RTL9311, RTL9312 and RTL9313.
The MDIO controller of these switches works like the existing RTL930x
logic but has different characteristics and different registers. Add new
compatibles in the device tree.

Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com>
Signed-off-by: Markus Stockhausen <markus.stockhausen@gmx.de>
Link: https://patch.msgid.link/20260610194145.4153668-2-markus.stockhausen@gmx.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agoMerge tag 'pinctrl-v7.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw...
Linus Torvalds [Sat, 13 Jun 2026 00:23:05 +0000 (17:23 -0700)] 
Merge tag 'pinctrl-v7.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl

Pull pin control fixes from Linus Walleij:

 - Two fixes for the mcp23s08 driver.

 - Revert an earlier fix to the AMD pin controller that was all wrong. A
   proper fix is being developed.

* tag 'pinctrl-v7.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
  Revert "pinctrl-amd: enable IRQ for WACF2200 touchscreen on Lenovo Yoga 7 14AGP11"
  pinctrl: mcp23s08: Read spi-present-mask as u8 not u32
  pinctrl: mcp23s08: Initialize mcp->dev and mcp->addr before regmap init

11 days agoMerge branch 'avoid-mistaken-parent-class-deactivation-during-peek'
Jakub Kicinski [Sat, 13 Jun 2026 00:20:55 +0000 (17:20 -0700)] 
Merge branch 'avoid-mistaken-parent-class-deactivation-during-peek'

Victor Nogueira says:

====================
Avoid mistaken parent class deactivation during peek

Several qdiscs (fq_codel, codel and dualpi2) may drop packets while
peeking at their queue. When that happens they call
qdisc_tree_reduce_backlog() to notify the parent of the backlog/qlen
change. The problem is that they do so *before* reincrementing the qlen
that peek had temporarily decremented.

If the qlen momentarily drops to zero while peek still has an skb to
return, qdisc_tree_reduce_backlog() ends up invoking the parent's
qlen_notify() callback even though the child is not actually empty. The
parent then deactivates the class, while the child still holds a packet.
For parents such as QFQ this desync corrupts the active class list and
leads to wild memory accesses and NULL pointer dereferences (see the
per-patch splats). For HFSC it might lead to stalls [1].

Fix all three qdiscs the same way: only call qdisc_tree_reduce_backlog()
once the qlen has been restored, so the parent never observes a
transient empty child during peek.

Patch 1 fixes this for fq_codel, patch 2 for codel, patch 3 for dualpi2
and patch 4 adds test cases for these 3 setups.

Note: Patch 1 is one of two fixes for the stall reported in [1]; the
companion fix is "net/sched: sch_hfsc: Don't make class passive twice",
sent separately.

Note2: A possible cleaner fix is to create a new helper function for peek
that only calls qdisc_tree_reduce_backlog after reincrementing the qlen.
This would be called from the 3 vulnerable qdiscs, however we thought this
might make it harder for backporting so, if people agree, we can submit
this cleaner version to net-next after this one is merged.

[1] https://lore.kernel.org/netdev/CAN2cbVe79oj0O9==m4+4x3v+O+qzRagA=2=wkrp9i9=CqYvyZA@mail.gmail.com/
====================

Link: https://patch.msgid.link/20260610192855.3121513-1-victor@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agoselftests/tc-testing: Verify child qdisc will not mistakenly deactivate QFQ parent
Victor Nogueira [Wed, 10 Jun 2026 19:28:55 +0000 (16:28 -0300)] 
selftests/tc-testing: Verify child qdisc will not mistakenly deactivate QFQ parent

Create 3 test cases:
- Verify fq_codel won't mistakenly deactivate QFQ parent class during peek
- Verify codel won't mistakenly deactivate QFQ parent class during peek
- Verify dualpi2 won't mistakenly deactivate QFQ parent class during peek

Verify that these 3 qdiscs (fq_codel, codel, dualpi2) will not call
qdisc_tree_reduce_backlog with an incorrect qlen (0) during peek and
mistakenly deactivate a parent class.

Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20260610192855.3121513-5-victor@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agonet/sched: sch_dualpi2: Do not call qdisc_tree_reduce_backlog during peek before...
Victor Nogueira [Wed, 10 Jun 2026 19:28:54 +0000 (16:28 -0300)] 
net/sched: sch_dualpi2: Do not call qdisc_tree_reduce_backlog during peek before restoring qlen

Whenever dualpi2 drops packets during peek, it calls
qdisc_tree_reduce_backlog. An issue arises because it calls
qdisc_tree_reduce_backlog before it reincrements the qlen. If qlen drops
to zero, but peek returns an skb, the parent's qlen_notify callback will be
executed even though dualpi2 still has 1 packet on the queue and, thus,
mistakenly deactivates the parent's class which leads to a null-ptr-deref:

[  101.427314][  T599] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000009: 0000 [#1] SMP KASAN NOPTI
[  101.427755][  T599] KASAN: null-ptr-deref in range [0x0000000000000048-0x000000000000004f]
[  101.428048][  T599] CPU: 2 UID: 0 PID: 599 Comm: ping Not tainted 7.1.0-rc5-00284-gbce53c430ed7 #102 PREEMPT(full)
[  101.428400][  T599] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  101.428608][  T599] RIP: 0010:qfq_dequeue (net/sched/sch_qfq.c:1150) sch_qfq
[  101.428821][  T599] Code: 00 fc ff df 80 3c 02 00 0f 85 46 0c 00 00 4c 8d 73 48 48 89 9d b8 02 00 00 48 b8 00 00 00 00 00 fc ff df 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 2d 0c 00 00 48 b8 00 00 00 00 00 fc ff df 4c 8b
All code
[  101.429348][  T599] RSP: 0018:ffff8881110df4f0 EFLAGS: 00010216
[  101.429541][  T599] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: dffffc0000000000
[  101.429763][  T599] RDX: 0000000000000009 RSI: 00000024c0000000 RDI: ffff88811436c2b0
[  101.429985][  T599] RBP: ffff88811436c000 R08: ffff88811436c280 R09: 1ffff11021277523
[  101.430206][  T599] R10: 1ffff11021277526 R11: 1ffff11021277527 R12: 00000024c0000000
[  101.430423][  T599] R13: ffff88811436c2b8 R14: 0000000000000048 R15: 0000000020000000
[  101.430642][  T599] FS:  00007f61813e1c40(0000) GS:ffff8881691ef000(0000) knlGS:0000000000000000
[  101.430913][  T599] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  101.431100][  T599] CR2: 00005651650850a8 CR3: 000000010ca0b000 CR4: 0000000000750ef0
[  101.431320][  T599] PKRU: 55555554
[  101.431433][  T599] Call Trace:
[  101.431544][  T599]  <TASK>
[  101.431628][  T599]  __qdisc_run (net/sched/sch_generic.c:322 net/sched/sch_generic.c:427 net/sched/sch_generic.c:445)
[  101.431792][  T599]  ? dev_qdisc_enqueue (./include/trace/events/qdisc.h:49 (discriminator 22) net/core/dev.c:4176 (discriminator 22))
[  101.431941][  T599]  __dev_queue_xmit (./include/net/pkt_sched.h:120 ./include/net/pkt_sched.h:117 net/core/dev.c:4292 net/core/dev.c:4831)

Fix this by only calling qdisc_tree_reduce_backlog in peek after the
qlen is restored.

Fixes: 8f9516daedd6 ("sched: Add enqueue/dequeue of dualpi2 qdisc")
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20260610192855.3121513-4-victor@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agonet/sched: sch_codel: Do not call qdisc_tree_reduce_backlog during peek before restor...
Victor Nogueira [Wed, 10 Jun 2026 19:28:53 +0000 (16:28 -0300)] 
net/sched: sch_codel: Do not call qdisc_tree_reduce_backlog during peek before restoring qlen

Whenever codel drops packets during peek, it calls
qdisc_tree_reduce_backlog. An issue arises because it calls
qdisc_tree_reduce_backlog before it reincrements the qlen. If qlen drops
to zero, but peek returns an skb, the parent's qlen_notify callback will
be executed even though codel still has 1 packet on the queue and, thus,
will mistakenly deactivate the parent's class causing issues like a wild
memory access when qfq has codel as a child:

[   36.339843][  T370] Oops: general protection fault, probably for non-canonical address 0xfbd59c0000000024: 0000 [#1] SMP KASAN NOPTI
[   36.340408][  T370] KASAN: maybe wild-memory-access in range [0xdead000000000120-0xdead000000000127]
[   36.340737][  T370] CPU: 2 UID: 0 PID: 370 Comm: tc Not tainted 7.1.0-rc5-00287-g66e13b626592 #87 PREEMPT(full)
[   36.341113][  T370] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   36.341357][  T370] RIP: 0010:qfq_deactivate_agg (include/linux/list.h:1029 (discriminator 2) include/linux/list.h:1043 (discriminator 2) net/sched/sch_qfq.c:1369 (discriminator 2) net/sched/sch_qfq.c:1395 (discriminator 2)) sch_qfq
[   36.342221][  T370] RSP: 0018:ffff8881100ef370 EFLAGS: 00010216
[   36.342422][  T370] RAX: 0000000000000000 RBX: ffff8881058a9568 RCX: dffffc0000000000
[   36.342664][  T370] RDX: 1ffff11021064dc3 RSI: ffff888108326e00 RDI: dffffc0000000000
[   36.342905][  T370] RBP: ffff8881058a8280 R08: dead000000000122 R09: 1bd5a00000000024
[   36.343140][  T370] R10: fffffbfff2940329 R11: fffffbfff2940329 R12: 0000000000000000
[   36.343383][  T370] R13: dead000000000100 R14: ffff8881058a9580 R15: ffff8881058a9578
[   36.343631][  T370] FS:  00007fc04b0ca780(0000) GS:ffff888184fef000(0000) knlGS:0000000000000000
[   36.343911][  T370] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   36.344116][  T370] CR2: 0000557c02c02000 CR3: 000000010e0ba000 CR4: 0000000000750ef0
[   36.344359][  T370] PKRU: 55555554
[   36.344481][  T370] Call Trace:
...
[   36.345054][  T370] qfq_reset_qdisc (net/sched/sch_qfq.c:357 net/sched/sch_qfq.c:1487) sch_qfq
[   36.345222][  T370]  qdisc_reset (net/sched/sch_generic.c:1057)
[   36.345503][  T370]  __qdisc_destroy (net/sched/sch_generic.c:1096)
[   36.345677][  T370]  qdisc_graft (net/sched/sch_api.c:1062 net/sched/sch_api.c:1053 net/sched/sch_api.c:1159)
[   36.346335][  T370]  tc_get_qdisc (net/sched/sch_api.c:1528 net/sched/sch_api.c:1556)

Fix this by only calling qdisc_tree_reduce_backlog in peek after the
qlen is restored.

Fixes: 342debc12183 ("codel: remove sch->q.qlen check before qdisc_tree_reduce_backlog()")
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20260610192855.3121513-3-victor@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agonet/sched: sch_fq_codel: Do not call qdisc_tree_reduce_backlog during peek before...
Victor Nogueira [Wed, 10 Jun 2026 19:28:52 +0000 (16:28 -0300)] 
net/sched: sch_fq_codel: Do not call qdisc_tree_reduce_backlog during peek before restoring qlen

Whenever fq_codel drops packets during peek, it calls
qdisc_tree_reduce_backlog. An issue arises because it calls
qdisc_tree_reduce_backlog before it reincrements the qlen. If qlen drops
to zero, but peek returns an skb, the parent's qlen_notify callback will be
executed even though fq_codel still has 1 packet on the queue and, thus,
will mistakenly deactivate the parent's class causing issues like a recent
report [1] and a wild memory access in qfq:

[   29.371146][  T360] Oops: general protection fault, probably for non-canonical address 0xfbd59c0000000024: 0000 [#1] SMP KASAN NOPTI
[   29.371666][  T360] KASAN: maybe wild-memory-access in range [0xdead000000000120-0xdead000000000127]
[   29.371987][  T360] CPU: 6 UID: 0 PID: 360 Comm: tc Not tainted 7.1.0-rc5-00285-gc530e5b2dbc6-dirty #82 PREEMPT(full)
[   29.372384][  T360] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   29.372620][  T360] RIP: 0010:qfq_deactivate_agg (include/linux/list.h:1029 (discriminator 2) include/linux/list.h:1043 (discriminator 2) net/sched/sch_qfq.c:1369 (discriminator 2) net/sched/sch_qfq.c:1395 (discriminator 2)) sch_qfq
[   29.373544][  T360] RSP: 0018:ffff888102417370 EFLAGS: 00010216
[   29.373800][  T360] RAX: 0000000000000000 RBX: ffff88811224d568 RCX: dffffc0000000000
[   29.374079][  T360] RDX: 1ffff11021fe1543 RSI: ffff88810ff0aa00 RDI: dffffc0000000000
[   29.374368][  T360] RBP: ffff88811224c280 R08: dead000000000122 R09: 1bd5a00000000024
[   29.374649][  T360] R10: fffffbfff7940329 R11: fffffbfff7940329 R12: 0000000000000000
[   29.374926][  T360] R13: dead000000000100 R14: ffff88811224d580 R15: ffff88811224d578
[   29.375207][  T360] FS:  00007f5b794e5780(0000) GS:ffff88815d1e9000(0000) knlGS:0000000000000000
[   29.375545][  T360] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   29.375823][  T360] CR2: 000055ffb091f000 CR3: 000000010a305000 CR4: 0000000000750ef0
[   29.376103][  T360] PKRU: 55555554
[   29.376258][  T360] Call Trace:
[   29.376401][  T360]  <TASK>
...
[   29.376885][  T360] qfq_reset_qdisc (net/sched/sch_qfq.c:357 net/sched/sch_qfq.c:1487) sch_qfq
[   29.377074][  T360]  qdisc_reset (net/sched/sch_generic.c:1057)
[   29.377414][  T360]  __qdisc_destroy (net/sched/sch_generic.c:1096)
[   29.377600][  T360]  qdisc_graft (net/sched/sch_api.c:1062 net/sched/sch_api.c:1053 net/sched/sch_api.c:1159)
[   29.378593][  T360]  tc_get_qdisc (net/sched/sch_api.c:1528 net/sched/sch_api.c:1556)

Fix this by only calling qdisc_tree_reduce_backlog in peek after the
qlen is restored.

[1] http://lore.kernel.org/netdev/CAN2cbVe79oj0O9==m4+4x3v+O+qzRagA=2=wkrp9i9=CqYvyZA@mail.gmail.com/

Fixes: 342debc12183 ("codel: remove sch->q.qlen check before qdisc_tree_reduce_backlog()")
Reported-by: Anirudh Gupta <anirudhrudr@gmail.com>
Closes: https://lore.kernel.org/netdev/CAN2cbVe79oj0O9==m4+4x3v+O+qzRagA=2=wkrp9i9=CqYvyZA@mail.gmail.com/
Tested-by: Anirudh Gupta <anirudhrudr@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20260610192855.3121513-2-victor@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agoMerge branch 'ipv6-mcast-annotate-data-races-in-proc-net-igmp6'
Jakub Kicinski [Sat, 13 Jun 2026 00:12:13 +0000 (17:12 -0700)] 
Merge branch 'ipv6-mcast-annotate-data-races-in-proc-net-igmp6'

Yuyang Huang says:

====================
ipv6: mcast: annotate data races in /proc/net/igmp6

/proc/net/igmp6 walks IPv6 multicast memberships under RCU without
holding idev->mc_lock, taking a lockless snapshot of two fields that
writers update under the lock: mca_flags and mca_work.timer.expires.

Patch 1 adds WRITE_ONCE() to all mca_flags update sites and READ_ONCE()
to the procfs reader.  Patch 2 does the same for the timer.expires read
in the procfs path.
====================

Link: https://patch.msgid.link/20260609081113.7613-1-sigefriedhyy@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agoipv6: mcast: annotate igmp6 timer expiry race
Yuyang Huang [Tue, 9 Jun 2026 08:11:13 +0000 (17:11 +0900)] 
ipv6: mcast: annotate igmp6 timer expiry race

/proc/net/igmp6 walks IPv6 multicast memberships under RCU and reads
mca_work.timer.expires to print the remaining multicast timer. The
delayed-work timer can be updated concurrently.

Annotate the intentional lockless procfs snapshot with READ_ONCE().

Signed-off-by: Yuyang Huang <sigefriedhyy@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260609081113.7613-3-sigefriedhyy@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agoipv6: mcast: annotate data-races around mca_flags
Yuyang Huang [Tue, 9 Jun 2026 08:11:12 +0000 (17:11 +0900)] 
ipv6: mcast: annotate data-races around mca_flags

/proc/net/igmp6 walks IPv6 multicast memberships under RCU and
prints mca_flags without holding idev->mc_lock. The multicast paths
update the field while holding idev->mc_lock.

Annotate this intentional lockless snapshot with READ_ONCE() and the
matching writers with WRITE_ONCE().

Signed-off-by: Yuyang Huang <sigefriedhyy@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260609081113.7613-2-sigefriedhyy@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agoMerge branch 'rxrpc-miscellaneous-fixes'
Jakub Kicinski [Fri, 12 Jun 2026 23:48:57 +0000 (16:48 -0700)] 
Merge branch 'rxrpc-miscellaneous-fixes'

David Howells says:

====================
rxrpc: Miscellaneous fixes

Here are some miscellaneous AF_RXRPC fixes:

 (1) Make sure rxrpc_verify_data() allocates a buffer, even if the DATA
     packet being looked at is zero length to avoid potential NULL-pointer
     exceptions.

 (2) Don't move an OOB message (e.g. an RxGK CHALLENGE) off the receive
     queue onto the pending queue in recvmsg() if MSG_PEEK is specified.

 (3) Fix a potential UAF in rxgk_issue_challenge() in which a tracepoint
     refers to memory just freed by a different pointer.

 (4) Fix afs net namespace teardown to cancel the incoming call
     preallocation charger before we disable listening (which will delete
     the preallocation queue).

 (5) Fix rxrpc_kernel_charge_accept() to use the socket mutex to defend
     against listen(0)/shutdown simultaneously deleting the preallocation
     queue.
====================

Link: https://patch.msgid.link/20260609140911.838677-1-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agorxrpc: serialize kernel accept preallocation with socket teardown
Li Daming [Tue, 9 Jun 2026 14:09:09 +0000 (15:09 +0100)] 
rxrpc: serialize kernel accept preallocation with socket teardown

rxrpc_kernel_charge_accept() reads rx->backlog without any
socket/backlog synchronization and passes that raw pointer into
rxrpc_service_prealloc_one(). A concurrent rxrpc_discard_prealloc()
sets rx->backlog = NULL and frees the backlog rings, so a kernel
preallocation worker can keep using a freed struct rxrpc_backlog
while updating *_backlog_head/tail and array slots.

Serialize the state check and backlog lookup with the socket lock,
and reject kernel preallocation once teardown has disabled
listening or discarded the service backlog.

Fixes: 00e907127e6f ("rxrpc: Preallocate peers, conns and calls for incoming service requests")
Reported-by: Yuan Tan <yuantan098@gmail.com>
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Signed-off-by: Li Daming <d4n.for.sec@gmail.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeffrey Altman <jaltman@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org
Link: https://patch.msgid.link/20260609140911.838677-6-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agoafs: Fix netns teardown to cancel the preallocation charger
David Howells [Tue, 9 Jun 2026 14:09:08 +0000 (15:09 +0100)] 
afs: Fix netns teardown to cancel the preallocation charger

Fix the teardown of an afs network namespace to make sure it cancels the
work item that keeps the preallocated rxrpc call/conn/peer queue charged
before incoming calls are disabled (i.e. listen 0).

Also, if net->live is false because the afs netns is being deleted, make
afs_charge_preallocation() skip charging and make afs_rx_new_call() avoid
requeuing the charger.

(This was found by AI review).

Fixes: 00e907127e6f ("rxrpc: Preallocate peers, conns and calls for incoming service requests")
Reported-by: Simon Horman <horms@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Li Daming <d4n.for.sec@gmail.com>
cc: Ren Wei <n05ec@lzu.edu.cn>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeffrey Altman <jaltman@auristor.com>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org
Link: https://patch.msgid.link/20260609140911.838677-5-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agorxrpc: Fix UAF in rxgk_issue_challenge()
David Howells [Tue, 9 Jun 2026 14:09:07 +0000 (15:09 +0100)] 
rxrpc: Fix UAF in rxgk_issue_challenge()

Fix rxgk_issue_challenge() to free the page containing the challenge
content after invoking the tracepoint as the whdr passed to the tracepoint
points into the page just freed.

Fixes: 9d1d2b59341f ("rxrpc: rxgk: Implement the yfs-rxgk security class (GSSAPI)")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org
Link: https://patch.msgid.link/20260609140911.838677-4-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agorxrpc: Don't move a peeked OOB message onto the pending queue
Hyunwoo Kim [Tue, 9 Jun 2026 14:09:06 +0000 (15:09 +0100)] 
rxrpc: Don't move a peeked OOB message onto the pending queue

rxrpc_recvmsg_oob() takes a received oob message off recvmsg_oobq and,
if a response is needed, moves it onto the pending_oobq tree. However,
only the unlink from recvmsg_oobq is guarded by MSG_PEEK; the move onto
pending_oobq always runs.

As a result, reading a challenge with MSG_PEEK leaves the skb on
recvmsg_oobq while also adding it to pending_oobq. Since struct
sk_buff's rbnode shares storage with its next and prev pointers,
rb_insert_color() overwrites the list linkage, and the skb, which holds
a single reference, becomes reachable from both queues at once.

When the socket is closed both queues are drained in turn. While
draining recvmsg_oobq, __skb_unlink() follows the next and prev
pointers that rbnode has overwritten and writes to a bad address. Also,
as the skb holds a single reference but is freed from each queue, both
the skb and the connection reference it holds are released twice. This
leads to memory corruption and to a use-after-free caused by the
connection refcount underflow.

MSG_PEEK does not consume the message from the queue, so only unlink it
from recvmsg_oobq and then move it onto pending_oobq or free it when
the message is actually consumed.

Fixes: 5800b1cf3fd8 ("rxrpc: Allow CHALLENGEs to the passed to the app for a RESPONSE")
Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org
Link: https://patch.msgid.link/20260609140911.838677-3-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agorxrpc: rxrpc_verify_data ensure rx_dec_buffer alloc
Jeffrey Altman [Tue, 9 Jun 2026 14:09:05 +0000 (15:09 +0100)] 
rxrpc: rxrpc_verify_data ensure rx_dec_buffer alloc

rxrpc_recvmsg_data() calls rxrpc_verify_data() whenever the
rxrpc_call.rx_dec_buffer is unallocated and assumes that upon
successful return that rx_dec_buffer must be allocated.
However, rxrpc_verify_data() does not request an allocation if
the rxrpc_skb_priv.len is zero.

In addition, failure to allocate rx_dec_buffer will result in a
call to skb_copy_bits() with a NULL destination which can
trigger a NULL pointer dereference.

To prevent these issues rxrpc_verify_data() is modified to
always attempt to allocate the rxrpc_call.rx_dec_buffer if it
is NULL.

This issue was identified with assistance of a private
sashiko instance.

Fixes: d2bc90cf6c75cb ("rxrpc: Fix DATA decrypt vs splice() by copying data to buffer in recvmsg")
Reported-by: Simon Horman <simon.horman@redhat.com>
Signed-off-by: Jeffrey Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jiayuan Chen <jiayuan.chen@linux.dev>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org
Link: https://patch.msgid.link/20260609140911.838677-2-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agoMerge branch 'net-remove-tls_toe'
Jakub Kicinski [Fri, 12 Jun 2026 23:43:14 +0000 (16:43 -0700)] 
Merge branch 'net-remove-tls_toe'

Sabrina Dubroca says:

====================
net: remove tls_toe

This series removes the tls_toe feature, its single user (chtls), and
cleans up the EXPORT_SYMBOL()s that no other module requires.

Driver changes only compile-tested.
====================

Link: https://patch.msgid.link/cover.1781165969.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agonet: remove some unused EXPORT_SYMBOL()s
Sabrina Dubroca [Thu, 11 Jun 2026 10:21:34 +0000 (12:21 +0200)] 
net: remove some unused EXPORT_SYMBOL()s

chtls was using a lot of symbols that no other module requires. Remove
those EXPORT_SYMBOL()s.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/d124db74f6f0838b652f0ee4b4530964f3cf8d49.1781165969.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 days agotls: remove tls_toe and the related driver
Sabrina Dubroca [Thu, 11 Jun 2026 10:21:33 +0000 (12:21 +0200)] 
tls: remove tls_toe and the related driver

The tls_toe feature and its single user (chelsio chtls) have been
unmaintained for multiple years. It also hooks into the core of the
TCP implementation, and bypasses most of the networking stack.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/1f30e73275c07bf879f547589872d0916025a52e.1781165969.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>