]> git.ipfire.org Git - thirdparty/linux.git/log
thirdparty/linux.git
5 weeks agonet: ethernet: ti: am65-cpsw: Use also port number to identify timestamps
Sebastian Andrzej Siewior [Fri, 6 Mar 2026 14:44:39 +0000 (15:44 +0100)] 
net: ethernet: ti: am65-cpsw: Use also port number to identify timestamps

The driver uses packet-type (RX/TX) PTP-message type and PTP-sequence
number to identify a matching timestamp packet for a skb. If the same
PTP packet arrives on both ports (as in a PRP environment) then it is
not obvious which event belongs to which skb.

The event contains also the port number on which it was received.
Instead of masking it out, use it for matching.

Tested-by: Chintan Vankar <c-vankar@ti.com>
Reviewed-by: Martin Kaistra <martin.kaistra@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20260306144439.cVwaaopR@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonet/sched: do not reset queues in graft operations
Eric Dumazet [Sat, 7 Mar 2026 16:34:30 +0000 (16:34 +0000)] 
net/sched: do not reset queues in graft operations

Following typical script is extremely disruptive,
because each graft operation calls dev_deactivate()
which resets all the queues of the device.

QPARAM="limit 100000 flow_limit 1000 buckets 4096"
TXQS=64
for ETH in eth1
do
 tc qd del dev $ETH root 2>/dev/null
 tc qd add dev $ETH root handle 1: mq
 for i in `seq 1 $TXQS`
 do
   slot=$( printf %x $(( i )) )
   tc qd add dev $ETH parent 1:$slot fq $QPARAM
 done
done

One can add "ip link set dev $ETH down/up" to reduce the disruption time:

QPARAM="limit 100000 flow_limit 1000 buckets 4096"
TXQS=64
for ETH in eth1
do
 ip link set dev $ETH down
 tc qd del dev $ETH root 2>/dev/null
 tc qd add dev $ETH root handle 1: mq
 for i in `seq 1 $TXQS`
 do
   slot=$( printf %x $(( i )) )
   tc qd add dev $ETH parent 1:$slot fq $QPARAM
 done
 ip link set dev $ETH up
done

Or we can add a @reset_needed flag to dev_deactivate() and
dev_deactivate_many().

This flag is set to true at device dismantle or linkwatch_do_dev(),
and to false for graft operations.

In the future, we might only stop one queue instead of the whole
device, ie call dev_deactivate_queue() instead of dev_deactivate().

I think the problem (quadratic behavior) was added in commit
2fb541c862c9 ("net: sch_generic: aviod concurrent reset and enqueue op
for lockless qdisc") but this does not look serious enough to deserve
risky backports.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yunsheng Lin <linyunsheng@huawei.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20260307163430.470644-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agotcp: avoid dst->ops->check() call in tcp_v{4,6}_do_rcv()
Eric Dumazet [Fri, 6 Mar 2026 15:43:22 +0000 (15:43 +0000)] 
tcp: avoid dst->ops->check() call in tcp_v{4,6}_do_rcv()

If incoming skb dst matches the socket cached one,
there is no need to call again dst->ops->check().

Network layer already validated the skb dst for us,
usually from tcp_v{4,6}_early_demux().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260306154322.1086539-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonet: rocker: kzalloc + kcalloc to kzalloc_flex
Rosen Penev [Fri, 6 Mar 2026 02:54:49 +0000 (18:54 -0800)] 
net: rocker: kzalloc + kcalloc to kzalloc_flex

Combining the allocations simplifies things, especially the free path.

Remove ofdpa_group_tbl_entry_free as a result. kfree is shorter.

Add __counted_by for extra runtime analysis.

Signed-off-by: Rosen Penev <rosenp@gmail.com>
Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20260306025449.12333-1-rosenp@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agotcp: move tcp_v4_early_demux() to net/ipv4/ip_input.c
Eric Dumazet [Fri, 6 Mar 2026 13:11:30 +0000 (13:11 +0000)] 
tcp: move tcp_v4_early_demux() to net/ipv4/ip_input.c

tcp_v4_early_demux() has a single caller : ip_rcv_finish_core().

Move it to net/ipv4/ip_input.c and mark it static, for possible
compiler/linker optimizations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260306131130.654991-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agotcp: remove unused hash_size from struct tcp_out_options
Keita Morisaki [Sat, 7 Mar 2026 05:16:19 +0000 (14:16 +0900)] 
tcp: remove unused hash_size from struct tcp_out_options

hash_size is declared but never read. The MD5 path always uses a
fixed size of 16, and the TCP-AO path uses tcp_ao_maclen().

This closes a 7-byte hole and reduces the struct size from 96 to
88 bytes.

Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Keita Morisaki <kmta1236@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260307051619.51685-1-kmta1236@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonet: Add SPDX ids to some source files
Tim Bird [Thu, 5 Mar 2026 00:47:22 +0000 (17:47 -0700)] 
net: Add SPDX ids to some source files

Add SPDX-License-Identifier lines to several source
files under the network sub-directory.  Work on files
in the core, dns_resolver, ipv4, ipv6 and
netfilter sub-dirs.  Remove boilerplate
and license reference text to avoid ambiguity.

Rusty Russell has expressed that his contributions
were intended to be GPL-2.0-or-later.

Signed-off-by: Tim Bird <tim.bird@sony.com>
Link: https://patch.msgid.link/20260305004724.87469-1-tim.bird@sony.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agoMerge branch 'tools-ynl-convert-samples-into-selftests'
Jakub Kicinski [Tue, 10 Mar 2026 00:02:30 +0000 (17:02 -0700)] 
Merge branch 'tools-ynl-convert-samples-into-selftests'

Jakub Kicinski says:

====================
tools: ynl: convert samples into selftests

The "samples" were always poor man's tests, used to manually
confirm that C YNL works as expected. Since a proper tests/
directory now exists move the samples and use the kselftest
harness to turn them into selftests outputting KTAP.
====================

Link: https://patch.msgid.link/20260307033630.1396085-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agotools: ynl: convert rt-route sample to selftest
Jakub Kicinski [Sat, 7 Mar 2026 03:36:30 +0000 (19:36 -0800)] 
tools: ynl: convert rt-route sample to selftest

Convert rt-route.c to use kselftest_harness.h with FIXTURE/TEST_F.
This is the last test to convert so clean up the Makefile.

Validate that the connected routes for 192.168.1.0/24 and
2001:db8::/64 appear in the dump.

Output:

  TAP version 13
  1..1
  # Starting 1 tests from 1 test cases.
  #  RUN           rt_route.dump ...
  # oif: nsim0            dst: 192.168.1.0/24
  # oif: lo               dst: ::1/128
  # oif: nsim0            dst: 2001:db8::1/128
  # oif: nsim0            dst: 2001:db8::/64
  # oif: nsim0            dst: fe80::/64
  # oif: nsim0            dst: ff00::/8
  #            OK  rt_route.dump
  ok 1 rt_route.dump
  # PASSED: 1 / 1 tests passed.
  # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Tested-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20260307033630.1396085-11-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agotools: ynl: convert rt-addr sample to selftest
Jakub Kicinski [Sat, 7 Mar 2026 03:36:29 +0000 (19:36 -0800)] 
tools: ynl: convert rt-addr sample to selftest

Convert rt-addr.c to use kselftest_harness.h with FIXTURE/TEST_F.

Validate that the addresses configured by the wrapper (192.168.1.1
and 2001:db8::1) appear in the dump.

Output:

  TAP version 13
  1..1
  # Starting 1 tests from 1 test cases.
  #  RUN           rt_addr.dump ...
  #               lo: 127.0.0.1
  #            nsim0: 192.168.1.1
  #               lo: ::1
  #            nsim0: 2001:db8::1
  #            nsim0: fe80::7c66:c9ff:fe5f:bf01
  #            OK  rt_addr.dump
  ok 1 rt_addr.dump
  # PASSED: 1 / 1 tests passed.
  # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Tested-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20260307033630.1396085-10-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agotools: ynl: convert ethtool sample to selftest
Jakub Kicinski [Sat, 7 Mar 2026 03:36:28 +0000 (19:36 -0800)] 
tools: ynl: convert ethtool sample to selftest

Convert ethtool.c to use kselftest_harness.h with FIXTURE/TEST_F.
Move ethtool from BINS to TEST_GEN_FILES and add ethtool.sh wrapper
which sets up a netdevsim device before running the test binary.

Output:

  TAP version 13
  1..2
  # Starting 2 tests from 1 test cases.
  #  RUN           ethtool.channels ...
  #    nsim0: combined 1
  #            OK  ethtool.channels
  ok 1 ethtool.channels
  #  RUN           ethtool.rings ...
  #    nsim0: rx 512 tx 512
  #            OK  ethtool.rings
  ok 2 ethtool.rings
  # PASSED: 2 / 2 tests passed.
  # Totals: pass:2 fail:0 xfail:0 xpass:0 skip:0 error:0

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Tested-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20260307033630.1396085-9-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agotools: ynl: convert devlink sample to selftest
Jakub Kicinski [Sat, 7 Mar 2026 03:36:27 +0000 (19:36 -0800)] 
tools: ynl: convert devlink sample to selftest

Convert devlink.c to use kselftest_harness.h with FIXTURE/TEST_F.
Move devlink from BINS to TEST_GEN_FILES in the Makefile since
it's invoked via the devlink.sh wrapper which sets up netdevsim.

Output:

  TAP version 13
  1..2
  # Starting 2 tests from 1 test cases.
  #  RUN           devlink.dump ...
  # netdevsim/netdevsim1337
  #            OK  devlink.dump
  ok 1 devlink.dump
  #  RUN           devlink.info ...
  # netdevsim/netdevsim1337:
  #   driver: netdevsim
  #   running fw:
  #     fw.mgmt: 10.20.30
  #            OK  devlink.info
  ok 2 devlink.info
  # PASSED: 2 / 2 tests passed.
  # Totals: pass:2 fail:0 xfail:0 xpass:0 skip:0 error:0

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Tested-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20260307033630.1396085-8-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agotools: ynl: add netdevsim wrapper library for YNL tests
Jakub Kicinski [Sat, 7 Mar 2026 03:36:26 +0000 (19:36 -0800)] 
tools: ynl: add netdevsim wrapper library for YNL tests

Some tests need netdevsim setup which is painful to do from C.

Add ynl_nsim_lib.sh, a shared library providing nsim_setup and
nsim_cleanup functions for tests that need a netdevsim device.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Tested-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20260307033630.1396085-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agotools: ynl: convert tc and tc-filter-add samples to selftest
Jakub Kicinski [Sat, 7 Mar 2026 03:36:25 +0000 (19:36 -0800)] 
tools: ynl: convert tc and tc-filter-add samples to selftest

Convert tc.c and tc-filter-add.c to produce KTAP output with
kselftest_harness. Merge the two tests together. They both
test TC one is testing qdisc and the other classifiers but
they can easily live in a single selftest.

Make the test spawn a new netns, and run the operations on
lo to avoid onerous setup and cleanup.

  TAP version 13
  1..2
  # Starting 2 tests from 1 test cases.
  #  RUN           tc.qdisc ...
  #               lo: fq_codel  limit: 10240p target: 5ms new_flow_cnt: 0
  #            OK  tc.qdisc
  ok 1 tc.qdisc
  #  RUN           tc.flower ...
  # flower pref 1 proto: 0x8100
  # flower:
  #   vlan_id: 100
  #   vlan_prio: 5
  #   num_of_vlans: 3
  # action order: 1 vlan push id 200 protocol 0x8100 priority 0
  # action order: 2 vlan push id 300 protocol 0x8100 priority 0
  #            OK  tc.flower
  ok 2 tc.flower
  # PASSED: 2 / 2 tests passed.
  # Totals: pass:2 fail:0 xfail:0 xpass:0 skip:0 error:0

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Tested-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20260307033630.1396085-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agotools: ynl: convert rt-link sample to selftest
Jakub Kicinski [Sat, 7 Mar 2026 03:36:24 +0000 (19:36 -0800)] 
tools: ynl: convert rt-link sample to selftest

Convert rt-link.c to use kselftest_harness.h with FIXTURE/TEST_F.
Move rt-link from BINS to TEST_GEN_PROGS.

Output:

  TAP version 13
  1..3
  # Starting 3 tests from 1 test cases.
  #  RUN           rt_link.dump ...
  #   1:          lo: mtu 65536
  #   2:          sit0: mtu  1480  kind sit
  #            OK  rt_link.dump
  ok 1 rt_link.dump
  #  RUN           rt_link.netkit ...
  #   4:          nk1: mtu  1500  kind netkit    primary 1  policy blackhole
  #            OK  rt_link.netkit
  ok 2 rt_link.netkit
  #  RUN           rt_link.netkit_err_msg ...
  #            OK  rt_link.netkit_err_msg
  ok 3 rt_link.netkit_err_msg
  # PASSED: 3 / 3 tests passed.
  # Totals: pass:3 fail:0 xfail:0 xpass:0 skip:0 error:0

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Tested-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20260307033630.1396085-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agotools: ynl: convert ovs sample to selftest
Jakub Kicinski [Sat, 7 Mar 2026 03:36:23 +0000 (19:36 -0800)] 
tools: ynl: convert ovs sample to selftest

Convert ovs.c to produce KTAP output with kselftest_harness.
The single "crud" test creates a new OVS datapath, fetches it back
by name, then dumps all datapaths verifying the new one appears.

IIRC I added this test because ovs is a genetlink family but
has a family-specific fixed header.

  TAP version 13
  1..1
  # Starting 1 tests from 1 test cases.
  #  RUN           ovs.crud ...
  # get:
  # ynl-test(3): pid:0 cache:256
  # dump:
  # ynl-test(3): pid:0 cache:256
  #            OK  ovs.crud
  ok 1 ovs.crud
  # PASSED: 1 / 1 tests passed.
  # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Tested-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20260307033630.1396085-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agotools: ynl: convert netdev sample to selftest
Jakub Kicinski [Sat, 7 Mar 2026 03:36:22 +0000 (19:36 -0800)] 
tools: ynl: convert netdev sample to selftest

Convert netdev.c to produce KTAP output with 3 tests:
- dev_dump: dump all netdev devices, skip if empty
- dev_get: query first device from dump by ifindex
- ntf_check: subscribe to "mgmt", create a veth via rt-link,
  verify netdev notification is received, then delete the veth

Remove stdin/scanf-based UI. Add rt-link dependency for the veth
notification test.

  TAP version 13
  1..3
  # Starting 3 tests from 1 test cases.
  #  RUN           netdev.dump ...
  #       lo[1] xdp-features (0): xdp-rx-metadata-features (0): xsk-fea...
  #     sit0[2] xdp-features (0): xdp-rx-metadata-features (0): xsk-fea...
  #            OK  netdev.dump
  ok 1 netdev.dump
  #  RUN           netdev.get ...
  #       lo[1] xdp-features (0): xdp-rx-metadata-features (0): xsk-fea...
  #            OK  netdev.get
  ok 2 netdev.get
  #  RUN           netdev.ntf_check ...
  #    veth0[7] xdp-features (0): xdp-rx-metadata-features (7): timesta...
  #            OK  netdev.ntf_check
  ok 3 netdev.ntf_check
  # PASSED: 3 / 3 tests passed.
  # Totals: pass:3 fail:0 xfail:0 xpass:0 skip:0 error:0

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Tested-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20260307033630.1396085-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agotools: ynl: move samples to tests
Jakub Kicinski [Sat, 7 Mar 2026 03:36:21 +0000 (19:36 -0800)] 
tools: ynl: move samples to tests

The "samples" were always poor man's tests (used to manually
confirm that C YNL works).

Move all C sample programs from tools/net/ynl/samples/ to
tools/net/ynl/tests/, "merge" the Makefiles. The subsequent
changes will convert each sample into a proper KTAP selftests.

Since these are now tests rather than samples - default to
enabling asan. After all we're testing user space code here.

Sort the gitignore while at it, the page-pool entry was a leftover
so delete it.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Tested-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20260307033630.1396085-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agoselftests: net: make ovs-dpctl.py fail when pyroute2 is unsupported
Aleksei Oladko [Fri, 6 Mar 2026 00:01:23 +0000 (00:01 +0000)] 
selftests: net: make ovs-dpctl.py fail when pyroute2 is unsupported

The pmtu.sh kselftest configures OVS using ovs-dpctl.py and falls back
to ovs-vsctl only when ovs-dpctl.py fails. However, ovs-dpctl.py exits
with a success status when the installed pyroute2 package version is
lower than 0.6, even though the OVS datapath is not configured.

As a result, pmtu.sh assumes that the setup was successful and
continues running the test, which later fails due to the missing
OVS configuration.

Fix the exit code handling in ovs-dpctl.py so that pmtu.sh can detect
that the setup did not complete successfully and fall back to
ovs-vsctl.

Signed-off-by: Aleksei Oladko <aleksey.oladko@virtuozzo.com>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Link: https://patch.msgid.link/20260306000127.519064-3-aleksey.oladko@virtuozzo.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonet: usb: lan78xx: drop redundant device reference
Johan Hovold [Thu, 5 Mar 2026 10:50:06 +0000 (11:50 +0100)] 
net: usb: lan78xx: drop redundant device reference

Driver core holds a reference to the USB interface and its parent USB
device while the interface is bound to a driver and there is no need to
take additional references unless the structures are needed after
disconnect.

Drop the redundant device reference to reduce cargo culting, make it
easier to spot drivers where an extra reference is needed, and reduce
the risk of memory leaks when drivers fail to release it.

Signed-off-by: Johan Hovold <johan@kernel.org>
Link: https://patch.msgid.link/20260305105006.16415-1-johan@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agoMerge branch 'net-ntb_netdev-add-multi-queue-support'
Jakub Kicinski [Sat, 7 Mar 2026 03:15:24 +0000 (19:15 -0800)] 
Merge branch 'net-ntb_netdev-add-multi-queue-support'

Koichiro Den says:

====================
net: ntb_netdev: Add Multi-queue support

ntb_netdev currently hard-codes a single NTB transport queue pair, which
means the datapath effectively runs as a single-queue netdev regardless
of available CPUs / parallel flows.

The longer-term motivation here is throughput scale-out: allow
ntb_netdev to grow beyond the single-QP bottleneck and make it possible
to spread TX/RX work across multiple queue pairs as link speeds and core
counts keep increasing.

Multi-queue also unlocks the standard networking knobs on top of it. In
particular, once the device exposes multiple TX queues, qdisc/tc can
steer flows/traffic classes into different queues (via
skb->queue_mapping), enabling per-flow/per-class scheduling and QoS in a
familiar way.

Usage
=====

  1. Ensure the NTB device you want to use has multiple Memory Windows.
  2. modprobe ntb_transport on both sides, if it's not built-in.
  3. modprobe ntb_netdev on both sides, if it's not built-in.
  4. Use ethtool -L to configure the desired number of queues.
     The default number of real (combined) queues is 1.

     e.g. ethtool -L eth0 combined 2 # to increase
          ethtool -L eth0 combined 1 # to reduce back to 1

  Note:
    * If the NTB device has only a single Memory Window, ethtool -L eth0
      combined N (N > 1) fails with:
      "netlink error: No space left on device".
    * ethtool -L can be executed while the net_device is up.

Compatibility
=============

  The default remains a single queue, so behavior is unchanged unless
  the user explicitly increases the number of queues.

Kernel base
===========

  ntb-next (latest as of 2026-03-06):
  commit 7b3302c687ca ("ntb_hw_amd: Fix incorrect debug message in link
                        disable path")

Testing / Results
=================

  Environment / command line:
    - 2x R-Car S4 Spider boards
      "Kernel base" (see above) + this series

  TCP:
    [RC] $ sudo iperf3 -s
    [EP] $ sudo iperf3 -Z -c ${SERVER_IP} -l 65480 -w 512M -P 4
  UDP:
    [RC] $ sudo iperf3 -s
    [EP] $ sudo iperf3 -ub0 -c ${SERVER_IP} -l 65480 -w 512M -P 4

  Without this series:
      TCP / UDP : 589 Mbps / 580 Mbps

  With this series (default single queue):
      TCP / UDP : 583 Mbps / 583 Mbps

  With this series + `ethtool -L eth0 combined 2`:
      TCP / UDP : 576 Mbps / 584 Mbps

  With this series + `ethtool -L eth0 combined 2` + [1], where flows are
  properly distributed across queues:
      TCP / UDP : 1.13 Gbps / 1.16 Gbps (re-measured with v3)

  The 575~590 Mbps variation is run-to-run variance i.e. no measurable
  regression or improvement is observed with a single queue. The key
  point is scaling from ~600 Mbps to ~1.20 Gbps once flows are
  distributed across multiple queues.

  Note: On R-Car S4 Spider, only BAR2 is usable for ntb_transport MW.
  For testing, BAR2 was expanded from 1 MiB to 2 MiB and split into two
  Memory Windows. A follow-up series is planned to add split BAR support
  for vNTB. On platforms where multiple BARs can be used for the
  datapath, this series should allow >=2 queues without additional
  changes.

  [1] [PATCH v2 00/10] NTB: epf: Enable per-doorbell bit handling while keeping legacy offset
      https://lore.kernel.org/linux-pci/20260227084955.3184017-1-den@valinux.co.jp/
      (subject was accidentally incorrect in the original posting)
====================

Link: https://patch.msgid.link/20260305155639.1885517-1-den@valinux.co.jp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonet: ntb_netdev: Support ethtool channels for multi-queue
Koichiro Den [Thu, 5 Mar 2026 15:56:39 +0000 (00:56 +0900)] 
net: ntb_netdev: Support ethtool channels for multi-queue

Support dynamic queue pair addition/removal via ethtool channels.
Use the combined channel count to control the number of netdev TX/RX
queues, each corresponding to a ntb_transport queue pair.

When the number of queues is reduced, tear down and free the removed
ntb_transport queue pairs (not just deactivate them) so other
ntb_transport clients can reuse the freed resources.

When the number of queues is increased, create additional queue pairs up
to NTB_NETDEV_MAX_QUEUES (=64). The effective limit is determined by the
underlying ntb_transport implementation and NTB hardware resources (the
number of MWs), so set_channels may return -ENOSPC if no more QPs can be
allocated.

Keep the default at one queue pair to preserve the previous behavior.

Signed-off-by: Koichiro Den <den@valinux.co.jp>
Link: https://patch.msgid.link/20260305155639.1885517-5-den@valinux.co.jp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonet: ntb_netdev: Factor out multi-queue helpers
Koichiro Den [Thu, 5 Mar 2026 15:56:38 +0000 (00:56 +0900)] 
net: ntb_netdev: Factor out multi-queue helpers

Implementing .set_channels will otherwise duplicate the same multi-queue
operations at multiple call sites. Factor out the following helpers:

  - ntb_netdev_update_carrier(): carrier is switched on when at least
                                 one QP link is up
  - ntb_netdev_queue_rx_drain(): drain and free all queued RX packets
                                 for one QP
  - ntb_netdev_queue_rx_fill():  prefill RX ring for one QP

No functional change.

Signed-off-by: Koichiro Den <den@valinux.co.jp>
Link: https://patch.msgid.link/20260305155639.1885517-4-den@valinux.co.jp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonet: ntb_netdev: Gate subqueue stop/wake by transport link
Koichiro Den [Thu, 5 Mar 2026 15:56:37 +0000 (00:56 +0900)] 
net: ntb_netdev: Gate subqueue stop/wake by transport link

When ntb_netdev is extended to multiple ntb_transport queue pairs, the
netdev carrier can be up as long as at least one QP link is up. In that
setup, a given QP may be link-down while the carrier remains on.

Make the link event handler start/stop the corresponding netdev TX
subqueue and drive carrier state based on whether any QP link is up.
Also guard subqueue wake/start points in the TX completion and timer
paths so a subqueue is not restarted while its QP link is down.

Stop all queues in ndo_open() and let the link event handler wake each
subqueue once ntb_transport link negotiation succeeds.

Signed-off-by: Koichiro Den <den@valinux.co.jp>
Link: https://patch.msgid.link/20260305155639.1885517-3-den@valinux.co.jp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonet: ntb_netdev: Introduce per-queue context
Koichiro Den [Thu, 5 Mar 2026 15:56:36 +0000 (00:56 +0900)] 
net: ntb_netdev: Introduce per-queue context

Prepare ntb_netdev for multi-queue operation by moving queue-pair state
out of struct ntb_netdev.

Introduce struct ntb_netdev_queue to carry the ntb_transport_qp pointer,
the per-QP TX timer and queue id. Pass this object as the callback
context and convert the RX/TX handlers and link event path accordingly.

The probe path allocates a fixed upper bound for netdev queues while
instantiating only a single ntb_transport queue pair, preserving the
previous behavior. Also store client_dev for future queue pair
creation/removal via the ntb_transport API.

Signed-off-by: Koichiro Den <den@valinux.co.jp>
Link: https://patch.msgid.link/20260305155639.1885517-2-den@valinux.co.jp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonet: spacemit: Remove unused buff_addr fields
Vivian Wang [Thu, 5 Mar 2026 07:00:29 +0000 (15:00 +0800)] 
net: spacemit: Remove unused buff_addr fields

These were never used. Just remove them.

No functional change intended.

Signed-off-by: Vivian Wang <wangruikang@iscas.ac.cn>
Link: https://patch.msgid.link/20260305-k1-ethernet-cleanup-buff_addr-v1-1-e978ef119231@iscas.ac.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agoMerge branch 'nfc-drop-redundant-usb-device-references'
Jakub Kicinski [Sat, 7 Mar 2026 02:57:46 +0000 (18:57 -0800)] 
Merge branch 'nfc-drop-redundant-usb-device-references'

Johan Hovold says:

====================
nfc: drop redundant USB device references

Driver core holds a reference to the USB interface and its parent USB
device while the interface is bound to a driver and there is no need to
take additional references unless the structures are needed after
disconnect.

Drop redundant device references to reduce cargo culting, make it easier
to spot drivers where an extra reference is needed, and reduce the risk
of memory leaks when drivers fail to release them.
====================

Link: https://patch.msgid.link/20260305111019.18030-1-johan@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonfc: port100: drop redundant device reference
Johan Hovold [Thu, 5 Mar 2026 11:10:19 +0000 (12:10 +0100)] 
nfc: port100: drop redundant device reference

Driver core holds a reference to the USB interface and its parent USB
device while the interface is bound to a driver and there is no need to
take additional references unless the structures are needed after
disconnect.

Drop the redundant device reference to reduce cargo culting, make it
easier to spot drivers where an extra reference is needed, and reduce
the risk of memory leaks when drivers fail to release it.

Signed-off-by: Johan Hovold <johan@kernel.org>
Link: https://patch.msgid.link/20260305111019.18030-3-johan@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonfc: pn533: drop redundant device reference
Johan Hovold [Thu, 5 Mar 2026 11:10:18 +0000 (12:10 +0100)] 
nfc: pn533: drop redundant device reference

Driver core holds a reference to the USB interface and its parent USB
device while the interface is bound to a driver and there is no need to
take additional references unless the structures are needed after
disconnect.

Drop the redundant device reference to reduce cargo culting, make it
easier to spot drivers where an extra reference is needed, and reduce
the risk of memory leaks when drivers fail to release it.

Signed-off-by: Johan Hovold <johan@kernel.org>
Link: https://patch.msgid.link/20260305111019.18030-2-johan@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agovirtio-net: xsk: Support wakeup on RX side
Bui Quang Minh [Wed, 4 Mar 2026 15:43:17 +0000 (22:43 +0700)] 
virtio-net: xsk: Support wakeup on RX side

When XDP_USE_NEED_WAKEUP is used and the fill ring is empty so no buffer
is allocated on RX side, allow RX NAPI to be descheduled. This avoids
wasting CPU cycles on polling. Users will be notified and they need to
make a wakeup call after refilling the ring.

Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Link: https://patch.msgid.link/20260304154317.7506-1-minhquangbui99@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agoselftests: net: forwarding: fix IPv6 address leak in cleanup
Aleksei Oladko [Thu, 5 Mar 2026 21:10:00 +0000 (21:10 +0000)] 
selftests: net: forwarding: fix IPv6 address leak in cleanup

Several forwarding tests (e.g., gre_multipath.sh) initialize both IPv4
and IPv6 addresses using simple_if_init, but only clean up IPv4
in simple_if_fini. This leaves stale IPv6 addresses on the interfaces,
which causes subsequent tests to fail when they encounter unexpected
address configuration.

The issue can be reproduced by running tests in sequence:
  # run_kselftest.sh -t net/forwarding:ipip_hier_gre.sh
  # run_kselftest.sh -t net/forwarding:min_max_mtu.sh
  TAP version 13
  1..1
  # timeout set to 0
  # selftests: net/forwarding: min_max_mtu.sh
  # TEST: ping                                                          [ OK ]
  # TEST: ping6                                                         [ OK ]
  # TEST: Test maximum MTU configuration                                [ OK ]
  # TEST: Test traffic, packet size is maximum MTU                      [FAIL]
  #       Ping6, packet size: 65487 succeeded, but should have failed
  # TEST: Test minimum MTU configuration                                [ OK ]
  # TEST: Test traffic, packet size is minimum MTU                      [ OK ]
  not ok 1 selftests: net/forwarding: min_max_mtu.sh # exit=1

Fix this by removing the unused IPv6 argument from simple_if_init in
tests that don't use IPv6 (gre_multipath.sh, ipip_lib.sh), and by
adding the missing IPv6 argument to simple_if_fini in tests that
use IPv6 (gre_multipath_nh.sh, gre_multipath_nh_res.sh).

Signed-off-by: Aleksei Oladko <aleksey.oladko@virtuozzo.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260305211000.515301-1-aleksey.oladko@virtuozzo.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonet: annotate data races around sk->sk_prot
Jiayuan Chen [Wed, 4 Mar 2026 06:42:52 +0000 (14:42 +0800)] 
net: annotate data races around sk->sk_prot

inet_sendmsg() and inet_recvmsg() access sk->sk_prot without
lock_sock() or any other synchronization.

sock_replace_proto() (used by sockmap), TLS and MPTCP can change
sk->sk_prot under us, so these functions need READ_ONCE() to avoid
load tearing.

Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260304064253.16955-1-jiayuan.chen@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonet: phy: remove phy_attach
Heiner Kallweit [Wed, 4 Mar 2026 20:17:28 +0000 (21:17 +0100)] 
net: phy: remove phy_attach

378e6523ebb1 ("net: bcmgenet: remove unused platform code") removed
the last user of phy_attach(). So remove this function.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://patch.msgid.link/8812176a-e319-4e9f-815d-99ea339df8b2@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agoinet_diag: report delayed ack timer information
Eric Dumazet [Thu, 5 Mar 2026 11:48:29 +0000 (11:48 +0000)] 
inet_diag: report delayed ack timer information

inet_sk_diag_fill() populates r->idiag_timer with the following
precedence order:

1 - Retransmit timer.
4 - Probe0 timer.
2 - Keepalive timer.

This patch adds a new value, last in the list, if other timers
are not active.

5 - Delayed ACK timer.

A corresponding iproute2 patch will follow to replace "unknown"
with "delack":

ESTAB 10     0   [2002:a05:6830:1f86::]:12875 [2002:a05:6830:1f85::]:50438

    timer:(unknown,003ms,0) ino:152178 sk:3004 cgroup:unreachable:189 <->

    skmem:(r1344,rb12780520,t0,tb262144,f2752,w0,o250,bl0,d0) ts usec_ts
    ...

Also add the following enum in uapi/linux/inet_diag.h
as suggested by David Ahern.

enum {
IDIAG_TIMER_OFF,
IDIAG_TIMER_ON,
IDIAG_TIMER_KEEPALIVE,
IDIAG_TIMER_TIMEWAIT,
IDIAG_TIMER_PROBE0,
IDIAG_TIMER_DELACK,
};

Neal Cardwell suggested to test for ICSK_ACK_TIMER:
inet_csk_clear_xmit_timer() does not call sk_stop_timer()
because INET_CSK_CLEAR_TIMERS is unset.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260305114829.2163276-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agoMerge branch 'net-stmmac-mdio-related-cleanups'
Jakub Kicinski [Fri, 6 Mar 2026 23:39:12 +0000 (15:39 -0800)] 
Merge branch 'net-stmmac-mdio-related-cleanups'

Russell King says:

====================
net: stmmac: mdio related cleanups

The first four patches clean up the MDC clock divisor selection code,
turning the three different ways we choose a divisor into tabular form,
rather than doing the selection purely in code.

Convert MDIO to use field_prep() which allows a non-constant mask to be
used when preparing fields.

Then use u32 and the associated typed GENMASK for MDIO register field
definitions.

Finally, an extra couple of patches that use appropriate types in
struct mdio_bus_data.
====================

Link: https://patch.msgid.link/aald--qJquWGIvmO@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonet: stmmac: make pcs_mask and phy_mask u32
Russell King (Oracle) [Thu, 5 Mar 2026 10:43:02 +0000 (10:43 +0000)] 
net: stmmac: make pcs_mask and phy_mask u32

The PCS and PHY masks are passed to the mdio bus layer as phy_mask
to prevent bus addresses between 0 and 31 inclusive being scanned,
and this is declared as u32. Also declare these as u32 in stmmac
for type consistency.

Since this is a u32, use BIT_U32() rather than BIT() to generate
values for these fields.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/E1vy6AY-0000000BtxJ-3smT@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonet: stmmac: mdio_bus_data->default_an_inband is boolean
Russell King (Oracle) [Thu, 5 Mar 2026 10:42:57 +0000 (10:42 +0000)] 
net: stmmac: mdio_bus_data->default_an_inband is boolean

default_an_inband is declared as an unsigned int, but is set to true/
false and is assigned to phylink_config's member of the same name
which is a bool. Declare this also as a bool for consistency.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/E1vy6AT-0000000BtxD-2qm7@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonet: stmmac: use GENMASK_U32() for mdio bitfields
Russell King (Oracle) [Thu, 5 Mar 2026 10:42:52 +0000 (10:42 +0000)] 
net: stmmac: use GENMASK_U32() for mdio bitfields

Rather than using hex numbers, use GENMASK() for mdio bitfields.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/E1vy6AO-0000000Btx7-2NDV@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonet: stmmac: use u32 for MDIO register field masks
Russell King (Oracle) [Thu, 5 Mar 2026 10:42:47 +0000 (10:42 +0000)] 
net: stmmac: use u32 for MDIO register field masks

MDIO registers are 32-bit, so use u32 to describe the masks for these
registers. Convert the GENMASK() initialisers to GENMASK_U32() for
type compatibility.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/E1vy6AJ-0000000Btx1-1teC@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonet: stmmac: mdio: convert field prep to use field_prep()
Russell King (Oracle) [Thu, 5 Mar 2026 10:42:42 +0000 (10:42 +0000)] 
net: stmmac: mdio: convert field prep to use field_prep()

Convert the MDIO field preparation to use field_prep(), which removes
the need to store separate mask and shifts. Also convert the clk_csr
value using __ffs() to do the shift as we need to detect overflows
for this.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/E1vy6AE-0000000Btwv-1LM4@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonet: stmmac: mdio: simplify MDC clock divisor lookup
Russell King (Oracle) [Thu, 5 Mar 2026 10:42:37 +0000 (10:42 +0000)] 
net: stmmac: mdio: simplify MDC clock divisor lookup

As each lookup now iterates over each table in the same way, simplfy
the code to select the table, and then walk that table.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/E1vy6A9-0000000Btwp-0lxY@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonet: stmmac: mdio: use same test for MDC clock divisor lookups
Russell King (Oracle) [Thu, 5 Mar 2026 10:42:32 +0000 (10:42 +0000)] 
net: stmmac: mdio: use same test for MDC clock divisor lookups

Use the same frequency test for all clk_csr value lookups (clock
rate > table rate). This has the side effect that the standard rate
table results in the divider being used for the maximum frequency
for the divider rather than the next higher divider. This still
allows MDC to meet the IEE 802.3 specification, but at a rate closer
to 2.5MHz for these frequencies.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/E1vy6A4-0000000Btwj-0ATB@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonet: stmmac: mdio: convert MDC clock divisor selection to tables
Russell King (Oracle) [Thu, 5 Mar 2026 10:42:26 +0000 (10:42 +0000)] 
net: stmmac: mdio: convert MDC clock divisor selection to tables

Convert the MDC clock divisor selection to tabular format.

Note that there is a change for 300MHz, but this is not a problem,
as the MDC clock remains within the useable ranges, which are:

STMMAC_CSR_500_800M /324 1.54 - 2.47MHz
STMMAC_CSR_300_500M /204 1.47 - 2.45MHz
STMMAC_CSR_250_300M /124 2.02 - 2.42MHz
STMMAC_CSR_150_250M /102 1.47 - 2.45MHz
STMMAC_CSR_100_150M /62  1.61 - 2.42MHz
STMMAC_CSR_60_100M /42  1.43 - 2.38MHz
STMMAC_CSR_35_60M /26  1.35 - 2.31MHz
STMMAC_CSR_20_35M /16  1.25 - 2.19MHz

Thus, with the change of divisor for exactly 300MHz, MDC temporarily
changes from 2.42MHz to 1.47MHz for the sake of consistency.

The databook does not specify whether the frequency limits for the
CSR divider are inclusive or exclusive.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/E1vy69y-0000000Btwd-3oq7@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agoselftests: drv-net: iou-zcrx: wait for memory cleanup of probe run
Dragos Tatulea [Thu, 5 Mar 2026 08:04:45 +0000 (10:04 +0200)] 
selftests: drv-net: iou-zcrx: wait for memory cleanup of probe run

The large chunks test does a probe run of iou-zcrx before it runs the
actual test. After the probe run finishes, the context will still exist
until the deferred io_uring teardown. When running iou-zcrx the second
time, io_uring_register_ifq() can return -EEXIST due to the existence of
the old context.

The fix is simple: wait for the context teardown using the new
mp_clear_wait() utility before running the second instance of iou-zcrx.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Link: https://patch.msgid.link/20260305080446.897628-2-dtatulea@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agoRevert "net: phy: improve mdiobus_stats_acct"
Heiner Kallweit [Thu, 5 Mar 2026 17:42:04 +0000 (18:42 +0100)] 
Revert "net: phy: improve mdiobus_stats_acct"

This reverts commit 1afccc5a201ec7c9023370958bae1312369b64da.

As reported by Marek the change causes a warning on non-PREEMPT_RT
32 bit systems.

Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://patch.msgid.link/c3a1aba9-3fae-4c4b-bcb1-fb620fb7a309@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agodocs: netdev: refine netdevsim testing guidance
Jakub Kicinski [Wed, 4 Mar 2026 15:16:46 +0000 (07:16 -0800)] 
docs: netdev: refine netdevsim testing guidance

The library to create tests for both NIC HW and netdevsim has existed
for almost a year. netdevsim-only tests we get increasingly feel like
a waste, we should try to write tests that work both on netdevsim and
real HW. Refine the guidance accordingly.

Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260304151647.2770466-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agoMerge branch 'selftests-net-add-netkit-container-env-and-test'
Jakub Kicinski [Fri, 6 Mar 2026 21:11:21 +0000 (13:11 -0800)] 
Merge branch 'selftests-net-add-netkit-container-env-and-test'

David Wei says:

====================
selftests/net: add netkit container env and test

Add a new Python selftest env NetDrvContEnv that sets up a pair of
netkit netdevs, with one inside of a netns, and a bpf prog that forwards
skbs from NETIF to the netkit inside the netns.

  NETIF           = "eth0"
  LOCAL_V6        = "2001:db8:1::1"
  REMOTE_V6       = "2001:db8:1::2"
  LOCAL_PREFIX_V6 = "2001:db8:2::0/64"

          +-----------------------------+        +------------------------------+
  dst     | INIT NS                     |        | TEST NS                      |
  2001:   | +---------------+           |        |                              |
  db8:2::2| | NETIF         |           |  bpf   |                              |
      +---|>| 2001:db8:1::1 |           |redirect| +-------------------------+  |
      |   | |               |-----------|--------|>| Netkit                  |  |
      |   | +---------------+           | _peer  | | nk_guest                |  |
      |   | +-------------+ Netkit pair |        | | fe80::2/64              |  |
      |   | | Netkit      |.............|........|>| 2001:db8:2::2/64        |  |
      |   | | nk_host     |             |        | +-------------------------+  |
      |   | | fe80::1/64  |             |        |                              |
      |   | +-------------+             |        | route:                       |
      |   |                             |        |   default                    |
      |   | route:                      |        |     via fe80::1 dev nk_guest |
      |   |   2001:db8:2::2/128         |        +------------------------------+
      |   |     via fe80::2 dev nk_host |
      |   +-----------------------------+
      |
      |   +---------------+
      |   | REMOTE        |
      +---| 2001:db8:1::2 |
          +---------------+

I will use this series for queue leasing selftests. Include a basic ping
test in this series as demonstration.
====================

Link: https://patch.msgid.link/20260305181803.2912736-1-dw@davidwei.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agoselftests/net: Add netkit container ping test
David Wei [Thu, 5 Mar 2026 18:18:03 +0000 (10:18 -0800)] 
selftests/net: Add netkit container ping test

Add a basic ping test using NetDrvContEnv that sets up a netkit pair,
with one end in a netns. Use LOCAL_PREFIX_V6 and nk_forward BPF program
to ping from a remote host to the netkit in netns.

Signed-off-by: David Wei <dw@davidwei.uk>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://patch.msgid.link/20260305181803.2912736-5-dw@davidwei.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agoselftests/net: Add env for container based tests
David Wei [Thu, 5 Mar 2026 18:18:02 +0000 (10:18 -0800)] 
selftests/net: Add env for container based tests

Add an env NetDrvContEnv for container based selftests. This automates
the setup of a netns, netkit pair with one inside the netns, and a BPF
program that forwards skbs from the NETIF host inside the container.

Currently only netkit is used, but other virtual netdevs e.g. veth can
be used too.

Expect netkit container datapath selftests to have a publicly routable
IP prefix to assign to netkit in a container, such that packets will
land on eth0. The BPF skb forward program will then forward such packets
from the host netns to the container netns.

Signed-off-by: David Wei <dw@davidwei.uk>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://patch.msgid.link/20260305181803.2912736-4-dw@davidwei.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agoselftests/net: Export Netlink class via lib.py
David Wei [Thu, 5 Mar 2026 18:18:01 +0000 (10:18 -0800)] 
selftests/net: Export Netlink class via lib.py

Making rtnl newlink calls requires constants defined in Netlink class in
pyynl. Export it.

Signed-off-by: David Wei <dw@davidwei.uk>
Link: https://patch.msgid.link/20260305181803.2912736-3-dw@davidwei.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agoselftests/net: Add bpf skb forwarding program
David Wei [Thu, 5 Mar 2026 18:18:00 +0000 (10:18 -0800)] 
selftests/net: Add bpf skb forwarding program

Add nk_forward.bpf.c, a BPF program that forwards skbs matching some IPv6
prefix received on eth0 ifindex to a specified netkit ifindex. This will
be needed by netkit container tests.

Signed-off-by: David Wei <dw@davidwei.uk>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260305181803.2912736-2-dw@davidwei.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agotools: ynl: add uns-admin-perm to genetlink
Antonio Quartulli [Wed, 4 Mar 2026 14:10:09 +0000 (15:10 +0100)] 
tools: ynl: add uns-admin-perm to genetlink

GENL_UNS_ADMIN_PERM may be required by protocols using
the `genetlink` family, however, this flag is currently
only allowed in `genetlink-legacy`.

Add it to the list of possible values in genetlink.yaml too.

Cc: Simon Horman <horms@kernel.org>
Cc: Donald Hunter <donald.hunter@gmail.com>
Link: https://github.com/OpenVPN/ovpn-net-next/issues/33
Suggested-by: Ralf Lici <ralf@mandelbit.com>
Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
Link: https://patch.msgid.link/20260304141020.23270-1-antonio@openvpn.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonet: airoha: Rely __field_prep for non-constant masks
Lorenzo Bianconi [Wed, 4 Mar 2026 10:56:47 +0000 (11:56 +0100)] 
net: airoha: Rely __field_prep for non-constant masks

Rely on __field_prep macros for non-constant masks preparing the values
for register updates instead of open-coding.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20260304-airoha-__field_prep-v1-1-b185facc4e2f@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agoMerge branch 'net-cadence-macb-add-ieee-802-3az-eee-support'
Jakub Kicinski [Fri, 6 Mar 2026 02:56:53 +0000 (18:56 -0800)] 
Merge branch 'net-cadence-macb-add-ieee-802-3az-eee-support'

Nicolai Buchwitz says:

====================
net: cadence: macb: add IEEE 802.3az EEE support

Add Energy Efficient Ethernet (IEEE 802.3az) support to the Cadence GEM
(macb) driver using phylink's managed EEE framework. The GEM MAC has
hardware LPI registers but no built-in idle timer, so the driver
implements software-managed TX LPI using a delayed_work timer while
delegating EEE negotiation and ethtool state to phylink.

The series is structured as follows:

  1. LPI statistics: Expose the four hardware EEE counters (RX/TX LPI
     transitions and time) through ethtool -S, accumulated in software
     since they are clear-on-read. Adds register offset definitions
     GEM_RXLPI/RXLPITIME/TXLPI/TXLPITIME (0x270-0x27c).

  2. TX LPI engine: Introduces GEM_TXLPIEN (NCR bit 19) and
     MACB_CAPS_EEE alongside the implementation that uses them.
     phylink mac_enable_tx_lpi / mac_disable_tx_lpi callbacks with a
     delayed_work-based idle timer. LPI entry is deferred 1 second
     after link-up per IEEE 802.3az. Wake before transmit with a
     conservative 50us PHY wake delay (IEEE 802.3az Tw_sys_tx).

  3. ethtool EEE ops: get_eee/set_eee delegating to phylink for PHY
     negotiation and timer management.

  4. RP1 enablement: Set MACB_CAPS_EEE for the Raspberry Pi 5's RP1
     southbridge (Cadence GEM_GXL rev 0x00070109 + BCM54213PE PHY).

  5. EyeQ5 enablement: Set MACB_CAPS_EEE for the Mobileye EyeQ5 GEM
     instance, verified with a hardware loopback by Théo Lebrun.

Tested on Raspberry Pi 5 (1000BASE-T, BCM54213PE PHY, 250ms LPI timer):

  iperf3 throughput (no regression):
    TCP TX: 937.8 Mbit/s (EEE on) vs 937.0 Mbit/s (EEE off)
    TCP RX: 936.5 Mbit/s both

  Latency (ping RTT, small expected increase from LPI wake):
    1s interval:  0.273 ms (EEE on) vs 0.181 ms (EEE off)
    10ms interval: 0.206 ms (EEE on) vs 0.168 ms (EEE off)
    flood ping:   0.200 ms (EEE on) vs 0.156 ms (EEE off)

  LPI counters (ethtool -S, 1s-interval ping, EEE on):
    tx_lpi_transitions: 112
    tx_lpi_time: 15574651

  Zero packet loss across all tests. Also verified with
  ethtool --show-eee / --set-eee and cable unplug/replug cycling.
====================

Link: https://patch.msgid.link/20260304105432.631186-1-nb@tipi-net.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonet: cadence: macb: enable EEE for Mobileye EyeQ5
Nicolai Buchwitz [Wed, 4 Mar 2026 10:54:32 +0000 (11:54 +0100)] 
net: cadence: macb: enable EEE for Mobileye EyeQ5

Set MACB_CAPS_EEE for the Mobileye EyeQ5 GEM instance. EEE has been
verified on EyeQ5 hardware using a loopback setup with ethtool
--show-eee confirming EEE active on both ends at 100baseT/Full and
1000baseT/Full.

Tested-by: Théo Lebrun <theo.lebrun@bootlin.com>
Signed-off-by: Nicolai Buchwitz <nb@tipi-net.de>
Link: https://patch.msgid.link/20260304105432.631186-6-nb@tipi-net.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonet: cadence: macb: enable EEE for Raspberry Pi RP1
Nicolai Buchwitz [Wed, 4 Mar 2026 10:54:31 +0000 (11:54 +0100)] 
net: cadence: macb: enable EEE for Raspberry Pi RP1

Set MACB_CAPS_EEE for the Raspberry Pi 5 RP1 southbridge
(Cadence GEM_GXL rev 0x00070109 paired with BCM54213PE PHY).

EEE has been verified on RP1 hardware: the LPI counter registers
at 0x270-0x27c return valid data, the TXLPIEN bit in NCR (bit 19)
controls LPI transmission correctly, and ethtool --show-eee reports
the negotiated state after link-up.

Other GEM variants that share the same LPI register layout (SAMA5D2,
SAME70, PIC32CZ) can be enabled by adding MACB_CAPS_EEE to their
respective config entries once tested.

Reviewed-by: Claudiu Beznea <claudiu.beznea@tuxon.dev>
Reviewed-by: Théo Lebrun <theo.lebrun@bootlin.com>
Signed-off-by: Nicolai Buchwitz <nb@tipi-net.de>
Link: https://patch.msgid.link/20260304105432.631186-5-nb@tipi-net.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonet: cadence: macb: add ethtool EEE support
Nicolai Buchwitz [Wed, 4 Mar 2026 10:54:30 +0000 (11:54 +0100)] 
net: cadence: macb: add ethtool EEE support

Implement get_eee and set_eee ethtool ops for GEM as simple passthroughs
to phylink_ethtool_get_eee() and phylink_ethtool_set_eee().

No MACB_CAPS_EEE guard is needed: phylink returns -EOPNOTSUPP from both
ops when mac_supports_eee is false, which is the case when
lpi_capabilities and lpi_interfaces are not populated. Those fields are
only set when MACB_CAPS_EEE is present (previous patch), so phylink
already handles the unsupported case correctly.

Reviewed-by: Claudiu Beznea <claudiu.beznea@tuxon.dev>
Reviewed-by: Théo Lebrun <theo.lebrun@bootlin.com>
Signed-off-by: Nicolai Buchwitz <nb@tipi-net.de>
Link: https://patch.msgid.link/20260304105432.631186-4-nb@tipi-net.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonet: cadence: macb: implement EEE TX LPI support
Nicolai Buchwitz [Wed, 4 Mar 2026 10:54:29 +0000 (11:54 +0100)] 
net: cadence: macb: implement EEE TX LPI support

The GEM MAC has hardware LPI registers (NCR bit 19: TXLPIEN) but no
built-in idle timer, so asserting TXLPIEN blocks all TX immediately
with no automatic wake. A software idle timer is required, as noted
in Microchip documentation (section 40.6.19): "It is best to use
firmware to control LPI."

Implement phylink managed EEE using the mac_enable_tx_lpi and
mac_disable_tx_lpi callbacks:

- macb_tx_lpi_set(): sets or clears TXLPIEN; requires bp->lock to be
  held by the caller (asserted with lockdep_assert_held). Returns bool
  indicating whether the register actually changed, avoiding redundant
  writes and unnecessary udelay on the xmit fast path.

- macb_tx_lpi_work_fn(): delayed_work handler that enters LPI if all
  TX queues are idle and EEE is still active. Takes bp->lock with
  irqsave before calling macb_tx_lpi_set().

- macb_tx_lpi_schedule(): arms the work timer using the LPI timer
  value provided by phylink (default 250 ms). Called from
  macb_tx_complete() after each TX drain so the idle countdown
  restarts whenever the ring goes quiet.

- macb_tx_lpi_wake(): called from macb_start_xmit() under bp->lock,
  immediately before TSTART. Returns early if eee_active is false to
  avoid a register read on the common path when EEE is disabled.
  Clears TXLPIEN and applies a 50 us udelay for PHY wake (IEEE
  802.3az Tw_sys_tx is 16.5 us for 1000BASE-T / 30 us for
  100BASE-TX; GEM has no hardware enforcement). Only delays when
  TXLPIEN was actually set. The delay is placed after tx_head is
  advanced so the work_fn's queue-idle check sees a non-empty ring
  and cannot race back into LPI before the frame is transmitted.

- mac_enable_tx_lpi: stores the timer and sets eee_active under
  bp->lock, then defers the first LPI entry by 1 second per IEEE
  802.3az section 22.7a.

- mac_disable_tx_lpi: cancels the work (sync, without the lock to
  avoid deadlock with the work_fn), then takes bp->lock to clear
  eee_active and deassert TXLPIEN.

Populate phylink_config lpi_interfaces (MII, GMII, RGMII variants)
and lpi_capabilities (MAC_100FD | MAC_1000FD) so phylink can
negotiate EEE with the PHY and call the callbacks appropriately.
Set lpi_timer_default to 250000 us and eee_enabled_default to true.

Reviewed-by: Claudiu Beznea <claudiu.beznea@tuxon.dev>
Reviewed-by: Théo Lebrun <theo.lebrun@bootlin.com>
Signed-off-by: Nicolai Buchwitz <nb@tipi-net.de>
Link: https://patch.msgid.link/20260304105432.631186-3-nb@tipi-net.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonet: cadence: macb: add EEE LPI statistics counters
Nicolai Buchwitz [Wed, 4 Mar 2026 10:54:28 +0000 (11:54 +0100)] 
net: cadence: macb: add EEE LPI statistics counters

The GEM MAC provides four read-only, clear-on-read LPI statistics
registers at offsets 0x270-0x27c:

  GEM_RXLPI     (0x270): RX LPI transition count (16-bit)
  GEM_RXLPITIME (0x274): cumulative RX LPI time (24-bit)
  GEM_TXLPI     (0x278): TX LPI transition count (16-bit)
  GEM_TXLPITIME (0x27c): cumulative TX LPI time (24-bit)

Add register offset definitions, extend struct gem_stats with
corresponding u64 software accumulators, and register the four
counters in gem_statistics[] so they appear in ethtool -S output.
Because the hardware counters clear on read, the existing
macb_update_stats() path accumulates them into the u64 fields on
every stats poll, preventing loss between userspace reads.

These registers are present on SAMA5D2, SAME70, PIC32CZ, and RP1
variants of the Cadence GEM IP and have been confirmed on RP1 via
devmem reads.

Reviewed-by: Claudiu Beznea <claudiu.beznea@tuxon.dev>
Reviewed-by: Théo Lebrun <theo.lebrun@bootlin.com>
Signed-off-by: Nicolai Buchwitz <nb@tipi-net.de>
Link: https://patch.msgid.link/20260304105432.631186-2-nb@tipi-net.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agotcp: Initialise ehash secrets during connect() and listen().
Kuniyuki Iwashima [Tue, 3 Mar 2026 23:54:16 +0000 (23:54 +0000)] 
tcp: Initialise ehash secrets during connect() and listen().

inet_ehashfn() and inet6_ehashfn() initialise random secrets
on the first call by net_get_random_once().

While the init part is patched out using static keys, with
CONFIG_STACKPROTECTOR_STRONG=y, this causes a compiler to
generate a stack canary due to an automatic variable,
unsigned long ___flags, in the DO_ONCE() macro being passed
to __do_once_start().

With FDO, this is visible in __inet_lookup_established() and
__inet6_lookup_established() too.

Let's initialise the secrets by get_random_sleepable_once()
in the slow paths: inet_hash() for listen(), and
inet_hash_connect() and inet6_hash_connect() for connect().

Note that IPv6 listener will initialise both IPv4 & IPv6 secrets
in inet_hash() for IPv4-mapped IPv6 address.

With the patch, the stack size is reduced by 16 bytes (___flags
 + a stack canary) and NOPs for the static key go away.

Before: __inet6_lookup_established()

       ...
       push   %rbx
       sub    $0x38,%rsp                # stack is 56 bytes
       mov    %edx,%ebx                 # sport
       mov    %gs:0x299419f(%rip),%rax  # load stack canary
       mov    %rax,0x30(%rsp)              and store it onto stack
       mov    0x440(%rdi),%r15          # net->ipv4.tcp_death_row.hashinfo
       nop
 32:   mov    %r8d,%ebp                 # hnum
       shl    $0x10,%ebp                # hnum << 16
       nop
 3d:   mov    0x70(%rsp),%r14d          # sdif
       or     %ebx,%ebp                 # INET_COMBINED_PORTS(sport, hnum)
       mov    0x11a8382(%rip),%eax      # inet6_ehashfn() ...

After: __inet6_lookup_established()

       ...
       push   %rbx
       sub    $0x28,%rsp                # stack is 40 bytes
       mov    0x60(%rsp),%ebp           # sdif
       mov    %r8d,%r14d                # hnum
       shl    $0x10,%r14d               # hnum << 16
       or     %edx,%r14d                # INET_COMBINED_PORTS(sport, hnum)
       mov    0x440(%rdi),%rax          # net->ipv4.tcp_death_row.hashinfo
       mov    0x1194f09(%rip),%r10d     # inet6_ehashfn() ...

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260303235424.3877267-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agoMerge branch 'doc-netlink-expand-nftables-specification'
Jakub Kicinski [Fri, 6 Mar 2026 02:49:10 +0000 (18:49 -0800)] 
Merge branch 'doc-netlink-expand-nftables-specification'

Remy D. Farley says:

====================
doc/netlink: Expand nftables specification

Getting out some changes I've accumulated while making nftables work
with Rust netlink-bindings. Hopefully, this will be useful upstream.
====================

Link: https://patch.msgid.link/20260303195638.381642-1-one-d-wide@protonmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agodoc/netlink: nftables: Fill out operation attributes
Remy D. Farley [Tue, 3 Mar 2026 20:00:12 +0000 (20:00 +0000)] 
doc/netlink: nftables: Fill out operation attributes

Filled out operation attributes:
- newtable
- gettable
- deltable
- destroytable
- newchain
- getchain
- delchain
- destroychain
- newrule
- getrule
- getrule-reset
- delrule
- destroyrule
- newset
- getset
- delset
- destroyset
- newsetelem
- getsetelem
- getsetelem-reset
- delsetelem
- destroysetelem
- getgen
- newobj
- getobj
- delobj
- destroyobj
- newflowtable
- getflowtable
- delflowtable
- destroyflowtable

Signed-off-by: Remy D. Farley <one-d-wide@protonmail.com>
Link: https://patch.msgid.link/20260303195638.381642-6-one-d-wide@protonmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agodoc/netlink: nftables: Add sub-messages
Remy D. Farley [Tue, 3 Mar 2026 19:59:19 +0000 (19:59 +0000)] 
doc/netlink: nftables: Add sub-messages

New sub-messsages:
- log
- match
- numgen
- range

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Signed-off-by: Remy D. Farley <one-d-wide@protonmail.com>
Link: https://patch.msgid.link/20260303195638.381642-5-one-d-wide@protonmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agodoc/netlink: nftables: Update attribute sets
Remy D. Farley [Tue, 3 Mar 2026 19:58:52 +0000 (19:58 +0000)] 
doc/netlink: nftables: Update attribute sets

New attribute sets:
- log-attrs
- numgen-attrs
- range-attrs
- compat-target-attrs
- compat-match-attrs
- compat-attrs

Added missing attributes:
- table-attrs (pad, owner)
- set-attrs (type, count)

Added missing checks:
- range-attrs
- expr-bitwise-attrs
- compat-target-attrs
- compat-match-attrs
- compat-attrs

Annotated doc comment or associated enum:
- batch-attrs
- verdict-attrs
- expr-payload-attrs

Fixed byte order:
- nft-counter-attrs
- expr-counter-attrs
- rule-compat-attrs

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Signed-off-by: Remy D. Farley <one-d-wide@protonmail.com>
Link: https://patch.msgid.link/20260303195638.381642-4-one-d-wide@protonmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agodoc/netlink: nftables: Add definitions
Remy D. Farley [Tue, 3 Mar 2026 19:58:13 +0000 (19:58 +0000)] 
doc/netlink: nftables: Add definitions

New enums/flags:
- payload-base
- range-ops
- registers
- numgen-types
- log-level
- log-flags

Added missing enumerations:
- bitwise-ops

Annotated doc comment or associated enum:
- bitwise-ops

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Signed-off-by: Remy D. Farley <one-d-wide@protonmail.com>
Link: https://patch.msgid.link/20260303195638.381642-3-one-d-wide@protonmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agodoc/netlink: netlink-raw: Add max check
Remy D. Farley [Tue, 3 Mar 2026 19:57:41 +0000 (19:57 +0000)] 
doc/netlink: netlink-raw: Add max check

Add definitions for max check and len-or-limit type, the same as in other
specifications.

Suggested-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Signed-off-by: Remy D. Farley <one-d-wide@protonmail.com>
Link: https://patch.msgid.link/20260303195638.381642-2-one-d-wide@protonmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agoMerge branch 'net-stmmac-qcom-ethqos-further-serdes-reorganisation'
Jakub Kicinski [Fri, 6 Mar 2026 02:43:08 +0000 (18:43 -0800)] 
Merge branch 'net-stmmac-qcom-ethqos-further-serdes-reorganisation'

Russell King says:

====================
net: stmmac: qcom-ethqos: further serdes reorganisation

This is part 2 of the qcom-ethqos series, part 1 and patch 2 of part 2
has now been merged.

This part of the series focuses on the generic PHY driver, but these
changes have dependencies on the ethernet driver, hence why
it will need to go via net-next. Furthermore, subsequent changes
depend on these patches.

The underlying ideas here are:

- get rid of the driver using phy_set_speed() with SPEED_1000 and
  SPEED_2500 which makes no sense for an ethernet SerDes due to the
  PCS 8B10B data encoding, which inflates the data rate at the SerDes
  compared to the MAC. This is replaced with phy_set_mode_ext().
- allow phy_power_on() / phy_set_mode*() to be called in any order.

Mohd has tested this series, although not in the resulting merge order.
====================

Link: https://patch.msgid.link/aacD3osfaZkLsGxm@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonet: stmmac: qcom-ethqos: remove phy_set_mode_ext() after phy_power_on()
Russell King (Oracle) [Tue, 3 Mar 2026 15:54:06 +0000 (15:54 +0000)] 
net: stmmac: qcom-ethqos: remove phy_set_mode_ext() after phy_power_on()

The call to phy_set_mode_ext() after phy_power_on() was a work-around
for the qcom-sgmii-eth SerDes driver that only re-enabled its clocks on
phy_power_on() but did not configure the PHY. Now that the SerDes driver
fully configures the SerDes at phy_power_on(), there is no need to call
phy_set_mode_ext() immediately afterwards.

This also means we no longer need to record the previous operating mode
of the driver - this is up to the SerDes driver. In any case, the only
thing that we care about is the SerDes provides the necessary clocks to
the stmmac core to allow it to reset at this point. The actual mode is
irrelevant at this point as the correct mode will be configured in
ethqos_mac_finish_serdes() just before the network device is brought
online.

Reviewed-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com>
Tested-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vxS4U-0000000BQXy-1Q1v@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agophy: qcom-sgmii-eth: relax order of .power_on() vs .set_mode*()
Russell King (Oracle) [Tue, 3 Mar 2026 15:54:01 +0000 (15:54 +0000)] 
phy: qcom-sgmii-eth: relax order of .power_on() vs .set_mode*()

Allow any order of the .power_on() and .set_mode*() methods as per the
recent discussion. This means phy_power_on() with this SerDes will now
restore the previous setup without requiring a subsequent
phy_set_mode*() call.

Tested-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com>
Acked-by: Vinod Koul <vkoul@kernel.org>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vxS4P-0000000BQXs-0vGB@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agophy: qcom-sgmii-eth: remove qcom_dwmac_sgmii_phy_interface()
Russell King (Oracle) [Tue, 3 Mar 2026 15:53:56 +0000 (15:53 +0000)] 
phy: qcom-sgmii-eth: remove qcom_dwmac_sgmii_phy_interface()

Now that qcom_dwmac_sgmii_phy_interface() only serves to validate the
passed interface mode, combine it with qcom_dwmac_sgmii_phy_validate(),
and use qcom_dwmac_sgmii_phy_validate() to validate the mode in
qcom_dwmac_sgmii_phy_set_mode().

Tested-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com>
Acked-by: Vinod Koul <vkoul@kernel.org>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vxS4K-0000000BQXm-0OJL@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agophy: qcom-sgmii-eth: use PHY interface mode for SerDes settings
Russell King (Oracle) [Tue, 3 Mar 2026 15:53:50 +0000 (15:53 +0000)] 
phy: qcom-sgmii-eth: use PHY interface mode for SerDes settings

As established in the previous commit, using SPEED_1000 and SPEED_2500
does not make sense for a SerDes due to the PCS encoding that is used
over the SerDes link, which inflates the data rate at the SerDes. Thus,
the use of these constants in a SerDes driver is incorrect.

Since qcom-sgmii-eth no longer implements phy_set_speed(), but instead
uses the PHY interface mode passed via the .set_mode() method, convert
the driver to use the PHY interface mode internally to decide whether
to configure the SerDes for 1.25Gbps or 3.125Gbps mode.

Tested-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com>
Acked-by: Vinod Koul <vkoul@kernel.org>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vxS4E-0000000BQXg-46dJ@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agophy: qcom-sgmii-eth: remove .set_speed() implementation
Russell King (Oracle) [Tue, 3 Mar 2026 15:53:45 +0000 (15:53 +0000)] 
phy: qcom-sgmii-eth: remove .set_speed() implementation

Now that the qcom-ethqos driver has migrated to use phy_set_mode_ext()
rather than phy_set_speed() to configure the SerDes, the support for
phy_set_speed() is now obsolete. Remove support for this method.

Using the MAC speed for the SerDes is never correct due to the PCS
encoding. For SGMII and 2500BASE-X, the PCS uses 8B10B encoding, and
so:

  MAC rate * PCS output bits / PCS input bits = SerDes rate
   1000M   *       10        /       8        = 1250M
   2500M   *       10        /       8        = 3125M

Tested-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com>
Acked-by: Vinod Koul <vkoul@kernel.org>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vxS49-0000000BQXa-3Zcg@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonet: stmmac: qcom-ethqos: convert to use phy_set_mode_ext()
Russell King (Oracle) [Tue, 3 Mar 2026 15:53:40 +0000 (15:53 +0000)] 
net: stmmac: qcom-ethqos: convert to use phy_set_mode_ext()

qcom-sgmii-eth now accepts the phy_set_mode*() calls to configure the
SerDes, taking a PHY interface mode rather than a speed. This allows
the elimination of the interface mode to speed conversion in
ethqos_mac_finish_serdes().

Tested-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com>
Link: https://patch.msgid.link/E1vxS44-0000000BQXU-38lG@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonet: stmmac: qcom-ethqos: move ethqos_set_serdes_speed()
Russell King (Oracle) [Tue, 3 Mar 2026 15:53:35 +0000 (15:53 +0000)] 
net: stmmac: qcom-ethqos: move ethqos_set_serdes_speed()

Combine ethqos_set_serdes_speed() with ethqos_mac_finish_serdes() to
simplify the code.

Reviewed-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com>
Tested-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vxS3z-0000000BQXO-2WpU@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agotcp: move tcp_v6_early_demux() to net/ipv6/ip6_input.c
Eric Dumazet [Wed, 4 Mar 2026 02:27:06 +0000 (02:27 +0000)] 
tcp: move tcp_v6_early_demux() to net/ipv6/ip6_input.c

tcp_v6_early_demux() has a single caller : ip6_rcv_finish_core().

Move it to net/ipv6/ip6_input.c and mark it static, for possible
compiler/linker optimizations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260304022706.1062459-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonet: mdio: xgene: Fix misleading err message in xgene mdio read
Alok Tiwari [Wed, 4 Mar 2026 19:57:36 +0000 (11:57 -0800)] 
net: mdio: xgene: Fix misleading err message in xgene mdio read

xgene_xfi_mdio_read() prints "write failed" when the MDIO management
interface remains busy and the read times out. Update the message to
"read failed" to match the operation.

Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/20260304195755.2468204-1-alok.a.tiwari@oracle.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agoocteontx2-af: make PF_FUNC comparison consistent in NIX XOFF handling
Alok Tiwari [Wed, 4 Mar 2026 19:39:48 +0000 (11:39 -0800)] 
octeontx2-af: make PF_FUNC comparison consistent in NIX XOFF handling

nix_smq_flush_enadis_xoff() compares PF_FUNC values with the FUNC bits
masked off, but one operand applied the mask before extracting PF_FUNC
via TXSCH_MAP_FUNC().

Apply RVU_PFVF_FUNC_MASK after TXSCH_MAP_FUNC() for the TL2 scheduler
queue operand, matching the existing handling of the other operand and
making the comparison consistent and clearer.

No functional change intended.

Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Reviewed-by: Subbaraya Sundeep <sbhatta@marvell.com>
Link: https://patch.msgid.link/20260304193950.2467391-1-alok.a.tiwari@oracle.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agotcp: shrink per-packet memset in __tcp_transmit_skb()
Keita Morisaki [Wed, 4 Mar 2026 11:15:17 +0000 (20:15 +0900)] 
tcp: shrink per-packet memset in __tcp_transmit_skb()

Use struct_group() to group the three fields in tcp_out_options that are
read unconditionally by tcp_options_write() and bpf_skops_write_hdr_opt()
(mss, bpf_opt_len, num_sack_blocks), then replace the full-struct memset
with a targeted memset of only that group.

struct tcp_out_options is 40 bytes without MPTCP and 96 bytes with
CONFIG_MPTCP=y (typical distro config). Every remaining field is either
assigned before first use by tcp_established_options()/tcp_syn_options(),
or gated behind its OPTION_* flag in tcp_options_write(). This memset
runs on every transmitted TCP packet, so shrinking it from 96 (or 40)
bytes to 4 bytes reduces per-packet overhead on the hot path.

Assembly comparison (x86-64, GCC 13, CONFIG_MPTCP=y):

  Before: rep stos zeroing 96 bytes (5 instructions, 12 8-byte stores)
  After:  movl $0x0 zeroing 4 bytes (1 instruction, 1 store)

Also add opts->options = 0 at the top of tcp_syn_options(), which
already used |= without a prior clear. tcp_established_options() already
clears opts->options at its top.

Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Keita Morisaki <kmta1236@gmail.com>
Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260304111517.2088694-1-kmta1236@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonet: phy: realtek: Add support for PHY LEDs on RTL8211F-VD
Kryštof Černý [Wed, 4 Mar 2026 12:03:10 +0000 (13:03 +0100)] 
net: phy: realtek: Add support for PHY LEDs on RTL8211F-VD

Realtek RTL8211F-VD has the same LED configuration
and registers as RTL8211F.
Use the existing LED related functions for this chip,
so it is possible to also use the netdev trigger.

Tested on ROCK Pi E.

Signed-off-by: Kryštof Černý <cleverline1mc@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20260304-rtl8211fvd-add-leds-v2-1-d50bd8a50f08@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Jakub Kicinski [Thu, 26 Feb 2026 18:20:47 +0000 (10:20 -0800)] 
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR (net-7.0-rc3).

No conflicts.

Adjacent changes:

net/netfilter/nft_set_rbtree.c
  fb7fb4016300 ("netfilter: nf_tables: clone set on flush only")
  3aea466a4399 ("netfilter: nft_set_rbtree: don't disable bh when acquiring tree lock")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoMerge tag 'net-7.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Linus Torvalds [Thu, 5 Mar 2026 19:00:46 +0000 (11:00 -0800)] 
Merge tag 'net-7.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from CAN, netfilter and wireless.

  Current release - new code bugs:

   - sched: cake: fixup cake_mq rate adjustment for diffserv config

   - wifi: fix missing ieee80211_eml_params member initialization

  Previous releases - regressions:

   - tcp: give up on stronger sk_rcvbuf checks (for now)

  Previous releases - always broken:

   - net: fix rcu_tasks stall in threaded busypoll

   - sched:
      - fq: clear q->band_pkt_count[] in fq_reset()
      - only allow act_ct to bind to clsact/ingress qdiscs and shared
        blocks

   - bridge: check relevant per-VLAN options in VLAN range grouping

   - xsk: fix fragment node deletion to prevent buffer leak

  Misc:

   - spring cleanup of inactive maintainers"

* tag 'net-7.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (138 commits)
  xdp: produce a warning when calculated tailroom is negative
  net: enetc: use truesize as XDP RxQ info frag_size
  libeth, idpf: use truesize as XDP RxQ info frag_size
  i40e: use xdp.frame_sz as XDP RxQ info frag_size
  i40e: fix registering XDP RxQ info
  ice: change XDP RxQ frag_size from DMA write length to xdp.frame_sz
  ice: fix rxq info registering in mbuf packets
  xsk: introduce helper to determine rxq->frag_size
  xdp: use modulo operation to calculate XDP frag tailroom
  selftests/tc-testing: Add tests exercising act_ife metalist replace behaviour
  net/sched: act_ife: Fix metalist update behavior
  selftests: net: add test for IPv4 route with loopback IPv6 nexthop
  net: ipv6: fix panic when IPv4 route references loopback IPv6 nexthop
  net: vxlan: fix nd_tbl NULL dereference when IPv6 is disabled
  net: bridge: fix nd_tbl NULL dereference when IPv6 is disabled
  MAINTAINERS: remove Thomas Falcon from IBM ibmvnic
  MAINTAINERS: remove Claudiu Manoil and Alexandre Belloni from Ocelot switch
  MAINTAINERS: replace Taras Chornyi with Elad Nachman for Marvell Prestera
  MAINTAINERS: remove Jonathan Lemon from OpenCompute PTP
  MAINTAINERS: replace Clark Wang with Frank Li for Freescale FEC
  ...

6 weeks agoMerge tag 'trace-v7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace...
Linus Torvalds [Thu, 5 Mar 2026 16:05:05 +0000 (08:05 -0800)] 
Merge tag 'trace-v7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix thresh_return of function graph tracer

   The update to store data on the shadow stack removed the abuse of
   using the task recursion word as a way to keep track of what
   functions to ignore. The trace_graph_return() was updated to handle
   this, but when function_graph tracer is using a threshold (only trace
   functions that took longer than a specified time), it uses
   trace_graph_thresh_return() instead.

   This function was still incorrectly using the task struct recursion
   word causing the function graph tracer to permanently set all
   functions to "notrace"

 - Fix thresh_return nosleep accounting

   When the calltime was moved to the shadow stack storage instead of
   being on the fgraph descriptor, the calculations for the amount of
   sleep time was updated. The calculation was done in the
   trace_graph_thresh_return() function, which also called the
   trace_graph_return(), which did the calculation again, causing the
   time to be doubled.

   Remove the call to trace_graph_return() as what it needed to do
   wasn't that much, and just do the work in
   trace_graph_thresh_return().

 - Fix syscall trace event activation on boot up

   The syscall trace events are pseudo events attached to the
   raw_syscall tracepoints. When the first syscall event is enabled, it
   enables the raw_syscall tracepoint and doesn't need to do anything
   when a second syscall event is also enabled.

   When events are enabled via the kernel command line, syscall events
   are partially enabled as the enabling is called before rcu_init. This
   is due to allow early events to be enabled immediately. Because
   kernel command line events do not distinguish between different types
   of events, the syscall events are enabled here but are not fully
   functioning. After rcu_init, they are disabled and re-enabled so that
   they can be fully enabled.

   The problem happened is that this "disable-enable" is done one at a
   time. If more than one syscall event is specified on the command
   line, by disabling them one at a time, the counter never gets to
   zero, and the raw_syscall is not disabled and enabled, keeping the
   syscall events in their non-fully functional state.

   Instead, disable all events and re-enabled them all, as that will
   ensure the raw_syscall event is also disabled and re-enabled.

 - Disable preemption in ftrace pid filtering

   The ftrace pid filtering attaches to the fork and exit tracepoints to
   add or remove pids that should be traced. They access variables
   protected by RCU (preemption disabled). Now that tracepoint callbacks
   are called with preemption enabled, this protection needs to be added
   explicitly, and not depend on the functions being called with
   preemption disabled.

 - Disable preemption in event pid filtering

   The event pid filtering needs the same preemption disabling guards as
   ftrace pid filtering.

 - Fix accounting of the memory mapped ring buffer on fork

   Memory mapping the ftrace ring buffer sets the vm_flags to DONTCOPY.
   But this does not prevent the application from calling
   madvise(MADVISE_DOFORK). This causes the mapping to be copied on
   fork. After the first tasks exits, the mapping is considered unmapped
   by everyone. But when he second task exits, the counter goes below
   zero and triggers a WARN_ON.

   Since nothing prevents two separate tasks from mmapping the ftrace
   ring buffer (although two mappings may mess each other up), there's
   no reason to stop the memory from being copied on fork.

   Update the vm_operations to have an ".open" handler to update the
   accounting and let the ring buffer know someone else has it mapped.

 - Add all ftrace headers in MAINTAINERS file

   The MAINTAINERS file only specifies include/linux/ftrace.h But misses
   ftrace_irq.h and ftrace_regs.h. Make the file use wildcards to get
   all *ftrace* files.

* tag 'trace-v7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  ftrace: Add MAINTAINERS entries for all ftrace headers
  tracing: Fix WARN_ON in tracing_buffers_mmap_close
  tracing: Disable preemption in the tracepoint callbacks handling filtered pids
  ftrace: Disable preemption in the tracepoint callbacks handling filtered pids
  tracing: Fix syscall events activation by ensuring refcount hits zero
  fgraph: Fix thresh_return nosleeptime double-adjust
  fgraph: Fix thresh_return clear per-task notrace

6 weeks agoMerge branch 'Address-XDP-frags-having-negative-tailroom'
Jakub Kicinski [Thu, 5 Mar 2026 16:02:27 +0000 (08:02 -0800)] 
Merge branch 'Address-XDP-frags-having-negative-tailroom'

Larysa Zaremba says:

====================
Address XDP frags having negative tailroom

Aside from the issue described below, tailroom calculation does not account
for pages being split between frags, e.g. in i40e, enetc and
AF_XDP ZC with smaller chunks. These series address the problem by
calculating modulo (skb_frag_off() % rxq->frag_size) in order to get
data offset within a smaller block of memory. Please note, xskxceiver
tail grow test passes without modulo e.g. in xdpdrv mode on i40e,
because there is not enough descriptors to get to flipped buffers.

Many ethernet drivers report xdp Rx queue frag size as being the same as
DMA write size. However, the only user of this field, namely
bpf_xdp_frags_increase_tail(), clearly expects a truesize.

Such difference leads to unspecific memory corruption issues under certain
circumstances, e.g. in ixgbevf maximum DMA write size is 3 KB, so when
running xskxceiver's XDP_ADJUST_TAIL_GROW_MULTI_BUFF, 6K packet fully uses
all DMA-writable space in 2 buffers. This would be fine, if only
rxq->frag_size was properly set to 4K, but value of 3K results in a
negative tailroom, because there is a non-zero page offset.

We are supposed to return -EINVAL and be done with it in such case,
but due to tailroom being stored as an unsigned int, it is reported to be
somewhere near UINT_MAX, resulting in a tail being grown, even if the
requested offset is too much(it is around 2K in the abovementioned test).
This later leads to all kinds of unspecific calltraces.

[ 7340.337579] xskxceiver[1440]: segfault at 1da718 ip 00007f4161aeac9d sp 00007f41615a6a00 error 6
[ 7340.338040] xskxceiver[1441]: segfault at 7f410000000b ip 00000000004042b5 sp 00007f415bffecf0 error 4
[ 7340.338179]  in libc.so.6[61c9d,7f4161aaf000+160000]
[ 7340.339230]  in xskxceiver[42b5,400000+69000]
[ 7340.340300]  likely on CPU 6 (core 0, socket 6)
[ 7340.340302] Code: ff ff 01 e9 f4 fe ff ff 0f 1f 44 00 00 4c 39 f0 74 73 31 c0 ba 01 00 00 00 f0 0f b1 17 0f 85 ba 00 00 00 49 8b 87 88 00 00 00 <4c> 89 70 08 eb cc 0f 1f 44 00 00 48 8d bd f0 fe ff ff 89 85 ec fe
[ 7340.340888]  likely on CPU 3 (core 0, socket 3)
[ 7340.345088] Code: 00 00 00 ba 00 00 00 00 be 00 00 00 00 89 c7 e8 31 ca ff ff 89 45 ec 8b 45 ec 85 c0 78 07 b8 00 00 00 00 eb 46 e8 0b c8 ff ff <8b> 00 83 f8 69 74 24 e8 ff c7 ff ff 8b 00 83 f8 0b 74 18 e8 f3 c7
[ 7340.404334] Oops: general protection fault, probably for non-canonical address 0x6d255010bdffc: 0000 [#1] SMP NOPTI
[ 7340.405972] CPU: 7 UID: 0 PID: 1439 Comm: xskxceiver Not tainted 6.19.0-rc1+ #21 PREEMPT(lazy)
[ 7340.408006] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-5.fc42 04/01/2014
[ 7340.409716] RIP: 0010:lookup_swap_cgroup_id+0x44/0x80
[ 7340.410455] Code: 83 f8 1c 73 39 48 ba ff ff ff ff ff ff ff 03 48 8b 04 c5 20 55 fa bd 48 21 d1 48 89 ca 83 e1 01 48 d1 ea c1 e1 04 48 8d 04 90 <8b> 00 48 83 c4 10 d3 e8 c3 cc cc cc cc 31 c0 e9 98 b7 dd 00 48 89
[ 7340.412787] RSP: 0018:ffffcc5c04f7f6d0 EFLAGS: 00010202
[ 7340.413494] RAX: 0006d255010bdffc RBX: ffff891f477895a8 RCX: 0000000000000010
[ 7340.414431] RDX: 0001c17e3fffffff RSI: 00fa070000000000 RDI: 000382fc7fffffff
[ 7340.415354] RBP: 00fa070000000000 R08: ffffcc5c04f7f8f8 R09: ffffcc5c04f7f7d0
[ 7340.416283] R10: ffff891f4c1a7000 R11: ffffcc5c04f7f9c8 R12: ffffcc5c04f7f7d0
[ 7340.417218] R13: 03ffffffffffffff R14: 00fa06fffffffe00 R15: ffff891f47789500
[ 7340.418229] FS:  0000000000000000(0000) GS:ffff891ffdfaa000(0000) knlGS:0000000000000000
[ 7340.419489] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7340.420286] CR2: 00007f415bfffd58 CR3: 0000000103f03002 CR4: 0000000000772ef0
[ 7340.421237] PKRU: 55555554
[ 7340.421623] Call Trace:
[ 7340.421987]  <TASK>
[ 7340.422309]  ? softleaf_from_pte+0x77/0xa0
[ 7340.422855]  swap_pte_batch+0xa7/0x290
[ 7340.423363]  zap_nonpresent_ptes.constprop.0.isra.0+0xd1/0x270
[ 7340.424102]  zap_pte_range+0x281/0x580
[ 7340.424607]  zap_pmd_range.isra.0+0xc9/0x240
[ 7340.425177]  unmap_page_range+0x24d/0x420
[ 7340.425714]  unmap_vmas+0xa1/0x180
[ 7340.426185]  exit_mmap+0xe1/0x3b0
[ 7340.426644]  __mmput+0x41/0x150
[ 7340.427098]  exit_mm+0xb1/0x110
[ 7340.427539]  do_exit+0x1b2/0x460
[ 7340.427992]  do_group_exit+0x2d/0xc0
[ 7340.428477]  get_signal+0x79d/0x7e0
[ 7340.428957]  arch_do_signal_or_restart+0x34/0x100
[ 7340.429571]  exit_to_user_mode_loop+0x8e/0x4c0
[ 7340.430159]  do_syscall_64+0x188/0x6b0
[ 7340.430672]  ? __do_sys_clone3+0xd9/0x120
[ 7340.431212]  ? switch_fpu_return+0x4e/0xd0
[ 7340.431761]  ? arch_exit_to_user_mode_prepare.isra.0+0xa1/0xc0
[ 7340.432498]  ? do_syscall_64+0xbb/0x6b0
[ 7340.433015]  ? __handle_mm_fault+0x445/0x690
[ 7340.433582]  ? count_memcg_events+0xd6/0x210
[ 7340.434151]  ? handle_mm_fault+0x212/0x340
[ 7340.434697]  ? do_user_addr_fault+0x2b4/0x7b0
[ 7340.435271]  ? clear_bhb_loop+0x30/0x80
[ 7340.435788]  ? clear_bhb_loop+0x30/0x80
[ 7340.436299]  ? clear_bhb_loop+0x30/0x80
[ 7340.436812]  ? clear_bhb_loop+0x30/0x80
[ 7340.437323]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 7340.437973] RIP: 0033:0x7f4161b14169
[ 7340.438468] Code: Unable to access opcode bytes at 0x7f4161b1413f.
[ 7340.439242] RSP: 002b:00007ffc6ebfa770 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
[ 7340.440173] RAX: fffffffffffffe00 RBX: 00000000000005a1 RCX: 00007f4161b14169
[ 7340.441061] RDX: 00000000000005a1 RSI: 0000000000000109 RDI: 00007f415bfff990
[ 7340.441943] RBP: 00007ffc6ebfa7a0 R08: 0000000000000000 R09: 00000000ffffffff
[ 7340.442824] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 7340.443707] R13: 0000000000000000 R14: 00007f415bfff990 R15: 00007f415bfff6c0
[ 7340.444586]  </TASK>
[ 7340.444922] Modules linked in: rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency_common skx_edac_common nfit libnvdimm kvm_intel vfat fat kvm snd_pcm irqbypass rapl iTCO_wdt snd_timer intel_pmc_bxt iTCO_vendor_support snd ixgbevf virtio_net soundcore i2c_i801 pcspkr libeth_xdp net_failover i2c_smbus lpc_ich failover libeth virtio_balloon joydev 9p fuse loop zram lz4hc_compress lz4_compress 9pnet_virtio 9pnet netfs ghash_clmulni_intel serio_raw qemu_fw_cfg
[ 7340.449650] ---[ end trace 0000000000000000 ]---

The issue can be fixed in all in-tree drivers, but we cannot just trust OOT
drivers to not do this. Therefore, make tailroom a signed int and produce a
warning when it is negative to prevent such mistakes in the future.

The issue can also be easily reproduced with ice driver, by applying
the following diff to xskxceiver and enjoying a kernel panic in xdpdrv mode:

diff --git a/tools/testing/selftests/bpf/prog_tests/test_xsk.c b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
index 5af28f359cfd..042d587fa7ef 100644
--- a/tools/testing/selftests/bpf/prog_tests/test_xsk.c
+++ b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
@@ -2541,8 +2541,8 @@ int testapp_adjust_tail_grow_mb(struct test_spec *test)
 {
        test->mtu = MAX_ETH_JUMBO_SIZE;
        /* Grow by (frag_size - last_frag_Size) - 1 to stay inside the last fragment */
-       return testapp_adjust_tail(test, (XSK_UMEM__MAX_FRAME_SIZE / 2) - 1,
-                                  XSK_UMEM__LARGE_FRAME_SIZE * 2);
+       return testapp_adjust_tail(test, XSK_UMEM__MAX_FRAME_SIZE * 100,
+                                  6912);
 }

 int testapp_tx_queue_consumer(struct test_spec *test)

If we print out the values involved in the tailroom calculation:

tailroom = rxq->frag_size - skb_frag_size(frag) - skb_frag_off(frag);

4294967040 = 3456 - 3456 - 256

I personally reproduced and verified the issue in ice and i40e,
aside from WiP ixgbevf implementation.
====================

Link: https://patch.msgid.link/20260305111253.2317394-1-larysa.zaremba@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoxdp: produce a warning when calculated tailroom is negative
Larysa Zaremba [Thu, 5 Mar 2026 11:12:50 +0000 (12:12 +0100)] 
xdp: produce a warning when calculated tailroom is negative

Many ethernet drivers report xdp Rx queue frag size as being the same as
DMA write size. However, the only user of this field, namely
bpf_xdp_frags_increase_tail(), clearly expects a truesize.

Such difference leads to unspecific memory corruption issues under certain
circumstances, e.g. in ixgbevf maximum DMA write size is 3 KB, so when
running xskxceiver's XDP_ADJUST_TAIL_GROW_MULTI_BUFF, 6K packet fully uses
all DMA-writable space in 2 buffers. This would be fine, if only
rxq->frag_size was properly set to 4K, but value of 3K results in a
negative tailroom, because there is a non-zero page offset.

We are supposed to return -EINVAL and be done with it in such case, but due
to tailroom being stored as an unsigned int, it is reported to be somewhere
near UINT_MAX, resulting in a tail being grown, even if the requested
offset is too much (it is around 2K in the abovementioned test). This later
leads to all kinds of unspecific calltraces.

[ 7340.337579] xskxceiver[1440]: segfault at 1da718 ip 00007f4161aeac9d sp 00007f41615a6a00 error 6
[ 7340.338040] xskxceiver[1441]: segfault at 7f410000000b ip 00000000004042b5 sp 00007f415bffecf0 error 4
[ 7340.338179]  in libc.so.6[61c9d,7f4161aaf000+160000]
[ 7340.339230]  in xskxceiver[42b5,400000+69000]
[ 7340.340300]  likely on CPU 6 (core 0, socket 6)
[ 7340.340302] Code: ff ff 01 e9 f4 fe ff ff 0f 1f 44 00 00 4c 39 f0 74 73 31 c0 ba 01 00 00 00 f0 0f b1 17 0f 85 ba 00 00 00 49 8b 87 88 00 00 00 <4c> 89 70 08 eb cc 0f 1f 44 00 00 48 8d bd f0 fe ff ff 89 85 ec fe
[ 7340.340888]  likely on CPU 3 (core 0, socket 3)
[ 7340.345088] Code: 00 00 00 ba 00 00 00 00 be 00 00 00 00 89 c7 e8 31 ca ff ff 89 45 ec 8b 45 ec 85 c0 78 07 b8 00 00 00 00 eb 46 e8 0b c8 ff ff <8b> 00 83 f8 69 74 24 e8 ff c7 ff ff 8b 00 83 f8 0b 74 18 e8 f3 c7
[ 7340.404334] Oops: general protection fault, probably for non-canonical address 0x6d255010bdffc: 0000 [#1] SMP NOPTI
[ 7340.405972] CPU: 7 UID: 0 PID: 1439 Comm: xskxceiver Not tainted 6.19.0-rc1+ #21 PREEMPT(lazy)
[ 7340.408006] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-5.fc42 04/01/2014
[ 7340.409716] RIP: 0010:lookup_swap_cgroup_id+0x44/0x80
[ 7340.410455] Code: 83 f8 1c 73 39 48 ba ff ff ff ff ff ff ff 03 48 8b 04 c5 20 55 fa bd 48 21 d1 48 89 ca 83 e1 01 48 d1 ea c1 e1 04 48 8d 04 90 <8b> 00 48 83 c4 10 d3 e8 c3 cc cc cc cc 31 c0 e9 98 b7 dd 00 48 89
[ 7340.412787] RSP: 0018:ffffcc5c04f7f6d0 EFLAGS: 00010202
[ 7340.413494] RAX: 0006d255010bdffc RBX: ffff891f477895a8 RCX: 0000000000000010
[ 7340.414431] RDX: 0001c17e3fffffff RSI: 00fa070000000000 RDI: 000382fc7fffffff
[ 7340.415354] RBP: 00fa070000000000 R08: ffffcc5c04f7f8f8 R09: ffffcc5c04f7f7d0
[ 7340.416283] R10: ffff891f4c1a7000 R11: ffffcc5c04f7f9c8 R12: ffffcc5c04f7f7d0
[ 7340.417218] R13: 03ffffffffffffff R14: 00fa06fffffffe00 R15: ffff891f47789500
[ 7340.418229] FS:  0000000000000000(0000) GS:ffff891ffdfaa000(0000) knlGS:0000000000000000
[ 7340.419489] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7340.420286] CR2: 00007f415bfffd58 CR3: 0000000103f03002 CR4: 0000000000772ef0
[ 7340.421237] PKRU: 55555554
[ 7340.421623] Call Trace:
[ 7340.421987]  <TASK>
[ 7340.422309]  ? softleaf_from_pte+0x77/0xa0
[ 7340.422855]  swap_pte_batch+0xa7/0x290
[ 7340.423363]  zap_nonpresent_ptes.constprop.0.isra.0+0xd1/0x270
[ 7340.424102]  zap_pte_range+0x281/0x580
[ 7340.424607]  zap_pmd_range.isra.0+0xc9/0x240
[ 7340.425177]  unmap_page_range+0x24d/0x420
[ 7340.425714]  unmap_vmas+0xa1/0x180
[ 7340.426185]  exit_mmap+0xe1/0x3b0
[ 7340.426644]  __mmput+0x41/0x150
[ 7340.427098]  exit_mm+0xb1/0x110
[ 7340.427539]  do_exit+0x1b2/0x460
[ 7340.427992]  do_group_exit+0x2d/0xc0
[ 7340.428477]  get_signal+0x79d/0x7e0
[ 7340.428957]  arch_do_signal_or_restart+0x34/0x100
[ 7340.429571]  exit_to_user_mode_loop+0x8e/0x4c0
[ 7340.430159]  do_syscall_64+0x188/0x6b0
[ 7340.430672]  ? __do_sys_clone3+0xd9/0x120
[ 7340.431212]  ? switch_fpu_return+0x4e/0xd0
[ 7340.431761]  ? arch_exit_to_user_mode_prepare.isra.0+0xa1/0xc0
[ 7340.432498]  ? do_syscall_64+0xbb/0x6b0
[ 7340.433015]  ? __handle_mm_fault+0x445/0x690
[ 7340.433582]  ? count_memcg_events+0xd6/0x210
[ 7340.434151]  ? handle_mm_fault+0x212/0x340
[ 7340.434697]  ? do_user_addr_fault+0x2b4/0x7b0
[ 7340.435271]  ? clear_bhb_loop+0x30/0x80
[ 7340.435788]  ? clear_bhb_loop+0x30/0x80
[ 7340.436299]  ? clear_bhb_loop+0x30/0x80
[ 7340.436812]  ? clear_bhb_loop+0x30/0x80
[ 7340.437323]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 7340.437973] RIP: 0033:0x7f4161b14169
[ 7340.438468] Code: Unable to access opcode bytes at 0x7f4161b1413f.
[ 7340.439242] RSP: 002b:00007ffc6ebfa770 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
[ 7340.440173] RAX: fffffffffffffe00 RBX: 00000000000005a1 RCX: 00007f4161b14169
[ 7340.441061] RDX: 00000000000005a1 RSI: 0000000000000109 RDI: 00007f415bfff990
[ 7340.441943] RBP: 00007ffc6ebfa7a0 R08: 0000000000000000 R09: 00000000ffffffff
[ 7340.442824] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 7340.443707] R13: 0000000000000000 R14: 00007f415bfff990 R15: 00007f415bfff6c0
[ 7340.444586]  </TASK>
[ 7340.444922] Modules linked in: rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency_common skx_edac_common nfit libnvdimm kvm_intel vfat fat kvm snd_pcm irqbypass rapl iTCO_wdt snd_timer intel_pmc_bxt iTCO_vendor_support snd ixgbevf virtio_net soundcore i2c_i801 pcspkr libeth_xdp net_failover i2c_smbus lpc_ich failover libeth virtio_balloon joydev 9p fuse loop zram lz4hc_compress lz4_compress 9pnet_virtio 9pnet netfs ghash_clmulni_intel serio_raw qemu_fw_cfg
[ 7340.449650] ---[ end trace 0000000000000000 ]---

The issue can be fixed in all in-tree drivers, but we cannot just trust OOT
drivers to not do this. Therefore, make tailroom a signed int and produce a
warning when it is negative to prevent such mistakes in the future.

Fixes: bf25146a5595 ("bpf: add frags support to the bpf_xdp_adjust_tail() API")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Link: https://patch.msgid.link/20260305111253.2317394-10-larysa.zaremba@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: enetc: use truesize as XDP RxQ info frag_size
Larysa Zaremba [Thu, 5 Mar 2026 11:12:49 +0000 (12:12 +0100)] 
net: enetc: use truesize as XDP RxQ info frag_size

The only user of frag_size field in XDP RxQ info is
bpf_xdp_frags_increase_tail(). It clearly expects truesize instead of DMA
write size. Different assumptions in enetc driver configuration lead to
negative tailroom.

Set frag_size to the same value as frame_sz.

Fixes: 2768b2e2f7d2 ("net: enetc: register XDP RX queues with frag_size")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Link: https://patch.msgid.link/20260305111253.2317394-9-larysa.zaremba@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agolibeth, idpf: use truesize as XDP RxQ info frag_size
Larysa Zaremba [Thu, 5 Mar 2026 11:12:48 +0000 (12:12 +0100)] 
libeth, idpf: use truesize as XDP RxQ info frag_size

The only user of frag_size field in XDP RxQ info is
bpf_xdp_frags_increase_tail(). It clearly expects whole buffer size instead
of DMA write size. Different assumptions in idpf driver configuration lead
to negative tailroom.

To make it worse, buffer sizes are not actually uniform in idpf when
splitq is enabled, as there are several buffer queues, so rxq->rx_buf_size
is meaningless in this case.

Use truesize of the first bufq in AF_XDP ZC, as there is only one. Disable
growing tail for regular splitq.

Fixes: ac8a861f632e ("idpf: prepare structures to support XDP")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Link: https://patch.msgid.link/20260305111253.2317394-8-larysa.zaremba@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoi40e: use xdp.frame_sz as XDP RxQ info frag_size
Larysa Zaremba [Thu, 5 Mar 2026 11:12:47 +0000 (12:12 +0100)] 
i40e: use xdp.frame_sz as XDP RxQ info frag_size

The only user of frag_size field in XDP RxQ info is
bpf_xdp_frags_increase_tail(). It clearly expects whole buffer size instead
of DMA write size. Different assumptions in i40e driver configuration lead
to negative tailroom.

Set frag_size to the same value as frame_sz in shared pages mode, use new
helper to set frag_size when AF_XDP ZC is active.

Fixes: a045d2f2d03d ("i40e: set xdp_rxq_info::frag_size")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Link: https://patch.msgid.link/20260305111253.2317394-7-larysa.zaremba@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoi40e: fix registering XDP RxQ info
Larysa Zaremba [Thu, 5 Mar 2026 11:12:46 +0000 (12:12 +0100)] 
i40e: fix registering XDP RxQ info

Current way of handling XDP RxQ info in i40e has a problem, where frag_size
is not updated when xsk_buff_pool is detached or when MTU is changed, this
leads to growing tail always failing for multi-buffer packets.

Couple XDP RxQ info registering with buffer allocations and unregistering
with cleaning the ring.

Fixes: a045d2f2d03d ("i40e: set xdp_rxq_info::frag_size")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Link: https://patch.msgid.link/20260305111253.2317394-6-larysa.zaremba@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoice: change XDP RxQ frag_size from DMA write length to xdp.frame_sz
Larysa Zaremba [Thu, 5 Mar 2026 11:12:45 +0000 (12:12 +0100)] 
ice: change XDP RxQ frag_size from DMA write length to xdp.frame_sz

The only user of frag_size field in XDP RxQ info is
bpf_xdp_frags_increase_tail(). It clearly expects whole buff size instead
of DMA write size. Different assumptions in ice driver configuration lead
to negative tailroom.

This allows to trigger kernel panic, when using
XDP_ADJUST_TAIL_GROW_MULTI_BUFF xskxceiver test and changing packet size to
6912 and the requested offset to a huge value, e.g.
XSK_UMEM__MAX_FRAME_SIZE * 100.

Due to other quirks of the ZC configuration in ice, panic is not observed
in ZC mode, but tailroom growing still fails when it should not.

Use fill queue buffer truesize instead of DMA write size in XDP RxQ info.
Fix ZC mode too by using the new helper.

Fixes: 2fba7dc5157b ("ice: Add support for XDP multi-buffer on Rx side")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Link: https://patch.msgid.link/20260305111253.2317394-5-larysa.zaremba@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoice: fix rxq info registering in mbuf packets
Larysa Zaremba [Thu, 5 Mar 2026 11:12:44 +0000 (12:12 +0100)] 
ice: fix rxq info registering in mbuf packets

XDP RxQ info contains frag_size, which depends on the MTU. This makes the
old way of registering RxQ info before calculating new buffer sizes
invalid. Currently, it leads to frag_size being outdated, making it
sometimes impossible to grow tailroom in a mbuf packet. E.g. fragments are
actually 3K+, but frag size is still as if MTU was 1500.

Always register new XDP RxQ info after reconfiguring memory pools.

Fixes: 2fba7dc5157b ("ice: Add support for XDP multi-buffer on Rx side")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Link: https://patch.msgid.link/20260305111253.2317394-4-larysa.zaremba@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoxsk: introduce helper to determine rxq->frag_size
Larysa Zaremba [Thu, 5 Mar 2026 11:12:43 +0000 (12:12 +0100)] 
xsk: introduce helper to determine rxq->frag_size

rxq->frag_size is basically a step between consecutive strictly aligned
frames. In ZC mode, chunk size fits exactly, but if chunks are unaligned,
there is no safe way to determine accessible space to grow tailroom.

Report frag_size to be zero, if chunks are unaligned, chunk_size otherwise.

Fixes: 24ea50127ecf ("xsk: support mbuf on ZC RX")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Link: https://patch.msgid.link/20260305111253.2317394-3-larysa.zaremba@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoxdp: use modulo operation to calculate XDP frag tailroom
Larysa Zaremba [Thu, 5 Mar 2026 11:12:42 +0000 (12:12 +0100)] 
xdp: use modulo operation to calculate XDP frag tailroom

The current formula for calculating XDP tailroom in mbuf packets works only
if each frag has its own page (if rxq->frag_size is PAGE_SIZE), this
defeats the purpose of the parameter overall and without any indication
leads to negative calculated tailroom on at least half of frags, if shared
pages are used.

There are not many drivers that set rxq->frag_size. Among them:
* i40e and enetc always split page uniformly between frags, use shared
  pages
* ice uses page_pool frags via libeth, those are power-of-2 and uniformly
  distributed across page
* idpf has variable frag_size with XDP on, so current API is not applicable
* mlx5, mtk and mvneta use PAGE_SIZE or 0 as frag_size for page_pool

As for AF_XDP ZC, only ice, i40e and idpf declare frag_size for it. Modulo
operation yields good results for aligned chunks, they are all power-of-2,
between 2K and PAGE_SIZE. Formula without modulo fails when chunk_size is
2K. Buffers in unaligned mode are not distributed uniformly, so modulo
operation would not work.

To accommodate unaligned buffers, we could define frag_size as
data + tailroom, and hence do not subtract offset when calculating
tailroom, but this would necessitate more changes in the drivers.

Define rxq->frag_size as an even portion of a page that fully belongs to a
single frag. When calculating tailroom, locate the data start within such
portion by performing a modulo operation on page offset.

Fixes: bf25146a5595 ("bpf: add frags support to the bpf_xdp_adjust_tail() API")
Acked-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Link: https://patch.msgid.link/20260305111253.2317394-2-larysa.zaremba@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoselftests/tc-testing: Add tests exercising act_ife metalist replace behaviour
Victor Nogueira [Wed, 4 Mar 2026 14:06:03 +0000 (09:06 -0500)] 
selftests/tc-testing: Add tests exercising act_ife metalist replace behaviour

Add 2 test cases to exercise fix in act_ife's internal metalist
behaviour.

- Update decode ife action into encode with tcindex metadata
- Update decode ife action into encode with multiple metadata

Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20260304140603.76500-2-jhs@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet/sched: act_ife: Fix metalist update behavior
Jamal Hadi Salim [Wed, 4 Mar 2026 14:06:02 +0000 (09:06 -0500)] 
net/sched: act_ife: Fix metalist update behavior

Whenever an ife action replace changes the metalist, instead of
replacing the old data on the metalist, the current ife code is appending
the new metadata. Aside from being innapropriate behavior, this may lead
to an unbounded addition of metadata to the metalist which might cause an
out of bounds error when running the encode op:

[  138.423369][    C1] ==================================================================
[  138.424317][    C1] BUG: KASAN: slab-out-of-bounds in ife_tlv_meta_encode (net/ife/ife.c:168)
[  138.424906][    C1] Write of size 4 at addr ffff8880077f4ffe by task ife_out_out_bou/255
[  138.425778][    C1] CPU: 1 UID: 0 PID: 255 Comm: ife_out_out_bou Not tainted 7.0.0-rc1-00169-gfbdfa8da05b6 #624 PREEMPT(full)
[  138.425795][    C1] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  138.425800][    C1] Call Trace:
[  138.425804][    C1]  <IRQ>
[  138.425808][    C1]  dump_stack_lvl (lib/dump_stack.c:122)
[  138.425828][    C1]  print_report (mm/kasan/report.c:379 mm/kasan/report.c:482)
[  138.425839][    C1]  ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221)
[  138.425844][    C1]  ? __virt_addr_valid (./arch/x86/include/asm/preempt.h:95 (discriminator 1) ./include/linux/rcupdate.h:975 (discriminator 1) ./include/linux/mmzone.h:2207 (discriminator 1) arch/x86/mm/physaddr.c:54 (discriminator 1))
[  138.425853][    C1]  ? ife_tlv_meta_encode (net/ife/ife.c:168)
[  138.425859][    C1]  kasan_report (mm/kasan/report.c:221 mm/kasan/report.c:597)
[  138.425868][    C1]  ? ife_tlv_meta_encode (net/ife/ife.c:168)
[  138.425878][    C1]  kasan_check_range (mm/kasan/generic.c:186 (discriminator 1) mm/kasan/generic.c:200 (discriminator 1))
[  138.425884][    C1]  __asan_memset (mm/kasan/shadow.c:84 (discriminator 2))
[  138.425889][    C1]  ife_tlv_meta_encode (net/ife/ife.c:168)
[  138.425893][    C1]  ? ife_tlv_meta_encode (net/ife/ife.c:171)
[  138.425898][    C1]  ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221)
[  138.425903][    C1]  ife_encode_meta_u16 (net/sched/act_ife.c:57)
[  138.425910][    C1]  ? __pfx_do_raw_spin_lock (kernel/locking/spinlock_debug.c:114)
[  138.425916][    C1]  ? __asan_memcpy (mm/kasan/shadow.c:105 (discriminator 3))
[  138.425921][    C1]  ? __pfx_ife_encode_meta_u16 (net/sched/act_ife.c:45)
[  138.425927][    C1]  ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221)
[  138.425931][    C1]  tcf_ife_act (net/sched/act_ife.c:847 net/sched/act_ife.c:879)

To solve this issue, fix the replace behavior by adding the metalist to
the ife rcu data structure.

Fixes: aa9fd9a325d51 ("sched: act: ife: update parameters via rcu handling")
Reported-by: Ruitong Liu <cnitlrt@gmail.com>
Tested-by: Ruitong Liu <cnitlrt@gmail.com>
Co-developed-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20260304140603.76500-1-jhs@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoMerge branch 'net-ipv6-fix-panic-when-ipv4-route-references-loopback-ipv6-nexthop...
Jakub Kicinski [Thu, 5 Mar 2026 15:53:19 +0000 (07:53 -0800)] 
Merge branch 'net-ipv6-fix-panic-when-ipv4-route-references-loopback-ipv6-nexthop-and-add-selftest'

Jiayuan Chen says:

====================
net: ipv6: fix panic when IPv4 route references loopback IPv6 nexthop and add selftest

syzbot reported a kernel panic [1] when an IPv4 route references
a loopback IPv6 nexthop object:

BUG: unable to handle page fault for address: ffff8d069e7aa000
PF: supervisor read access in kernel mode
PF: error_code(0x0000) - not-present page
PGD 6aa01067 P4D 6aa01067 PUD 0
Oops: Oops: 0000 [#1] SMP PTI
CPU: 2 UID: 0 PID: 530 Comm: ping Not tainted 6.19.0+ #193 PREEMPT
RIP: 0010:ip_route_output_key_hash_rcu+0x578/0x9e0
RSP: 0018:ffffd2ffc1573918 EFLAGS: 00010286
RAX: ffff8d069e7aa000 RBX: ffffd2ffc1573988 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffffd2ffc1573978 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8d060d496000
R13: 0000000000000000 R14: ffff8d060399a600 R15: ffff8d06019a6ab8
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff8d069e7aa000 CR3: 0000000106eb0001 CR4: 0000000000770ef0
PKRU: 55555554
Call Trace:
 <TASK>
 ip_route_output_key_hash+0x86/0x1a0
 __ip4_datagram_connect+0x2b5/0x4e0
 udp_connect+0x2c/0x60
 inet_dgram_connect+0x88/0xd0
 __sys_connect_file+0x56/0x90
 __sys_connect+0xa8/0xe0
 __x64_sys_connect+0x18/0x30
 x64_sys_call+0xfb9/0x26e0
 do_syscall_64+0xd3/0x1510
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

Reproduction:

    ip -6 nexthop add id 100 dev lo
    ip route add 172.20.20.0/24 nhid 100
    ping -c1 172.20.20.1     # kernel crash

Problem Description

When a standalone IPv6 nexthop object is created with a loopback device,
fib6_nh_init() misclassifies it as a reject route. Nexthop objects have
no destination prefix (fc_dst=::), so fib6_is_reject() always matches
any loopback nexthop. The reject path skips fib_nh_common_init(), leaving
nhc_pcpu_rth_output unallocated. When an IPv4 route later references
this nexthop and triggers a route lookup, __mkroute_output() calls
raw_cpu_ptr(nhc->nhc_pcpu_rth_output) on a NULL pointer, causing a page
fault.

The reject classification was designed for regular IPv6 routes to prevent
kernel routing loops, but nexthop objects should not be subject to this
check since they carry no destination information. Loop prevention is
handled separately when the route itself is created.
[1] https://syzkaller.appspot.com/bug?extid=334190e097a98a1b81bb
====================

Link: https://patch.msgid.link/20260304113817.294966-1-jiayuan.chen@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoselftests: net: add test for IPv4 route with loopback IPv6 nexthop
Jiayuan Chen [Wed, 4 Mar 2026 11:38:14 +0000 (19:38 +0800)] 
selftests: net: add test for IPv4 route with loopback IPv6 nexthop

Add a regression test for a kernel panic that occurs when an IPv4 route
references an IPv6 nexthop object created on the loopback device.

The test creates an IPv6 nexthop on lo, binds an IPv4 route to it, then
triggers a route lookup via ping to verify the kernel does not crash.

  ./fib_nexthops.sh
  Tests passed: 249
  Tests failed:   0

Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20260304113817.294966-3-jiayuan.chen@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: ipv6: fix panic when IPv4 route references loopback IPv6 nexthop
Jiayuan Chen [Wed, 4 Mar 2026 11:38:13 +0000 (19:38 +0800)] 
net: ipv6: fix panic when IPv4 route references loopback IPv6 nexthop

When a standalone IPv6 nexthop object is created with a loopback device
(e.g., "ip -6 nexthop add id 100 dev lo"), fib6_nh_init() misclassifies
it as a reject route. This is because nexthop objects have no destination
prefix (fc_dst=::), causing fib6_is_reject() to match any loopback
nexthop. The reject path skips fib_nh_common_init(), leaving
nhc_pcpu_rth_output unallocated. If an IPv4 route later references this
nexthop, __mkroute_output() dereferences NULL nhc_pcpu_rth_output and
panics.

Simplify the check in fib6_nh_init() to only match explicit reject
routes (RTF_REJECT) instead of using fib6_is_reject(). The loopback
promotion heuristic in fib6_is_reject() is handled separately by
ip6_route_info_create_nh(). After this change, the three cases behave
as follows:

1. Explicit reject route ("ip -6 route add unreachable 2001:db8::/64"):
   RTF_REJECT is set, enters reject path, skips fib_nh_common_init().
   No behavior change.

2. Implicit loopback reject route ("ip -6 route add 2001:db8::/32 dev lo"):
   RTF_REJECT is not set, takes normal path, fib_nh_common_init() is
   called. ip6_route_info_create_nh() still promotes it to reject
   afterward. nhc_pcpu_rth_output is allocated but unused, which is
   harmless.

3. Standalone nexthop object ("ip -6 nexthop add id 100 dev lo"):
   RTF_REJECT is not set, takes normal path, fib_nh_common_init() is
   called. nhc_pcpu_rth_output is properly allocated, fixing the crash
   when IPv4 routes reference this nexthop.

Suggested-by: Ido Schimmel <idosch@nvidia.com>
Fixes: 493ced1ac47c ("ipv4: Allow routes to use nexthop objects")
Reported-by: syzbot+334190e097a98a1b81bb@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/698f8482.a70a0220.2c38d7.00ca.GAE@google.com/T/
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20260304113817.294966-2-jiayuan.chen@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: vxlan: fix nd_tbl NULL dereference when IPv6 is disabled
Fernando Fernandez Mancera [Wed, 4 Mar 2026 12:03:57 +0000 (13:03 +0100)] 
net: vxlan: fix nd_tbl NULL dereference when IPv6 is disabled

When booting with the 'ipv6.disable=1' parameter, the nd_tbl is never
initialized because inet6_init() exits before ndisc_init() is called
which initializes it. If an IPv6 packet is injected into the interface,
route_shortcircuit() is called and a NULL pointer dereference happens on
neigh_lookup().

 BUG: kernel NULL pointer dereference, address: 0000000000000380
 Oops: Oops: 0000 [#1] SMP NOPTI
 [...]
 RIP: 0010:neigh_lookup+0x20/0x270
 [...]
 Call Trace:
  <TASK>
  vxlan_xmit+0x638/0x1ef0 [vxlan]
  dev_hard_start_xmit+0x9e/0x2e0
  __dev_queue_xmit+0xbee/0x14e0
  packet_sendmsg+0x116f/0x1930
  __sys_sendto+0x1f5/0x200
  __x64_sys_sendto+0x24/0x30
  do_syscall_64+0x12f/0x1590
  entry_SYSCALL_64_after_hwframe+0x76/0x7e

Fix this by adding an early check on route_shortcircuit() when protocol
is ETH_P_IPV6. Note that ipv6_mod_enabled() cannot be used here because
VXLAN can be built-in even when IPv6 is built as a module.

Fixes: e15a00aafa4b ("vxlan: add ipv6 route short circuit support")
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Link: https://patch.msgid.link/20260304120357.9778-2-fmancera@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: bridge: fix nd_tbl NULL dereference when IPv6 is disabled
Fernando Fernandez Mancera [Wed, 4 Mar 2026 12:03:56 +0000 (13:03 +0100)] 
net: bridge: fix nd_tbl NULL dereference when IPv6 is disabled

When booting with the 'ipv6.disable=1' parameter, the nd_tbl is never
initialized because inet6_init() exits before ndisc_init() is called
which initializes it. Then, if neigh_suppress is enabled and an ICMPv6
Neighbor Discovery packet reaches the bridge, br_do_suppress_nd() will
dereference ipv6_stub->nd_tbl which is NULL, passing it to
neigh_lookup(). This causes a kernel NULL pointer dereference.

 BUG: kernel NULL pointer dereference, address: 0000000000000268
 Oops: 0000 [#1] PREEMPT SMP NOPTI
 [...]
 RIP: 0010:neigh_lookup+0x16/0xe0
 [...]
 Call Trace:
  <IRQ>
  ? neigh_lookup+0x16/0xe0
  br_do_suppress_nd+0x160/0x290 [bridge]
  br_handle_frame_finish+0x500/0x620 [bridge]
  br_handle_frame+0x353/0x440 [bridge]
  __netif_receive_skb_core.constprop.0+0x298/0x1110
  __netif_receive_skb_one_core+0x3d/0xa0
  process_backlog+0xa0/0x140
  __napi_poll+0x2c/0x170
  net_rx_action+0x2c4/0x3a0
  handle_softirqs+0xd0/0x270
  do_softirq+0x3f/0x60

Fix this by replacing IS_ENABLED(IPV6) call with ipv6_mod_enabled() in
the callers. This is in essence disabling NS/NA suppression when IPv6 is
disabled.

Fixes: ed842faeb2bd ("bridge: suppress nd pkts on BR_NEIGH_SUPPRESS ports")
Reported-by: Guruprasad C P <gurucp2005@gmail.com>
Closes: https://lore.kernel.org/netdev/CAHXs0ORzd62QOG-Fttqa2Cx_A_VFp=utE2H2VTX5nqfgs7LDxQ@mail.gmail.com/
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260304120357.9778-1-fmancera@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoMerge branch 'maintainers-annual-cleanup-of-inactive-maintainers'
Jakub Kicinski [Thu, 5 Mar 2026 15:35:45 +0000 (07:35 -0800)] 
Merge branch 'maintainers-annual-cleanup-of-inactive-maintainers'

Jakub Kicinski says:

====================
MAINTAINERS: annual cleanup of inactive maintainers

Annual cleanup of inactive maintainers under networking.
The goal is to make sure MAINTAINERS reflect reality for
code which is relatively actively changed (at least 70 commits
in the last 2 years or at least 120 commits in the last 5 years).

Those who either:
 - were the initial author / "upstreamer" of the driver; or
 - authored at least 1/3rd of the exiting code base (per git blame); or
 - authored at least 25% of commits before becoming inactive
are moved to CREDITS.

The discovery of inactive maintainers was done using gitdm tools,
with a bunch of ad-hoc scripts on top to do the rest. I tried to
double check the results but this is mostly a scripted cleanup
so please report inaccuracies if any.
====================

Link: https://patch.msgid.link/20260303215339.2333548-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>