git.ipfire.org Git - thirdparty/kernel/linux.git/log

net: remove HIPPI support and RoadRunner HIPPI driver

HIPPI has not been relevant for over two decades. It was rapidly
eclipsed by Fibre Channel, and even when it was new, it was
confined to very high-end hardware. The HIPPI code has only
received tree-wide changes and fixes by inspection in the entire
Git history. Remove HIPPI support and the rrunner HIPPI driver,
and move the former maintainer to the CREDITS file. Keep the
include/uapi/linux/if_hippi.h header because it is used by the TUN
code, and to avoid breaking userspace, however unlikely that may be.

Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com>
Link: https://patch.msgid.link/20260119022451.22344-1-enelsonmoore@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: move tcp_rate_skb_delivered() to tcp_input.c

tcp_rate_skb_delivered() is only called from tcp_input.c.
Move it there and make it static.

Both gcc and clang are (auto)inlining it, TCP performance
is increased at a small space cost.

$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/2 grow/shrink: 3/0 up/down: 509/-187 (322)
Function                                     old     new   delta
tcp_sacktag_walk                            1682    1867    +185
tcp_ack                                     5230    5405    +175
tcp_shifted_skb                              437     586    +149
__pfx_tcp_rate_skb_delivered                  16       -     -16
tcp_rate_skb_delivered                       171       -    -171
Total: Before=22566192, After=22566514, chg +0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20260118123204.2315993-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'fix-typos-in-network-driver-code-comments'

Yicong Hui says:

====================
Fix typos in network driver code comments

Fix various minor typos and mispellings in 3 different driver
subdirectories in drivers/net/ethernet
====================

Link: https://patch.msgid.link/20260118121001.136806-1-yiconghui@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/xen-netback: Fix mispelling of "Software" as "Softare"

Fix misspelling of "software" as "softare" in xen-netback code comment.

Signed-off-by: Yicong Hui <yiconghui@gmail.com>
Link: https://patch.msgid.link/20260118121001.136806-4-yiconghui@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/micrel: Fix typos in micrel driver code comments

Fix various typos and misspellings in code comments in the
drivers/net/ethernet/micrel directory

Signed-off-by: Yicong Hui <yiconghui@gmail.com>
Link: https://patch.msgid.link/20260118121001.136806-3-yiconghui@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/benet: Fix typos in driver code comments

Fix various typos and misspellings in code comments in the
drivers/net/ethernet/emulex directory

Signed-off-by: Yicong Hui <yiconghui@gmail.com>
Link: https://patch.msgid.link/20260118121001.136806-2-yiconghui@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: simplify PHY fixup registration

Based on the fact that either bus_id-based matching or phy_uid-based
matching is used, the code can be simplified. PHY_ANY_ID and
PHY_ANY_UID are not needed. Ensure that phy_id_compare() is called
only if phy_uid_mask isn't zero, because a zero value would always
result in a match.
In addition change the return value type of phy_needs_fixup() to bool.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://patch.msgid.link/e7394cc8-5895-4d02-a8fe-802345c7c547@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: macb: Replace open-coded device config retrieval with of_device_get_match_data()

Use of_device_get_match_data() to replace the open-coded method for
obtaining the device config.

Additionally, adjust the ordering of local variables to ensure
compatibility with RCS.

Signed-off-by: Kevin Hao <haokexin@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20260117-macb-v1-1-f092092d8c91@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha_eth: increase max MTU to 9220 for DSA jumbo frames

The industry standard jumbo frame MTU is 9216 bytes. When using the DSA
subsystem, a 4-byte tag is added to each Ethernet frame.

Increase AIROHA_MAX_MTU to 9220 bytes (9216 + 4) so that users can set a
standard 9216-byte MTU on DSA ports.

The underlying hardware supports significantly larger frame sizes
(approximately 16K). However, the maximum MTU is limited to 9220 bytes
for now, as this is sufficient to support standard jumbo frames and does
not incur additional memory allocation overhead.

Signed-off-by: Sayantan Nandy <sayantann11@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20260119073658.6216-1-sayantann11@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-pf: Remove unnecessary bounds check

active_fec is a 2-bit unsigned field, and thus can only have the values
0-3. So checking that it is less than 4 is unnecessary.

Simplify the code by dropping this check.

As it no longer fits well where it is, move FEC_MAX_INDEX to towards the
top of the file. And add the prefix OXT2. I believe this is more
idiomatic.

Flagged by Smatch as:
...//otx2_ethtool.c:1024 otx2_get_fecparam() warn: always true condition '(pfvf->linfo.fec < 4) => (0-3 < 4)'

No functional change intended.
Compile tested only.

Signed-off-by: Simon Horman <horms@kernel.org>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Reviewed-by: Hariprasad Kelam <hkelam@marvell.com>
Link: https://patch.msgid.link/20260119-oob-v1-1-a4147e75e770@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: add kdoc for napi_consume_skb()

Looks like AI reviewers miss that napi_consume_skb() must have
a real budget passed to it. Let's see if adding a real kdoc will
help them figure this out.

Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20260119224140.1362729-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: usb: r8152: fix transmit queue timeout

When the TX queue length reaches the threshold, the netdev watchdog
immediately detects a TX queue timeout.

This patch updates the trans_start timestamp of the transmit queue
on every asynchronous USB URB submission along the transmit path,
ensuring that the network watchdog accurately reflects ongoing
transmission activity.

Signed-off-by: Mingj Ye <insyelu@gmail.com>
Reviewed-by: Hayes Wang <hayeswang@realtek.com>
Link: https://patch.msgid.link/20260120015949.84996-1-insyelu@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: drv-net: fix missing include in ncdevmem

Commit ca9d74eb5f6a ("uapi: add INT_MAX and INT_MIN constants")
recently removed some includes of limits.h in uAPI headers.
ncdevmem.c was depending on them:

  ncdevmem.c: In function ‘ethtool_add_flow’:
  ncdevmem.c:369:60: error: ‘INT_MAX’ undeclared (first use in this function)
  369 |         if (endptr == id_start || flow_id < 0 || flow_id > INT_MAX)
      |                                                            ^~~~~~~
  ncdevmem.c:77:1: note: ‘INT_MAX’ is defined in header ‘<limits.h>’; did you forget to ‘#include <limits.h>’?

Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20260120180319.1673271-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: fclone allocation small optimization

After skb allocation, initial skb->fclone value is 0 (SKB_FCLONE_UNAVAILABLE)

We can replace one RMW sequence with a single OR instruction.

movzbl 0x7e(%r13),%eax // skb->fclone = SKB_FCLONE_ORIG;
and    $0xf3,%al
or     $0x4,%al
mov    %al,0x7e(%r13)
->
or     $0x4,0x7e(%r13) // skb->fclone |= SKB_FCLONE_ORIG;

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260116164402.1872649-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'convert-the-micrel-bindings-to-dt-schema'

Stefan Eichenberger says:

====================
Convert the Micrel bindings to DT schema

Convert the device tree bindings for the Micrel PHYs and switches to DT
schema.
====================

Link: https://patch.msgid.link/20260116130948.79558-1-eichest@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: micrel: Convert micrel-ksz90x1.txt to DT schema

Convert the micrel-ksz90x1.txt to DT schema. Create a separate YAML file
for this PHY series. The old naming of ksz90x1 would be misleading in
this case, so rename it to gigabit, as it contains ksz9xx1 and lan8xxx
gigabit PHYs.

Signed-off-by: Stefan Eichenberger <stefan.eichenberger@toradex.com>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20260116130948.79558-3-eichest@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: micrel: Convert to DT schema

Convert the devicetree bindings for the Micrel PHYs and switches to DT
schema.

Signed-off-by: Stefan Eichenberger <stefan.eichenberger@toradex.com>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20260116130948.79558-2-eichest@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: sparx5: do not require phys when RGMII is used

LAN969x has 2 dedicated RGMII ports, so regular SERDES lanes are not used
for RGMII.

So, lets not require phys to be defined when any of the rgmii phy-modes are
set.

Signed-off-by: Robert Marko <robert.marko@sartura.hr>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20260115114021.111324-11-robert.marko@sartura.hr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: drv-net: extend HW timestamp test with ioctl

Extend HW timestamp tests to check that ioctl interface is not broken
and configuration setups and requests are equal to netlink interface.
Some linter warnings are disabled because of ctypes classes.

Reviewed-by: Kory Maincent <kory.maincent@bootlin.com>
Signed-off-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20260116062121.1230184-2-vadim.fedorenko@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: remove legacy way to get/set HW timestamp config

With all drivers converted to use ndo_hwstamp callbacks the legacy way
can be removed, marking ioctl interface as deprecated.

Signed-off-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Reviewed-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://patch.msgid.link/20260116062121.1230184-1-vadim.fedorenko@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: split kmalloc_reserve() to allow inlining

kmalloc_reserve() is too big to be inlined.

Put the slow path in a new out-of-line function : kmalloc_pfmemalloc()

Then let kmalloc_reserve() set skb->pfmemalloc only when/if
the slow path is taken.

This makes __alloc_skb() faster :

- kmalloc_reserve() is now automatically inlined by both gcc and clang.
- No more expensive RMW (skb->pfmemalloc = pfmemalloc).
- No more expensive stack canary (for CONFIG_STACKPROTECTOR_STRONG=y).
- Removal of two prefetches that were coming too late for modern cpus.

Text size increase is quite small compared to the cpu savings (~0.7 %)

$ size net/core/skbuff.clang.before.o net/core/skbuff.clang.after.o
   text    data     bss     dec     hex filename
  72507    5897       0   78404   13244 net/core/skbuff.clang.before.o
  72681    5897       0   78578   132f2 net/core/skbuff.clang.after.o

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260116041359.181104-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'eth-fbnic-update-ipc-mailbox-support'

Mohsin Bashir says:

====================
eth: fbnic: Update IPC mailbox support

Update IPC mailbox support for fbnic to cater for several changes.
====================

Link: https://patch.msgid.link/20260115003353.4150771-1-mohsin.bashr@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

eth: fbnic: Update RX mbox timeout value

While waiting for completions on read requests, driver is using
different timeout values for different messages. Make use of a single
timeout value.

Introduce a wrapper function to handle the wait, which also simplify
maintaining the 80 char line limit.

Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com>
Link: https://patch.msgid.link/20260115003353.4150771-6-mohsin.bashr@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

eth: fbnic: Remove retry support

The driver retries sensor read requests from firmware, but this is
unnecessary. A functioning firmware should respond to each request
within the timeout period. Remove the retry logic and set the timeout
to the sum of all retry timeouts.

Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com>
Link: https://patch.msgid.link/20260115003353.4150771-5-mohsin.bashr@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

eth: fbnic: Reuse RX mailbox pages

Currently, the RX mailbox frees and reallocates a page for each received
message. Since FW Rx messages are processed synchronously, and nothing
hold these pages (unlike skbs which we hand over to the stack), reuse
the pages and put them back on the Rx ring. Now that we ensure the ring
is always fully populated we don't have to worry about filling it up
after partial population during init, either. Update
fbnic_mbx_process_rx_msgs() to recycle pages after message processing.

Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com>
Link: https://patch.msgid.link/20260115003353.4150771-4-mohsin.bashr@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

eth: fbnic: Allocate all pages for RX mailbox

Now that memory is allocated with GFP_KERNEL, allocation failures
should be extremely rare. Ensure the FW communication ring is
always fully populated with free pages, and hard fail initialization
otherwise. This enables simplifications in next patches.

Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com>
Link: https://patch.msgid.link/20260115003353.4150771-3-mohsin.bashr@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

eth: fbnic: Use GFP_KERNEL to allocting mbx pages

Replace GFP_ATOMIC with GFP_KERNEL for mailbox RX page allocation. Since
interrupt handler is threaded GFP_KERNEL is a safe option to reduce
allocation failures.

Also remove __GFP_NOWARN so the kernel reports a warning on allocation
failure to aid debugging.

Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com>
Link: https://patch.msgid.link/20260115003353.4150771-2-mohsin.bashr@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'net-queue-rx-buf-len-v9' of https://github.com/isilence/linux

Pavel Begunkov says:

====================
Add support for providers with large rx buffer

Many modern NICs support configurable receive buffer lengths, and zcrx and
memory providers can use buffers larger than 4K to improve performance.
When paired with hw-gro larger rx buffer sizes can drastically reduce
the number of buffers traversing the stack and save a lot of processing
time. It also allows to give to users larger contiguous chunks of data.

Single stream benchmarks showed up to ~30% CPU util improvement.
E.g. comparison for 4K vs 32K buffers using a 200Gbit NIC:

packets=23987040 (MB=2745098), rps=199559 (MB/s=22837)
CPU    %usr   %nice    %sys %iowait    %irq   %soft   %idle
  0    1.53    0.00   27.78    2.72    1.31   66.45    0.22
packets=24078368 (MB=2755550), rps=200319 (MB/s=22924)
CPU    %usr   %nice    %sys %iowait    %irq   %soft   %idle
  0    0.69    0.00    8.26   31.65    1.83   57.00    0.57

This series adds net infrastructure for memory providers configuring
the size and implements it for bnxt. It's an opt-in feature for drivers,
they should advertise support for the parameter in the qops and must check
if the hardware supports the given size. It's limited to memory providers
as it drastically simplifies implementation. It doesn't affect the fast
path zcrx uAPI, and the user exposed parameter is defined in zcrx terms,
which allows it to be flexible and adjusted in the future.

A liburing example can be found at [2]

full branch:
[1] https://github.com/isilence/linux.git zcrx/large-buffers-v8
Liburing example:
[2] https://github.com/isilence/liburing.git zcrx/rx-buf-len

* tag 'net-queue-rx-buf-len-v9' of https://github.com/isilence/linux:
  io_uring/zcrx: document area chunking parameter
  selftests: iou-zcrx: test large chunk sizes
  eth: bnxt: support qcfg provided rx page size
  eth: bnxt: adjust the fill level of agg queues with larger buffers
  eth: bnxt: store rx buffer size per queue
  net: pass queue rx page size from memory provider
  net: add bare bone queue configs
  net: reduce indent of struct netdev_queue_mgmt_ops members
  net: memzero mp params when closing a queue
====================

Link: https://patch.msgid.link/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Revert "Merge branch 'netkit-support-for-io_uring-zero-copy-and-af_xdp'"

This reverts commit 77b9c4a438fc66e2ab004c411056b3fb71a54f2c, reversing
changes made to 4515ec4ad58a37e70a9e1256c0b993958c9b7497:

931420a2fc36 ("selftests/net: Add netkit container tests")
ab771c938d9a ("selftests/net: Make NetDrvContEnv support queue leasing")
6be87fbb2776 ("selftests/net: Add env for container based tests")
61d99ce3dfc2 ("selftests/net: Add bpf skb forwarding program")
920da3634194 ("netkit: Add xsk support for af_xdp applications")
eef51113f8af ("netkit: Add netkit notifier to check for unregistering devices")
b5ef109d22d4 ("netkit: Implement rtnl_link_ops->alloc and ndo_queue_create")
b5c3fa4a0b16 ("netkit: Add single device mode for netkit")
0073d2fd679d ("xsk: Proxy pool management for leased queues")
1ecea95dd3b5 ("xsk: Extend xsk_rcv_check validation")
804bf334d08a ("net: Proxy netdev_queue_get_dma_dev for leased queues")
0caa9a8ddec3 ("net: Proxy net_mp_{open,close}_rxq for leased queues")
ff8889ff9107 ("net, ethtool: Disallow leased real rxqs to be resized")
9e2103f36110 ("net: Add lease info to queue-get response")
31127deddef4 ("net: Implement netdev_nl_queue_create_doit")
a5546e18f77c ("net: Add queue-create operation")

The series will conflict with io_uring work, and the code needs more
polish.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'netkit-support-for-io_uring-zero-copy-and-af_xdp'

Daniel Borkmann says:

====================
netkit: Support for io_uring zero-copy and AF_XDP

Containers use virtual netdevs to route traffic from a physical netdev
in the host namespace. They do not have access to the physical netdev
in the host and thus can't use memory providers or AF_XDP that require
reconfiguring/restarting queues in the physical netdev.

This patchset adds the concept of queue leasing to virtual netdevs that
allow containers to use memory providers and AF_XDP at native speed.
Leased queues are bound to a real queue in a physical netdev and act
as a proxy.

Memory providers and AF_XDP operations take an ifindex and queue id,
so containers would pass in an ifindex for a virtual netdev and a queue
id of a leased queue, which then gets proxied to the underlying real
queue.

We have implemented support for this concept in netkit and tested the
latter against Nvidia ConnectX-6 (mlx5) as well as Broadcom BCM957504
(bnxt_en) 100G NICs. For more details see the individual patches.
====================

Link: https://patch.msgid.link/20260115082603.219152-1-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

selftests/net: Add netkit container tests

Add two tests using NetDrvContEnv. One basic test that sets up a netkit
pair, with one end in a netns. Use LOCAL_PREFIX_V6 and nk_forward BPF
program to ping from a remote host to the netkit in netns.

Second is a selftest for netkit queue leasing, using io_uring zero copy
test binary inside of a netns with netkit. This checks that memory
providers can be bound against virtual queues in a netkit within a
netns that are leasing from a physical netdev in the default netns.

Signed-off-by: David Wei <dw@davidwei.uk>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260115082603.219152-17-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

selftests/net: Make NetDrvContEnv support queue leasing

Add a new parameter `lease` to NetDrvContEnv that sets up queue leasing
in the env.

The NETIF also has some ethtool parameters changed to support memory
provider tests. This is needed in NetDrvContEnv rather than individual
test cases since the cleanup to restore NETIF can't be done, until the
netns in the env is gone.

Signed-off-by: David Wei <dw@davidwei.uk>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260115082603.219152-16-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

selftests/net: Add env for container based tests

Add an env NetDrvContEnv for container based selftests. This automates
the setup of a netns, netkit pair with one inside the netns, and a BPF
program that forwards skbs from the NETIF host inside the container.

Currently only netkit is used, but other virtual netdevs e.g. veth can
be used too.

Expect netkit container datapath selftests to have a publicly routable
IP prefix to assign to netkit in a container, such that packets will
land on eth0. The BPF skb forward program will then forward such packets
from the host netns to the container netns.

Signed-off-by: David Wei <dw@davidwei.uk>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260115082603.219152-15-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

selftests/net: Add bpf skb forwarding program

Add nk_forward.bpf.c, a BPF program that forwards skbs matching some IPv6
prefix received on eth0 ifindex to a specified netkit ifindex. This will
be needed by netkit container tests.

Signed-off-by: David Wei <dw@davidwei.uk>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260115082603.219152-14-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

netkit: Add xsk support for af_xdp applications

Enable support for AF_XDP applications to operate on a netkit device.
The goal is that AF_XDP applications can natively consume AF_XDP
from network namespaces. The use-case from Cilium side is to support
Kubernetes KubeVirt VMs through QEMU's AF_XDP backend. KubeVirt is a
virtual machine management add-on for Kubernetes which aims to provide
a common ground for virtualization. KubeVirt spawns the VMs inside
Kubernetes Pods which reside in their own network namespace just like
regular Pods.

Raw QEMU AF_XDP backend example with eth0 being a physical device with
16 queues where netkit is bound to the last queue (for multi-queue RSS
context can be used if supported by the driver):

  # ethtool -X eth0 start 0 equal 15
  # ethtool -X eth0 start 15 equal 1 context new
  # ethtool --config-ntuple eth0 flow-type ether \
            src 00:00:00:00:00:00 \
            src-mask ff:ff:ff:ff:ff:ff \
            dst $mac dst-mask 00:00:00:00:00:00 \
            proto 0 proto-mask 0xffff action 15
  [ ... setup BPF/XDP prog on eth0 to steer into shared xsk map ... ]
  # ip netns add foo
  # ip link add numrxqueues 2 nk type netkit single
  # ./pyynl/cli.py --spec ~/netlink/specs/netdev.yaml \
                   --do queue-create \
                   --json "{"ifindex": $(ifindex nk), "type": "rx", \
                            "lease": { "ifindex": $(ifindex eth0), \
                                       "queue": { "type": "rx", "id": 15 } } }"
  {'id': 1}
  # ip link set nk netns foo
  # ip netns exec foo ip link set lo up
  # ip netns exec foo ip link set nk up
  # ip netns exec foo qemu-system-x86_64 \
          -kernel $kernel \
          -drive file=${image_name},index=0,media=disk,format=raw \
          -append "root=/dev/sda rw console=ttyS0" \
          -cpu host \
          -m $memory \
          -enable-kvm \
          -device virtio-net-pci,netdev=net0,mac=$mac \
          -netdev af-xdp,ifname=nk,id=net0,mode=native,queues=1,start-queue=1,inhibit=on,map-path=$dir/xsks_map \
          -nographic

We have tested the above against a dual-port Nvidia ConnectX-6 (mlx5)
100G NIC with successful network connectivity out of QEMU. An earlier
iteration of this work was presented at LSF/MM/BPF [0] and more
recently at LPC [1].

For getting to a first starting point to connect all things with
KubeVirt, bind mounting the xsk map from Cilium into the VM launcher
Pod which acts as a regular Kubernetes Pod while not perfect, is not
a big problem given its out of reach from the application sitting
inside the VM (and some of the control plane aspects are baked in
the launcher Pod already), so the isolation barrier is still the VM.
Eventually the goal is to have a XDP/XSK redirect extension where
there is no need to have the xsk map, and the BPF program can just
derive the target xsk through the queue where traffic was received
on.

The exposure through netkit is because Cilium should not act as a
proxy handing out xsk sockets. Existing applications expect a netdev
from kernel side and should not need to rewrite just to implement
against a CNI's protocol. Also, all the memory should not be accounted
against Cilium but rather the application Pod itself which is consuming
AF_XDP. Further, on up/downgrades we expect the data plane to being
completely decoupled from the control plane; if Cilium would own the
sockets that would be disruptive. Another use-case which opens up and
is regularly asked from users would be to have DPDK applications on
top of AF_XDP in regular Kubernetes Pods.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://bpfconf.ebpf.io/bpfconf2025/bpfconf2025_material/lsfmmbpf_2025_netkit_borkmann.pdf
Link: https://lpc.events/event/19/contributions/2275/
Link: https://patch.msgid.link/20260115082603.219152-13-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

netkit: Add netkit notifier to check for unregistering devices

Add a netdevice notifier in netkit to watch for NETDEV_UNREGISTER events.
If the target device is indeed NETREG_UNREGISTERING and previously leased
a queue to a netkit device, then collect the related netkit devices and
batch-unregister_netdevice_many() them.

If this would not be done, then the netkit device would hold a reference
on the physical device preventing it from going away. However, in case of
both io_uring zero-copy as well as AF_XDP this situation is handled
gracefully and the allocated resources are torn down.

In the case where mentioned infra is used through netkit, the applications
have a reference on netkit, and netkit in turn holds a reference on the
physical device. In order to have netkit release the reference on the
physical device, we need such watcher to then unregister the netkit ones.

This is generally quite similar to the dependency handling in case of
tunnels (e.g. vxlan bound to a underlying netdev) where the tunnel device
gets removed along with the physical device.

  # ip a
  [...]
  4: enp10s0f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN group default qlen 1000
      link/ether e8:eb:d3:a3:43:f6 brd ff:ff:ff:ff:ff:ff
      inet 10.0.0.2/24 scope global enp10s0f0np0
         valid_lft forever preferred_lft forever
  [...]
  8: nk@NONE: <BROADCAST,MULTICAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
      link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
  [...]

  # rmmod mlx5_ib
  # rmmod mlx5_core

  [  309.261822] mlx5_core 0000:0a:00.0 mlx5_0: Port: 1 Link DOWN
  [  344.235236] mlx5_core 0000:0a:00.1: E-Switch: Unload vfs: mode(LEGACY), nvfs(0), necvfs(0), active vports(0)
  [  344.246948] mlx5_core 0000:0a:00.1: E-Switch: Disable: mode(LEGACY), nvfs(0), necvfs(0), active vports(0)
  [  344.463754] mlx5_core 0000:0a:00.1: E-Switch: Disable: mode(LEGACY), nvfs(0), necvfs(0), active vports(0)
  [  344.770155] mlx5_core 0000:0a:00.1: E-Switch: cleanup
  [  345.345709] mlx5_core 0000:0a:00.0: E-Switch: Unload vfs: mode(LEGACY), nvfs(0), necvfs(0), active vports(0)
  [  345.357524] mlx5_core 0000:0a:00.0: E-Switch: Disable: mode(LEGACY), nvfs(0), necvfs(0), active vports(0)
  [  350.995989] mlx5_core 0000:0a:00.0: E-Switch: Disable: mode(LEGACY), nvfs(0), necvfs(0), active vports(0)
  [  351.574396] mlx5_core 0000:0a:00.0: E-Switch: cleanup

  # ip a
  [...]
  [ both enp10s0f0np0 and nk gone ]
  [...]

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260115082603.219152-12-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

netkit: Implement rtnl_link_ops->alloc and ndo_queue_create

Implement rtnl_link_ops->alloc that allows the number of rx queues to be
set when netkit is created. By default, netkit has only a single rxq (and
single txq). The number of queues is deliberately not allowed to be changed
via ethtool -L and is fixed for the lifetime of a netkit instance.

For netkit device creation, numrxqueues with larger than one rxq can be
specified. These rxqs are leasable to real rxqs in physical netdevs:

  ip link add type netkit peer numrxqueues 64      # for device pair
  ip link add numrxqueues 64 type netkit single    # for single device

The limit of numrxqueues for netkit is currently set to 1024, which allows
leasing multiple real rxqs from physical netdevs.

The implementation of ndo_queue_create() adds a new rxq during the queue
lease operation. We allow to create queues either in single device mode
or for the case of dual device mode for the netkit peer device which gets
placed into the target network namespace. For dual device mode the lease
against the primary device does not make sense for the targeted use cases,
and therefore gets rejected.

We also need to add a lockdep class for netkit, such that lockdep does
not trip over us, similarly done as in commit 0bef512012b1 ("net: add
netdev_lockdep_set_classes() to virtual drivers").

This is also the last missing bit to netkit for supporting io_uring with
zero-copy mode [0]. Up until this point it was not possible to consume the
latter out of containers or Kubernetes Pods where applications are in their
own network namespace.

io_uring example with eth0 being a physical device with 16 queues where
netkit is bound to the last queue, iou-zcrx.c is binary from selftests.
Flow steering to that queue is based on the service VIP:port of the
server utilizing io_uring:

  # ethtool -X eth0 start 0 equal 15
  # ethtool -X eth0 start 15 equal 1 context new
  # ethtool --config-ntuple eth0 flow-type tcp4 dst-ip 1.2.3.4 dst-port 5000 action 15
  # ip netns add foo
  # ip link add type netkit peer numrxqueues 2
  # ./pyynl/cli.py --spec ~/netlink/specs/netdev.yaml \
                   --do queue-create \
                   --json "{"ifindex": $(ifindex nk0), "type": "rx", \
                            "lease": { "ifindex": $(ifindex eth0), \
                                       "queue": { "type": "rx", "id": 15 } } }"
  {'id': 1}
  # ip link set nk0 netns foo
  # ip link set nk1 up
  # ip netns exec foo ip link set lo up
  # ip netns exec foo ip link set nk0 up
  # ip netns exec foo ip addr add 1.2.3.4/32 dev nk0
  [ ... setup routing etc to get external traffic into the netns ... ]
  # ip netns exec foo ./iou-zcrx -s -p 5000 -i nk0 -q 1

Remote io_uring client:

  # ./iou-zcrx -c -h 1.2.3.4 -p 5000 -l 12840 -z 65536

We have tested the above against a Broadcom BCM957504 (bnxt_en) 100G NIC,
supporting TCP header/data split.

Similarly, this also works for devmem which we tested using ncdevmem:

  # ip netns exec foo ./ncdevmem -s 1.2.3.4 -l -p 5000 -f nk0 -t 1 -q 1

And on the remote client:

  # ./ncdevmem -s 1.2.3.4 -p 5000 -f eth0

For Cilium, the plan is to open up support for the various memory providers
for regular Kubernetes Pods when Cilium is configured with netkit datapath
mode.

Signed-off-by: David Wei <dw@davidwei.uk>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://kernel-recipes.org/en/2024/schedule/efficient-zero-copy-networking-using-io_uring
Link: https://patch.msgid.link/20260115082603.219152-11-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

netkit: Add single device mode for netkit

Add a single device mode for netkit instead of netkit pairs. The primary
target for the paired devices is to connect network namespaces, of course,
and support has been implemented in projects like Cilium [0]. For the rxq
leasing the plan is to support two main scenarios related to single device
mode:

* For the use-case of io_uring zero-copy, the control plane can either
  set up a netkit pair where the peer device can perform rxq leasing which
  is then tied to the lifetime of the peer device, or the control plane
  can use a regular netkit pair to connect the hostns to a Pod/container
  and dynamically add/remove rxq leasing through a single device without
  having to interrupt the device pair. In the case of io_uring, the memory
  pool is used as skb non-linear pages, and thus the skb will go its way
  through the regular stack into netkit. Things like the netkit policy when
  no BPF is attached or skb scrubbing etc apply as-is in case the paired
  devices are used, or if the backend memory is tied to the single device
  and traffic goes through a paired device.

* For the use-case of AF_XDP, the control plane needs to use netkit in the
  single device mode. The single device mode currently enforces only a
  pass policy when no BPF is attached, and does not yet support BPF link
  attachments for AF_XDP. skbs sent to that device get dropped at the
  moment. Given AF_XDP operates at a lower layer of the stack tying this
  to the netkit pair did not make sense. In future, the plan is to allow
  BPF at the XDP layer which can: i) process traffic coming from the AF_XDP
  application (e.g. QEMU with AF_XDP backend) to filter egress traffic or
  to push selected egress traffic up to the single netkit device to the
  local stack (e.g. DHCP requests), and ii) vice-versa skbs sent to the
  single netkit into the AF_XDP application (e.g. DHCP replies). Also,
  the control-plane can dynamically manage rxq leasing for the single
  netkit device without having to interrupt (e.g. down/up cycle) the main
  netkit pair for the Pod which has traffic going in and out.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Jordan Rife <jordan@jrife.io>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://docs.cilium.io/en/stable/operations/performance/tuning/#netkit-device-mode
Link: https://patch.msgid.link/20260115082603.219152-10-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

xsk: Proxy pool management for leased queues

Similarly to the net_mp_{open,close}_rxq handling for leased queues, proxy
the xsk_{reg,clear}_pool_at_qid via netif_get_rx_queue_lease_locked such
that in case a virtual netdev picked a leased rxq, the request gets through
to the real rxq in the physical netdev. The proxying is only relevant for
queue_id < dev->real_num_rx_queues since right now its only supported for
rxqs.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260115082603.219152-9-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

xsk: Extend xsk_rcv_check validation

xsk_rcv_check tests for inbound packets to see whether they match
the bound AF_XDP socket. Refactor the test into a small helper
xsk_dev_queue_valid and move the validation against xs->dev and
xs->queue_id there.

The fast-path case stays in place and allows for quick return in
xsk_dev_queue_valid. If it fails, the validation is extended to
check whether the AF_XDP socket is bound against a leased queue,
and if the case then the test is redone.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260115082603.219152-8-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: Proxy netdev_queue_get_dma_dev for leased queues

Extend netdev_queue_get_dma_dev to return the physical device of the
real rxq for DMA in case the queue was leased. This allows memory
providers like io_uring zero-copy or devmem to bind to the physically
leased rxq via virtual devices such as netkit.

Signed-off-by: David Wei <dw@davidwei.uk>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260115082603.219152-7-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: Proxy net_mp_{open,close}_rxq for leased queues

When a process in a container wants to setup a memory provider, it will
use the virtual netdev and a leased rxq, and call net_mp_{open,close}_rxq
to try and restart the queue. At this point, proxy the queue restart on
the real rxq in the physical netdev.

For memory providers (io_uring zero-copy rx and devmem), it causes the
real rxq in the physical netdev to be filled from a memory provider that
has DMA mapped memory from a process within a container.

Signed-off-by: David Wei <dw@davidwei.uk>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20260115082603.219152-6-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net, ethtool: Disallow leased real rxqs to be resized

Similar to AF_XDP, do not allow queues in a physical netdev to be
resized by ethtool -L when they are leased.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20260115082603.219152-5-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: Add lease info to queue-get response

Populate nested lease info to the queue-get response that returns the
ifindex, queue id with type and optionally netns id if the device
resides in a different netns.

Example with ynl client:

  # ip a
  [...]
  4: enp10s0f0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp/id:24 qdisc mq state UP group default qlen 1000
    link/ether e8:eb:d3:a3:43:f6 brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.2/24 scope global enp10s0f0np0
       valid_lft forever preferred_lft forever
    inet6 fe80::eaeb:d3ff:fea3:43f6/64 scope link proto kernel_ll
       valid_lft forever preferred_lft forever
  [...]

  # ethtool -i enp10s0f0np0
  driver: mlx5_core
  [...]

  # ./pyynl/cli.py \
      --spec ~/netlink/specs/netdev.yaml \
      --do queue-get \
      --json '{"ifindex": 4, "id": 15, "type": "rx"}'
  {'id': 15,
   'ifindex': 4,
   'lease': {'ifindex': 8, 'netns-id': 0, 'queue': {'id': 1, 'type': 'rx'}},
   'napi-id': 8227,
   'type': 'rx',
   'xsk': {}}

  # ip netns list
  foo (id: 0)

  # ip netns exec foo ip a
  [...]
  8: nk@NONE: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
      link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
      inet6 fe80::200:ff:fe00:0/64 scope link proto kernel_ll
         valid_lft forever preferred_lft forever
  [...]

  # ip netns exec foo ethtool -i nk
  driver: netkit
  [...]

  # ip netns exec foo ls /sys/class/net/nk/queues/
  rx-0  rx-1  tx-0

  # ip netns exec foo ./pyynl/cli.py \
      --spec ~/netlink/specs/netdev.yaml \
      --do queue-get \
      --json '{"ifindex": 8, "id": 1, "type": "rx"}'
  {'id': 1, 'ifindex': 8, 'type': 'rx'}

Note that the caller of netdev_nl_queue_fill_one() holds the netdevice
lock. For the queue-get we do not lock both devices. When queues get
{un,}leased, both devices are locked, thus if __netif_get_rx_queue_peer()
returns true, the peer pointer points to a valid device. The netns-id
is fetched via peernet2id_alloc() similarly as done in OVS.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20260115082603.219152-4-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: Implement netdev_nl_queue_create_doit

Implement netdev_nl_queue_create_doit which creates a new rx queue in a
virtual netdev and then leases it to a rx queue in a physical netdev.

Example with ynl client:

  # ./pyynl/cli.py \
      --spec ~/netlink/specs/netdev.yaml \
      --do queue-create \
      --json '{"ifindex": 8, "type": "rx", "lease": {"ifindex": 4, "queue": {"type": "rx", "id": 15}}}'
  {'id': 1}

Note that the netdevice locking order is always from the virtual to
the physical device.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260115082603.219152-3-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: Add queue-create operation

Add a ynl netdev family operation called queue-create that creates a
new queue on a netdevice:

      name: queue-create
      attribute-set: queue
      flags: [admin-perm]
      do:
        request:
          attributes:
            - ifindex
            - type
            - lease
        reply: &queue-create-op
          attributes:
            - id

This is a generic operation such that it can be extended for various
use cases in future. Right now it is mandatory to specify ifindex,
the queue type which is enforced to rx and a lease. The newly created
queue id is returned to the caller.

A queue from a virtual device can have a lease which refers to another
queue from a physical device. This is useful for memory providers
and AF_XDP operations which take an ifindex and queue id to allow
applications to bind against virtual devices in containers. The lease
couples both queues together and allows to proxy the operations from
a virtual device in a container to the physical device.

In future, the nested lease attribute can be lifted and made optional
for other use-cases such as dynamic queue creation for physical
netdevs. The lack of lease and the specification of the physical
device as an ifindex will imply that we need a real queue to be
allocated. Similarly, the queue type enforcement to rx can then be
lifted as well to support tx.

An early implementation had only driver-specific integration [0], but
in order for other virtual devices to reuse, it makes sense to have
this as a generic API in core net.

For leasing queues, the virtual netdev must have real_num_rx_queue
less than num_rx_queues at the time of calling queue-create. The
queue-type must be rx as only rx queues are supported for leasing
for now. We also enforce that the queue-create ifindex must point
to a virtual device, and that the nested lease attribute's ifindex
must point to a physical device. The nested lease attribute set
contains a netns-id attribute which is currently only intended for
dumping as part of the queue-get operation. Also, it is modeled as
an s32 type similarly as done elsewhere in the stack.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
Link: https://bpfconf.ebpf.io/bpfconf2025/bpfconf2025_material/lsfmmbpf_2025_netkit_borkmann.pdf
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260115082603.219152-2-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'net-hinic3-pf-initialization'

Fan Gong says:

====================
net: hinic3: PF initialization

This is [1/3] part of hinic3 Ethernet driver second submission.
With this patch hinic3 becomes a complete Ethernet driver with
pf and vf.

The driver parts contained in this patch:
Add support for PF framework based on the VF code.
Add PF management interfaces to communicate with HW.
Add 8 netdev ops to configure NIC features.
Support mac filter to unicast and multicast.
Add HW event handler to manage port and link status.

V01: https://lore.kernel.org/netdev/cover.1760502478.git.zhuyikai1@h-partners.com/
V02: https://lore.kernel.org/netdev/cover.1760685059.git.zhuyikai1@h-partners.com/
V03: https://lore.kernel.org/netdev/cover.1761362580.git.zhuyikai1@h-partners.com/
V04: https://lore.kernel.org/netdev/cover.1761711549.git.zhuyikai1@h-partners.com/
V05: https://lore.kernel.org/netdev/cover.1762414088.git.zhuyikai1@h-partners.com/
V06: https://lore.kernel.org/netdev/cover.1762581665.git.zhuyikai1@h-partners.com/
V07: https://lore.kernel.org/netdev/cover.1763555878.git.zhuyikai1@h-partners.com/
V08: https://lore.kernel.org/netdev/cover.1767495881.git.zhuyikai1@h-partners.com/
V09: https://lore.kernel.org/netdev/cover.1767707500.git.zhuyikai1@h-partners.com/
V10: https://lore.kernel.org/netdev/cover.1767861236.git.zhuyikai1@h-partners.com/
====================

Link: https://patch.msgid.link/cover.1768375903.git.zhuyikai1@h-partners.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

hinic3: Add HW event handler

Add HINIC3_INIT_UP flags to trace netdev open status.
Add port module event handler.
Add link status event type(FAULT, PCIE link down, heart lost, mgmt
watchdog).

Co-developed-by: Zhu Yikai <zhuyikai1@h-partners.com>
Signed-off-by: Zhu Yikai <zhuyikai1@h-partners.com>
Signed-off-by: Fan Gong <gongfan1@huawei.com>
Link: https://patch.msgid.link/53a2b928136998f740d597bbd45ca1740b95538f.1768375903.git.zhuyikai1@h-partners.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

hinic3: Add mac filter ops

Add ops to support unicast and multicast to filter mac address in
packets sending and receiving.

Co-developed-by: Zhu Yikai <zhuyikai1@h-partners.com>
Signed-off-by: Zhu Yikai <zhuyikai1@h-partners.com>
Signed-off-by: Fan Gong <gongfan1@huawei.com>
Link: https://patch.msgid.link/9ea618ad5d6c404e222e9114e08e913a73177705.1768375903.git.zhuyikai1@h-partners.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

hinic3: Add adaptive IRQ coalescing with DIM

DIM offers a way to adjust the coalescing settings based
on load. As hinic3 rx and tx share interrupts, we only need
to base dim on rx stats.

Co-developed-by: Zhu Yikai <zhuyikai1@h-partners.com>
Signed-off-by: Zhu Yikai <zhuyikai1@h-partners.com>
Signed-off-by: Fan Gong <gongfan1@huawei.com>
Link: https://patch.msgid.link/af96c20a836800a5972a09cdaf520029d976ad48.1768375903.git.zhuyikai1@h-partners.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

hinic3: Add .ndo_vlan_rx_add/kill_vid and .ndo_validate_addr

Implement following callback function:
.ndo_vlan_rx_add_vid
.ndo_vlan_rx_kill_vid
.ndo_validate_addr

Co-developed-by: Zhu Yikai <zhuyikai1@h-partners.com>
Signed-off-by: Zhu Yikai <zhuyikai1@h-partners.com>
Signed-off-by: Fan Gong <gongfan1@huawei.com>
Link: https://patch.msgid.link/70be187afcaa7b38981d114c088ffdc2cba0b4f1.1768375903.git.zhuyikai1@h-partners.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

hinic3: Add .ndo_features_check

As we cannot solve packets with multiple stacked vlan, so we use
.ndo_features_check to check for these packets and return a smaller
feature without offload features.

Co-developed-by: Zhu Yikai <zhuyikai1@h-partners.com>
Signed-off-by: Zhu Yikai <zhuyikai1@h-partners.com>
Signed-off-by: Fan Gong <gongfan1@huawei.com>
Link: https://patch.msgid.link/3879b20b7ffa20106a3f8f56dbf2d5eb389f260a.1768375903.git.zhuyikai1@h-partners.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

hinic3: Add .ndo_set_features and .ndo_fix_features

Implement following callback function:
.ndo_set_features
.ndo_fix_features

The .ndo_set_features function includes five features: rx_csum,
tso, lro, rx_cvlan and vlan_filter.
Add these new features in netdev_feature_init.

Co-developed-by: Zhu Yikai <zhuyikai1@h-partners.com>
Signed-off-by: Zhu Yikai <zhuyikai1@h-partners.com>
Signed-off-by: Fan Gong <gongfan1@huawei.com>
Link: https://patch.msgid.link/682734a08fde421413048bf70057dafe3cbe8497.1768375903.git.zhuyikai1@h-partners.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

hinic3: Add .ndo_tx_timeout and .ndo_get_stats64

Implement following callback function:
.ndo_tx_timeout
.ndo_get_stats64

Use a work queue to trace tx_timeout callback and dump necessary
debug information.

Co-developed-by: Zhu Yikai <zhuyikai1@h-partners.com>
Signed-off-by: Zhu Yikai <zhuyikai1@h-partners.com>
Signed-off-by: Fan Gong <gongfan1@huawei.com>
Link: https://patch.msgid.link/ec34d2ff9b142e1e142e47700714533baf7e659c.1768375903.git.zhuyikai1@h-partners.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

hinic3: Add PF management interfaces

Add management and communication pathways between PF and HW.

Co-developed-by: Zhu Yikai <zhuyikai1@h-partners.com>
Signed-off-by: Zhu Yikai <zhuyikai1@h-partners.com>
Signed-off-by: Fan Gong <gongfan1@huawei.com>
Link: https://patch.msgid.link/447e72b1c5f255ea5d79ecf96c8dac58011e604d.1768375903.git.zhuyikai1@h-partners.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

hinic3: Add PF framework

Add support for PF framework based on the VF code.

Co-developed-by: Zhu Yikai <zhuyikai1@h-partners.com>
Signed-off-by: Zhu Yikai <zhuyikai1@h-partners.com>
Signed-off-by: Fan Gong <gongfan1@huawei.com>
Link: https://patch.msgid.link/4796d6076bbad0f096eb10261ab3c7d989c5f7b7.1768375903.git.zhuyikai1@h-partners.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'net-mlx5e-save-per-channel-async-icosq-in-default'

Tariq Toukan says:

====================
net/mlx5e: Save per-channel async ICOSQ in default

This series by William reduces the default number of SQs in a channel
from 3 down to 2, by not creating the async ICOSQ (asynchronous
internal-communication-operations send-queue).

This significantly improves the latency of channel configuration
operations, like interface up (create channels), interface down (destroy
channels), and channels reconfiguration (create new set, destroy old
one).

This reduces the per-channel memory usage, saves hardware resources, in
addition to the improved latency.

This significantly speeds up the setup/config stage on systems with high
number of channels or many netdevs, in particular systems with hundreds
or K's of SFs.

The two remaining default SQs per channel after this series:
1 TXQ SQ (for traffic), and 1 ICOSQ (for internal communication
operations with the device).

Perf numbers:
NIC: Connect-X7.
Test: Latency of interface up + down operations.

Measured 20% speedup.
Saving ~0.36 sec for 248 channels (~1.45 msec per channel).
====================

Link: https://patch.msgid.link/1768376800-1607672-1-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Conditionally create async ICOSQ

The async ICOSQ is only required by TLS RX (for re-sync flow) and XSK
TX. Create it only when these features are enabled instead of always
allocating it. This reduces per-channel memory usage, saves hardware
resources, improves latency, and decreases the default number of SQs
(from 3 to 2) and CQs (from 4 to 3). It also speeds up channel
open/close operations for a netdev when async ICOSQ is not needed.

Currently when TLS RX is enabled, there is no channel reset triggered.
As a result, async ICOSQ allocation is not triggered, causing a NULL
pointer crash. One solution is to do channel reset every time when
toggling TLS RX. However, it's not straightforward as the offload
state matters only on connection creation, and can go on beyond the
channels reset.

Instead, introduce a new field 'ktls_rx_was_enabled': if TLS RX is
enabled for the first time: reset channels, create async ICOSQ, set
the field. From that point on, no need to reset channels for any TLS
RX enable/disable. Async ICOSQ will always be needed.

For XSK TX, async ICOSQ is used in wakeup control and is guaranteed
to have async ICOSQ allocated.

This improves the latency of interface up/down operations when it
applies.

Perf numbers:
NIC: Connect-X7.
Test: Latency of interface up + down operations.

Measured 20% speedup.
Saving ~0.36 sec for 248 channels (~1.45 msec per channel).

Signed-off-by: William Tu <witu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1768376800-1607672-5-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Move async ICOSQ to dynamic allocation

Dynamically allocate async ICOSQ. ICO (Internal Communication
Operations) is for driver to communicate with the HW, and it's
not used for traffic. Currently mlx5 driver has sync and async
ICO send queues. The async ICOSQ means that it's not necessarily
under NAPI context protection. The patch is in preparation for
the later patch to detect its usage and enable it when necessary.

Signed-off-by: William Tu <witu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1768376800-1607672-4-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Use regular ICOSQ for triggering NAPI

Before the cited commit, ICOSQ is used to post NOP WQE to trigger
hardware interrupt and start NAPI, but this mechanism suffers from
a race condition: mlx5e_alloc_rx_mpwqe may post UMR WQEs to ICOSQ
_before_ NOP WQE is posted. The cited commit fixes the issue by
replacing ICOSQ with async ICOSQ, as a new way to post the NOP WQE
to trigger the hardware interrupt and NAPI.

The patch changes it back by replacing async ICOSQ with regular
ICOSQ, for the purpose of saving memory in later patches, and solves
the issue by adding a new SQ state, MLX5E_SQ_STATE_LOCK_NEEDED
for syncing the start of NAPI.

What it does:
- Switch trigger path from async ICOSQ to regular ICOSQ to reduce
  need for async SQ.
- Introduce MLX5E_SQ_STATE_LOCK_NEEDED and mlx5e_icosq_sync_lock(),
  unlock() to prevent the race where UMR WQEs could be posted before
  the NOP WQE used to trigger NAPI.
- Use synchronize_net() once per trigger cycle to quiesce in-flight
  softirqs before serializing the NOP WQE and any UMR postings via
  the ICOSQ lock.
- Wrap ICOSQ UMR posting in en_rx.c and xsk/rx.c with the new
  conditional lock.

The conditional locking approach is critical for performance: always
locking would impose unnecessary overhead. Synchronization is not needed
between regular NAPI cycles once the channel is activated and running.
The lock is only required to protect against the race during channel
activation—specifically, when the very first NOP WQE is posted to trigger
NAPI. After that initial trigger, normal NAPI polling handles subsequent
work without contention. The MLX5E_SQ_STATE_LOCK_NEEDED flag ensures we
pay the synchronization cost only when necessary.

Signed-off-by: William Tu <witu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1768376800-1607672-3-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Move async ICOSQ lock into ICOSQ struct

Move the async_icosq spinlock from the mlx5e_channel structure into
the mlx5e_icosq structure itself for better encapsulation and for
later patch to also use it for other icosq use cases.

Changes:
- Add spinlock_t lock field to struct mlx5e_icosq
- Remove async_icosq_lock field from struct mlx5e_channel
- Initialize the new lock in mlx5e_open_icosq()
- Update all lock usage in ktls_rx.c and en_main.c to use sq->lock
instead of c->async_icosq_lock

Signed-off-by: William Tu <witu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1768376800-1607672-2-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeon_ep: reset firmware ready status

Add support to reset firmware ready status
when the driver is removed(either in unload
or unbind)

Signed-off-by: Sathesh Edara <sedara@marvell.com>
Signed-off-by: Shinas Rasheed <srasheed@marvell.com>
Signed-off-by: Vimlesh Kumar <vimleshk@marvell.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260115092048.870237-1-vimleshk@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-thunderbolt-various-improvements'

Mika Westerberg says:

====================
net: thunderbolt: Various improvements

This series improves the Thunderbolt networking driver so that it should
work with the bonding driver.

The discussion that started this patch series can be read below:

https://lore.kernel.org/netdev/CAFJzfF9N4Hak23sc-zh0jMobbkjK7rg4odhic1DQ1cC+=MoQoA@mail.gmail.com/

v2: https://lore.kernel.org/20260109122606.3586895-1-mika.westerberg@linux.intel.com
v1: https://lore.kernel.org/20251127131521.2580237-1-mika.westerberg@linux.intel.com
====================

Link: https://patch.msgid.link/20260115115646.328898-1-mika.westerberg@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: thunderbolt: Allow reading link settings

In order to use Thunderbolt networking as part of bonding device it
needs to support ->get_link_ksettings() ethtool operation, so that the
bonding driver can read the link speed and the related attributes. Add
support for this to the driver.

Signed-off-by: Ian MacDonald <ian@netstatz.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Link: https://patch.msgid.link/20260115115646.328898-5-mika.westerberg@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bonding: 3ad: Add support for SPEED_80000

Add support for ethtool SPEED_80000. This is needed to allow
Thunderbolt/USB4 networking driver to be used with the bonding driver.

Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260115115646.328898-4-mika.westerberg@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ethtool: Add support for 80Gbps speed

USB4 v2 link used in peer-to-peer networking is symmetric 80Gbps so in
order to support reading this link speed, add support for it to ethtool.

Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20260115115646.328898-3-mika.westerberg@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: thunderbolt: Allow changing MAC address of the device

The MAC address we use is based on a suggestion in the USB4 Inter-domain
spec but it is not really used in the USB4NET protocol. It is more
targeted for the upper layers of the network stack. There is no reason
why it should not be changed by the userspace for example if needed for
bonding.

Reported-by: Ian MacDonald <ian@netstatz.com>
Closes: https://lore.kernel.org/netdev/CAFJzfF9N4Hak23sc-zh0jMobbkjK7rg4odhic1DQ1cC+=MoQoA@mail.gmail.com/
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Link: https://patch.msgid.link/20260115115646.328898-2-mika.westerberg@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'dpll-support-mode-switching'

Ivan Vecera says:

====================
dpll: support mode switching

This series adds support for switching the working mode (automatic vs
manual) of a DPLL device via netlink.

Currently, the DPLL subsystem allows userspace to retrieve the current
working mode but lacks the mechanism to configure it. Userspace is also
unaware of which modes a specific device actually supports, as it
currently assumes only the active mode is supported.

The series addresses these limitations by:
1. Introducing .supported_modes_get() callback to allow drivers to report
   all modes capable of running on the device.
2. Introducing .mode_set() callback and updating the netlink policy
   to allow userspace to request a mode change.
3. Implementing these callbacks in the zl3073x driver, enabling dynamic
   switching between automatic and manual modes.
====================

Link: https://patch.msgid.link/20260114122726.120303-1-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dpll: zl3073x: Implement device mode setting support

Add support for .supported_modes_get() and .mode_set() callbacks
to enable switching between manual and automatic modes via netlink.

Implement .supported_modes_get() to report available modes based
on the current hardware configuration:

* manual mode is always supported
* automatic mode is supported unless the dpll channel is configured
  in NCO (Numerically Controlled Oscillator) mode

Implement .mode_set() to handle the specific logic required when
transitioning between modes:

1) Transition to manual:
* If a valid reference is currently active, switch the hardware
  to ref-lock mode (force lock to that reference).
* If no reference is valid and the DPLL is unlocked, switch to freerun.
* Otherwise, switch to Holdover.

2) Transition to automatic:
* If the currently selected reference pin was previously marked
  as non-selectable (likely during a previous manual forcing
  operation), restore its priority and selectability in the hardware.
* Switch the hardware to Automatic selection mode.

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Prathosh Satish <Prathosh.Satish@microchip.com>
Link: https://patch.msgid.link/20260114122726.120303-4-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dpll: add dpll_device op to set working mode

Currently, userspace can retrieve the DPLL working mode but cannot
configure it. This prevents changing the device operation, such as
switching from manual to automatic mode and vice versa.

Add a new callback .mode_set() to struct dpll_device_ops. Extend
the netlink policy and device-set command handling to process
the DPLL_A_MODE attribute. Update the netlink YAML specification
to include the mode attribute in the device-set operation.

Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Link: https://patch.msgid.link/20260114122726.120303-3-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dpll: add dpll_device op to get supported modes

Currently, the DPLL subsystem assumes that the only supported mode is
the one currently active on the device. When dpll_msg_add_mode_supported()
is called, it relies on ops->mode_get() and reports that single mode
to userspace. This prevents users from discovering other modes the device
might be capable of.

Add a new callback .supported_modes_get() to struct dpll_device_ops. This
allows drivers to populate a bitmap indicating all modes supported by
the hardware.

Update dpll_msg_add_mode_supported() to utilize this new callback:

* if ops->supported_modes_get is defined, use it to retrieve the full
  bitmap of supported modes.
* if not defined, fall back to the existing behavior: retrieve
  the current mode via ops->mode_get and set the corresponding bit
  in the bitmap.

Finally, iterate over the bitmap and add a DPLL_A_MODE_SUPPORTED netlink
attribute for every set bit, accurately reporting the device's capabilities
to userspace.

Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Link: https://patch.msgid.link/20260114122726.120303-2-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: net: csum: Fix printk format in recv_get_packet_csum_status()

Following warning is encountered when building selftests on powerpc/32.

  CC       csum
csum.c: In function 'recv_get_packet_csum_status':
csum.c:710:50: warning: format '%lu' expects argument of type 'long unsigned int', but argument 4 has type 'size_t' {aka 'unsigned int'} [-Wformat=]
  710 |                         error(1, 0, "cmsg: len=%lu expected=%lu",
      |                                                ~~^
      |                                                  |
      |                                                  long unsigned int
      |                                                %u
  711 |                               cm->cmsg_len, CMSG_LEN(sizeof(struct tpacket_auxdata)));
      |                               ~~~~~~~~~~~~
      |                                 |
      |                                 size_t {aka unsigned int}
csum.c:710:63: warning: format '%lu' expects argument of type 'long unsigned int', but argument 5 has type 'unsigned int' [-Wformat=]
  710 |                         error(1, 0, "cmsg: len=%lu expected=%lu",
      |                                                             ~~^
      |                                                               |
      |                                                               long unsigned int
      |                                                             %u

cm->cmsg_len has type __kernel_size_t and CMSG() macro has the type
returned by sizeof() which is size_t.

size_t is 'unsigned int' on some platforms and 'unsigned long' on
other ones so use %zu instead of %lu.

The code in question was introduced by
commit 91a7de85600d ("selftests/net: add csum offload test").

Signed-off-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/8b69b40826553c1dd500d9d25e45883744f3f348.1768556791.git.chleroy@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'dsa-mxl-gsw1xx-support-r-g-mii-slew-rate-configuration'

Alexander Sverdlin says:

====================
dsa: mxl-gsw1xx: Support R(G)MII slew rate configuration

Maxlinear GSW1xx switches offer slew rate configuration bits for R(G)MII
interface. The default state of the configuration bits is "normal", while
"slow" can be used to reduce the radiated emissions. Add the support for
the latter option into the driver as well as the new DT bindings.
====================

Link: https://patch.msgid.link/20260114104509.618984-1-alexander.sverdlin@siemens.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: mxl-gsw1xx: Support R(G)MII slew rate configuration

Support newly introduced maxlinear,slew-rate-txc and
maxlinear,slew-rate-txd device tree properties to configure R(G)MII
interface pins' slew rate. It might be used to reduce the radiated
emissions.

Reviewed-by: Daniel Golle <daniel@makrotopia.org>
Tested-by: Daniel Golle <daniel@makrotopia.org>
Signed-off-by: Alexander Sverdlin <alexander.sverdlin@siemens.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Link: https://patch.msgid.link/20260114104509.618984-3-alexander.sverdlin@siemens.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: dsa: lantiq,gswip: add MaxLinear R(G)MII slew rate

Add new maxlinear,slew-rate-txc and maxlinear,slew-rate-txd uint32
properties. The properties are only applicable for ports in R(G)MII mode
and allow for slew rate reduction in comparison to "normal" default
configuration with the purpose to reduce radiated emissions.

Signed-off-by: Alexander Sverdlin <alexander.sverdlin@siemens.com>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20260114104509.618984-2-alexander.sverdlin@siemens.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'ipv6-more-data-race-annotations'

Eric Dumazet says:

====================
ipv6: more data-race annotations

Inspired by one unrelated syzbot report.

This series adds missing (and boring) data-race annotations in IPv6.

Only the first patch adds sysctl_ipv6_flowlabel group
to speedup ip6_make_flowlabel() a bit.
====================

Link: https://patch.msgid.link/20260115094141.3124990-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: annotate data-races in net/ipv6/route.c

sysctls are read while their values can change,
add READ_ONCE() annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260115094141.3124990-9-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: exthdrs: annotate data-race over multiple sysctl

Following four sysctls can change under us, add missing READ_ONCE().

- ipv6.sysctl.max_dst_opts_len
- ipv6.sysctl.max_dst_opts_cnt
- ipv6.sysctl.max_hbh_opts_len
- ipv6.sysctl.max_hbh_opts_cnt

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260115094141.3124990-8-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: annotate data-races around sysctl.ip6_rt_gc_interval

Add READ_ONCE() on lockless reads of net->ipv6.sysctl.ip6_rt_gc_interval

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260115094141.3124990-7-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: annotate data-races over sysctl.flowlabel_reflect

Add missing READ_ONCE() when reading ipv6.sysctl.flowlabel_reflect,
as its value can be changed under us.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260115094141.3124990-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: annotate data-races in ip6_multipath_hash_{policy,fields}()

Add missing READ_ONCE() when reading sysctl values.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260115094141.3124990-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: annotate date-race in ipv6_can_nonlocal_bind()

Add a missing READ_ONCE(), and add const qualifiers to the two parameters.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260115094141.3124990-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: annotate data-races from ip6_make_flowlabel()

Use READ_ONCE() to read sysctl values in ip6_make_flowlabel()
and ip6_make_flowlabel()

Add a const qualifier to 'struct net' parameters.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260115094141.3124990-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: add sysctl_ipv6_flowlabel group

Group together following struct netns_sysctl_ipv6 fields:

- flowlabel_consistency
- auto_flowlabels
- flowlabel_state_ranges

After this patch, ip6_make_flowlabel() uses a single cache line to fetch
auto_flowlabels and flowlabel_state_ranges (instead of two before the patch).

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260115094141.3124990-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-convert-drivers-to-get_rx_ring_count-part-2'

Breno Leitao says:

====================
net: convert drivers to .get_rx_ring_count (part 2)

Commit 84eaf4359c36 ("net: ethtool: add get_rx_ring_count callback to
optimize RX ring queries") added specific support for GRXRINGS callback,
simplifying .get_rxnfc.

Remove the handling of GRXRINGS in .get_rxnfc() by moving it to the new
.get_rx_ring_count().

This simplifies the RX ring count retrieval and aligns the following
drivers with the new ethtool API for querying RX ring parameters.
  * engleder/tsnep
  * mediatek
  * amazon/ena
  * microchip/lan743x
  * amd/xgbe
  * chelsio/cxgb4
  * wangxun/txgbe
  * cadence/macb

All of these change were compile-tested only.
====================

Link: https://patch.msgid.link/20260115-grxring_big_v2-v1-0-b3e1b58bced5@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: txgbe: convert to use .get_rx_ring_count

Use the newly introduced .get_rx_ring_count ethtool ops callback instead
of handling ETHTOOL_GRXRINGS directly in .get_rxnfc().

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260115-grxring_big_v2-v1-9-b3e1b58bced5@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: macb: convert to use .get_rx_ring_count

Use the newly introduced .get_rx_ring_count ethtool ops callback instead
of handling ETHTOOL_GRXRINGS directly in .get_rxnfc().

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260115-grxring_big_v2-v1-8-b3e1b58bced5@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: cxgb4: convert to use .get_rx_ring_count

Use the newly introduced .get_rx_ring_count ethtool ops callback instead
of handling ETHTOOL_GRXRINGS directly in .get_rxnfc().

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260115-grxring_big_v2-v1-7-b3e1b58bced5@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: xgbe: convert to use .get_rx_ring_count

Use the newly introduced .get_rx_ring_count ethtool ops callback instead
of handling ETHTOOL_GRXRINGS directly in .get_rxnfc().

Since ETHTOOL_GRXRINGS was the only command handled by xgbe_get_rxnfc(),
remove the function entirely.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260115-grxring_big_v2-v1-6-b3e1b58bced5@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: lan743x: convert to use .get_rx_ring_count

Use the newly introduced .get_rx_ring_count ethtool ops callback instead
of handling ETHTOOL_GRXRINGS directly in .get_rxnfc().

Since ETHTOOL_GRXRINGS was the only command handled by
lan743x_ethtool_get_rxnfc(), remove the function entirely.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260115-grxring_big_v2-v1-5-b3e1b58bced5@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ena: convert to use .get_rx_ring_count

Use the newly introduced .get_rx_ring_count ethtool ops callback instead
of handling ETHTOOL_GRXRINGS directly in .get_rxnfc().

Since ETHTOOL_GRXRINGS was the only useful command handled by
ena_get_rxnfc(), remove the function entirely.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Arthur Kiyanovski <akiyano@amazon.com>
Link: https://patch.msgid.link/20260115-grxring_big_v2-v1-4-b3e1b58bced5@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mediatek: convert to use .get_rx_ring_count

Use the newly introduced .get_rx_ring_count ethtool ops callback instead
of handling ETHTOOL_GRXRINGS directly in .get_rxnfc().

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260115-grxring_big_v2-v1-3-b3e1b58bced5@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: tsnep: convert to use .get_rx_ring_count

Use the newly introduced .get_rx_ring_count ethtool ops callback instead
of handling ETHTOOL_GRXRINGS directly in .get_rxnfc().

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260115-grxring_big_v2-v1-2-b3e1b58bced5@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'selftests-net-improve-error-handling-in-passive-tfo-test'

Yohei Kojima says:

====================
selftests: net: improve error handling in passive TFO test

This series improves error handling in the passive TFO test by (1)
fixing a broken behavior when the child processes failed (or timed out),
and (2) adding more error handlng code in the test program.

The first patch fixes the behavior that the test didn't report failure
even if the server or the client process exited with non-zero status.
The second patch adds error handling code in the test program to improve
reliability of the test.
====================

Link: https://patch.msgid.link/cover.1768312014.git.yk@y-koj.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: net: improve error handling in passive TFO test

Improve the error handling in passive TFO test to check the return value
from sendto(), and to fail if read() or fprintf() failed.

Signed-off-by: Yohei Kojima <yk@y-koj.net>
Link: https://patch.msgid.link/24707c8133f7095c0e5a94afa69e75c3a80bf6e7.1768312014.git.yk@y-koj.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: net: fix passive TFO test to fail if child processes failed

Improve the passive TFO test to report failure if the server or the
client timed out or exited with non-zero status.

Before this commit, TFO test didn't fail even if exit(EXIT_FAILURE) is
added to the first line of the run_server() and run_client() functions.

Signed-off-by: Yohei Kojima <yk@y-koj.net>
Link: https://patch.msgid.link/214d399caec2e5de7738ced5736829915d507e4e.1768312014.git.yk@y-koj.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-phy-realtek-simplify-and-reunify-c22-c45-drivers'

Daniel Golle says:

====================
net: phy: realtek: simplify and reunify C22/C45 drivers

The RTL8221B PHY variants (VB-CG and VM-CG) were previously split into
separate C22 and C45 driver instances to support copper SFP modules
using the RollBall MDIO-over-I2C protocol, which only supports Clause-45
access. However, this split created significant code duplication and
complexity.

Commit 8af2136e77989 ("net: phy: realtek: add helper
RTL822X_VND2_C22_REG") exposed that RealTek PHYs map all standard
Clause-22 registers into MDIO_MMD_VEND2 at offset 0xa400.

With commit 1850ec20d6e71 ("net: phy: realtek: use paged access for
MDIO_MMD_VEND2 in C22 mode") it is now possible to access all MMD
registers transparently, regardless of whether the PHY is accessed via
C22 or C45 MDIO.

Further improve the translation logic for this register mapping, so a
single unified driver works efficiently with both access methods,
reducing code duplication.

The series also includes cleanup to remove unnecessary paged operations
on registers that aren't actually affected by page selection.

Testing was done on RTL8211F and RTL8221B-VB-CG (the latter in both
C22 and C45 modes).
====================

Link: https://patch.msgid.link/cover.1768275364.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: realtek: simplify bogus paged operations

Only registers 0x10~0x17 are affected by the value in the page
selection register 0x1f. Hence there is no point in using paged
operations when accessing any other registers.
Simplify the driver by using the normal phy_read and phy_write
operations for registers which are anyway not affected by paging.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Link: https://patch.msgid.link/0c5cbb66ce3e72a011d76f8c3d61ebcac44483bb.1768275364.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: realtek: demystify PHYSR register location

Turns out that register address RTL_VND2_PHYSR (0xa434) maps to
Clause-22 register MII_RESV2. Use that to get rid of yet another magic
number, and rename access macros accordingly.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Link: https://patch.msgid.link/6ed246e0aa3ca8038d2fa432d51518959fb89b6b.1768275364.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: realtek: reunify C22 and C45 drivers

Reunify the split C22/C45 drivers for the RTL8221B-VB-CG 2.5Gbps and
RTL8221B-VM-CG 2.5Gbps PHYs back into a single driver.

This is possible now by using all the driver operations previously used
by the C45 driver, as transparent access to all MMDs including
MDIO_MMD_VEND2 is now possible also over Clause-22 MDIO.

The unified driver will still only use Clause-45 access on any Clause-45
capable busses while still working fine on Clause-22 busses.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Link: https://patch.msgid.link/bffcb85fdc20e07056976962d3caaa1be5d0ddb0.1768275364.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>