]> git.ipfire.org Git - thirdparty/kernel/linux.git/log
thirdparty/kernel/linux.git
7 weeks agonetkit: Add netkit notifier to check for unregistering devices
Daniel Borkmann [Thu, 2 Apr 2026 23:10:29 +0000 (01:10 +0200)] 
netkit: Add netkit notifier to check for unregistering devices

Add a netdevice notifier in netkit to watch for NETDEV_UNREGISTER events.
If the target device is indeed NETREG_UNREGISTERING and previously leased
a queue to a netkit device, then collect the related netkit devices and
batch-unregister_netdevice_many() them.

If this were not done, then the netkit device would hold a reference on
the physical device preventing it from going away. However, in case of
both io_uring zero-copy as well as AF_XDP this situation is handled
gracefully and the allocated resources are torn down.

In the case where mentioned infra is used through netkit, the applications
have a reference on netkit, and netkit in turn holds a reference on the
physical device. In order to have netkit release the reference on the
physical device, we need such watcher to then unregister the netkit ones.

This is generally quite similar to the dependency handling in case of
tunnels (e.g. vxlan bound to a underlying netdev) where the tunnel device
gets removed along with the physical device.

  # ip a
  [...]
  4: enp10s0f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN group default qlen 1000
      link/ether e8:eb:d3:a3:43:f6 brd ff:ff:ff:ff:ff:ff
      inet 10.0.0.2/24 scope global enp10s0f0np0
         valid_lft forever preferred_lft forever
  [...]
  8: nk@NONE: <BROADCAST,MULTICAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
      link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
  [...]

  # rmmod mlx5_ib
  # rmmod mlx5_core
  [...]
  [  309.261822] mlx5_core 0000:0a:00.0 mlx5_0: Port: 1 Link DOWN
  [  344.235236] mlx5_core 0000:0a:00.1: E-Switch: Unload vfs: mode(LEGACY), nvfs(0), necvfs(0), active vports(0)
  [  344.246948] mlx5_core 0000:0a:00.1: E-Switch: Disable: mode(LEGACY), nvfs(0), necvfs(0), active vports(0)
  [  344.463754] mlx5_core 0000:0a:00.1: E-Switch: Disable: mode(LEGACY), nvfs(0), necvfs(0), active vports(0)
  [  344.770155] mlx5_core 0000:0a:00.1: E-Switch: cleanup
  [...]

  # ip a
  [...]
  [ both enp10s0f0np0 and nk gone ]
  [...]

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260402231031.447597-13-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agonetkit: Implement rtnl_link_ops->alloc and ndo_queue_create
David Wei [Thu, 2 Apr 2026 23:10:28 +0000 (01:10 +0200)] 
netkit: Implement rtnl_link_ops->alloc and ndo_queue_create

Implement rtnl_link_ops->alloc that allows the number of rx queues to be
set when netkit is created. By default, netkit has only a single rxq (and
single txq). The number of queues is deliberately not allowed to be changed
via ethtool -L and is fixed for the lifetime of a netkit instance.

For netkit device creation, numrxqueues with larger than one rxq can be
specified. These rxqs are leasable to real rxqs in physical netdevs:

  ip link add type netkit peer numrxqueues 64      # for device pair
  ip link add numrxqueues 64 type netkit single    # for single device

The limit of numrxqueues for netkit is currently set to 1024, which allows
leasing multiple real rxqs from physical netdevs.

The implementation of ndo_queue_create() adds a new rxq during the queue
lease operation. We allow to create queues either in single device mode
or for the case of dual device mode for the netkit peer device which gets
placed into the target network namespace. For dual device mode the lease
against the primary device does not make sense for the targeted use cases,
and therefore gets rejected.

We also need to add a lockdep class for netkit, such that lockdep does
not trip over us, similarly done as in commit 0bef512012b1 ("net: add
netdev_lockdep_set_classes() to virtual drivers").

This is also the last missing bit to netkit for supporting io_uring with
zero-copy mode [0]. Up until this point it was not possible to consume the
latter out of containers or Kubernetes Pods where applications are in their
own network namespace.

io_uring example with eth0 being a physical device with 16 queues where
netkit is bound to the last queue, iou-zcrx.c is binary from selftests;
ethtool configuration (tcp-data-split, hds_thresh, RSS, flow steering)
is done on the physical device by the control plane; here, flow steering
to that queue is based on the service VIP:port of the server utilizing
io_uring:

  # ethtool -X eth0 start 0 equal 15
  # ethtool -X eth0 start 15 equal 1 context new
  # ethtool --config-ntuple eth0 flow-type tcp4 dst-ip 1.2.3.4 dst-port 5000 action 15
  # ip netns add foo
  # ip link add type netkit peer numrxqueues 2
  # ynl --family netdev --output-json --do queue-create \
        --json "{"ifindex": $(ifindex nk0), "type": "rx", \
                 "lease": { "ifindex": $(ifindex eth0), \
                            "queue": { "type": "rx", "id": 15 } } }"
  {'id': 1}
  # ip link set nk0 netns foo
  # ip link set nk1 up
  # ip netns exec foo ip link set lo up
  # ip netns exec foo ip link set nk0 up
  # ip netns exec foo ip addr add 1.2.3.4/32 dev nk0
  [ ... setup routing etc to get external traffic into the netns ... ]
  # ip netns exec foo ./iou-zcrx -s -p 5000 -i nk0 -q 1

Remote io_uring client:

  # ./iou-zcrx -c -h 1.2.3.4 -p 5000 -l 12840 -z 65536

We have tested the above against a Broadcom BCM957504 (bnxt_en) 100G NIC,
supporting TCP header/data split.

Similarly, this also works for devmem which we tested using ncdevmem:

  # ip netns exec foo ./ncdevmem -s 1.2.3.4 -l -p 5000 -f nk0 -t 1 -q 1

And on the remote client:

  # ./ncdevmem -s 1.2.3.4 -p 5000 -f eth0

For Cilium, the plan is to open up support for the various memory providers
for regular Kubernetes Pods when Cilium is configured with netkit datapath
mode.

Signed-off-by: David Wei <dw@davidwei.uk>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://kernel-recipes.org/en/2024/schedule/efficient-zero-copy-networking-using-io_uring
Link: https://patch.msgid.link/20260402231031.447597-12-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agonetkit: Add single device mode for netkit
Daniel Borkmann [Thu, 2 Apr 2026 23:10:27 +0000 (01:10 +0200)] 
netkit: Add single device mode for netkit

Add a single device mode for netkit instead of netkit pairs. The primary
target for the paired devices is to connect network namespaces, of course,
and support has been implemented in projects like Cilium [0]. For the rxq
leasing the plan is to support two main scenarios related to single device
mode:

* For the use-case of io_uring zero-copy, the control plane can either
  set up a netkit pair where the peer device can perform rxq leasing which
  is then tied to the lifetime of the peer device, or the control plane
  can use a regular netkit pair to connect the hostns to a Pod/container
  and dynamically add/remove rxq leasing through a single device without
  having to interrupt the device pair. In the case of io_uring, the memory
  pool is used as skb non-linear pages, and thus the skb will go its way
  through the regular stack into netkit. Things like the netkit policy when
  no BPF is attached or skb scrubbing etc apply as-is in case the paired
  devices are used, or if the backend memory is tied to the single device
  and traffic goes through a paired device.

* For the use-case of AF_XDP, the control plane needs to use netkit in the
  single device mode. The single device mode currently enforces only a
  pass policy when no BPF is attached, and does not yet support BPF link
  attachments for AF_XDP. skbs sent to that device get dropped at the
  moment. Given AF_XDP operates at a lower layer of the stack tying this
  to the netkit pair did not make sense. In future, the plan is to allow
  BPF at the XDP layer which can: i) process traffic coming from the AF_XDP
  application (e.g. QEMU with AF_XDP backend) to filter egress traffic or
  to push selected egress traffic up to the single netkit device to the
  local stack (e.g. DHCP requests), and ii) vice-versa skbs sent to the
  single netkit into the AF_XDP application (e.g. DHCP replies). Also,
  the control-plane can dynamically manage rxq leasing for the single
  netkit device without having to interrupt (e.g. down/up cycle) the main
  netkit pair for the Pod which has traffic going in and out.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Jordan Rife <jordan@jrife.io>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://docs.cilium.io/en/stable/operations/performance/tuning/#netkit-device-mode
Link: https://patch.msgid.link/20260402231031.447597-11-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoxsk: Proxy pool management for leased queues
Daniel Borkmann [Thu, 2 Apr 2026 23:10:26 +0000 (01:10 +0200)] 
xsk: Proxy pool management for leased queues

Similarly to the netif_mp_{open,close}_rxq handling for leased queues, proxy
the xsk_{reg,clear}_pool_at_qid via netif_get_rx_queue_lease_locked such
that in case a virtual netdev picked a leased rxq, the request gets through
to the real rxq in the physical netdev. The proxying is only relevant for
queue_id < dev->real_num_rx_queues since right now it's only supported for
rxqs.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260402231031.447597-10-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoxsk: Extend xsk_rcv_check validation
Daniel Borkmann [Thu, 2 Apr 2026 23:10:25 +0000 (01:10 +0200)] 
xsk: Extend xsk_rcv_check validation

xsk_rcv_check tests for inbound packets to see whether they match
the bound AF_XDP socket. Refactor the test into a small helper
xsk_dev_queue_valid and move the validation against xs->dev and
xs->queue_id there.

The fast-path case stays in place and allows for quick return in
xsk_dev_queue_valid. If it fails, the validation is extended to
check whether the AF_XDP socket is bound against a leased queue,
and if so, the test is redone.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260402231031.447597-9-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agonet: Proxy netdev_queue_get_dma_dev for leased queues
David Wei [Thu, 2 Apr 2026 23:10:24 +0000 (01:10 +0200)] 
net: Proxy netdev_queue_get_dma_dev for leased queues

Extend netdev_queue_get_dma_dev to return the physical device of the
real rxq for DMA in case the queue was leased. This allows memory
providers like io_uring zero-copy or devmem to bind to the physically
leased rxq via virtual devices such as netkit.

Signed-off-by: David Wei <dw@davidwei.uk>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260402231031.447597-8-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agonet: Proxy netif_mp_{open,close}_rxq for leased queues
David Wei [Thu, 2 Apr 2026 23:10:23 +0000 (01:10 +0200)] 
net: Proxy netif_mp_{open,close}_rxq for leased queues

When a process in a container wants to setup a memory provider, it will
use the virtual netdev and a leased rxq, and call netif_mp_{open,close}_rxq
to try and restart the queue. At this point, proxy the queue restart on
the real rxq in the physical netdev.

For memory providers (io_uring zero-copy rx and devmem), it causes the
real rxq in the physical netdev to be filled from a memory provider that
has DMA mapped memory from a process within a container.

Signed-off-by: David Wei <dw@davidwei.uk>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260402231031.447597-7-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agonet: Slightly simplify net_mp_{open,close}_rxq
Daniel Borkmann [Thu, 2 Apr 2026 23:10:22 +0000 (01:10 +0200)] 
net: Slightly simplify net_mp_{open,close}_rxq

net_mp_open_rxq is currently not used in the tree as all callers are
using __net_mp_open_rxq directly, and net_mp_close_rxq is only used
once while all other locations use __net_mp_close_rxq.

Consolidate into a single API, netif_mp_{open,close}_rxq, using the
netif_ prefix to indicate that the caller is responsible for locking.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260402231031.447597-6-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agonet, ethtool: Disallow leased real rxqs to be resized
Daniel Borkmann [Thu, 2 Apr 2026 23:10:21 +0000 (01:10 +0200)] 
net, ethtool: Disallow leased real rxqs to be resized

Similar to AF_XDP, do not allow queues in a physical netdev to be resized
by ethtool -L when they are leased. Cover channel resize paths (both
netlink and ioctl) to reject resizing when the queues would be affected.

Given we need to have different checks for RX vs TX, detangle the code into
a two-loop version rather than the range of new_combined + min(new_rx, new_tx)
to old_combined + max(old_rx, old_tx).

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260402231031.447597-5-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agonet: Add lease info to queue-get response
Daniel Borkmann [Thu, 2 Apr 2026 23:10:20 +0000 (01:10 +0200)] 
net: Add lease info to queue-get response

Populate nested lease info to the queue-get response that returns the
ifindex, queue id with type and optionally netns id if the device
resides in a different netns.

Example with ynl client when using AF_XDP via queue leasing:

  # ip a
  [...]
  4: enp10s0f0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp/id:24 qdisc mq state UP group default qlen 1000
    link/ether e8:eb:d3:a3:43:f6 brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.2/24 scope global enp10s0f0np0
       valid_lft forever preferred_lft forever
    inet6 fe80::eaeb:d3ff:fea3:43f6/64 scope link proto kernel_ll
       valid_lft forever preferred_lft forever
  [...]

  # ethtool -i enp10s0f0np0
  driver: mlx5_core
  [...]

  # ynl --family netdev --output-json --do queue-get \
        --json '{"ifindex": 4, "id": 15, "type": "rx"}'
  {'id': 15,
   'ifindex': 4,
   'lease': {'ifindex': 8, 'netns-id': 0, 'queue': {'id': 1, 'type': 'rx'}},
   'napi-id': 8227,
   'type': 'rx',
   'xsk': {}}

  # ip netns list
  foo (id: 0)

  # ip netns exec foo ip a
  [...]
  8: nk@NONE: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
      link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
      inet6 fe80::200:ff:fe00:0/64 scope link proto kernel_ll
         valid_lft forever preferred_lft forever
  [...]

  # ip netns exec foo ethtool -i nk
  driver: netkit
  [...]

  # ip netns exec foo ls /sys/class/net/nk/queues/
  rx-0  rx-1  tx-0

  # ip netns exec foo ynl --family netdev --output-json --do queue-get \
        --json '{"ifindex": 8, "id": 1, "type": "rx"}'
  {"id": 1, "type": "rx", "ifindex": 8, "xsk": {}}

Note that the caller of netdev_nl_queue_fill_one() holds the netdevice
lock. For the queue-get we do not lock both devices. When queues get
{un,}leased, both devices are locked, thus if __netif_get_rx_queue_lease()
returns a lease pointer, it points to a valid device. The netns-id is
fetched via peernet2id_alloc() similarly as done in OVS.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260402231031.447597-4-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agonet: Implement netdev_nl_queue_create_doit
Daniel Borkmann [Thu, 2 Apr 2026 23:10:19 +0000 (01:10 +0200)] 
net: Implement netdev_nl_queue_create_doit

Implement netdev_nl_queue_create_doit which creates a new rx queue in a
virtual netdev and then leases it to a rx queue in a physical netdev.

Example with ynl client:

  # ynl --family netdev --output-json --do queue-create \
        --json '{"ifindex": 8, "type": "rx", "lease": {"ifindex": 4, "queue": {"type": "rx", "id": 15}}}'
  {'id': 1}

Note that the netdevice locking order is always from the virtual to
the physical device.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260402231031.447597-3-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agonet: Add queue-create operation
Daniel Borkmann [Thu, 2 Apr 2026 23:10:18 +0000 (01:10 +0200)] 
net: Add queue-create operation

Add a ynl netdev family operation called queue-create that creates a
new queue on a netdevice:

      name: queue-create
      attribute-set: queue
      flags: [admin-perm]
      do:
        request:
          attributes:
            - ifindex
            - type
            - lease
        reply: &queue-create-op
          attributes:
            - id

This is a generic operation such that it can be extended for various
use cases in future. Right now it is mandatory to specify ifindex,
the queue type which is enforced to rx and a lease. The newly created
queue id is returned to the caller.

A queue from a virtual device can have a lease which refers to another
queue from a physical device. This is useful for memory providers
and AF_XDP operations which take an ifindex and queue id to allow
applications to bind against virtual devices in containers. The lease
couples both queues together and allows to proxy the operations from
a virtual device in a container to the physical device.

In future, the nested lease attribute can be lifted and made optional
for other use-cases such as dynamic queue creation for physical
netdevs. The lack of lease and the specification of the physical
device as an ifindex will imply that we need a real queue to be
allocated. Similarly, the queue type enforcement to rx can then be
lifted as well to support tx.

An early implementation had only driver-specific integration [0], but
in order for other virtual devices to reuse, it makes sense to have
this as a generic API in core net.

For leasing queues, the virtual netdev must have real_num_rx_queues
less than num_rx_queues at the time of calling queue-create. The
queue-type must be rx as only rx queues are supported for leasing
for now. We also enforce that the queue-create ifindex must point
to a virtual device, and that the nested lease attribute's ifindex
must point to a physical device. The nested lease attribute set
contains a netns-id attribute which is optional and can specify a
netns-id relative to the caller's netns. It requires cap_net_admin
and if the netns-id attribute is not specified, the lease ifindex
will be retrieved from the current netns. Also, it is modeled as
an s32 type similarly as done elsewhere in the stack.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://bpfconf.ebpf.io/bpfconf2025/bpfconf2025_material/lsfmmbpf_2025_netkit_borkmann.pdf
Link: https://patch.msgid.link/20260402231031.447597-2-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoMerge branch 'r8152-add-support-for-the-rtl8157-5gbit-usb-ethernet-chip'
Paolo Abeni [Thu, 9 Apr 2026 10:16:47 +0000 (12:16 +0200)] 
Merge branch 'r8152-add-support-for-the-rtl8157-5gbit-usb-ethernet-chip'

Birger Koblitz says:

====================
r8152: Add support for the RTL8157 5Gbit USB Ethernet chip

Add support for the RTL8157, which is a 5GBit USB-Ethernet adapter
chip in the RTL815x family of chips.

The RTL8157 uses a different frame descriptor format, and different
SRAM/ADV access methods, plus offers 5GBit/s Ethernet, so support for these
features is added in addition to chip initialization and configuration.

The module was tested with an OEM RTL8157 USB adapter:
[25758.328238] usb 4-1: new SuperSpeed Plus Gen 2x1 USB device number 2 using xhci_hcd
[25758.345565] usb 4-1: New USB device found, idVendor=0bda, idProduct=8157, bcdDevice=30.00
[25758.345585] usb 4-1: New USB device strings: Mfr=1, Product=2, SerialNumber=7
[25758.345593] usb 4-1: Product: USB 10/100/1G/2.5G/5G LAN
[25758.345599] usb 4-1: Manufacturer: Realtek
[25758.345605] usb 4-1: SerialNumber: 000300E04C68xxxx
[25758.534241] r8152-cfgselector 4-1: reset SuperSpeed Plus Gen 2x1 USB device number 2 using xhci_hcd
[25758.603511] r8152 4-1:1.0: skip request firmware
[25758.653351] r8152 4-1:1.0 eth0: v1.12.13
[25758.689271] r8152 4-1:1.0 enx00e04c68xxxx: renamed from eth0
[25763.271682] r8152 4-1:1.0 enx00e04c68xxxx: carrier on

The RTL8157 adapter was tested against an AQC107 PCIe-card supporting
10GBit/s and an RTL8126 5Gbit PCIe-card supporting 5GBit/s for
performance, link speed and EEE negotiation. Using USB3.2 Gen 1 with
the RTL8157 USB adapter and running iperf3 against the AQC107 PCIe
card resulted in 3.47 Gbits/sec, whereas using USB3.2 Gen2 resulted
in 4.70 Gbits/sec, speeds against the RTL8126-card were the same.

As the code integrates the RTL8157-specific code with existing RTL8156 code
in order to improve code maintainability (instead of adding RTL8157-specific
functions duplicaing most of the RTL8156 code), regression tests were done
with an Edimax EU-4307 V1.0 USB-Ethernet adapter with RTL8156.

The code is based on the out-of-tree r8152 driver published by Realtek under
the GPL.

This patch is on top of linux-next as the code re-uses the 2.5 Gbit EEE
recently added in r8152.c.

Signed-off-by: Birger Koblitz <mail@birger-koblitz.de>
====================

Link: https://patch.msgid.link/20260404-rtl8157_next-v7-0-039121318f23@birger-koblitz.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 weeks agor8152: Add support for the RTL8157 hardware
Birger Koblitz [Sat, 4 Apr 2026 07:57:43 +0000 (09:57 +0200)] 
r8152: Add support for the RTL8157 hardware

The RTL8157 uses a different packet descriptor format compared to the
previous generation of chips. Add support for this format by adding a
descriptor format structure into the r8152 structure and corresponding
desc_ops functions which abstract the vlan-tag, tx/rx len and
tx/rx checksum algorithms.

Also, add support for the ADV indirect access interface of the RTL8157
and PHY setup.

For initialization of the RTL8157, combine the existing RTL8156B and
RTL8156 init functions and add RTL8157-specific functinality in order
to improve code readability and maintainability.
r8156_init() is now called with RTL_VER_10 and RTL_VER_11 for the RTL8156,
with RTL_VER_12, RTL_VER_13 and RTL_VER_15 for the RTL8156B and with
RTL_VER_16 for the RTL8157 and checks the version for chip-specific code.
Also add USB power control functions for the RTL8157.

Add support for the USB device ID of Realtek RTL8157-based adapters. Detect
the RTL8157 as RTL_VER_16 and set it up.

Signed-off-by: Birger Koblitz <mail@birger-koblitz.de>
Link: https://patch.msgid.link/20260404-rtl8157_next-v7-2-039121318f23@birger-koblitz.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 weeks agor8152: Add support for 5Gbit Link Speeds and EEE
Birger Koblitz [Sat, 4 Apr 2026 07:57:42 +0000 (09:57 +0200)] 
r8152: Add support for 5Gbit Link Speeds and EEE

The RTL8157 supports 5GBit Link speeds. Add support for this speed
in the setup and setting/getting through ethtool. Also add 5GBit EEE.
Add functionality for setup and ethtool get/set methods.

Signed-off-by: Birger Koblitz <mail@birger-koblitz.de>
Link: https://patch.msgid.link/20260404-rtl8157_next-v7-1-039121318f23@birger-koblitz.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agoMerge branch 'devlink-add-per-port-resource-support'
Jakub Kicinski [Thu, 9 Apr 2026 02:55:43 +0000 (19:55 -0700)] 
Merge branch 'devlink-add-per-port-resource-support'

Tariq Toukan says:

====================
devlink: add per-port resource support

This series by Or adds devlink per-port resource support:

Currently, devlink resources are only available at the device level.
However, some resources are inherently per-port, such as the maximum
number of subfunctions (SFs) that can be created on a specific PF port.
This limitation prevents user space from obtaining accurate per-port
capacity information.
This series adds infrastructure for per-port resources in devlink core
and implements it in the mlx5 driver to expose the max_SFs resource
on PF devlink ports.

Patch #1  refactors resource functions to be generic
Patch #2  adds port-level resource registration infrastructure
Patch #3  registers SF resource on PF port representor in mlx5
Patch #4  adds devlink port resource registration to netdevsim for testing
Patch #5  adds dump support for device-level resources
Patch #6  includes port resources in the resource dump dumpit path
Patch #7  adds port-specific option to resource dump doit path
Patch #8  adds selftest for devlink port resource doit
Patch #9  documents port-level resources and full dump
Patch #10 adds resource scope filtering to resource dump
Patch #11 adds selftest for resource dump and scope filter
Patch #12 documents resource scope filtering
====================

Link: https://patch.msgid.link/20260407194107.148063-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agodevlink: Document resource scope filtering
Or Har-Toov [Tue, 7 Apr 2026 19:41:07 +0000 (22:41 +0300)] 
devlink: Document resource scope filtering

Document the scope parameter for devlink resource show, which allows
filtering the dump to device-level or port-level resources only.

Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260407194107.148063-13-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoselftest: netdevsim: Add resource dump and scope filter test
Or Har-Toov [Tue, 7 Apr 2026 19:41:06 +0000 (22:41 +0300)] 
selftest: netdevsim: Add resource dump and scope filter test

Add resource_dump_test() which verifies dumping resources for all
devices and ports, and tests that scope=dev returns only device-level
resources and scope=port returns only port resources.

Skip if userspace does not support the scope parameter.

Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260407194107.148063-12-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agodevlink: Add resource scope filtering to resource dump
Or Har-Toov [Tue, 7 Apr 2026 19:41:05 +0000 (22:41 +0300)] 
devlink: Add resource scope filtering to resource dump

Allow filtering the resource dump to device-level or port-level
resources using the 'scope' option.

Example - dump only device-level resources:

  $ devlink resource show scope dev
  pci/0000:03:00.0:
    name max_local_SFs size 128 unit entry dpipe_tables none
    name max_external_SFs size 128 unit entry dpipe_tables none
  pci/0000:03:00.1:
    name max_local_SFs size 128 unit entry dpipe_tables none
    name max_external_SFs size 128 unit entry dpipe_tables none

Example - dump only port-level resources:

  $ devlink resource show scope port
  pci/0000:03:00.0/196608:
    name max_SFs size 128 unit entry dpipe_tables none
  pci/0000:03:00.0/196609:
    name max_SFs size 128 unit entry dpipe_tables none
  pci/0000:03:00.1/196708:
    name max_SFs size 128 unit entry dpipe_tables none
  pci/0000:03:00.1/196709:
    name max_SFs size 128 unit entry dpipe_tables none

Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260407194107.148063-11-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agodevlink: Document port-level resources and full dump
Or Har-Toov [Tue, 7 Apr 2026 19:41:04 +0000 (22:41 +0300)] 
devlink: Document port-level resources and full dump

Document the port-level resource support and the option to dump all
resources, including both device-level and port-level entries.

Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Shay Drori <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260407194107.148063-10-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoselftest: netdevsim: Add devlink port resource doit test
Or Har-Toov [Tue, 7 Apr 2026 19:41:03 +0000 (22:41 +0300)] 
selftest: netdevsim: Add devlink port resource doit test

Tests that querying a specific port handle returns the expected
resource name and size.

Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260407194107.148063-9-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agodevlink: Add port-specific option to resource dump doit
Or Har-Toov [Tue, 7 Apr 2026 19:41:02 +0000 (22:41 +0300)] 
devlink: Add port-specific option to resource dump doit

Allow querying devlink resources per-port via the resource-dump doit
handler. When a port-index attribute is provided, only that port's
resources are returned. When no port-index is given, only device-level
resources are returned, preserving backward compatibility.

Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260407194107.148063-8-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agodevlink: Include port resources in resource dump dumpit
Or Har-Toov [Tue, 7 Apr 2026 19:41:01 +0000 (22:41 +0300)] 
devlink: Include port resources in resource dump dumpit

Allow querying devlink resources per-port via the resource-dump dumpit
handler. Both device-level and all ports resources are included in the
reply.

For example:

$ devlink resource show
pci/0000:03:00.0:
  name local_max_SFs size 508 unit entry
  name external_max_SFs size 508 unit entry
pci/0000:03:00.0/196608:
  name max_SFs size 20 unit entry
pci/0000:03:00.1:
  name local_max_SFs size 508 unit entry
  name external_max_SFs size 508 unit entry
pci/0000:03:00.1/262144:
  name max_SFs size 20 unit entry

Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260407194107.148063-7-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agodevlink: Add dump support for device-level resources
Or Har-Toov [Tue, 7 Apr 2026 19:41:00 +0000 (22:41 +0300)] 
devlink: Add dump support for device-level resources

Add dumpit handler for resource-dump command to iterate over all devlink
devices and show their resources.

  $ devlink resource show
  pci/0000:08:00.0:
    name local_max_SFs size 508 unit entry
    name external_max_SFs size 508 unit entry
  pci/0000:08:00.1:
    name local_max_SFs size 508 unit entry
    name external_max_SFs size 508 unit entry

Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Shay Drori <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260407194107.148063-6-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonetdevsim: Add devlink port resource registration
Or Har-Toov [Tue, 7 Apr 2026 19:40:59 +0000 (22:40 +0300)] 
netdevsim: Add devlink port resource registration

Register port-level resources for netdevsim ports to enable testing
of the port resource infrastructure.

Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Shay Drori <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260407194107.148063-5-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet/mlx5: Register SF resource on PF port representor
Or Har-Toov [Tue, 7 Apr 2026 19:40:58 +0000 (22:40 +0300)] 
net/mlx5: Register SF resource on PF port representor

The device-level "resource show" displays max_local_SFs and
max_external_SFs without indicating which port each resource belongs
to. Users cannot determine the controller number and pfnum associated
with each SF pool.

Register max_SFs resource on the host PF representor port to expose
per-port SF limits. Users can correlate the port resource with the
controller number and pfnum shown in 'devlink port show'.

Future patches will introduce an ECPF that manages multiple PFs,
where each PF has its own SF pool.

Example usage:

  $ devlink resource show pci/0000:03:00.0/196608
  pci/0000:03:00.0/196608:
    name max_SFs size 20 unit entry

  $ devlink port show pci/0000:03:00.0/196608
  pci/0000:03:00.0/196608: type eth netdev pf0hpf flavour pcipf
    controller 1 pfnum 0 external true splittable false
    function:
      hw_addr b8:3f:d2:e1:8f:dc roce enable max_io_eqs 120

We can create up to 20 SFs over devlink port pci/0000:03:00.0/196608,
with pfnum 0 and controller 1.

Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Shay Drori <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260407194107.148063-4-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agodevlink: Add port-level resource registration infrastructure
Or Har-Toov [Tue, 7 Apr 2026 19:40:57 +0000 (22:40 +0300)] 
devlink: Add port-level resource registration infrastructure

The current devlink resource infrastructure supports only device-level
resources. Some hardware resources are associated with specific ports
rather than the entire device, and today we have no way to show resource
per-port.

Add support for registering resources at the port level.

Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Shay Drori <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260407194107.148063-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agodevlink: Refactor resource functions to be generic
Or Har-Toov [Tue, 7 Apr 2026 19:40:56 +0000 (22:40 +0300)] 
devlink: Refactor resource functions to be generic

Currently the resource functions take devlink pointer as parameter
and take the resource list from there.
Allow resource functions to work with other resource lists that will
be added in next patches and not only with the devlink's resource list.

Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Shay Drori <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260407194107.148063-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoselftests/drivers/net: Add an xdp test to xdp.py
Leon Hwang [Mon, 6 Apr 2026 07:26:54 +0000 (15:26 +0800)] 
selftests/drivers/net: Add an xdp test to xdp.py

In "bpf: Disallow freplace on XDP with mismatched xdp_has_frags values" [1],
this XDP test is suggested to add to xdp.py.

1. Verify the failure of updating frag-capable prog with non-frag-capable
   prog, when the frag-capable prog attaches to mtu=9k driver.

The test has been verified against Mellanox CX6 and Intel 82599ES NICs.

With dropping other tests, here is the test log.

 # ethtool -i eth0
 driver: mlx5_core
 version: 6.19.0-061900-generic

 # NETIF=eth0 python3 xdp.py
 TAP version 13
 1..1
 ok 1 xdp.test_xdp_native_update_mb_to_sb
 # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0

 # ethtool -i eth0
 driver: ixgbe
 version: 6.19.0-061900-generic

 # NETIF=eth0 python3 xdp.py
 TAP version 13
 1..1
 # CMD: ip  link set dev eth0 xdpdrv obj /path/to/tools/testing/selftests/net/lib/xdp_dummy.bpf.o sec xdp.frags
 #   EXIT: 2
 #   STDERR: RTNETLINK answers: Invalid argument
 ok 1 xdp.test_xdp_native_update_mb_to_sb # SKIP device does not support multi-buffer XDP
 # Totals: pass:0 fail:0 xfail:0 xpass:0 skip:1 error:0

Signed-off-by: Leon Hwang <leon.huangfu@shopee.com>
Link: https://patch.msgid.link/20260406072655.368173-1-leon.huangfu@shopee.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoMerge branch 'dsa_loop-and-platform_data-cleanups'
Jakub Kicinski [Thu, 9 Apr 2026 02:38:55 +0000 (19:38 -0700)] 
Merge branch 'dsa_loop-and-platform_data-cleanups'

Vladimir Oltean says:

====================
dsa_loop and platform_data cleanups

While working to add some new features to dsa_loop, I gathered a number
of cleanup patches. They mostly remove some data structures that became
unused after the multi-switch platforms were migrated to the modern DT
bindings.
====================

Link: https://patch.msgid.link/20260406212158.721806-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet: dsa: eliminate <linux/dsa/loop.h>
Vladimir Oltean [Mon, 6 Apr 2026 21:21:58 +0000 (00:21 +0300)] 
net: dsa: eliminate <linux/dsa/loop.h>

There is no reason at all to export these data types to the global
include directory.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20260406212158.721806-5-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet: dsa: remove unused platform_data definitions
Vladimir Oltean [Mon, 6 Apr 2026 21:21:57 +0000 (00:21 +0300)] 
net: dsa: remove unused platform_data definitions

Pretty self-explanatory, nobody needs these.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20260406212158.721806-4-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet: dsa: clean up struct dsa_chip_data
Vladimir Oltean [Mon, 6 Apr 2026 21:21:56 +0000 (00:21 +0300)] 
net: dsa: clean up struct dsa_chip_data

This has accumulated some fields which are no longer parsed by the core
or set by any driver. Remove them.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20260406212158.721806-3-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet: dsa: remove struct platform_data
Vladimir Oltean [Mon, 6 Apr 2026 21:21:55 +0000 (00:21 +0300)] 
net: dsa: remove struct platform_data

This is not used anywhere in the kernel.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20260406212158.721806-2-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoMerge branch 'mptcp-autotune-related-improvement'
Jakub Kicinski [Thu, 9 Apr 2026 02:32:05 +0000 (19:32 -0700)] 
Merge branch 'mptcp-autotune-related-improvement'

Matthieu Baerts says:

====================
mptcp: autotune related improvement

Here are two patches from Paolo that have been crafted a couple of
months ago, but needed more validation because they were indirectly
causing instabilities in the sefltests. The root cause has been fixed in
'net' recently in commit 8c09412e584d ("selftests: mptcp: more stable
simult_flows tests").

These patches refactor the receive space and RTT estimator, overall
making DRS more correct while avoiding receive buffer drifting to
tcp_rmem[2], which in turn makes the throughput more stable and less
bursty, especially with high bandwidth and low delay environments.

Note that the first patch addresses a very old issue. 'net-next' is
targeted because the change is quite invasive and based on a recent
backlog refactor. The 'Fixes' tag is then there more as a FYI, because
backporting this patch will quickly be blocked due to large conflicts.
====================

Link: https://patch.msgid.link/20260407-net-next-mptcp-reduce-rbuf-v2-0-0d1d135bf6f6@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agomptcp: add receive queue awareness in tcp_rcv_space_adjust()
Paolo Abeni [Tue, 7 Apr 2026 08:45:18 +0000 (10:45 +0200)] 
mptcp: add receive queue awareness in tcp_rcv_space_adjust()

This is the MPTCP counter-part of commit ea33537d8292 ("tcp: add receive
queue awareness in tcp_rcv_space_adjust()").

Prior to this commit:

  ESTAB 33165568 0      192.168.255.2:5201 192.168.255.1:53380 \
        skmem:(r33076416,rb33554432,t0,tb91136,f448,w0,o0,bl0,d0)

After:

  ESTAB 3279168 0      192.168.255.2:5201 192.168.255.1]:53042 \
        skmem:(r3190912,rb3719956,t0,tb91136,f1536,w0,o0,bl0,d0)

Same throughput.

Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260407-net-next-mptcp-reduce-rbuf-v2-2-0d1d135bf6f6@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agomptcp: better mptcp-level RTT estimator
Paolo Abeni [Tue, 7 Apr 2026 08:45:17 +0000 (10:45 +0200)] 
mptcp: better mptcp-level RTT estimator

The current MPTCP-level RTT estimator has several issues. On high speed
links, the MPTCP-level receive buffer auto-tuning happens with a
frequency well above the TCP-level's one. That in turn can cause
excessive/unneeded receive buffer increase.

On such links, the initial rtt_us value is considerably higher than the
actual delay, and the current mptcp_rcv_space_adjust() updates
msk->rcvq_space.rtt_us with a period equal to the such field previous
value. If the initial rtt_us is 40ms, its first update will happen after
40ms, even if the subflows see actual RTT orders of magnitude lower.

Additionally:
- setting the msk RTT to the maximum among all the subflows RTTs makes
  DRS constantly overshooting the rcvbuf size when a subflow has
  considerable higher latency than the other(s).

- during unidirectional bulk transfers with multiple active subflows,
  the TCP-level RTT estimator occasionally sees considerably higher
  value than the real link delay, i.e. when the packet scheduler reacts
  to an incoming ACK on given subflow pushing data on a different
  subflow.

- currently inactive but still open subflows (i.e. switched to backup
  mode) are always considered when computing the msk-level RTT.

Address the all the issues above with a more accurate RTT estimation
strategy: the MPTCP-level RTT is set to the minimum of all the subflows
actually feeding data into the MPTCP receive buffer, using a small
sliding window.

While at it, also use EWMA to compute the msk-level scaling_ratio, to
that MPTCP can avoid traversing the subflow list is
mptcp_rcv_space_adjust().

Use some care to avoid updating msk and ssk level fields too often.

Fixes: a6b118febbab ("mptcp: add receive buffer auto-tuning")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260407-net-next-mptcp-reduce-rbuf-v2-1-0d1d135bf6f6@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet: initialize sk_rx_queue_mapping in sk_clone()
Jiayuan Chen [Tue, 7 Apr 2026 08:42:18 +0000 (16:42 +0800)] 
net: initialize sk_rx_queue_mapping in sk_clone()

sk_clone() initializes sk_tx_queue_mapping via sk_tx_queue_clear()
but does not initialize sk_rx_queue_mapping. Since this field is in
the sk_dontcopy region, it is neither copied from the parent socket
by sock_copy() nor zeroed by sk_prot_alloc() (called without
__GFP_ZERO from sk_clone).

Commit 03cfda4fa6ea ("tcp: fix another uninit-value
(sk_rx_queue_mapping)") attempted to fix this by introducing
sk_mark_napi_id_set() with force_set=true in tcp_child_process().
However, sk_mark_napi_id_set() -> sk_rx_queue_set() only writes
when skb_rx_queue_recorded(skb) is true. If the 3-way handshake
ACK arrives through a device that does not record rx_queue (e.g.
loopback or veth), sk_rx_queue_mapping remains uninitialized.

When a subsequent data packet arrives with a recorded rx_queue,
sk_mark_napi_id() -> sk_rx_queue_update() reads the uninitialized
field for comparison (force_set=false path), triggering KMSAN.

This was reproduced by establishing a TCP connection over loopback
(which does not call skb_record_rx_queue), then attaching a BPF TC
program on lo ingress to set skb->queue_mapping on data packets:

BUG: KMSAN: uninit-value in tcp_v4_do_rcv (net/ipv4/tcp_ipv4.c:1875)
 tcp_v4_do_rcv (net/ipv4/tcp_ipv4.c:1875)
 tcp_v4_rcv (net/ipv4/tcp_ipv4.c:2287)
 ip_protocol_deliver_rcu (net/ipv4/ip_input.c:207)
 ip_local_deliver_finish (net/ipv4/ip_input.c:242)
 ip_local_deliver (net/ipv4/ip_input.c:262)
 ip_rcv (net/ipv4/ip_input.c:573)
 __netif_receive_skb (net/core/dev.c:6294)
 process_backlog (net/core/dev.c:6646)
 __napi_poll (net/core/dev.c:7710)
 net_rx_action (net/core/dev.c:7929)
 handle_softirqs (kernel/softirq.c:623)
 do_softirq (kernel/softirq.c:523)
 __local_bh_enable_ip (kernel/softirq.c:?)
 __dev_queue_xmit (net/core/dev.c:?)
 ip_finish_output2 (net/ipv4/ip_output.c:237)
 ip_output (net/ipv4/ip_output.c:438)
 __ip_queue_xmit (net/ipv4/ip_output.c:534)
 __tcp_transmit_skb (net/ipv4/tcp_output.c:1693)
 tcp_write_xmit (net/ipv4/tcp_output.c:3064)
 tcp_sendmsg_locked (net/ipv4/tcp.c:?)
 tcp_sendmsg (net/ipv4/tcp.c:1465)
 inet_sendmsg (net/ipv4/af_inet.c:865)
 sock_write_iter (net/socket.c:1195)
 vfs_write (fs/read_write.c:688)
 ...
Uninit was created at:
 kmem_cache_alloc_noprof (mm/slub.c:4873)
 sk_prot_alloc (net/core/sock.c:2239)
 sk_alloc (net/core/sock.c:2301)
 inet_create (net/ipv4/af_inet.c:334)
 __sock_create (net/socket.c:1605)
 __sys_socket (net/socket.c:1747)

Fix this at the root by adding sk_rx_queue_clear() alongside
sk_tx_queue_clear() in sk_clone().

Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260407084219.95718-1-jiayuan.chen@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoselftests: forwarding: lib: rewrite processing of command line arguments
Ioana Ciornei [Tue, 7 Apr 2026 10:20:58 +0000 (13:20 +0300)] 
selftests: forwarding: lib: rewrite processing of command line arguments

The piece of code which processes the command line arguments and
populates NETIFS based on them is really unobvious. Rewrite it so that
the intention is clear and the code is easy to follow.

Suggested-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/20260407102058.867279-1-ioana.ciornei@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet: bcmasp: Switch to page pool for RX path
Florian Fainelli [Wed, 8 Apr 2026 00:18:13 +0000 (17:18 -0700)] 
net: bcmasp: Switch to page pool for RX path

This shows an improvement of 1.9% in reducing the CPU cycles and data
cache misses.

Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Justin Chen <justin.chen@broadcom.com>
Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de>
Link: https://patch.msgid.link/20260408001813.635679-1-florian.fainelli@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet: dropreason: add MACVLAN_BROADCAST_BACKLOG and IPVLAN_MULTICAST_BACKLOG
Eric Dumazet [Tue, 7 Apr 2026 15:07:10 +0000 (15:07 +0000)] 
net: dropreason: add MACVLAN_BROADCAST_BACKLOG and IPVLAN_MULTICAST_BACKLOG

ipvlan and macvlan use queues to process broadcast/multicast packets
from a work queue.

Under attack these queues can drop packets.

Add MACVLAN_BROADCAST_BACKLOG drop_reason for macvlan broadcast queue.

Add IPVLAN_MULTICAST_BACKLOG drop_reason for ipvlan multicast queue.

Use different reasons as some deployments use both ipvlan and macvlan.

Also change ipvlan_rcv_frame() to use SKB_DROP_REASON_DEV_READY
when the device is not UP.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260407150710.1640747-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agocodel: annotate data-races in codel_dump_stats()
Eric Dumazet [Tue, 7 Apr 2026 14:30:53 +0000 (14:30 +0000)] 
codel: annotate data-races in codel_dump_stats()

codel_dump_stats() only runs with RTNL held,
reading fields that can be changed in qdisc fast path.

Add READ_ONCE()/WRITE_ONCE() annotations.

Alternative would be to acquire the qdisc spinlock, but our long-term
goal is to make qdisc dump operations lockless as much as we can.

tc_codel_xstats fields don't need to be latched atomically,
otherwise this bug would have been caught earlier.

No change in kernel size:

$ scripts/bloat-o-meter -t vmlinux.0 vmlinux
add/remove: 0/0 grow/shrink: 1/1 up/down: 3/-1 (2)
Function                                     old     new   delta
codel_qdisc_dequeue                         2462    2465      +3
codel_dump_stats                             250     249      -1
Total: Before=29739919, After=29739921, chg +0.00%

Fixes: 76e3cc126bb2 ("codel: Controlled Delay AQM")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260407143053.1570620-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet: phy: realtek: get rid of magic numbers in rtl8201_config_intr()
Aleksander Jan Bajkowski [Mon, 6 Apr 2026 20:12:12 +0000 (22:12 +0200)] 
net: phy: realtek: get rid of magic numbers in rtl8201_config_intr()

Replace the magic numbers with defines. Register names were obtained from
publicly available documentation[1]. This should make it clear what's going
on in the code.

1. RTL8201F/RTL8201FL/RTL8201FN Rev. 1.4 Datasheet
Signed-off-by: Aleksander Jan Bajkowski <olek2@wp.pl>
Reviewed-by: Daniel Golle <daniel@makrotopia.org>
Reviewed-by: Nicolai Buchwitz nb@tipi-net.de
Link: https://patch.msgid.link/20260406201222.1043396-1-olek2@wp.pl
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agobonding: remove unused bond_is_first_slave and bond_is_last_slave macros
Xiang Mei [Sat, 4 Apr 2026 22:04:12 +0000 (15:04 -0700)] 
bonding: remove unused bond_is_first_slave and bond_is_last_slave macros

Since commit 2884bf72fb8f ("net: bonding: fix use-after-free in
bond_xmit_broadcast()"), bond_is_last_slave() was only used in
bond_xmit_broadcast().  After the recent fix replaced that usage with
a simple index comparison, bond_is_last_slave() has no remaining
callers.  bond_is_first_slave() likewise has no callers.

Remove both unused macros.

Signed-off-by: Xiang Mei <xmei5@asu.edu>
Link: https://patch.msgid.link/20260404220412.444753-1-xmei5@asu.edu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agodocs: netdev: improve wording of reviewer guidance
Jakub Kicinski [Mon, 6 Apr 2026 17:53:34 +0000 (10:53 -0700)] 
docs: netdev: improve wording of reviewer guidance

Reword the reviewer guidance based on behavior we see on the list.
Steer folks:
 - towards sending tags
 - away from process issues.

Reviewed-by: Joe Damato <joe@dama.to>
Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de>
Link: https://patch.msgid.link/20260406175334.3153451-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoMerge tag 'nf-next-26-04-08' of https://git.kernel.org/pub/scm/linux/kernel/git/netfi...
Jakub Kicinski [Thu, 9 Apr 2026 01:58:08 +0000 (18:58 -0700)] 
Merge tag 'nf-next-26-04-08' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next

Florian Westphal says:

====================
netfilter: updates for net-next

1) Fix ancient sparse warnings in nf conntrack nat modules, from
   Sun Jian.

2) Fix typo in enum description, from Jelle van der Waa.

3) remove redundant refetch of netns pointer in nf_conntrack_sip.

4) add a deprecation warning for dccp match.
   We can extend the deadline later if needed, but plan atm is to
   remove the feature.

5) remove nf_conntrack_h323 debug code that can read out-of-bounds
   with malformed messages. This code was commented out, but better
   remove this.

6+7) add more netlink policy validations in netfilter.
   This could theoretically cause issues when a client sends e.g.
   unsupported feature flags that were previously ignored, so we
   may have to relax some changes. For now, try to be stricter and
   reject upfront.

8+9) minor code cleanup in nft_set_pipapo (an nftables set backend).

10) Add nftables matching support fro double-tagged vlan and pppoe
    frames, from Pablo Neira Ayuso.

11) Fix up indentation of debug messages in nf_conntrack_h323 conntrack
    helper, from David Laight.

12) Add a helper to iterate to next flow action and bail out if the
    maximum number of actions is reached, also from Pablo.

* tag 'nf-next-26-04-08' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  netfilter: nf_tables_offload: add nft_flow_action_entry_next() and use it
  netfilter: nf_conntrack_h323: Correct indentation when H323_TRACE defined
  netfilter: nft_meta: add double-tagged vlan and pppoe support
  netfilter: nft_set_pipapo_avx2: remove redundant loop in lookup_slow
  netfilter: nft_set_pipapo: increment data in one step
  netfilter: nf_tables: add netlink policy based cap on registers
  netfilter: add more netlink-based policy range checks
  netfilter: nf_conntrack_h323: remove unreliable debug code in decode_octstr
  netfilter: add deprecation warning for dccp support
  netfilter: nf_conntrack_sip: remove net variable shadowing
  netfilter: nf_tables: Fix typo in enum description
  netfilter: use function typedefs for __rcu NAT helper hook pointers
====================

Link: https://patch.msgid.link/20260408060419.25258-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoMerge tag 'ipsec-next-2026-04-08' of git://git.kernel.org/pub/scm/linux/kernel/git...
Jakub Kicinski [Thu, 9 Apr 2026 01:51:54 +0000 (18:51 -0700)] 
Merge tag 'ipsec-next-2026-04-08' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next

Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2026-04-08

1) Update outdated comment in xfrm_dst_check().
   From kexinsun.

2) Drop support for HMAC-RIPEMD-160 from IPsec.
   From Eric Biggers.

* tag 'ipsec-next-2026-04-08' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next:
  xfrm: Drop support for HMAC-RIPEMD-160
  xfrm: update outdated comment
====================

Link: https://patch.msgid.link/20260408094258.148555-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonetfilter: nf_tables_offload: add nft_flow_action_entry_next() and use it
Pablo Neira Ayuso [Mon, 30 Mar 2026 09:04:02 +0000 (11:04 +0200)] 
netfilter: nf_tables_offload: add nft_flow_action_entry_next() and use it

Add a new helper function to retrieve the next action entry in flow
rule, check if the maximum number of actions is reached, bail out in
such case.

Replace existing opencoded iteration on the action array by this
helper function.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
8 weeks agonetfilter: nf_conntrack_h323: Correct indentation when H323_TRACE defined
David Laight [Thu, 26 Mar 2026 20:18:19 +0000 (20:18 +0000)] 
netfilter: nf_conntrack_h323: Correct indentation when H323_TRACE defined

The trace lines are indented using PRINT("%*.s", xx, " ").
Userspace will treat this as "%*.0s" and will output no characters
when 'xx' is zero, the kernel treats it as "%*s" and will output
a single ' ' - which is probably what is intended.

Change all the formats to "%*s" removing the default precision.
This gives a single space indent when level is zero.

Signed-off-by: David Laight <david.laight.linux@gmail.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
8 weeks agonetfilter: nft_meta: add double-tagged vlan and pppoe support
Pablo Neira Ayuso [Sun, 22 Mar 2026 22:51:47 +0000 (23:51 +0100)] 
netfilter: nft_meta: add double-tagged vlan and pppoe support

Currently:

  add rule netdev x y ip saddr 1.1.1.1

does not work with neither double-tagged vlan nor pppoe packets. This is
because the network and transport header offset are not pointing to the
IP and transport protocol headers in the stack.

This patch expands NFT_META_PROTOCOL and NFT_META_L4PROTO to parse
double-tagged vlan and pppoe packets so matching network and transport
header fields becomes possible with the existing userspace generated
bytecode. Note that this parser only supports double-tagged vlan which
is composed of vlan offload + vlan header in the skb payload area for
simplicity.

NFT_META_PROTOCOL is used by bridge and netdev family as an implicit
dependency in the bytecode to match on network header fields.
Similarly, there is also NFT_META_L4PROTO, which is also used as an
implicit dependency when matching on the transport protocol header
fields.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
8 weeks agonetfilter: nft_set_pipapo_avx2: remove redundant loop in lookup_slow
Florian Westphal [Tue, 17 Mar 2026 14:09:18 +0000 (15:09 +0100)] 
netfilter: nft_set_pipapo_avx2: remove redundant loop in lookup_slow

nft_pipapo_avx2_lookup_slow will never be used in reality, because the
common sizes are handled by avx2 optimized versions.

However, nft_pipapo_avx2_lookup_slow loops over the data just like the
avx2 functions.  However, _slow doesn't need to do that.

As-is, first loop sets all the right result bits and the next iterations
boil down to 'x = x & x'.  Remove the loop.

Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
8 weeks agonetfilter: nft_set_pipapo: increment data in one step
Florian Westphal [Tue, 17 Mar 2026 14:50:08 +0000 (15:50 +0100)] 
netfilter: nft_set_pipapo: increment data in one step

Since commit e807b13cb3e3 ("nft_set_pipapo: Generalise group size for buckets")
there is no longer a need to increment the data pointer in two steps.
Switch to a single invocation of NFT_PIPAPO_GROUPS_PADDED_SIZE() helper,
like the avx2 implementation.

[ Stefano: Improve commit message ]

Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
8 weeks agonetfilter: nf_tables: add netlink policy based cap on registers
Florian Westphal [Fri, 13 Mar 2026 12:12:30 +0000 (13:12 +0100)] 
netfilter: nf_tables: add netlink policy based cap on registers

Should have no effect in practice; all of these use the
nft_parse_register_load/store apis which is mandatory anyway due
to the need to further validate the register load/store, e.g.
that the size argument doesn't result in out-of-bounds load/store.

OTOH this is a simple method to reject obviously wrong input
at earlier stage.

Signed-off-by: Florian Westphal <fw@strlen.de>
8 weeks agonetfilter: add more netlink-based policy range checks
Florian Westphal [Mon, 9 Mar 2026 23:26:46 +0000 (00:26 +0100)] 
netfilter: add more netlink-based policy range checks

These spots either already check the attribute range manually
before use or the consuming functions tolerate unexpected values.

Nevertheless, add more range checks via netlink policy so we gain
more users and avoid possible re-use in other places that might
not have the required manual checks.  This also improves error
reporting: netlink core can generate extack errors.

Signed-off-by: Florian Westphal <fw@strlen.de>
8 weeks agonetfilter: nf_conntrack_h323: remove unreliable debug code in decode_octstr
Florian Westphal [Thu, 12 Mar 2026 17:53:41 +0000 (18:53 +0100)] 
netfilter: nf_conntrack_h323: remove unreliable debug code in decode_octstr

The debug code (not enabled in any build) reads up to 6 octets of
the inpt buffer, but does so without bound checks.  Zap this.

Signed-off-by: Florian Westphal <fw@strlen.de>
8 weeks agonetfilter: add deprecation warning for dccp support
Florian Westphal [Wed, 11 Mar 2026 09:53:15 +0000 (10:53 +0100)] 
netfilter: add deprecation warning for dccp support

Add a deprecation warning for the xt_dccp match and the
nft exthdr code.

Signed-off-by: Florian Westphal <fw@strlen.de>
8 weeks agonetfilter: nf_conntrack_sip: remove net variable shadowing
Florian Westphal [Tue, 10 Mar 2026 15:05:51 +0000 (16:05 +0100)] 
netfilter: nf_conntrack_sip: remove net variable shadowing

net is already set, derived from nf_conn.
I don't see how the device could be living in a different netns
than the conntrack entry.

Remove the extra variable and re-use existing one.

Signed-off-by: Florian Westphal <fw@strlen.de>
8 weeks agonetfilter: nf_tables: Fix typo in enum description
Jelle van der Waa [Mon, 9 Mar 2026 20:29:33 +0000 (21:29 +0100)] 
netfilter: nf_tables: Fix typo in enum description

Fix the spelling of "options".

Signed-off-by: Jelle van der Waa <jelle@vdwaa.nl>
Signed-off-by: Florian Westphal <fw@strlen.de>
8 weeks agonetfilter: use function typedefs for __rcu NAT helper hook pointers
Sun Jian [Tue, 3 Mar 2026 10:15:25 +0000 (18:15 +0800)] 
netfilter: use function typedefs for __rcu NAT helper hook pointers

After commit 07919126ecfc ("netfilter: annotate NAT helper hook pointers
with __rcu"), sparse can warn about type/address-space mismatches when
RCU-dereferencing NAT helper hook function pointers.

The hooks are __rcu-annotated and accessed via rcu_dereference(), but the
combination of complex function pointer declarators and the WRITE_ONCE()
machinery used by RCU_INIT_POINTER()/rcu_assign_pointer() can confuse
sparse and trigger false positives.

Introduce typedefs for the NAT helper function types, so __rcu applies to
a simple "fn_t __rcu *" pointer form. Also replace local typeof(hook)
variables with "fn_t *" to avoid propagating __rcu address space into
temporaries.

No functional change intended.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202603022359.3dGE9fwI-lkp@intel.com/
Signed-off-by: Sun Jian <sun.jian.kdev@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
8 weeks agoMerge branch 'net-pull-gso-packet-headers-in-core-stack'
Jakub Kicinski [Wed, 8 Apr 2026 02:02:18 +0000 (19:02 -0700)] 
Merge branch 'net-pull-gso-packet-headers-in-core-stack'

Eric Dumazet says:

====================
net: pull gso packet headers in core stack

Most ndo_start_xmit() methods expects headers of gso packets
to be already in skb->head.

net/core/tso.c users are particularly at risk, because tso_build_hdr()
does a memcpy(hdr, skb->data, hdr_len);

qdisc_pkt_len_segs_init() already does a dissection of gso packets.

Use pskb_may_pull() instead of skb_header_pointer() to make
sure drivers do not have to reimplement this.

First patch is a small cleanup to ease second patch review.
====================

Link: https://patch.msgid.link/20260403221540.3297753-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet: pull headers in qdisc_pkt_len_segs_init()
Eric Dumazet [Fri, 3 Apr 2026 22:15:40 +0000 (22:15 +0000)] 
net: pull headers in qdisc_pkt_len_segs_init()

Most ndo_start_xmit() methods expects headers of gso packets
to be already in skb->head.

net/core/tso.c users are particularly at risk, because tso_build_hdr()
does a memcpy(hdr, skb->data, hdr_len);

qdisc_pkt_len_segs_init() already does a dissection of gso packets.

Use pskb_may_pull() instead of skb_header_pointer() to make
sure drivers do not have to reimplement this.

Some malicious packets could be fed, detect them so that we can
drop them sooner with a new SKB_DROP_REASON_SKB_BAD_GSO drop_reason.

Fixes: e876f208af18 ("net: Add a software TSO helper API")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Joe Damato <joe@dama.to>
Link: https://patch.msgid.link/20260403221540.3297753-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet: qdisc_pkt_len_segs_init() cleanup
Eric Dumazet [Fri, 3 Apr 2026 22:15:39 +0000 (22:15 +0000)] 
net: qdisc_pkt_len_segs_init() cleanup

Reduce indentation level by returning early if the transport header
was not set.

Add an unlikely() clause as this is not the common case.

No functional change.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Joe Damato <joe@dama.to>
Link: https://patch.msgid.link/20260403221540.3297753-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoselftests: drv-net: adjust to socat changes
Jakub Kicinski [Sat, 4 Apr 2026 23:01:03 +0000 (16:01 -0700)] 
selftests: drv-net: adjust to socat changes

socat v1.8.1.0 now defaults to shut-null, it sends an extra
0-length UDP packet when sender disconnects. This breaks
our tests which expect the exact packet sequence.

Add shut-none which was the old default where necessary.

Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Joe Damato <joe@dama.to>
Reviewed-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260404230103.2719103-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet: hsr: emit notification for PRP slave2 changed hw addr on port deletion
Fernando Fernandez Mancera [Fri, 3 Apr 2026 12:39:29 +0000 (14:39 +0200)] 
net: hsr: emit notification for PRP slave2 changed hw addr on port deletion

On PRP protocol, when deleting the port the MAC address change
notification was missing. In addition to that, make sure to only perform
the MAC address change on slave2 deletion and PRP protocol as the
operation isn't necessary for HSR nor slave1.

Note that the eth_hw_addr_set() is correct on PRP context as the slaves
are either in promiscuous mode or forward offload enabled.

Reported-by: Luka Gejak <luka.gejak@linux.dev>
Closes: https://lore.kernel.org/netdev/DHFCZEM93FTT.1RWFBIE32K7OT@linux.dev/
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Reviewed-by: Felix Maurer <fmaurer@redhat.com>
Link: https://patch.msgid.link/20260403123928.4249-2-fmancera@suse.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agoMerge branch 'net-mlx5e-xdp-add-support-for-multi-packet-per-page'
Paolo Abeni [Tue, 7 Apr 2026 11:34:08 +0000 (13:34 +0200)] 
Merge branch 'net-mlx5e-xdp-add-support-for-multi-packet-per-page'

Tariq Toukan says:

====================
net/mlx5e: XDP, Add support for multi-packet per page

This series removes the limitation of having one packet per page in XDP
mode. This has the following implications:

- XDP in Striding RQ mode can now be used on 64K page systems.

- XDP in Legacy RQ mode was using a single packet per page which on 64K
  page systems is quite inefficient. The improvement can be observed
  with an XDP_DROP test when running in Legacy RQ mode on a ARM
  Neoverse-N1 system with a 64K page size:
  +-----------------------------------------------+
  | MTU  | baseline   | this change | improvement |
  |------+------------+-------------+-------------|
  | 1500 | 15.55 Mpps | 18.99 Mpps  | 22.0 %      |
  | 9000 | 15.53 Mpps | 18.24 Mpps  | 17.5 %      |
  +-----------------------------------------------+

After lifting this limitation, the series switches to using fragments
for the side page in non-linear mode. This small improvement is at most
visible for XDP_DROP tests with small 64B packets and a large enough MTU
for Striding RQ to be in non-linear mode:
+----------------------------------------------------------------------+
| System               | MTU  | baseline   | this change | improvement |
|----------------------+------+------------+-------------+-------------|
| 4K page x86_64 [1]   | 9000 | 26.30 Mpps | 30.45 Mpps  | 15.80 %     |
| 64K page aarch64 [2] | 9000 | 15.27 Mpps | 20.10 Mpps  | 31.62 %     |
+----------------------------------------------------------------------+

This series does not cover the xsk (AF_XDP) paths for 64K page systems.

[1] https://lore.kernel.org/all/20260324024235.929875-1-kuba@kernel.org/
====================

Link: https://patch.msgid.link/20260403090927.139042-1-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agonet/mlx5e: XDP, Use page fragments for linear data in multibuf-mode
Dragos Tatulea [Fri, 3 Apr 2026 09:09:27 +0000 (12:09 +0300)] 
net/mlx5e: XDP, Use page fragments for linear data in multibuf-mode

Currently in XDP multi-buffer mode for striding rq a whole page is
allocated for the linear part of the XDP buffer. This is wasteful,
especially on systems with larger page sizes.

This change splits the page into fixed sized fragments. The page is
replenished when the maximum number of allowed fragments is reached.
When a fragment is not used, it will be simply recycled on next packet.
This is great for XDP_DROP as the fragment can be recycled for the next
packet. In the most extreme case (XDP_DROP everything), there will be 0
fragments used => only one linear page allocation for the lifetime of
the XDP program.

The previous page_pool size increase was too conservative (doubling the
size) and now there are much fewer allocations (1/8 for a 4K page). So
drop the page_pool size extension altogether when the linear side page
is used.

This small improvement is at most visible for XDP_DROP tests with small
64B packets and a large enough MTU for Striding RQ to be in non-linear
mode:
+----------------------------------------------------------------------+
| System               | MTU  | baseline   | this change | improvement |
|----------------------+------+------------+-------------+-------------|
| 4K page x86_64 [1]   | 9000 | 26.30 Mpps | 30.45 Mpps  | 15.80 %     |
| 64K page aarch64 [2] | 9000 | 15.27 Mpps | 20.10 Mpps  | 31.62 %     |
+----------------------------------------------------------------------+

[1] Intel Xeon Platinum 8580
[2] ARM Neoverse-N1

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260403090927.139042-6-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agonet/mlx5e: XDP, Use a single linear page per rq
Dragos Tatulea [Fri, 3 Apr 2026 09:09:26 +0000 (12:09 +0300)] 
net/mlx5e: XDP, Use a single linear page per rq

Currently in striding rq there is one mlx5e_frag_page member per WQE for
the linear page. This linear page is used only in XDP multi-buffer mode.
This is wasteful because only one linear page is needed per rq: the page
gets refreshed on every packet, regardless of WQE. Furthermore, it is
not needed in other modes (non-XDP, XDP single-buffer).

This change moves the linear page into its own structure (struct
mlx5_mpw_linear_info) and allocates it only when necessary.

A special structure is created because an upcoming patch will extend
this structure to support fragmentation of the linear page.

This patch has no functional changes.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260403090927.139042-5-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agonet/mlx5e: XDP, Remove stride size limitation
Dragos Tatulea [Fri, 3 Apr 2026 09:09:25 +0000 (12:09 +0300)] 
net/mlx5e: XDP, Remove stride size limitation

Currently XDP mode always uses PAGE_SIZE strides. This limitation
existed because page fragment counting was not implemented when XDP was
added. Furthermore, due to this limitation there were other issues as
well on system with larger pages (e.g. 64K):

- XDP for Striding RQ was effectively disabled on such systems.

- Legacy RQ allows the configuration but uses a fixed scheme of one XDP
  buffer per page which is inefficient.

As fragment counting was added during the driver conversion to
page_pool and the support for XDP multi-buffer, it is now possible
to remove this stride size limitation. This patch does just that.

Now it is possible to use XDP on systems with higher page sizes (e.g.
64K):

- For Striding RQ, loading the program is no longer blocked.
  Although a 64K page can fit any packet, MTUs that result in
  stride > 8K will still make the RQ in non-linear mode. That's
  because the HW doesn't support a higher than 8K stride.

- For Legacy RQ, the stride size was PAGE_SIZE which was very
  inefficient. Now the stride size will be calculated relative to MTU.
  Legacy RQ will always be in linear mode for larger system pages.

  This can be observed with an XDP_DROP test [1] when running
  in Legacy RQ mode on a ARM Neoverse-N1 system with a 64K
  page size:
  +-----------------------------------------------+
  | MTU  | baseline   | this change | improvement |
  |------+------------+-------------+-------------|
  | 1500 | 15.55 Mpps | 18.99 Mpps  | 22.0 %      |
  | 9000 | 15.53 Mpps | 18.24 Mpps  | 17.5 %      |
  +-----------------------------------------------+

There are performance benefits for Striding RQ mode as well:

- Striding RQ non-linear mode now uses 256B strides, just like
  non-XDP mode.

- Striding RQ linear mode can now fit a number of XDP buffers per page
  that is relative to the MTU size. That means that on 4K page systems
  and a small enough MTU, 2 XDP buffers can fit in one page.

The above benefits for Striding RQ can be observed with an
XDP_DROP test [1] when running on a 4K page x86_64 system
(Intel Xeon Platinum 8580):
  +-----------------------------------------------+
  | MTU  | baseline   | this change | improvement |
  |------+------------+-------------+-------------|
  | 1000 | 28.36 Mpps | 33.98 Mpps  | 19.82 %     |
  | 9000 | 20.76 Mpps | 26.30 Mpps  | 26.70 %     |
  +-----------------------------------------------+

[1] Test description:
- xdp-bench with XDP_DROP
- RX: single queue
- TX: sends 64B packets to saturate CPU on RX side

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260403090927.139042-4-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agonet/mlx5e: XDP, Improve dma address calculation of linear part for XDP_TX
Dragos Tatulea [Fri, 3 Apr 2026 09:09:24 +0000 (12:09 +0300)] 
net/mlx5e: XDP, Improve dma address calculation of linear part for XDP_TX

When calculating the dma address of the linear part of an XDP frame, the
formula assumes that there is a single XDP buffer per page. Extend the
formula to allow multiple XDP buffers per page by calculating the data
offset in the page.

This is a preparation for the upcoming removal of a single XDP buffer
per page limitation when the formula will no longer be correct.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260403090927.139042-3-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agonet/mlx5e: XSK, Increase size for chunk_size param
Dragos Tatulea [Fri, 3 Apr 2026 09:09:23 +0000 (12:09 +0300)] 
net/mlx5e: XSK, Increase size for chunk_size param

When 64K pages are used, chunk_size can take the 64K value
which doesn't fit in u16. This results in overflows that
are detected in mlx5e_mpwrq_log_wqe_sz().

Increase the type to u32 to fix this.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260403090927.139042-2-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agoselftests: net: add tests for PPP
Qingfang Deng [Fri, 3 Apr 2026 03:48:47 +0000 (11:48 +0800)] 
selftests: net: add tests for PPP

Add ping and iperf3 tests for ppp_async.c and pppoe.c.

Signed-off-by: Qingfang Deng <qingfang.deng@linux.dev>
Link: https://patch.msgid.link/20260403034908.30017-1-qingfang.deng@linux.dev
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agoxfrm: Drop support for HMAC-RIPEMD-160
Eric Biggers [Sun, 5 Apr 2026 01:15:13 +0000 (18:15 -0700)] 
xfrm: Drop support for HMAC-RIPEMD-160

Drop support for HMAC-RIPEMD-160 from IPsec to reduce the UAPI surface
and simplify future maintenance.  It's almost certainly unused.

RIPEMD-160 received some attention in the early 2000s when SHA-* weren't
quite as well established.  But it never received much adoption outside
of certain niches such as Bitcoin.

It's actually unclear that Linux + IPsec + HMAC-RIPEMD-160 has *ever*
been used, even historically.  When support for it was added in 2003, it
was done so in a "cleanup" commit without any justification [1].  It
didn't actually work until someone happened to fix it 5 years later [2].
That person didn't use or test it either [3].  Finally, also note that
"hmac(rmd160)" is by far the slowest of the algorithms in aalg_list[].

Of course, today IPsec is usually used with an AEAD, such as AES-GCM.
But even for IPsec users still using a dedicated auth algorithm, they
almost certainly aren't using, and shouldn't use, HMAC-RIPEMD-160.

Thus, let's just drop support for it.  Note: no kconfig update is
needed, since CRYPTO_RMD160 wasn't actually being selected anyway.

References:
  [1] linux-history commit d462985fc1941a47
      ("[IPSEC]: Clean up key manager algorithm handling.")
  [2] linux commit a13366c632132bb9
      ("xfrm: xfrm_algo: correct usage of RIPEMD-160")
  [3] https://lore.kernel.org/all/1212340578-15574-1-git-send-email-rueegsegger@swiss-it.ch

Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
8 weeks agoMerge branch 'mptcp-support-msg_eor-and-small-cleanups'
Jakub Kicinski [Tue, 7 Apr 2026 02:14:30 +0000 (19:14 -0700)] 
Merge branch 'mptcp-support-msg_eor-and-small-cleanups'

Matthieu Baerts says:

====================
mptcp: support MSG_EOR and small cleanups

This series contains various unrelated patches:

- Patches 1 & 2: support MSG_EOR instead of ignoring it.

- Patch 3: avoid duplicated code in TCP and MPTCP by using a new helper.

- Patch 4: adapt test to reproduce bug and increase code coverage.
====================

Link: https://patch.msgid.link/20260403-net-next-mptcp-msg_eor-misc-v1-0-b0b33bea3fed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoselftests: mptcp: join: recreate signal endp with same ID
Matthieu Baerts (NGI0) [Fri, 3 Apr 2026 11:29:31 +0000 (13:29 +0200)] 
selftests: mptcp: join: recreate signal endp with same ID

In this "delete re-add signal" MPTCP Join subtest, the endpoint linked
to the initial subflow is removed, but readded once with different ID.

It appears that there was an issue when reusing the same ID, recently
fixed by commit d191101dee25 ("mptcp: pm: in-kernel: always set ID as
avail when rm endp"). The test then now reuses the same ID the first
time, but continue to use another one (88) the second time.

This should then cover more cases.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/615
Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260403-net-next-mptcp-msg_eor-misc-v1-5-b0b33bea3fed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agotcp: add recv_should_stop helper
Geliang Tang [Fri, 3 Apr 2026 11:29:29 +0000 (13:29 +0200)] 
tcp: add recv_should_stop helper

Factor out a new helper tcp_recv_should_stop() from tcp_recvmsg_locked()
and tcp_splice_read() to check whether to stop receiving. And use this
helper in mptcp_recvmsg() and mptcp_splice_read() to reduce redundant code.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260403-net-next-mptcp-msg_eor-misc-v1-3-b0b33bea3fed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agomptcp: preserve MSG_EOR semantics in sendmsg path
Gang Yan [Fri, 3 Apr 2026 11:29:28 +0000 (13:29 +0200)] 
mptcp: preserve MSG_EOR semantics in sendmsg path

Extend MPTCP's sendmsg handling to recognize and honor the MSG_EOR flag,
which marks the end of a record for application-level message boundaries.

Data fragments tagged with MSG_EOR are explicitly marked in the
mptcp_data_frag structure and skb context to prevent unintended
coalescing with subsequent data chunks. This ensures the intent of
applications using MSG_EOR is preserved across MPTCP subflows,
maintaining consistent message segmentation behavior.

Signed-off-by: Gang Yan <yangang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260403-net-next-mptcp-msg_eor-misc-v1-2-b0b33bea3fed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agomptcp: reduce 'overhead' from u16 to u8
Gang Yan [Fri, 3 Apr 2026 11:29:27 +0000 (13:29 +0200)] 
mptcp: reduce 'overhead' from u16 to u8

The 'overhead' in struct mptcp_data_frag can safely use u8, as it
represents 'alignment + sizeof(mptcp_data_frag)'. With a maximum
alignment of 7('ALIGN(1, sizeof(long)) - 1'), the overhead is at most
47, well below U8_MAX and validated with BUILD_BUG_ON().

This patch also adds a field named 'unused' for further extensions.

Signed-off-by: Gang Yan <yangang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260403-net-next-mptcp-msg_eor-misc-v1-1-b0b33bea3fed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agodpaa2: avoid linking objects into multiple modules
Arnd Bergmann [Thu, 2 Apr 2026 18:46:55 +0000 (20:46 +0200)] 
dpaa2: avoid linking objects into multiple modules

Each object file contains information about which module it gets linked
into, so linking the same file into multiple modules now causes a warning:

scripts/Makefile.build:254: drivers/net/ethernet/freescale/dpaa2/Makefile: dpaa2-mac.o is added to multiple modules: fsl-dpaa2-eth fsl-dpaa2-switch
scripts/Makefile.build:254: drivers/net/ethernet/freescale/dpaa2/Makefile: dpmac.o is added to multiple modules: fsl-dpaa2-eth fsl-dpaa2-switch

Change the way that dpaa2 is built by moving the two common files into a
separate module with exported symbols instead.

Tested-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Reviewed-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://patch.msgid.link/20260402184726.3746487-3-arnd@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet: ethernet: ti-cpsw: fix linking built-in code to modules
Arnd Bergmann [Thu, 2 Apr 2026 18:46:54 +0000 (20:46 +0200)] 
net: ethernet: ti-cpsw: fix linking built-in code to modules

There are six variants of the cpsw driver, sharing various parts of
the code: davinci-emac, cpsw, cpsw-switchdev, netcp, netcp_ethss and
am65-cpsw-nuss.

I noticed that this means some files can be linked into more than
one loadable module, or even part of vmlinux but also linked into
a loadable module, both of which mess up assumptions of the build
system, and causes warnings:

scripts/Makefile.build:279: cpsw_ale.o is added to multiple modules: ti-am65-cpsw-nuss ti_cpsw ti_cpsw_new
scripts/Makefile.build:279: cpsw_priv.o is added to multiple modules: ti_cpsw ti_cpsw_new
scripts/Makefile.build:279: cpsw_sl.o is added to multiple modules: ti-am65-cpsw-nuss ti_cpsw ti_cpsw_new
scripts/Makefile.build:279: cpsw_ethtool.o is added to multiple modules: ti_cpsw ti_cpsw_new
scripts/Makefile.build:279: davinci_cpdma.o is added to multiple modules: ti_cpsw ti_cpsw_new ti_davinci_emac

Change this back to having separate modules for each portion that
can be linked standalone, exporting symbols as needed:

 - ti-cpsw-common.ko now contains both cpsw-common.o and
   davinci_cpdma.o as they are always used together

 - ti-cpsw-priv.ko contains cpsw_priv.o, cpsw_sl.o and cpsw_ethtool.o,
   which are the core of the cpsw and cpsw-new drivers.

 - ti-cpsw-sl.ko contains the cpsw-sl.o object and is used on
   ti-am65-cpsw-nuss.ko in addition to the two other cpsw variants.

 - ti-cpsw-ale.o is the one standalone module that is used by all
   except davinci_emac.

Each of these will be built-in if any of its users are built-in, otherwise
it's a loadable module if there is at least one module using it. I did
not bring back the separate Kconfig symbols for this, but just handle
it using Makefile logic.

Note: ideally this is something that Kbuild complains about, but usually
we just notice when something using THIS_MODULE misbehaves in a way that
a user notices.

Fixes: 99f6297182729 ("net: ethernet: ti: cpsw: drop TI_DAVINCI_CPDMA config option")
Link: https://lore.kernel.org/lkml/20240417084400.3034104-1-arnd@kernel.org/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://patch.msgid.link/20260402184726.3746487-2-arnd@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet: ethernet: ti-cpsw:: rename soft_reset() function
Arnd Bergmann [Thu, 2 Apr 2026 18:46:53 +0000 (20:46 +0200)] 
net: ethernet: ti-cpsw:: rename soft_reset() function

While looking at the glob symbols shared between the cpsw drivers,
I noticed that soft_reset() is the only one that is missing a proper
namespace prefix, and will pollute the kernel namespace, so rename
it to be consistent with the other symbols.

Reviewed-by: Alexander Sverdlin <alexander.sverdlin@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://patch.msgid.link/20260402184726.3746487-1-arnd@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoeth: remove the driver for acenic / tigon1&2
Jakub Kicinski [Fri, 3 Apr 2026 22:05:01 +0000 (15:05 -0700)] 
eth: remove the driver for acenic / tigon1&2

The entire git history for this driver looks like tree-wide
and automated cleanups. There's even more coming now with
AI, so let's try to delete it instead.

Acked-by: Jes Sorensen <jes@trained-monkey.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://patch.msgid.link/20260403220501.2263835-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet: macb: Use netif_napi_add_tx() instead of netif_napi_add() for TX NAPI
Kevin Hao [Fri, 3 Apr 2026 14:23:39 +0000 (22:23 +0800)] 
net: macb: Use netif_napi_add_tx() instead of netif_napi_add() for TX NAPI

The TX NAPI should be registered via netif_napi_add_tx() to avoid
unnecessarily polluting the napi_hash table.

Signed-off-by: Kevin Hao <haokexin@gmail.com>
Link: https://patch.msgid.link/20260403-macb-napi-tx-v1-1-08126a60c65e@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoMerge branch 'nfc-support-for-five-qualcomm-sdm845-phones'
Jakub Kicinski [Tue, 7 Apr 2026 01:50:51 +0000 (18:50 -0700)] 
Merge branch 'nfc-support-for-five-qualcomm-sdm845-phones'

David Heidelberg says:

====================
NFC support for five Qualcomm SDM845 phones

- OnePlus 6 / 6T
 - Pixel 3 / 3 XL
 - SHIFT 6MQ

Verified with NFC card using neard:

systemctl enable --now neard
nfctool --device nfc0 -1
nfctool -d nfc0 -p
gdbus introspect --system --dest org.neard --object-path /org/neard/nfc0/tag0/record0

or use gNFC:
  https://gitlab.gnome.org/dh/gnfc/

successfully detecting and reading a tag.
====================

Link: https://patch.msgid.link/20260403-oneplus-nfc-v3-0-fbdce57d63c1@ixit.cz
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agodt-bindings: nfc: nxp,nci: Document PN557 compatible
David Heidelberg [Fri, 3 Apr 2026 13:58:46 +0000 (15:58 +0200)] 
dt-bindings: nfc: nxp,nci: Document PN557 compatible

The PN557 uses the same hardware as the PN553 but ships with
firmware compliant with NCI 2.0.

Document PN557 as a compatible device.

Signed-off-by: David Heidelberg <david@ixit.cz>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com>
Link: https://patch.msgid.link/20260403-oneplus-nfc-v3-1-fbdce57d63c1@ixit.cz
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoip6_tunnel: use generic for_each_ip_tunnel_rcu macro
Yue Haibing [Fri, 3 Apr 2026 08:46:19 +0000 (16:46 +0800)] 
ip6_tunnel: use generic for_each_ip_tunnel_rcu macro

Remove the locally defined for_each_ip6_tunnel_rcu macro and use
the generic for_each_ip_tunnel_rcu from linux/if_tunnel.h instead.

This eliminates code duplication and ensures consistency across
the kernel tunnel implementations.

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260403084619.4107978-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet: advance skb_defer_disable_key check in napi_consume_skb
Jason Xing [Thu, 2 Apr 2026 03:41:14 +0000 (11:41 +0800)] 
net: advance skb_defer_disable_key check in napi_consume_skb

When net.core.skb_defer_max is adjusted to zero, napi_consume_skb()
shouldn't go into that deeper in skb_attempt_defer_free() because it adds
an additional pair of local_bh_enable/disable() which is evidently not
needed. Advancing the check of the static key saves more cycles and
benefits non defer case.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Link: https://patch.msgid.link/20260402034114.65766-1-kerneljasonxing@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoMerge branch 'net-dsa-mxl862xx-add-support-for-bridge-offloading'
Jakub Kicinski [Tue, 7 Apr 2026 01:30:35 +0000 (18:30 -0700)] 
Merge branch 'net-dsa-mxl862xx-add-support-for-bridge-offloading'

Daniel Golle says:

====================
net: dsa: mxl862xx: add support for bridge offloading

As a next step to complete the mxl862xx DSA driver, add support for
offloading forwarding between bridged ports to the switch hardware.

This works pretty much without any big surprises, apart from two
subtleties:
 * per-port control over flooding behavior has to be implemented by
   (ab)using a 0-rate QoS meters as stopper in lack of any better
   option.
 * STP state transition unconditionally enables learning on a port
   even if it was previously explicitely disabled (a firmware bug)

Note that as the driver is still lacking all VLAN features (which
are going to be added next), at this point some of the
bridge_vlan_aware.sh tests are failing after applying this series.

This is expected and cannot be avoided without implementing
port_vlan_filtering + port_vlan_add/del. And adding both bridge and
VLAN offloading at the same time would be too much for anyone to
review, so VLAN support is going to be submitted in a follow-up
series immediately after this series has been accepted.

All other relevant selftests (including bridge_vlan_unaware.sh) are
still passing.

Inspired by the comments received from Paolo Abeni as reply to v5
the driver now no longer caches bridge port membership in the
driver, but instead imports an existing helper from yt921x.c to dsa.h
in order to allow the driver to easily iterate over bridge members.
The mapping between DSA bridge num and firmware bridge ID is done
using a simple fixed-size array in mxl862xx_priv.
====================

Link: https://patch.msgid.link/cover.1775049897.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet: dsa: mxl862xx: implement bridge offloading
Daniel Golle [Wed, 1 Apr 2026 13:35:01 +0000 (14:35 +0100)] 
net: dsa: mxl862xx: implement bridge offloading

Implement joining and leaving bridges as well as add, delete and dump
operations on isolated FDBs, port MDB membership management, and
setting a port's STP state.

The switch supports a maximum of 63 bridges, however, up to 12 may
be used as "single-port bridges" to isolate standalone ports.
Allowing up to 48 bridges to be offloaded seems more than enough on
that hardware, hence that is set as max_num_bridges.

A total of 128 bridge ports are supported in the bridge portmap, and
virtual bridge ports have to be used eg. for link-aggregation, hence
potentially exceeding the number of hardware ports.

The firmware-assigned bridge identifier (FID) for each offloaded bridge
is stored in an array used to map DSA bridge num to firmware bridge ID,
avoiding the need for a driver-private bridge tracking structure.
Bridge member portmaps are rebuilt on join/leave using
dsa_switch_for_each_bridge_member().

As there are now more users of the BRIDGEPORT_CONFIG_SET API and the
state of each port is cached locally, introduce a helper function
mxl862xx_set_bridge_port(struct dsa_switch *ds, int port) which
applies the cached per-port state to hardware. For standalone user
ports (dp->bridge == NULL), it additionally resets the port to
single-port bridge state: CPU-only portmap, learning and flooding
disabled. The CPU port path sets its state explicitly before calling
this helper and is therefore not affected by the reset.

Note that MASK_VLAN_BASED_MAC_LEARNING is intentionally absent from
the firmware write mask. After mxl862xx_reset(), the firmware
initialises all VLAN-based MAC learning fields to 0 (disabled), so
SVL is the active mode by default without having to set it explicitly.

Note that there is no convenient way to control flooding on per-port
level, so the driver is using a 0-rate QoS meter setup as a stopper in
lack of any better option. In order to be perfect the firmware-enforced
minimum bucket size is bypassed by directly writing 0s to the relevant
registers -- without that at least one 64-byte packet could still
pass before the meter would change from 'yellow' into 'red' state.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Link: https://patch.msgid.link/dd079180e2098e5f9626fcd149b9bad9a1b5a1b2.1775049897.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agodsa: tag_mxl862xx: set dsa_default_offload_fwd_mark()
Daniel Golle [Wed, 1 Apr 2026 13:34:52 +0000 (14:34 +0100)] 
dsa: tag_mxl862xx: set dsa_default_offload_fwd_mark()

The MxL862xx offloads bridge forwarding in hardware, so set
dsa_default_offload_fwd_mark() to avoid duplicate forwarding of
packets of (eg. flooded) frames arriving at the CPU port.

Link-local frames are directly trapped to the CPU port only, so don't
set dsa_default_offload_fwd_mark() on those.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Link: https://patch.msgid.link/e1161c90894ddc519c57dc0224b3a0f6bfa1d2d6.1775049897.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet: dsa: add bridge member iteration macro
Daniel Golle [Wed, 1 Apr 2026 13:34:42 +0000 (14:34 +0100)] 
net: dsa: add bridge member iteration macro

Drivers that offload bridges need to iterate over the ports that are
members of a given bridge, for example to rebuild per-port forwarding
bitmaps when membership changes. Currently drivers typically open-code
this by combining dsa_switch_for_each_user_port() with a
dsa_port_offloads_bridge_dev() check, or cache bridge membership
within the driver.

Add dsa_switch_for_each_bridge_member() macro to express this pattern
directly, and use it for the existing dsa_bridge_ports() inline
helper.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Link: https://patch.msgid.link/e7136aaa26773f39e805a00fe4ecf13cd2b83fc0.1775049897.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet: dsa: move dsa_bridge_ports() helper to dsa.h
Daniel Golle [Wed, 1 Apr 2026 13:34:30 +0000 (14:34 +0100)] 
net: dsa: move dsa_bridge_ports() helper to dsa.h

The yt921x driver contains a helper to create a bitmap of ports
which are members of a bridge.

Move the helper as static inline function into dsa.h, so other driver
can make use of it as well.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Link: https://patch.msgid.link/4f8bbfce3e4e3a02064fc4dc366263136c6e0383.1775049897.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agovsock: avoid timeout for non-blocking accept() with empty backlog
Laurence Rowe [Thu, 2 Apr 2026 20:49:18 +0000 (13:49 -0700)] 
vsock: avoid timeout for non-blocking accept() with empty backlog

A common pattern in epoll network servers is to eagerly accept all
pending connections from the non-blocking listening socket after
epoll_wait indicates the socket is ready by calling accept in a loop
until EAGAIN is returned indicating that the backlog is empty.

Scheduling a timeout for a non-blocking accept with an empty backlog
meant AF_VSOCK sockets used by epoll network servers incurred hundreds
of microseconds of additional latency per accept loop compared to
AF_INET or AF_UNIX sockets.

Signed-off-by: Laurence Rowe <laurencerowe@gmail.com>
Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/20260402204918.130395-1-laurencerowe@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agopsp: add missing device stats to get-stats reply attributes
Daniel Zahka [Fri, 3 Apr 2026 10:26:31 +0000 (03:26 -0700)] 
psp: add missing device stats to get-stats reply attributes

Commit f05d26198cf2 ("psp: add stats from psp spec to driver facing
api") added device statistics (rx-packets, rx-bytes, rx-auth-fail,
rx-error, rx-bad, tx-packets, tx-bytes, tx-error) to the stats
attribute-set but did not add them to the get-stats operation reply
attributes. The kernel reports these attributes in the reply, so
list them in the spec to match.

Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260403-psp-yaml-fix-v1-1-dacee0663903@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet: mctp: defer creation of dst after source-address check
Jeremy Kerr [Fri, 3 Apr 2026 02:24:51 +0000 (10:24 +0800)] 
net: mctp: defer creation of dst after source-address check

Sashiko reports:

> mctp_dst_from_route() increments the device reference count by calling
> mctp_dev_hold(). When a valid route is found and dst is NULL, the
> structure copy is bypassed and rc is set to 0.

Instead of optimistically creating a dst from the final route (then
releasing it if the saddr is invalid), perform the saddr check first.

This means we don't have an unuecessary hold/release on the dev, which
could leak if the dst pointer is NULL. No caller passes a NULL dst at
present though (so the leak is not possible), but this is an intended
use of mctp_dst_from_route().

Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260403-dev-mctp-dst-defer-v1-1-9c2c55faf9e9@codeconstruct.com.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet: mctp: tests: use actual address when creating dev with addr
Jeremy Kerr [Fri, 3 Apr 2026 02:21:04 +0000 (10:21 +0800)] 
net: mctp: tests: use actual address when creating dev with addr

Sashiko reports:

> This isn't a bug in the core networking code, but the addr parameter
> appears to be ignored here.

In mctp_test_create_dev_with_addr(), we are ignoring the addr argument
and just using `8`. Use the passed address instead.

All invocations use 8 anyway, so no effective change at present.

Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Reviewed-by: Simon Horman <horms@verge.net.au>
Link: https://patch.msgid.link/20260403-dev-mctp-fix-test-addr-v1-1-b7fa789cdd9b@codeconstruct.com.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoselftests: net: py: color the basics in the output
Jakub Kicinski [Thu, 2 Apr 2026 21:54:44 +0000 (14:54 -0700)] 
selftests: net: py: color the basics in the output

Sometimes it's hard to spot the ok / not ok lines in the output.
This is especially true for the GRO tests which retries a lot
so there's a wall of non-fatal output printed.

Try to color the crucial lines green / red / yellow when running
in a terminal.

Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Acked-by: Joe Damato <joe@dama.to>
Link: https://patch.msgid.link/20260402215444.1589893-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agoMerge branch 'dpll-add-frequency-monitoring-feature'
Jakub Kicinski [Fri, 3 Apr 2026 23:48:03 +0000 (16:48 -0700)] 
Merge branch 'dpll-add-frequency-monitoring-feature'

Ivan Vecera says:

====================
dpll: add frequency monitoring feature

This series adds support for monitoring the measured input frequency
of DPLL input pins via the DPLL netlink interface.

Some DPLL devices can measure the actual frequency being received on
input pins. The approach mirrors the existing phase-offset-monitor
feature: a device-level attribute (DPLL_A_FREQUENCY_MONITOR) enables
or disables monitoring, and a per-pin attribute
(DPLL_A_PIN_MEASURED_FREQUENCY) exposes the measured frequency in
millihertz (mHz) when monitoring is enabled.

Patch 1 adds the new attributes to the DPLL netlink spec (dpll.yaml),
the DPLL_PIN_MEASURED_FREQUENCY_DIVIDER constant, regenerates the
auto-generated UAPI header and netlink policy, and updates
Documentation/driver-api/dpll.rst.

Patch 2 adds the callback operations (freq_monitor_get/set for
devices, measured_freq_get for pins) and the corresponding netlink
GET/SET handlers in the DPLL core. The core only invokes
measured_freq_get when the frequency monitor is enabled on the parent
device. The freq_monitor_get callback is required when measured_freq_get
is provided.

Patch 3 implements the feature in the ZL3073x driver by extracting
a common measurement latch helper from the existing FFO update path,
adding a frequency measurement function, and wiring up the new
callbacks.
====================

Link: https://patch.msgid.link/20260402184057.1890514-1-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agodpll: zl3073x: implement frequency monitoring
Ivan Vecera [Thu, 2 Apr 2026 18:40:57 +0000 (20:40 +0200)] 
dpll: zl3073x: implement frequency monitoring

Extract common measurement latch logic from zl3073x_ref_ffo_update()
into a new zl3073x_ref_freq_meas_latch() helper and add
zl3073x_ref_freq_meas_update() that uses it to latch and read absolute
input reference frequencies in Hz.

Add meas_freq field to struct zl3073x_ref and the corresponding
zl3073x_ref_meas_freq_get() accessor. The measured frequencies are
updated periodically alongside the existing FFO measurements.

Add freq_monitor boolean to struct zl3073x_dpll and implement the
freq_monitor_set/get device callbacks to enable/disable frequency
monitoring via the DPLL netlink interface.

Implement measured_freq_get pin callback for input pins that returns the
measured input frequency in mHz.

Reviewed-by: Petr Oros <poros@redhat.com>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Link: https://patch.msgid.link/20260402184057.1890514-4-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agodpll: add frequency monitoring callback ops
Ivan Vecera [Thu, 2 Apr 2026 18:40:56 +0000 (20:40 +0200)] 
dpll: add frequency monitoring callback ops

Add new callback operations for a dpll device:
- freq_monitor_get(..) - to obtain current state of frequency monitor
  feature from dpll device,
- freq_monitor_set(..) - to allow feature configuration.

Add new callback operation for a dpll pin:
- measured_freq_get(..) - to obtain the measured frequency in mHz.

Obtain the feature state value using the get callback and provide it to
the user if the device driver implements callbacks. The measured_freq_get
pin callback is only invoked when the frequency monitor is enabled.
The freq_monitor_get device callback is required when measured_freq_get
is provided by the driver.

Execute the set callback upon user requests.

Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Link: https://patch.msgid.link/20260402184057.1890514-3-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2 months agodpll: add frequency monitoring to netlink spec
Ivan Vecera [Thu, 2 Apr 2026 18:40:55 +0000 (20:40 +0200)] 
dpll: add frequency monitoring to netlink spec

Add DPLL_A_FREQUENCY_MONITOR device attribute to allow control over
the frequency monitor feature. The attribute uses the existing
dpll_feature_state enum (enable/disable) and is present in both
device-get reply and device-set request.

Add DPLL_A_PIN_MEASURED_FREQUENCY pin attribute to expose the measured
input frequency in millihertz (mHz). The attribute is present in the
pin-get reply. Add DPLL_PIN_MEASURED_FREQUENCY_DIVIDER constant to
allow userspace to extract integer and fractional parts.

Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Link: https://patch.msgid.link/20260402184057.1890514-2-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>