]> git.ipfire.org Git - thirdparty/linux.git/log
thirdparty/linux.git
3 months agohamradio: use netdev_lockdep_set_classes() helper
Eric Dumazet [Fri, 7 Mar 2025 16:03:58 +0000 (16:03 +0000)] 
hamradio: use netdev_lockdep_set_classes() helper

It is time to use netdev_lockdep_set_classes() in bpqether.c

List of related commits:

0bef512012b1 ("net: add netdev_lockdep_set_classes() to virtual drivers")
c74e1039912e ("net: bridge: use netdev_lockdep_set_classes()")
9a3c93af5491 ("vlan: use netdev_lockdep_set_classes()")
0d7dd798fd89 ("net: ipvlan: call netdev_lockdep_set_classes()")
24ffd752007f ("net: macvlan: call netdev_lockdep_set_classes()")
78e7a2ae8727 ("net: vrf: call netdev_lockdep_set_classes()")
d3fff6c443fe ("net: add netdev_lockdep_set_classes() helper")

syzbot reported:

WARNING: possible recursive locking detected
6.14.0-rc5-syzkaller-01064-g2525e16a2bae #0 Not tainted

dhcpcd/5501 is trying to acquire lock:
 ffff8880797e2d28 (&dev->lock){+.+.}-{4:4}, at: netdev_lock include/linux/netdevice.h:2765 [inline]
 ffff8880797e2d28 (&dev->lock){+.+.}-{4:4}, at: register_netdevice+0x12d8/0x1b70 net/core/dev.c:11008

but task is already holding lock:
 ffff88802e530d28 (&dev->lock){+.+.}-{4:4}, at: netdev_lock include/linux/netdevice.h:2765 [inline]
 ffff88802e530d28 (&dev->lock){+.+.}-{4:4}, at: netdev_lock_ops include/linux/netdevice.h:2804 [inline]
 ffff88802e530d28 (&dev->lock){+.+.}-{4:4}, at: dev_change_flags+0x120/0x270 net/core/dev_api.c:65

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&dev->lock);
  lock(&dev->lock);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

2 locks held by dhcpcd/5501:
  #0: ffffffff8fed6848 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_net_lock include/linux/rtnetlink.h:130 [inline]
  #0: ffffffff8fed6848 (rtnl_mutex){+.+.}-{4:4}, at: devinet_ioctl+0x34c/0x1d80 net/ipv4/devinet.c:1121
  #1: ffff88802e530d28 (&dev->lock){+.+.}-{4:4}, at: netdev_lock include/linux/netdevice.h:2765 [inline]
  #1: ffff88802e530d28 (&dev->lock){+.+.}-{4:4}, at: netdev_lock_ops include/linux/netdevice.h:2804 [inline]
  #1: ffff88802e530d28 (&dev->lock){+.+.}-{4:4}, at: dev_change_flags+0x120/0x270 net/core/dev_api.c:65

stack backtrace:
CPU: 1 UID: 0 PID: 5501 Comm: dhcpcd Not tainted 6.14.0-rc5-syzkaller-01064-g2525e16a2bae #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2025
Call Trace:
 <TASK>
  __dump_stack lib/dump_stack.c:94 [inline]
  dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
  print_deadlock_bug+0x483/0x620 kernel/locking/lockdep.c:3039
  check_deadlock kernel/locking/lockdep.c:3091 [inline]
  validate_chain+0x15e2/0x5920 kernel/locking/lockdep.c:3893
  __lock_acquire+0x1397/0x2100 kernel/locking/lockdep.c:5228
  lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5851
  __mutex_lock_common kernel/locking/mutex.c:585 [inline]
  __mutex_lock+0x19c/0x1010 kernel/locking/mutex.c:730
  netdev_lock include/linux/netdevice.h:2765 [inline]
  register_netdevice+0x12d8/0x1b70 net/core/dev.c:11008
  bpq_new_device drivers/net/hamradio/bpqether.c:499 [inline]
  bpq_device_event+0x4b1/0x8d0 drivers/net/hamradio/bpqether.c:542
  notifier_call_chain+0x1a5/0x3f0 kernel/notifier.c:85
 __dev_notify_flags+0x207/0x400
  netif_change_flags+0xf0/0x1a0 net/core/dev.c:9442
  dev_change_flags+0x146/0x270 net/core/dev_api.c:66
  devinet_ioctl+0xea2/0x1d80 net/ipv4/devinet.c:1200
  inet_ioctl+0x3d7/0x4f0 net/ipv4/af_inet.c:1001
  sock_do_ioctl+0x158/0x460 net/socket.c:1190
  sock_ioctl+0x626/0x8e0 net/socket.c:1309
  vfs_ioctl fs/ioctl.c:51 [inline]
  __do_sys_ioctl fs/ioctl.c:906 [inline]
  __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892
  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
  do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Fixes: 7e4d784f5810 ("net: hold netdev instance lock during rtnetlink operations")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250307160358.3153859-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoudp: expand SKB_DROP_REASON_UDP_CSUM use
Eric Dumazet [Fri, 7 Mar 2025 10:20:02 +0000 (10:20 +0000)] 
udp: expand SKB_DROP_REASON_UDP_CSUM use

SKB_DROP_REASON_UDP_CSUM can be used in four locations
when dropping a packet because of a wrong UDP checksum.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250307102002.2095238-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonetpoll: Optimize skb refilling on critical path
Breno Leitao [Tue, 4 Mar 2025 15:50:41 +0000 (07:50 -0800)] 
netpoll: Optimize skb refilling on critical path

netpoll tries to refill the skb queue on every packet send, independently
if packets are being consumed from the pool or not. This was
particularly problematic while being called from printk(), where the
operation would be done while holding the console lock.

Introduce a more intelligent approach to skb queue management. Instead
of constantly attempting to refill the queue, the system now defers
refilling to a work queue and only triggers the workqueue when a buffer
is actually dequeued. This change significantly reduces operations with
the lock held.

Add a work_struct to the netpoll structure for asynchronous refilling,
updating find_skb() to schedule refill work only when necessary (skb is
dequeued).

These changes have demonstrated a 15% reduction in time spent during
netpoll_send_msg operations, especially when no SKBs are not consumed
from consumed from pool.

When SKBs are being dequeued, the improvement is even better, around
70%, mainly because refilling the SKB pool is now happening outside of
the critical patch (with console_owner lock held).

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250304-netpoll_refill_v2-v1-1-06e2916a4642@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'net-phy-tja11xx-add-support-for-tja1102s'
Jakub Kicinski [Sat, 8 Mar 2025 03:51:08 +0000 (19:51 -0800)] 
Merge branch 'net-phy-tja11xx-add-support-for-tja1102s'

Dimitri Fedrau via says:

====================
net: phy: tja11xx: add support for TJA1102S

- add support for TJA1102S
- enable PHY in sleep mode for TJA1102S

v1: https://lore.kernel.org/20250303-tja1102s-support-v1-0-180e945396e0@liebherr.com
====================

Link: https://patch.msgid.link/20250304-tja1102s-support-v2-0-cd3e61ab920f@liebherr.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: phy: tja11xx: enable PHY in sleep mode for TJA1102S
Dimitri Fedrau [Tue, 4 Mar 2025 18:37:27 +0000 (19:37 +0100)] 
net: phy: tja11xx: enable PHY in sleep mode for TJA1102S

Due to pin strapping the PHY maybe disabled per default. TJA1102 devices
can be enabled by setting the PHY_EN bit. Support is provided for TJA1102S
devices but can be easily added for TJA1102 too.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Dimitri Fedrau <dimitri.fedrau@liebherr.com>
Link: https://patch.msgid.link/20250304-tja1102s-support-v2-2-cd3e61ab920f@liebherr.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: phy: tja11xx: add support for TJA1102S
Dimitri Fedrau [Tue, 4 Mar 2025 18:37:26 +0000 (19:37 +0100)] 
net: phy: tja11xx: add support for TJA1102S

NXPs TJA1102S is a single PHY version of the TJA1102 in which one of the
PHYs is disabled.

Signed-off-by: Dimitri Fedrau <dimitri.fedrau@liebherr.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250304-tja1102s-support-v2-1-cd3e61ab920f@liebherr.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: ethernet: Remove accidental duplication in Kconfig file
Lukas Bulwahn [Thu, 6 Mar 2025 09:47:53 +0000 (10:47 +0100)] 
net: ethernet: Remove accidental duplication in Kconfig file

Commit fb3dda82fd38 ("net: airoha: Move airoha_eth driver in a dedicated
folder") accidentally added the line:

  source "drivers/net/ethernet/mellanox/Kconfig"

in drivers/net/ethernet/Kconfig, so that this line is duplicated in that
file.

Remove this accidental duplication.

Fixes: fb3dda82fd38 ("net: airoha: Move airoha_eth driver in a dedicated folder")
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@redhat.com>
Acked-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20250306094753.63806-1-lukas.bulwahn@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMAINTAINERS: adjust entry in AIROHA ETHERNET DRIVER
Lukas Bulwahn [Thu, 6 Mar 2025 09:46:36 +0000 (10:46 +0100)] 
MAINTAINERS: adjust entry in AIROHA ETHERNET DRIVER

Commit fb3dda82fd38 ("net: airoha: Move airoha_eth driver in a dedicated
folder") moves the driver to drivers/net/ethernet/airoha/, but misses to
adjust the AIROHA ETHERNET DRIVER section in MAINTAINERS. Hence,
./scripts/get_maintainer.pl --self-test=patterns complains about a broken
reference.

Adjust the file entry to the dedicated folder for this driver.

Signed-off-by: Lukas Bulwahn <lukas.bulwahn@redhat.com>
Acked-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20250306094636.63709-1-lukas.bulwahn@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: airoha: Fix dev->dsa_ptr check in airoha_get_dsa_tag()
Lorenzo Bianconi [Thu, 6 Mar 2025 10:52:20 +0000 (11:52 +0100)] 
net: airoha: Fix dev->dsa_ptr check in airoha_get_dsa_tag()

Fix the following warning reported by Smatch static checker in
airoha_get_dsa_tag routine:

drivers/net/ethernet/airoha/airoha_eth.c:1722 airoha_get_dsa_tag()
warn: 'dp' isn't an ERR_PTR

dev->dsa_ptr can't be set to an error pointer, it can just be NULL.
Remove this check since it is already performed in netdev_uses_dsa().

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/netdev/Z8l3E0lGOcrel07C@lore-desk/T/#m54adc113fcdd8c5e6c5f65ffd60d8e8b1d483d90
Fixes: af3cf757d5c9 ("net: airoha: Move DSA tag in DMA descriptor")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250306-airoha-flowtable-fixes-v1-1-68d3c1296cdd@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'tcp-ulp-diag-expose-more-to-non-net-admin-users'
Jakub Kicinski [Sat, 8 Mar 2025 03:39:57 +0000 (19:39 -0800)] 
Merge branch 'tcp-ulp-diag-expose-more-to-non-net-admin-users'

Matthieu Baerts says:

====================
tcp: ulp: diag: expose more to non net admin users

Since its introduction in commit 61723b393292 ("tcp: ulp: add functions
to dump ulp-specific information"), the ULP diag info have been exported
only to users with CAP_NET_ADMIN capability.

Not everything is sensitive, and some info can be exported to all users
in order to ease the debugging from the userspace side without requiring
additional capabilities.

First, the ULP name can be easily exported. Then more depending on each
layer:

 - On kTLS side, it looks like everything can be exported to all users:
   version, cipher type, tx/rx user config type, plus some flags.

 - On MPTCP side, everything but the sequence numbers are exported to
   all non net admin users, similar to TCP.
====================

Link: https://patch.msgid.link/20250306-net-next-tcp-ulp-diag-net-admin-v1-0-06afdd860fc9@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agotcp: ulp: diag: more info without CAP_NET_ADMIN
Matthieu Baerts (NGI0) [Thu, 6 Mar 2025 11:29:28 +0000 (12:29 +0100)] 
tcp: ulp: diag: more info without CAP_NET_ADMIN

When introduced in commit 61723b393292 ("tcp: ulp: add functions to dump
ulp-specific information"), the whole ULP diag info has been exported
only if the requester had CAP_NET_ADMIN.

It looks like not everything is sensitive, and some info can be exported
to all users in order to ease the debugging from the userspace side
without requiring additional capabilities. Each layer should then decide
what can be exposed to everybody. The 'net_admin' boolean is then passed
to the different layers.

On kTLS side, it looks like there is nothing sensitive there: version,
cipher type, tx/rx user config type, plus some flags. So, only some
metadata about the configuration, no cryptographic info like keys, etc.
Then, everything can be exported to all users.

On MPTCP side, that's different. The MPTCP-related sequence numbers per
subflow should certainly not be exposed to everybody. For example, the
DSS mapping and ssn_offset would give all users on the system access to
narrow ranges of values for the subflow TCP sequence numbers and
MPTCP-level DSNs, and then ease packet injection. The TCP diag interface
doesn't expose the TCP sequence numbers for TCP sockets, so best to do
the same here. The rest -- token, IDs, flags -- can be exported to
everybody.

Acked-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250306-net-next-tcp-ulp-diag-net-admin-v1-2-06afdd860fc9@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agotcp: ulp: diag: always print the name if any
Matthieu Baerts (NGI0) [Thu, 6 Mar 2025 11:29:27 +0000 (12:29 +0100)] 
tcp: ulp: diag: always print the name if any

Since its introduction in commit 61723b393292 ("tcp: ulp: add functions
to dump ulp-specific information"), the ULP diag info have been exported
only if the requester had CAP_NET_ADMIN.

At least the ULP name can be exported without CAP_NET_ADMIN. This will
already help identifying which layer is being used, e.g. which TCP
connections are in fact MPTCP subflow.

Acked-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250306-net-next-tcp-ulp-diag-net-admin-v1-1-06afdd860fc9@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'eth-fbnic-support-ring-size-configuration'
Jakub Kicinski [Sat, 8 Mar 2025 03:37:39 +0000 (19:37 -0800)] 
Merge branch 'eth-fbnic-support-ring-size-configuration'

Jakub Kicinski says:

====================
eth: fbnic: support ring size configuration

Support ethtool -g / -G and a couple other small tweaks.
====================

Link: https://patch.msgid.link/20250306145150.1757263-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoeth: fbnic: support ring size configuration
Jakub Kicinski [Thu, 6 Mar 2025 14:51:50 +0000 (06:51 -0800)] 
eth: fbnic: support ring size configuration

Support ethtool -g / -G. Leverage the code added for -l / -L
to alloc / stop / start / free.

Check parameters against HW min/max but also our own min/max.
Min HW queue is 16 entries, we can't deal with TWQs that small
because of the queue waking logic. Add similar contraint on RCQ
for symmetry.

We need 3 sizes on Rx, as the NIC does header-data split two separate
buffer pools:
  (1) head page ring    - how many empty pages we post for headers
  (2) payload page ring - how many empty pages we post for payloads
  (3) completion ring   - where NIC produces the Rx descriptors

Acked-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250306145150.1757263-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoeth: fbnic: fix typo in compile assert
Jakub Kicinski [Thu, 6 Mar 2025 14:51:49 +0000 (06:51 -0800)] 
eth: fbnic: fix typo in compile assert

We should be validating the Rx count on the Rx struct,
not the Tx struct. There is no real change here, rx_stats
and tx_stats are instances of the same struct.

Acked-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250306145150.1757263-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoeth: fbnic: link NAPIs to page pools
Jakub Kicinski [Thu, 6 Mar 2025 14:51:48 +0000 (06:51 -0800)] 
eth: fbnic: link NAPIs to page pools

The lifetime of page pools is tied to NAPI instances,
and they are destroyed before NAPI is deleted.
It's safe to link them up.

Acked-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250306145150.1757263-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'net-bcmgenet-revise-suspend-resume'
Jakub Kicinski [Sat, 8 Mar 2025 03:33:50 +0000 (19:33 -0800)] 
Merge branch 'net-bcmgenet-revise-suspend-resume'

Doug Berger says:

====================
net: bcmgenet: revise suspend/resume

This commit set updates the GENET driver to reduce the delay to
resume the ethernet link when the Wake on Lan features are used.

In addition, the encoding of hardware versioning and features is
revised to avoid some redundancy and improve readability as well
as remove a warning that occurred for the BCM7712 device which
updated the device major version while maintaining compatibility
with the driver.

The assignment of hardware descriptor rings was modified to
simplify programming and to allow support for the hardware
RX_CLS_FLOW_DISC filter action.
====================

Link: https://patch.msgid.link/20250306192643.2383632-1-opendmb@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: bcmgenet: revise suspend/resume
Doug Berger [Thu, 6 Mar 2025 19:26:42 +0000 (11:26 -0800)] 
net: bcmgenet: revise suspend/resume

If the network interface is configured for Wake-on-LAN we should
avoid bringing the interface down and up since it slows the time
to reestablish network traffic on resume.

Redundant calls to phy_suspend() and phy_resume() are removed
since they are already invoked from within phy_stop() and
phy_start() called from bcmgenet_netif_stop() and
bcmgenet_netif_start().

Signed-off-by: Doug Berger <opendmb@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20250306192643.2383632-15-opendmb@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: bcmgenet: allow return of power up status
Doug Berger [Thu, 6 Mar 2025 19:26:41 +0000 (11:26 -0800)] 
net: bcmgenet: allow return of power up status

It is possible for a WoL power up to fail due to the GENET being
reset while in the suspend state. Allow these failures to be
returned as error codes to allow different recovery behavior
when necessary.

Signed-off-by: Doug Berger <opendmb@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20250306192643.2383632-14-opendmb@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: bcmgenet: move bcmgenet_power_up into resume_noirq
Doug Berger [Thu, 6 Mar 2025 19:26:40 +0000 (11:26 -0800)] 
net: bcmgenet: move bcmgenet_power_up into resume_noirq

The bcmgenet_power_up() function is moved from the resume method
to the resume_noirq method for symmetry with the suspend_noirq
method. This allows the wol_active flag to be removed.

The UMAC_IRQ_WAKE_EVENT interrupts that can be unmasked by the
bcmgenet_wol_power_down_cfg() function are now re-masked by the
bcmgenet_wol_power_up_cfg() function at the resume_noirq level
as well.

Signed-off-by: Doug Berger <opendmb@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20250306192643.2383632-13-opendmb@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: bcmgenet: support reclaiming unsent Tx packets
Doug Berger [Thu, 6 Mar 2025 19:26:39 +0000 (11:26 -0800)] 
net: bcmgenet: support reclaiming unsent Tx packets

When disabling the transmitter any outstanding packets can now
be reclaimed by bcmgenet_tx_reclaim_all() rather than by the
bcmgenet_fini_dma() function.

Signed-off-by: Doug Berger <opendmb@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20250306192643.2383632-12-opendmb@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: bcmgenet: introduce bcmgenet_[r|t]dma_disable
Doug Berger [Thu, 6 Mar 2025 19:26:38 +0000 (11:26 -0800)] 
net: bcmgenet: introduce bcmgenet_[r|t]dma_disable

The bcmgenet_rdma_disable and bcmgenet_tdma_disable functions
are introduced to provide a common method for disabling each
dma and the code is simplified.

Signed-off-by: Doug Berger <opendmb@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20250306192643.2383632-11-opendmb@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: bcmgenet: consolidate dma initialization
Doug Berger [Thu, 6 Mar 2025 19:26:37 +0000 (11:26 -0800)] 
net: bcmgenet: consolidate dma initialization

The functions bcmgenet_dma_disable and bcmgenet_enable_dma are
only used as part of dma initialization. Their functionality is
moved inside bcmgenet_init_dma and the functions are removed.

Since the dma is always disabled inside of bcmgenet_init_dma,
the initialization functions bcmgenet_init_rx_queues and
bcmgenet_init_tx_queues no longer need to attempt to manage its
state.

Signed-off-by: Doug Berger <opendmb@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20250306192643.2383632-10-opendmb@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: bcmgenet: remove dma_ctrl argument
Doug Berger [Thu, 6 Mar 2025 19:26:36 +0000 (11:26 -0800)] 
net: bcmgenet: remove dma_ctrl argument

Since the individual queues manage their own DMA enables there
is no need to return dma_ctrl from bcmgenet_dma_disable() and
pass it back to bcmgenet_enable_dma().

Signed-off-by: Doug Berger <opendmb@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20250306192643.2383632-9-opendmb@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: bcmgenet: add support for RX_CLS_FLOW_DISC
Doug Berger [Thu, 6 Mar 2025 19:26:35 +0000 (11:26 -0800)] 
net: bcmgenet: add support for RX_CLS_FLOW_DISC

Now that the DESC_INDEX ring descriptor is no longer used we can
enable hardware discarding of flows by routing them to a queue
that is not enabled.

Signed-off-by: Doug Berger <opendmb@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20250306192643.2383632-8-opendmb@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: bcmgenet: move DESC_INDEX flow to ring 0
Doug Berger [Thu, 6 Mar 2025 19:26:34 +0000 (11:26 -0800)] 
net: bcmgenet: move DESC_INDEX flow to ring 0

The default transmit and receive packet handling is moved from
the DESC_INDEX (i.e. 16) descriptor rings to the Ring 0 queues.
This saves a fair amount of special case code by unifying the
handling.

A default dummy filter is enabled in the Hardware Filter Block
to route default receive packets to Ring 0.

Signed-off-by: Doug Berger <opendmb@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20250306192643.2383632-7-opendmb@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: bcmgenet: extend bcmgenet_hfb_* API
Doug Berger [Thu, 6 Mar 2025 19:26:33 +0000 (11:26 -0800)] 
net: bcmgenet: extend bcmgenet_hfb_* API

Extend the bcmgenet_hfb_* API to allow initialization and
programming of the Hardware Filter Block on GENET v1 and
GENET v2 hardware. Programming of ethtool flows is still
not supported on this older hardware.

Signed-off-by: Doug Berger <opendmb@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20250306192643.2383632-6-opendmb@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: bcmgenet: BCM7712 is GENETv5 compatible
Doug Berger [Thu, 6 Mar 2025 19:26:32 +0000 (11:26 -0800)] 
net: bcmgenet: BCM7712 is GENETv5 compatible

The major revision of the GENET core in the BCM7712 SoC was bumped
to 7 but it is compatible with the GENETv5 implementation. This
commit maps the version accordingly to avoid a warning.

Signed-off-by: Doug Berger <opendmb@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20250306192643.2383632-5-opendmb@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: bcmgenet: move feature flags to bcmgenet_priv
Doug Berger [Thu, 6 Mar 2025 19:26:31 +0000 (11:26 -0800)] 
net: bcmgenet: move feature flags to bcmgenet_priv

The feature flags are moved and consolidated to the primary
private driver structure and are now initialized from the
platform device data rather than the hardware parameters to
allow finer control over which platforms use which features.

Signed-off-by: Doug Berger <opendmb@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20250306192643.2383632-4-opendmb@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: bcmgenet: add bcmgenet_has_* helpers
Doug Berger [Thu, 6 Mar 2025 19:26:30 +0000 (11:26 -0800)] 
net: bcmgenet: add bcmgenet_has_* helpers

Introduce helper functions to indicate whether the driver should
make use of a particular feature that it supports. These helpers
abstract the implementation of how the feature availability is
encoded.

Signed-off-by: Doug Berger <opendmb@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20250306192643.2383632-3-opendmb@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: bcmgenet: bcmgenet_hw_params clean up
Doug Berger [Thu, 6 Mar 2025 19:26:29 +0000 (11:26 -0800)] 
net: bcmgenet: bcmgenet_hw_params clean up

The entries of the bcmgenet_hw_params array are broken out to
remove unused and duplicate entries and are made read only since
they should not change for a specific version of the GENET
hardware.

Signed-off-by: Doug Berger <opendmb@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20250306192643.2383632-2-opendmb@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: stmmac: remove write-only priv->speed
Russell King (Oracle) [Fri, 7 Mar 2025 00:11:29 +0000 (00:11 +0000)] 
net: stmmac: remove write-only priv->speed

priv->speed is only ever written to in two locations, but never
read. Therefore, it serves no useful purpose. Remove this unnecessary
struct member.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/E1tqLJJ-005aQm-Mv@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agogve: convert to use netmem for DQO RDA mode
Harshitha Ramamurthy [Fri, 7 Mar 2025 00:39:05 +0000 (00:39 +0000)] 
gve: convert to use netmem for DQO RDA mode

To add netmem support to the gve driver, add a union
to the struct gve_rx_slot_page_info. netmem_ref is used for
DQO queue format's raw DMA addressing(RDA) mode. The struct
page is retained for other usecases.

Then, switch to using relevant netmem helper functions for
page pool and skb frag management.

Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Harshitha Ramamurthy <hramamurthy@google.com>
Link: https://patch.msgid.link/20250307003905.601175-1-hramamurthy@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: ethtool: use correct device pointer in ethnl_default_dump_one()
Eric Dumazet [Fri, 7 Mar 2025 08:35:44 +0000 (08:35 +0000)] 
net: ethtool: use correct device pointer in ethnl_default_dump_one()

ethnl_default_dump_one() operates on the device provided in its @dev
parameter, not from ctx->req_info->dev.

syzbot reported:

Oops: general protection fault, probably for non-canonical address 0xdffffc0000000197: 0000 [#1] PREEMPT SMP KASAN PTI
KASAN: null-ptr-deref in range [0x0000000000000cb8-0x0000000000000cbf]
 RIP: 0010:netdev_need_ops_lock include/linux/netdevice.h:2792 [inline]
 RIP: 0010:netdev_lock_ops include/linux/netdevice.h:2803 [inline]
 RIP: 0010:ethnl_default_dump_one net/ethtool/netlink.c:557 [inline]
 RIP: 0010:ethnl_default_dumpit+0x447/0xd40 net/ethtool/netlink.c:593
Call Trace:
 <TASK>
  genl_dumpit+0x10d/0x1b0 net/netlink/genetlink.c:1027
  netlink_dump+0x64d/0xe10 net/netlink/af_netlink.c:2309
  __netlink_dump_start+0x5a2/0x790 net/netlink/af_netlink.c:2424
  genl_family_rcv_msg_dumpit net/netlink/genetlink.c:1076 [inline]
  genl_family_rcv_msg net/netlink/genetlink.c:1192 [inline]
  genl_rcv_msg+0x894/0xec0 net/netlink/genetlink.c:1210
  netlink_rcv_skb+0x206/0x480 net/netlink/af_netlink.c:2534
  genl_rcv+0x28/0x40 net/netlink/genetlink.c:1219
  netlink_unicast_kernel net/netlink/af_netlink.c:1313 [inline]
  netlink_unicast+0x7f6/0x990 net/netlink/af_netlink.c:1339
  netlink_sendmsg+0x8de/0xcb0 net/netlink/af_netlink.c:1883
  sock_sendmsg_nosec net/socket.c:709 [inline]
  __sock_sendmsg+0x221/0x270 net/socket.c:724
  ____sys_sendmsg+0x53a/0x860 net/socket.c:2564
  ___sys_sendmsg net/socket.c:2618 [inline]
  __sys_sendmsg+0x269/0x350 net/socket.c:2650

Fixes: 2bcf4772e45a ("net: ethtool: try to protect all callback with netdev instance lock")
Reported-by: syzbot+3da2442641f0c6a705a2@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/lkml/67caaf5e.050a0220.15b4b9.007a.GAE@google.com/T/#u
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250307083544.1659135-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agobpf: fix a possible NULL deref in bpf_map_offload_map_alloc()
Eric Dumazet [Fri, 7 Mar 2025 07:43:02 +0000 (07:43 +0000)] 
bpf: fix a possible NULL deref in bpf_map_offload_map_alloc()

Call bpf_dev_offload_check() before netdev_lock_ops().

This is needed if attr->map_ifindex is not valid.

Oops: general protection fault, probably for non-canonical address 0xdffffc0000000197: 0000 [#1] PREEMPT SMP KASAN PTI
KASAN: null-ptr-deref in range [0x0000000000000cb8-0x0000000000000cbf]
 RIP: 0010:netdev_need_ops_lock include/linux/netdevice.h:2792 [inline]
 RIP: 0010:netdev_lock_ops include/linux/netdevice.h:2803 [inline]
 RIP: 0010:bpf_map_offload_map_alloc+0x19a/0x910 kernel/bpf/offload.c:533
Call Trace:
 <TASK>
  map_create+0x946/0x11c0 kernel/bpf/syscall.c:1455
  __sys_bpf+0x6d3/0x820 kernel/bpf/syscall.c:5777
  __do_sys_bpf kernel/bpf/syscall.c:5902 [inline]
  __se_sys_bpf kernel/bpf/syscall.c:5900 [inline]
  __x64_sys_bpf+0x7c/0x90 kernel/bpf/syscall.c:5900
  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
  do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83

Fixes: 97246d6d21c2 ("net: hold netdev instance lock during ndo_bpf")
Reported-by: syzbot+0c7bfd8cf3aecec92708@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/67caa2b1.050a0220.15b4b9.0077.GAE@google.com/T/#u
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250307074303.1497911-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf...
Jakub Kicinski [Sat, 8 Mar 2025 03:08:49 +0000 (19:08 -0800)] 
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Martin KaFai Lau says:

====================
pull-request: bpf-next 2025-03-06

We've added 6 non-merge commits during the last 13 day(s) which contain
a total of 6 files changed, 230 insertions(+), 56 deletions(-).

The main changes are:

1) Add XDP metadata support for tun driver, from Marcus Wichelmann.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
  selftests/bpf: Fix file descriptor assertion in open_tuntap helper
  selftests/bpf: Add test for XDP metadata support in tun driver
  selftests/bpf: Refactor xdp_context_functional test and bpf program
  selftests/bpf: Move open_tuntap to network helpers
  net: tun: Enable transfer of XDP metadata to skb
  net: tun: Enable XDP metadata support
====================

Link: https://patch.msgid.link/20250307055335.441298-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoselftests/net: add proc_net_pktgen to .gitignore
Willem de Bruijn [Fri, 7 Mar 2025 03:13:44 +0000 (22:13 -0500)] 
selftests/net: add proc_net_pktgen to .gitignore

Ensure git doesn't pick up this new target.

Fixes: 03544faad761 ("selftest: net: add proc_net_pktgen")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250307031356.368350-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'riscv-sophgo-add-ethernet-support-for-sg2044'
Jakub Kicinski [Sat, 8 Mar 2025 03:06:38 +0000 (19:06 -0800)] 
Merge branch 'riscv-sophgo-add-ethernet-support-for-sg2044'

Inochi Amaoto says:

====================
riscv: sophgo: Add ethernet support for SG2044

The ethernet controller of SG2044 is Synopsys DesignWare IP with
custom clock. Add glue layer for it.

v6: https://lore.kernel.org/20250305063920.803601-1-inochiama@gmail.com
v5: https://lore.kernel.org/20250216123953.1252523-1-inochiama@gmail.com
v4: https://lore.kernel.org/20250209013054.816580-1-inochiama@gmail.com
v3: https://lore.kernel.org/20241223005843.483805-1-inochiama@gmail.com
RFC: https://lore.kernel.org/20241101014327.513732-1-inochiama@gmail.com
v2: https://lore.kernel.org/20241025011000.244350-1-inochiama@gmail.com
v1: https://lore.kernel.org/20241021103617.653386-1-inochiama@gmail.com
====================

Link: https://patch.msgid.link/20250307011623.440792-1-inochiama@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: stmmac: Add glue layer for Sophgo SG2044 SoC
Inochi Amaoto [Fri, 7 Mar 2025 01:16:17 +0000 (09:16 +0800)] 
net: stmmac: Add glue layer for Sophgo SG2044 SoC

Adds Sophgo dwmac driver support on the Sophgo SG2044 SoC.

Signed-off-by: Inochi Amaoto <inochiama@gmail.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/20250307011623.440792-5-inochiama@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: stmmac: platform: Add snps,dwmac-5.30a IP compatible string
Inochi Amaoto [Fri, 7 Mar 2025 01:16:16 +0000 (09:16 +0800)] 
net: stmmac: platform: Add snps,dwmac-5.30a IP compatible string

Add "snps,dwmac-5.30a" compatible string for 5.30a version that can avoid
to define some platform data in the glue layer.

Signed-off-by: Inochi Amaoto <inochiama@gmail.com>
Reviewed-by: Romain Gantois <romain.gantois@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/20250307011623.440792-4-inochiama@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: stmmac: platform: Group GMAC4 compatible check
Inochi Amaoto [Fri, 7 Mar 2025 01:16:15 +0000 (09:16 +0800)] 
net: stmmac: platform: Group GMAC4 compatible check

Use of_device_compatible_match to group existing compatible
check of GMAC4 device.

Signed-off-by: Inochi Amaoto <inochiama@gmail.com>
Reviewed-by: Romain Gantois <romain.gantois@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/20250307011623.440792-3-inochiama@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agodt-bindings: net: Add support for Sophgo SG2044 dwmac
Inochi Amaoto [Fri, 7 Mar 2025 01:16:14 +0000 (09:16 +0800)] 
dt-bindings: net: Add support for Sophgo SG2044 dwmac

The GMAC IP on SG2044 is almost a standard Synopsys DesignWare
MAC (version 5.30a) with some extra clock.

Add necessary compatible string for this device.

Signed-off-by: Inochi Amaoto <inochiama@gmail.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://patch.msgid.link/20250307011623.440792-2-inochiama@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: phylink: Remove unused phylink_init_eee
Dr. David Alan Gilbert [Thu, 6 Mar 2025 18:45:34 +0000 (18:45 +0000)] 
net: phylink: Remove unused phylink_init_eee

phylink_init_eee() is currently unused.

It was last added in 2019 by
commit 86e58135bc4a ("net: phylink: add phylink_init_eee() helper")
but it didn't actually wire a use up.

It had previous been removed in 2017 by
commit 939eae25d9a5 ("phylink: remove phylink_init_eee()").

Remove it again.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/20250306184534.246152-1-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'mlx5-misc-enhancements-2025-03-04'
Jakub Kicinski [Fri, 7 Mar 2025 01:53:37 +0000 (17:53 -0800)] 
Merge branch 'mlx5-misc-enhancements-2025-03-04'

Tariq Toukan says:

====================
mlx5 misc enhancements 2025-03-04

This series introduces enhancements to the mlx5 core and Eth drivers.

Patches 1-3 by Shahar introduce support for configuring lanes alongside
speed when autonegotiation is disabled. The combination of speed and the
number of lanes corresponds to a specific link mode (in the extended
mask typically used in newer hardware), allowing the user to select a
precise link mode when autonegotiation is off, instead of just choosing
the speed.

Patch 4 by Amir extends the multi-port LAG support.

Patches 5-6 by Leon enhance IPsec matching logic.

v1: https://lore.kernel.org/20250226114752.104838-1-tariqt@nvidia.com
====================

Link: https://patch.msgid.link/20250304160620.417580-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet/mlx5e: Properly match IPsec subnet addresses
Leon Romanovsky [Tue, 4 Mar 2025 16:06:20 +0000 (18:06 +0200)] 
net/mlx5e: Properly match IPsec subnet addresses

Existing match criteria didn't allow to match whole subnet and
only by specific addresses only. This caused to tunnel mode do not
forward such traffic through relevant SA.

In tunnel mode, policies look like this:
src 192.169.0.0/16 dst 192.169.0.0/16
        dir out priority 383615 ptype main
        tmpl src 192.169.101.2 dst 192.169.101.1
                proto esp spi 0xc5141c18 reqid 1 mode tunnel
        crypto offload parameters: dev eth2 mode packet

In this case, the XFRM core code handled all subnet calculations and
forwarded network address to the drivers e.g. 192.169.0.0.

For mlx5 devices, there is a need to set relevant prefix e.g. 0xFFFF00
to perform flow steering match operation.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/20250304160620.417580-7-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet/mlx5e: Separate address related variables to be in struct
Leon Romanovsky [Tue, 4 Mar 2025 16:06:19 +0000 (18:06 +0200)] 
net/mlx5e: Separate address related variables to be in struct

Prepare the code to addition of prefix handling logic which is needed
to support matching logic based on source and/or destination network
prefixes.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/20250304160620.417580-6-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet/mlx5: Lag, Enable Multiport E-Switch offloads on 8 ports LAG
Amir Tzin [Tue, 4 Mar 2025 16:06:18 +0000 (18:06 +0200)] 
net/mlx5: Lag, Enable Multiport E-Switch offloads on 8 ports LAG

Patch [1] added mlx5 driver support for 8 ports HCAs which are available
since ConnectX-8. Now that Multiport E-Switch is tested, we can enable
it by removing flag MLX5_LAG_MPESW_OFFLOADS_SUPPORTED_PORTS.

[1]
commit e0e6adfe8c20 ("net/mlx5: Enable 8 ports LAG")

Signed-off-by: Amir Tzin <amirtz@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/20250304160620.417580-5-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet/mlx5e: Enable lanes configuration when auto-negotiation is off
Shahar Shitrit [Tue, 4 Mar 2025 16:06:17 +0000 (18:06 +0200)] 
net/mlx5e: Enable lanes configuration when auto-negotiation is off

Currently, when auto-negotiation is disabled, the driver retrieves
the speed and converts it into all link modes that correspond to
that speed. With this patch, we add the ability to set the number
of lanes, so that the combination of speed and lanes corresponds to
exactly one specific link mode for the extended bit map.

For the legacy bit map the driver sets all link modes correspond to
speed and lanes.

This change provides users with the option to set a specific link
mode, rather than enabling all link modes associated with a given
speed when auto-negotiation is off.

Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250304160620.417580-4-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet/mlx5: Refactor link speed handling with mlx5_link_info struct
Shahar Shitrit [Tue, 4 Mar 2025 16:06:16 +0000 (18:06 +0200)] 
net/mlx5: Refactor link speed handling with mlx5_link_info struct

Introduce struct mlx5_link_info with a speed field and change the
types of mlx5e_link_speed and mlx5e_ext_link_speed from arrays of
u32 to arrays of struct mlx5_link_info. These arrays are renamed
to mlx5e_link_info and mlx5e_ext_link_info, respectively.

This change prepares for a future patch that will introduce a lanes
field in struct mlx5_link_info and add lanes mapping alongside the
speed for each link mode in the two arrays.

Additionally, rename function mlx5_port_speed2linkmodes() to
mlx5_port_info2linkmodes() and function mlx5_port_ptys2speed()
to mlx5_port_ptys2info() and update the speed parameter/return
type to struct mlx5_link_info, in preparation for the upcoming
patch where these functions will also utilize the lanes field.

Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250304160620.417580-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet/mlx5: Relocate function declarations from port.h to mlx5_core.h
Shahar Shitrit [Tue, 4 Mar 2025 16:06:15 +0000 (18:06 +0200)] 
net/mlx5: Relocate function declarations from port.h to mlx5_core.h

The port header is a general file under include, yet it
contains declarations for functions that are either not
exported or exported but not used outside the mlx5_core
driver.

To enhance code organization, we move these declarations
to mlx5_core.h, where they are more appropriately scoped.

This refactor removes unnecessary exported symbols and
prevents unexported functions from being inadvertently
referenced outside of the mlx5_core driver.

Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250304160620.417580-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'add-perout-configuration-support-in-iep-driver'
Jakub Kicinski [Fri, 7 Mar 2025 01:47:05 +0000 (17:47 -0800)] 
Merge branch 'add-perout-configuration-support-in-iep-driver'

Meghana Malladi says:

====================
Add perout configuration support in IEP driver

IEP driver supported both perout and pps signal generation
but perout feature is faulty with half-cooked support
due to some missing configuration. Hence perout feature is
removed as a bug fix. This patch series adds back this feature
which configures perout signal based on the arguments passed
by the perout request.

This patch series is continuation to the bug fix:
https://lore.kernel.org/20250227092441.1848419-1-m-malladi@ti.com
as suggested by Jakub Kicinski and Jacob Keller:
https://lore.kernel.org/20250220172410.025b96d6@kernel.org

v3: https://lore.kernel.org/20250303135124.632845-1-m-malladi@ti.com
====================

Link: https://patch.msgid.link/20250304105753.1552159-1-m-malladi@ti.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: ti: icss-iep: Add phase offset configuration for perout signal
Meghana Malladi [Tue, 4 Mar 2025 10:57:53 +0000 (16:27 +0530)] 
net: ti: icss-iep: Add phase offset configuration for perout signal

icss_iep_perout_enable_hw() is a common function for generating
both pps and perout signals. When enabling pps, the application needs
to only pass enable/disable argument, whereas for perout it supports
different flags to configure the signal.

In case the app passes a valid phase offset value, the signal should
start toggling after that phase offset, else start immediately or
as soon as possible. ICSS_IEP_SYNC_START_REG register take number of
clock cycles to wait before starting the signal after activation time.
Set appropriate value to this register to support phase offset.

Signed-off-by: Meghana Malladi <m-malladi@ti.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://patch.msgid.link/20250304105753.1552159-3-m-malladi@ti.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: ti: icss-iep: Add pwidth configuration for perout signal
Meghana Malladi [Tue, 4 Mar 2025 10:57:52 +0000 (16:27 +0530)] 
net: ti: icss-iep: Add pwidth configuration for perout signal

icss_iep_perout_enable_hw() is a common function for generating
both pps and perout signals. When enabling pps, the application needs
to only pass enable/disable argument, whereas for perout it supports
different flags to configure the signal.

But icss_iep_perout_enable_hw() function is missing to hook the
configuration params passed by the app, causing perout to behave
same a pps (except being able to configure the period). As duty cycle
is also one feature which can configured for perout, incorporate this
in the function to get the expected signal.

Signed-off-by: Meghana Malladi <m-malladi@ti.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://patch.msgid.link/20250304105753.1552159-2-m-malladi@ti.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoselftests: openvswitch: don't hardcode the drop reason subsys
Jakub Kicinski [Tue, 4 Mar 2025 18:06:15 +0000 (10:06 -0800)] 
selftests: openvswitch: don't hardcode the drop reason subsys

WiFi removed one of their subsys entries from drop reasons, in
commit 286e69677065 ("wifi: mac80211: Drop cooked monitor support")
SKB_DROP_REASON_SUBSYS_OPENVSWITCH is now 2 not 3.
The drop reasons are not uAPI, read the correct value
from debug info.

We need to enable vmlinux BTF, otherwise pahole needs
a few GB of memory to decode the enum name.

Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Acked-by: Aaron Conole <aconole@redhat.com>
Link: https://patch.msgid.link/20250304180615.945945-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: airoha: Enable TSO/Scatter Gather for LAN port
Lorenzo Bianconi [Tue, 4 Mar 2025 15:46:40 +0000 (16:46 +0100)] 
net: airoha: Enable TSO/Scatter Gather for LAN port

Set net_device vlan_features in order to enable TSO and Scatter Gather
for DSA user ports.

Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250304-lan-enable-tso-v1-1-b398eb9976ba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: airoha: Fix lan4 support in airoha_qdma_get_gdm_port()
Lorenzo Bianconi [Tue, 4 Mar 2025 14:38:05 +0000 (15:38 +0100)] 
net: airoha: Fix lan4 support in airoha_qdma_get_gdm_port()

EN7581 SoC supports lan{1,4} ports on MT7530 DSA switch. Fix lan4
reported value in airoha_qdma_get_gdm_port routine.

Fixes: 23020f0493270 ("net: airoha: Introduce ethernet support for EN7581 SoC")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250304-airoha-eth-fix-lan4-v1-1-832417da4bb5@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'increase-maximum-mtu-to-9k-for-airoha-en7581-soc'
Jakub Kicinski [Fri, 7 Mar 2025 00:41:15 +0000 (16:41 -0800)] 
Merge branch 'increase-maximum-mtu-to-9k-for-airoha-en7581-soc'

Lorenzo Bianconi says:

====================
Increase maximum MTU to 9k for Airoha EN7581 SoC

EN7581 SoC supports 9k maximum MTU.
Enable the reception of Scatter-Gather (SG) frames for Airoha EN7581.
Introduce airoha_dev_change_mtu callback.
====================

Link: https://patch.msgid.link/20250304-airoha-eth-rx-sg-v1-0-283ebc61120e@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: airoha: Increase max mtu to 9k
Lorenzo Bianconi [Tue, 4 Mar 2025 14:21:11 +0000 (15:21 +0100)] 
net: airoha: Increase max mtu to 9k

EN7581 SoC supports 9k maximum MTU.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250304-airoha-eth-rx-sg-v1-4-283ebc61120e@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: airoha: Introduce airoha_dev_change_mtu callback
Lorenzo Bianconi [Tue, 4 Mar 2025 14:21:10 +0000 (15:21 +0100)] 
net: airoha: Introduce airoha_dev_change_mtu callback

Add airoha_dev_change_mtu callback to update the MTU of a running
device.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250304-airoha-eth-rx-sg-v1-3-283ebc61120e@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: airoha: Enable Rx Scatter-Gather
Lorenzo Bianconi [Tue, 4 Mar 2025 14:21:09 +0000 (15:21 +0100)] 
net: airoha: Enable Rx Scatter-Gather

EN7581 SoC can receive 9k frames. Enable the reception of Scatter-Gather
(SG) frames.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250304-airoha-eth-rx-sg-v1-2-283ebc61120e@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: airoha: Move min/max packet len configuration in airoha_dev_open()
Lorenzo Bianconi [Tue, 4 Mar 2025 14:21:08 +0000 (15:21 +0100)] 
net: airoha: Move min/max packet len configuration in airoha_dev_open()

In order to align max allowed packet size to the configured mtu, move
REG_GDM_LEN_CFG configuration in airoha_dev_open routine.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250304-airoha-eth-rx-sg-v1-1-283ebc61120e@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: stmmac: simplify phylink_suspend() and phylink_resume() calls
Russell King (Oracle) [Tue, 4 Mar 2025 11:21:27 +0000 (11:21 +0000)] 
net: stmmac: simplify phylink_suspend() and phylink_resume() calls

Currently, the calls to phylink's suspend and resume functions are
inside overly complex tests, and boil down to:

if (device_may_wakeup(priv->device) && priv->plat->pmt) {
call phylink
} else {
call phylink and
if (device_may_wakeup(priv->device))
do something else
}

This results in phylink always being called, possibly with differing
arguments for phylink_suspend().

Simplify this code, noting that each site is slightly different due to
the order in which phylink is called and the "something else".

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/E1tpQL1-005St4-Hn@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: stmmac: avoid shadowing global buf_sz
Russell King (Oracle) [Wed, 5 Mar 2025 17:54:16 +0000 (17:54 +0000)] 
net: stmmac: avoid shadowing global buf_sz

stmmac_rx() declares a local variable named "buf_sz" but there is also
a global variable for a module parameter which is called the same. To
avoid confusion, rename the local variable.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Furong Xu <0x1207@gmail.com>
Link: https://patch.msgid.link/E1tpswi-005U6C-Py@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoselftests: net: bpf_offload: add 'libbpf_global' to ignored maps
Jakub Kicinski [Tue, 4 Mar 2025 23:32:04 +0000 (15:32 -0800)] 
selftests: net: bpf_offload: add 'libbpf_global' to ignored maps

After installing pahole on the CI image we have a new map created
by libbpf. Ignore it otherwise we see:

  Exception: Time out waiting for map counts to stabilize want 2, have 3

Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250304233204.1139251-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoselftests: net: fix error message in bpf_offload
Jakub Kicinski [Tue, 4 Mar 2025 23:32:03 +0000 (15:32 -0800)] 
selftests: net: fix error message in bpf_offload

We hit a following exception on timeout, nmaps is never set:

    Test bpftool bound info reporting (own ns)...
    Traceback (most recent call last):
      File "/home/virtme/testing-1/tools/testing/selftests/net/./bpf_offload.py", line 1128, in <module>
        check_dev_info(False, "")
      File "/home/virtme/testing-1/tools/testing/selftests/net/./bpf_offload.py", line 583, in check_dev_info
        maps = bpftool_map_list_wait(expected=2, ns=ns)
      File "/home/virtme/testing-1/tools/testing/selftests/net/./bpf_offload.py", line 215, in bpftool_map_list_wait
        raise Exception("Time out waiting for map counts to stabilize want %d, have %d" % (expected, nmaps))
    NameError: name 'nmaps' is not defined

Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250304233204.1139251-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agotcp: clamp window like before the cleanup
Matthieu Baerts (NGI0) [Wed, 5 Mar 2025 14:49:48 +0000 (15:49 +0100)] 
tcp: clamp window like before the cleanup

A recent cleanup changed the behaviour of tcp_set_window_clamp(). This
looks unintentional, and affects MPTCP selftests, e.g. some tests
re-establishing a connection after a disconnect are now unstable.

Before the cleanup, this operation was done:

  new_rcv_ssthresh = min(tp->rcv_wnd, new_window_clamp);
  tp->rcv_ssthresh = max(new_rcv_ssthresh, tp->rcv_ssthresh);

The cleanup used the 'clamp' macro which takes 3 arguments -- value,
lowest, and highest -- and returns a value between the lowest and the
highest allowable values. This then assumes ...

  lowest (rcv_ssthresh) <= highest (rcv_wnd)

... which doesn't seem to be always the case here according to the MPTCP
selftests, even when running them without MPTCP, but only TCP.

For example, when we have ...

  rcv_wnd < rcv_ssthresh < new_rcv_ssthresh

... before the cleanup, the rcv_ssthresh was not changed, while after
the cleanup, it is lowered down to rcv_wnd (highest).

During a simple test with TCP, here are the values I observed:

  new_window_clamp (val)  rcv_ssthresh (lo)  rcv_wnd (hi)
      117760   (out)         65495         <  65536
      128512   (out)         109595        >  80256  => lo > hi
      1184975  (out)         328987        <  329088

      113664   (out)         65483         <  65536
      117760   (out)         110968        <  110976
      129024   (out)         116527        >  109696 => lo > hi

Here, we can see that it is not that rare to have rcv_ssthresh (lo)
higher than rcv_wnd (hi), so having a different behaviour when the
clamp() macro is used, even without MPTCP.

Note: new_window_clamp is always out of range (rcv_ssthresh < rcv_wnd)
here, which seems to be generally the case in my tests with small
connections.

I then suggests reverting this part, not to change the behaviour.

Fixes: 863a952eb79a ("tcp: tcp_set_window_clamp() cleanup")
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/551
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Tested-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250305-net-next-fix-tcp-win-clamp-v1-1-12afb705d34e@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: stmmac: mostly remove "buf_sz"
Russell King (Oracle) [Wed, 5 Mar 2025 17:54:21 +0000 (17:54 +0000)] 
net: stmmac: mostly remove "buf_sz"

The "buf_sz" parameter is not used in the stmmac driver - there is one
place where the value of buf_sz is validated, and two places where it
is written. It is otherwise unused.

Remove these accesses. However, leave the module parameter in place as
removing it could cause module load to fail, breaking user setups.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Furong Xu <0x1207@gmail.com>
Link: https://patch.msgid.link/E1tpswn-005U6I-TU@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoptp: ocp: Remove redundant check in _signal_summary_show
Ivan Abramov [Wed, 5 Mar 2025 09:25:20 +0000 (12:25 +0300)] 
ptp: ocp: Remove redundant check in _signal_summary_show

In the function _signal_summary_show(), there is a NULL-check for
&bp->signal[nr], which cannot actually be NULL.

Therefore, this redundant check can be removed.

Signed-off-by: Ivan Abramov <i.abramov@mt-integration.ru>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20250305092520.25817-1-i.abramov@mt-integration.ru
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'net-stmmac-dwc-qos-add-fsd-eqos-support'
Jakub Kicinski [Thu, 6 Mar 2025 23:30:36 +0000 (15:30 -0800)] 
Merge branch 'net-stmmac-dwc-qos-add-fsd-eqos-support'

Swathi says:

====================
net: stmmac: dwc-qos: Add FSD EQoS support

FSD platform has two instances of EQoS IP, one is in FSYS0 block and
another one is in PERIC block. This patch series add required DT binding
and platform driver specific changes for the same.
====================

Link: https://patch.msgid.link/20250305091246.106626-1-swathi.ks@samsung.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: stmmac: dwc-qos: Add FSD EQoS support
Swathi K S [Wed, 5 Mar 2025 09:12:46 +0000 (14:42 +0530)] 
net: stmmac: dwc-qos: Add FSD EQoS support

The FSD SoC contains two instance of the Synopsys DWC ethernet QOS IP core.
The binding that it uses is slightly different from existing ones because
of the integration (clocks, resets).

Signed-off-by: Swathi K S <swathi.ks@samsung.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/20250305091246.106626-3-swathi.ks@samsung.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agodt-bindings: net: Add FSD EQoS device tree bindings
Swathi K S [Wed, 5 Mar 2025 09:12:45 +0000 (14:42 +0530)] 
dt-bindings: net: Add FSD EQoS device tree bindings

Add FSD Ethernet compatible in Synopsys dt-bindings document. Add FSD
Ethernet YAML schema to enable the DT validation.

Signed-off-by: Pankaj Dubey <pankaj.dubey@samsung.com>
Signed-off-by: Ravi Patel <ravi.patel@samsung.com>
Signed-off-by: Swathi K S <swathi.ks@samsung.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://patch.msgid.link/20250305091246.106626-2-swathi.ks@samsung.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'tcp-even-faster-connect-under-stress'
Jakub Kicinski [Thu, 6 Mar 2025 23:26:07 +0000 (15:26 -0800)] 
Merge branch 'tcp-even-faster-connect-under-stress'

Eric Dumazet says:

====================
tcp: even faster connect() under stress

This is a followup on the prior series, "tcp: scale connect() under pressure"

Now spinlocks are no longer in the picture, we see a very high cost
of the inet6_ehashfn() function.

In this series (of 2), I change how lport contributes to inet6_ehashfn()
to ensure better cache locality and call inet6_ehashfn()
only once per connect() system call.

This brings an additional 229 % increase of performance
for "neper/tcp_crr -6 -T 200 -F 30000" stress test,
while greatly improving latency metrics.

Before:
  latency_min=0.014131929
  latency_max=17.895073144
  latency_mean=0.505675853
  latency_stddev=2.125164772
  num_samples=307884
  throughput=139866.80

After:
  latency_min=0.003041375
  latency_max=7.056589232
  latency_mean=0.141075048
  latency_stddev=0.526900516
  num_samples=312996
  throughput=320677.21
====================

Link: https://patch.msgid.link/20250305034550.879255-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoinet: call inet6_ehashfn() once from inet6_hash_connect()
Eric Dumazet [Wed, 5 Mar 2025 03:45:50 +0000 (03:45 +0000)] 
inet: call inet6_ehashfn() once from inet6_hash_connect()

inet6_ehashfn() being called from __inet6_check_established()
has a big impact on performance, as shown in the Tested section.

After prior patch, we can compute the hash for port 0
from inet6_hash_connect(), and derive each hash in
__inet_hash_connect() from this initial hash:

hash(saddr, lport, daddr, dport) == hash(saddr, 0, daddr, dport) + lport

Apply the same principle for __inet_check_established(),
although inet_ehashfn() has a smaller cost.

Tested:

Server: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog
Client: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog -c -H server

Before this patch:

  utime_start=0.286131
  utime_end=4.378886
  stime_start=11.952556
  stime_end=1991.655533
  num_transactions=1446830
  latency_min=0.001061085
  latency_max=12.075275028
  latency_mean=0.376375302
  latency_stddev=1.361969596
  num_samples=306383
  throughput=151866.56

perf top:

 50.01%  [kernel]       [k] __inet6_check_established
 20.65%  [kernel]       [k] __inet_hash_connect
 15.81%  [kernel]       [k] inet6_ehashfn
  2.92%  [kernel]       [k] rcu_all_qs
  2.34%  [kernel]       [k] __cond_resched
  0.50%  [kernel]       [k] _raw_spin_lock
  0.34%  [kernel]       [k] sched_balance_trigger
  0.24%  [kernel]       [k] queued_spin_lock_slowpath

After this patch:

  utime_start=0.315047
  utime_end=9.257617
  stime_start=7.041489
  stime_end=1923.688387
  num_transactions=3057968
  latency_min=0.003041375
  latency_max=7.056589232
  latency_mean=0.141075048    # Better latency metrics
  latency_stddev=0.526900516
  num_samples=312996
  throughput=320677.21        # 111 % increase, and 229 % for the series

perf top: inet6_ehashfn is no longer seen.

 39.67%  [kernel]       [k] __inet_hash_connect
 37.06%  [kernel]       [k] __inet6_check_established
  4.79%  [kernel]       [k] rcu_all_qs
  3.82%  [kernel]       [k] __cond_resched
  1.76%  [kernel]       [k] sched_balance_domains
  0.82%  [kernel]       [k] _raw_spin_lock
  0.81%  [kernel]       [k] sched_balance_rq
  0.81%  [kernel]       [k] sched_balance_trigger
  0.76%  [kernel]       [k] queued_spin_lock_slowpath

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Tested-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20250305034550.879255-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoinet: change lport contribution to inet_ehashfn() and inet6_ehashfn()
Eric Dumazet [Wed, 5 Mar 2025 03:45:49 +0000 (03:45 +0000)] 
inet: change lport contribution to inet_ehashfn() and inet6_ehashfn()

In order to speedup __inet_hash_connect(), we want to ensure hash values
for <source address, port X, destination address, destination port>
are not randomly spread, but monotonically increasing.

Goal is to allow __inet_hash_connect() to derive the hash value
of a candidate 4-tuple with a single addition in the following
patch in the series.

Given :
  hash_0 = inet_ehashfn(saddr, 0, daddr, dport)
  hash_sport = inet_ehashfn(saddr, sport, daddr, dport)

Then (hash_sport == hash_0 + sport) for all sport values.

As far as I know, there is no security implication with this change.

After this patch, when __inet_hash_connect() has to try XXXX candidates,
the hash table buckets are contiguous and packed, allowing
a better use of cpu caches and hardware prefetchers.

Tested:

Server: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog
Client: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog -c -H server

Before this patch:

  utime_start=0.271607
  utime_end=3.847111
  stime_start=18.407684
  stime_end=1997.485557
  num_transactions=1350742
  latency_min=0.014131929
  latency_max=17.895073144
  latency_mean=0.505675853
  latency_stddev=2.125164772
  num_samples=307884
  throughput=139866.80

perf top on client:

 56.86%  [kernel]       [k] __inet6_check_established
 17.96%  [kernel]       [k] __inet_hash_connect
 13.88%  [kernel]       [k] inet6_ehashfn
  2.52%  [kernel]       [k] rcu_all_qs
  2.01%  [kernel]       [k] __cond_resched
  0.41%  [kernel]       [k] _raw_spin_lock

After this patch:

  utime_start=0.286131
  utime_end=4.378886
  stime_start=11.952556
  stime_end=1991.655533
  num_transactions=1446830
  latency_min=0.001061085
  latency_max=12.075275028
  latency_mean=0.376375302
  latency_stddev=1.361969596
  num_samples=306383
  throughput=151866.56

perf top:

 50.01%  [kernel]       [k] __inet6_check_established
 20.65%  [kernel]       [k] __inet_hash_connect
 15.81%  [kernel]       [k] inet6_ehashfn
  2.92%  [kernel]       [k] rcu_all_qs
  2.34%  [kernel]       [k] __cond_resched
  0.50%  [kernel]       [k] _raw_spin_lock
  0.34%  [kernel]       [k] sched_balance_trigger
  0.24%  [kernel]       [k] queued_spin_lock_slowpath

There is indeed an increase of throughput and reduction of latency.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Tested-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20250305034550.879255-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agotcp: bring back NUMA dispersion in inet_ehash_locks_alloc()
Eric Dumazet [Wed, 5 Mar 2025 13:05:50 +0000 (13:05 +0000)] 
tcp: bring back NUMA dispersion in inet_ehash_locks_alloc()

We have platforms with 6 NUMA nodes and 480 cpus.

inet_ehash_locks_alloc() currently allocates a single 64KB page
to hold all ehash spinlocks. This adds more pressure on a single node.

Change inet_ehash_locks_alloc() to use vmalloc() to spread
the spinlocks on all online nodes, driven by NUMA policies.

At boot time, NUMA policy is interleave=all, meaning that
tcp_hashinfo.ehash_locks gets hash dispersion on all nodes.

Tested:

lack5:~# grep inet_ehash_locks_alloc /proc/vmallocinfo
0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2

lack5:~# echo 8192 >/proc/sys/net/ipv4/tcp_child_ehash_entries
lack5:~# numactl --interleave=all unshare -n bash -c "grep inet_ehash_locks_alloc /proc/vmallocinfo"
0x000000004e99d30c-0x00000000763f3279   36864 inet_ehash_locks_alloc+0x90/0x100 pages=8 vmalloc N0=1 N1=2 N2=2 N3=1 N4=1 N5=1
0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2

lack5:~# numactl --interleave=0,5 unshare -n bash -c "grep inet_ehash_locks_alloc /proc/vmallocinfo"
0x00000000fd73a33e-0x0000000004b9a177   36864 inet_ehash_locks_alloc+0x90/0x100 pages=8 vmalloc N0=4 N5=4
0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2

lack5:~# echo 1024 >/proc/sys/net/ipv4/tcp_child_ehash_entries
lack5:~# numactl --interleave=all unshare -n bash -c "grep inet_ehash_locks_alloc /proc/vmallocinfo"
0x00000000db07d7a2-0x00000000ad697d29    8192 inet_ehash_locks_alloc+0x90/0x100 pages=1 vmalloc N2=1
0x00000000d9aec4d1-0x00000000a828b652   69632 inet_ehash_locks_alloc+0x90/0x100 pages=16 vmalloc N0=2 N1=3 N2=3 N3=3 N4=3 N5=2

Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250305130550.1865988-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Jakub Kicinski [Thu, 6 Mar 2025 21:01:27 +0000 (13:01 -0800)] 
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR (net-6.14-rc6).

Conflicts:

net/ethtool/cabletest.c
  2bcf4772e45a ("net: ethtool: try to protect all callback with netdev instance lock")
  637399bf7e77 ("net: ethtool: netlink: Allow NULL nlattrs when getting a phy_device")

No Adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'net-hold-netdev-instance-lock-during-ndo-operations'
Jakub Kicinski [Thu, 6 Mar 2025 20:59:47 +0000 (12:59 -0800)] 
Merge branch 'net-hold-netdev-instance-lock-during-ndo-operations'

Stanislav Fomichev says:

====================
net: Hold netdev instance lock during ndo operations

As the gradual purging of rtnl continues, start grabbing netdev
instance lock in more places so we can get to the state where
most paths are working without rtnl. Start with requiring the
drivers that use shaper api (and later queue mgmt api) to work
with both rtnl and netdev instance lock. Eventually we might
attempt to drop rtnl. This mostly affects iavf, gve, bnxt and
netdev sim (as the drivers that implement shaper/queue mgmt)
so those drivers are converted in the process.

call_netdevice_notifiers locking is very inconsistent and might need
a separate follow up. Some notified events are covered by the
instance lock, some are not, which might complicate the driver
expectations.

Reviewed-by: Eric Dumazet <edumazet@google.com>
====================

Link: https://patch.msgid.link/20250305163732.2766420-1-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoeth: bnxt: remove most dependencies on RTNL
Stanislav Fomichev [Wed, 5 Mar 2025 16:37:32 +0000 (08:37 -0800)] 
eth: bnxt: remove most dependencies on RTNL

Only devlink and sriov paths are grabbing rtnl explicitly. The rest is
covered by netdev instance lock which the core now grabs, so there is
no need to manage rtnl in most places anymore.

On the core side we can now try to drop rtnl in some places
(do_setlink for example) for the drivers that signal non-rtnl
mode (TBD).

Boot-tested and with `ethtool -L eth1 combined 24` to trigger reset.

Cc: Saeed Mahameed <saeed@kernel.org>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250305163732.2766420-15-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agodocs: net: document new locking reality
Stanislav Fomichev [Wed, 5 Mar 2025 16:37:31 +0000 (08:37 -0800)] 
docs: net: document new locking reality

Also clarify ndo_get_stats (that read and write paths can run
concurrently) and mention only RCU.

Cc: Saeed Mahameed <saeed@kernel.org>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250305163732.2766420-14-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: add option to request netdev instance lock
Stanislav Fomichev [Wed, 5 Mar 2025 16:37:30 +0000 (08:37 -0800)] 
net: add option to request netdev instance lock

Currently only the drivers that implement shaper or queue APIs
are grabbing instance lock. Add an explicit opt-in for the
drivers that want to grab the lock without implementing the above
APIs.

There is a 3-byte hole after @up, use it:

        /* --- cacheline 47 boundary (3008 bytes) --- */
        u32                        napi_defer_hard_irqs; /*  3008     4 */
        bool                       up;                   /*  3012     1 */

        /* XXX 3 bytes hole, try to pack */

        struct mutex               lock;                 /*  3016   144 */

        /* XXX last struct has 1 hole */

Cc: Saeed Mahameed <saeed@kernel.org>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250305163732.2766420-13-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: replace dev_addr_sem with netdev instance lock
Stanislav Fomichev [Wed, 5 Mar 2025 16:37:29 +0000 (08:37 -0800)] 
net: replace dev_addr_sem with netdev instance lock

Lockdep reports possible circular dependency in [0]. Instead of
fixing the ordering, replace global dev_addr_sem with netdev
instance lock. Most of the paths that set/get mac are RTNL
protected. Two places where it's not, convert to explicit
locking:
- sysfs address_show
- dev_get_mac_address via dev_ioctl

0: https://netdev-3.bots.linux.dev/vmksft-forwarding-dbg/results/993321/24-router-bridge-1d-lag-sh/stderr

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250305163732.2766420-12-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: ethtool: try to protect all callback with netdev instance lock
Jakub Kicinski [Wed, 5 Mar 2025 16:37:28 +0000 (08:37 -0800)] 
net: ethtool: try to protect all callback with netdev instance lock

Protect all ethtool callbacks and PHY related state with the netdev
instance lock, for drivers which want / need to have their ops
instance-locked. Basically take the lock everywhere we take rtnl_lock.
It was tempting to take the lock in ethnl_ops_begin(), but turns
out we actually nest those calls (when generating notifications).

Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Cc: Saeed Mahameed <saeed@kernel.org>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250305163732.2766420-11-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: hold netdev instance lock during ndo_bpf
Stanislav Fomichev [Wed, 5 Mar 2025 16:37:27 +0000 (08:37 -0800)] 
net: hold netdev instance lock during ndo_bpf

Cover the paths that come via bpf system call and XSK bind.

Cc: Saeed Mahameed <saeed@kernel.org>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250305163732.2766420-10-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: hold netdev instance lock during sysfs operations
Stanislav Fomichev [Wed, 5 Mar 2025 16:37:26 +0000 (08:37 -0800)] 
net: hold netdev instance lock during sysfs operations

Most of them are already covered by the converted dev_xxx APIs.
Add the locking wrappers for the remaining ones.

Cc: Saeed Mahameed <saeed@kernel.org>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250305163732.2766420-9-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: hold netdev instance lock during ioctl operations
Stanislav Fomichev [Wed, 5 Mar 2025 16:37:25 +0000 (08:37 -0800)] 
net: hold netdev instance lock during ioctl operations

Convert all ndo_eth_ioctl invocations to dev_eth_ioctl which does the
locking. Reflow some of the dev_siocxxx to drop else clause.

Cc: Saeed Mahameed <saeed@kernel.org>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250305163732.2766420-8-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: hold netdev instance lock during rtnetlink operations
Stanislav Fomichev [Wed, 5 Mar 2025 16:37:24 +0000 (08:37 -0800)] 
net: hold netdev instance lock during rtnetlink operations

To preserve the atomicity, hold the lock while applying multiple
attributes. The major issue with a full conversion to the instance
lock are software nesting devices (bonding/team/vrf/etc). Those
devices call into the core stack for their lower (potentially
real hw) devices. To avoid explicitly wrapping all those places
into instance lock/unlock, introduce new API boundaries:

- (some) existing dev_xxx calls are now considered "external"
  (to drivers) APIs and they transparently grab the instance
  lock if needed (dev_api.c)
- new netif_xxx calls are internal core stack API (naming is
  sketchy, I've tried netdev_xxx_locked per Jakub's suggestion,
  but it feels a bit verbose; but happy to get back to this
  naming scheme if this is the preference)

This avoids touching most of the existing ioctl/sysfs/drivers paths.

Note the special handling of ndo_xxx_slave operations: I exploit
the fact that none of the drivers that call these functions
need/use instance lock. At the same time, they use dev_xxx
APIs, so the lower device has to be unlocked.

Changes in unregister_netdevice_many_notify (to protect dev->state
with instance lock) trigger lockdep - the loop over close_list
(mostly from cleanup_net) introduces spurious ordering issues.
netdev_lock_cmp_fn has a justification on why it's ok to suppress
for now.

Cc: Saeed Mahameed <saeed@kernel.org>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250305163732.2766420-7-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: hold netdev instance lock during queue operations
Stanislav Fomichev [Wed, 5 Mar 2025 16:37:23 +0000 (08:37 -0800)] 
net: hold netdev instance lock during queue operations

For the drivers that use queue management API, switch to the mode where
core stack holds the netdev instance lock. This affects the following
drivers:
- bnxt
- gve
- netdevsim

Originally I locked only start/stop, but switched to holding the
lock over all iterations to make them look atomic to the device
(feels like it should be easier to reason about).

Reviewed-by: Eric Dumazet <edumazet@google.com>
Cc: Saeed Mahameed <saeed@kernel.org>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250305163732.2766420-6-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: hold netdev instance lock during qdisc ndo_setup_tc
Stanislav Fomichev [Wed, 5 Mar 2025 16:37:22 +0000 (08:37 -0800)] 
net: hold netdev instance lock during qdisc ndo_setup_tc

Qdisc operations that can lead to ndo_setup_tc might need
to have an instance lock. Add netdev_lock_ops/netdev_unlock_ops
invocations for all psched_rtnl_msg_handlers operations.

Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: Saeed Mahameed <saeed@kernel.org>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250305163732.2766420-5-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: sched: wrap doit/dumpit methods
Stanislav Fomichev [Wed, 5 Mar 2025 16:37:21 +0000 (08:37 -0800)] 
net: sched: wrap doit/dumpit methods

In preparation for grabbing netdev instance lock around qdisc
operations, introduce tc_xxx wrappers that lookup netdev
and call respective __tc_xxx helper to do the actual work.
No functional changes.

Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: Saeed Mahameed <saeed@kernel.org>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250305163732.2766420-4-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: hold netdev instance lock during nft ndo_setup_tc
Stanislav Fomichev [Wed, 5 Mar 2025 16:37:20 +0000 (08:37 -0800)] 
net: hold netdev instance lock during nft ndo_setup_tc

Introduce new dev_setup_tc for nft ndo_setup_tc paths.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Cc: Saeed Mahameed <saeed@kernel.org>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250305163732.2766420-3-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: hold netdev instance lock during ndo_open/ndo_stop
Stanislav Fomichev [Wed, 5 Mar 2025 16:37:19 +0000 (08:37 -0800)] 
net: hold netdev instance lock during ndo_open/ndo_stop

For the drivers that use shaper API, switch to the mode where
core stack holds the netdev lock. This affects two drivers:

* iavf - already grabs netdev lock in ndo_open/ndo_stop, so mostly
         remove these
* netdevsim - switch to _locked APIs to avoid deadlock

iavf_close diff is a bit confusing, the existing call looks like this:
  iavf_close() {
    netdev_lock()
    ..
    netdev_unlock()
    wait_event_timeout(down_waitqueue)
  }

I change it to the following:
  netdev_lock()
  iavf_close() {
    ..
    netdev_unlock()
    wait_event_timeout(down_waitqueue)
    netdev_lock() // reusing this lock call
  }
  netdev_unlock()

Since I'm reusing existing netdev_lock call, so it looks like I only
add netdev_unlock.

Cc: Saeed Mahameed <saeed@kernel.org>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250305163732.2766420-2-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'xdp-metadata-support-for-tun-driver'
Martin KaFai Lau [Thu, 6 Mar 2025 20:31:08 +0000 (12:31 -0800)] 
Merge branch 'xdp-metadata-support-for-tun-driver'

Marcus Wichelmann says:

====================
XDP metadata support for tun driver

Hi all,

this v5 of the patch series is very similar to v4, but rebased onto the
bpf-next/net branch instead of bpf-next/master.
Because the commit c047e0e0e435 ("selftests/bpf: Optionally open a
dedicated namespace to run test in it") is not yet included in this branch,
I changed the xdp_context_tuntap test to manually create a namespace to run
the test in.

Not so successful pipeline: https://github.com/kernel-patches/bpf/actions/runs/13682405154

The CI pipeline failed because of veristat changes in seemingly unrelated
eBPF programs. I don't think this has to do with the changes from this
patch series, but if it does, please let me know what I may have to do
different to make the CI pass.
====================

Link: https://patch.msgid.link/20250305213438.3863922-1-marcus.wichelmann@hetzner-cloud.de
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
3 months agoselftests/bpf: Fix file descriptor assertion in open_tuntap helper
Marcus Wichelmann [Wed, 5 Mar 2025 21:34:38 +0000 (21:34 +0000)] 
selftests/bpf: Fix file descriptor assertion in open_tuntap helper

The open_tuntap helper function uses open() to get a file descriptor for
/dev/net/tun.

The open(2) manpage writes this about its return value:

  On success, open(), openat(), and creat() return the new file
  descriptor (a nonnegative integer).  On error, -1 is returned and
  errno is set to indicate the error.

This means that the fd > 0 assertion in the open_tuntap helper is
incorrect and should rather check for fd >= 0.

When running the BPF selftests locally, this incorrect assertion was not
an issue, but the BPF kernel-patches CI failed because of this:

  open_tuntap:FAIL:open(/dev/net/tun) unexpected open(/dev/net/tun):
  actual 0 <= expected 0

Signed-off-by: Marcus Wichelmann <marcus.wichelmann@hetzner-cloud.de>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250305213438.3863922-7-marcus.wichelmann@hetzner-cloud.de
3 months agoselftests/bpf: Add test for XDP metadata support in tun driver
Marcus Wichelmann [Wed, 5 Mar 2025 21:34:37 +0000 (21:34 +0000)] 
selftests/bpf: Add test for XDP metadata support in tun driver

Add a selftest that creates a tap device, attaches XDP and TC programs,
writes a packet with a test payload into the tap device and checks the
test result. This test ensures that the XDP metadata support in the tun
driver is enabled and that the metadata size is correctly passed to the
skb.

See the previous commit ("selftests/bpf: refactor xdp_context_functional
test and bpf program") for details about the test design.

The test runs in its own network namespace. This provides some extra
safety against conflicting interface names.

Signed-off-by: Marcus Wichelmann <marcus.wichelmann@hetzner-cloud.de>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20250305213438.3863922-6-marcus.wichelmann@hetzner-cloud.de
3 months agoselftests/bpf: Refactor xdp_context_functional test and bpf program
Marcus Wichelmann [Wed, 5 Mar 2025 21:34:36 +0000 (21:34 +0000)] 
selftests/bpf: Refactor xdp_context_functional test and bpf program

The existing XDP metadata test works by creating a veth pair and
attaching XDP & TC programs that drop the packet when the condition of
the test isn't fulfilled. The test then pings through the veth pair and
succeeds when the ping comes through.

While this test works great for a veth pair, it is hard to replicate for
tap devices to test the XDP metadata support of them. A similar test for
the tun driver would either involve logic to reply to the ping request,
or would have to capture the packet to check if it was dropped or not.

To make the testing of other drivers easier while still maximizing code
reuse, this commit refactors the existing xdp_context_functional test to
use a test_result map. Instead of conditionally passing or dropping the
packet, the TC program is changed to copy the received metadata into the
value of that single-entry array map. Tests can then verify that the map
value matches the expectation.

This testing logic is easy to adapt to other network drivers as the only
remaining requirement is that there is some way to send a custom
Ethernet packet through it that triggers the XDP & TC programs.

The Ethernet header of that custom packet is all-zero, because it is not
required to be valid for the test to work. The zero ethertype also helps
to filter out packets that are not related to the test and would
otherwise interfere with it.

The payload of the Ethernet packet is used as the test data that is
expected to be passed as metadata from the XDP to the TC program and
written to the map. It has a fixed size of 32 bytes which is a
reasonable size that should be supported by both drivers. Additional
packet headers are not necessary for the test and were therefore skipped
to keep the testing code short.

This new testing methodology no longer requires the veth interfaces to
have IP addresses assigned, therefore these were removed.

Signed-off-by: Marcus Wichelmann <marcus.wichelmann@hetzner-cloud.de>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250305213438.3863922-5-marcus.wichelmann@hetzner-cloud.de
3 months agoselftests/bpf: Move open_tuntap to network helpers
Marcus Wichelmann [Wed, 5 Mar 2025 21:34:35 +0000 (21:34 +0000)] 
selftests/bpf: Move open_tuntap to network helpers

To test the XDP metadata functionality of the tun driver, it's necessary
to create a new tap device first. A helper function for this already
exists in lwt_helpers.h. Move it to the common network helpers header,
so it can be reused in other tests.

Signed-off-by: Marcus Wichelmann <marcus.wichelmann@hetzner-cloud.de>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20250305213438.3863922-4-marcus.wichelmann@hetzner-cloud.de
3 months agonet: tun: Enable transfer of XDP metadata to skb
Marcus Wichelmann [Wed, 5 Mar 2025 21:34:34 +0000 (21:34 +0000)] 
net: tun: Enable transfer of XDP metadata to skb

When the XDP metadata area was used, it is expected that the same
metadata can also be accessed from TC, as can be read in the description
of the bpf_xdp_adjust_meta helper function. In the tun driver, this was
not yet implemented.

To make this work, the skb that is being built on XDP_PASS should know
of the current size of the metadata area. This is ensured by adding
calls to skb_metadata_set. For the tun_xdp_one code path, an additional
check is necessary to handle the case where the externally initialized
xdp_buff has no metadata support (xdp->data_meta == xdp->data + 1).

More information about this feature can be found in the commit message
of commit de8f3a83b0a0 ("bpf: add meta pointer for direct access").

Signed-off-by: Marcus Wichelmann <marcus.wichelmann@hetzner-cloud.de>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://patch.msgid.link/20250305213438.3863922-3-marcus.wichelmann@hetzner-cloud.de
3 months agonet: tun: Enable XDP metadata support
Marcus Wichelmann [Wed, 5 Mar 2025 21:34:33 +0000 (21:34 +0000)] 
net: tun: Enable XDP metadata support

Enable the support for the bpf_xdp_adjust_meta helper function for XDP
buffers initialized by the tun driver. This allows to reserve a metadata
area that is useful to pass any information from one XDP program to
another one, for example when using tail-calls.

Whether this helper function can be used in an XDP program depends on
how the xdp_buff was initialized. Most net drivers initialize the
xdp_buff in a way, that allows bpf_xdp_adjust_meta to be used. In case
of the tun driver, this is currently not the case.

There are two code paths in the tun driver that lead to a
bpf_prog_run_xdp and where metadata support should be enabled:

1. tun_build_skb, which is called by tun_get_user and is used when
   writing packets from userspace into the device. In this case, the
   xdp_buff created in tun_build_skb has no support for
   bpf_xdp_adjust_meta and calls of that helper function result in
   ENOTSUPP.

   For this code path, it's sufficient to set the meta_valid argument of
   the xdp_prepare_buff call. The reserved headroom is large enough
   already.

2. tun_xdp_one, which is called by tun_sendmsg which again is called by
   other drivers (e.g. vhost_net). When the TUN_MSG_PTR mode is used,
   another driver may pass a batch of xdp_buffs to the tun driver. In
   this case, that other driver is the one initializing the xdp_buff.

   See commit 043d222f93ab ("tuntap: accept an array of XDP buffs
   through sendmsg()") for details.

   For now, the vhost_net driver is the only one using TUN_MSG_PTR and
   it already initializes the xdp_buffs with metadata support and
   sufficient headroom. But the tun driver disables it again, so the
   xdp_set_data_meta_invalid call has to be removed.

Signed-off-by: Marcus Wichelmann <marcus.wichelmann@hetzner-cloud.de>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://patch.msgid.link/20250305213438.3863922-2-marcus.wichelmann@hetzner-cloud.de
3 months agoMerge tag 'net-6.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Linus Torvalds [Thu, 6 Mar 2025 19:34:54 +0000 (09:34 -1000)] 
Merge tag 'net-6.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Including fixes from bluetooth and wireless.

  Current release - new code bugs:

   - wifi: nl80211: disable multi-link reconfiguration

  Previous releases - regressions:

   - gso: fix ownership in __udp_gso_segment

   - wifi: iwlwifi:
      - fix A-MSDU TSO preparation
      - free pages allocated when failing to build A-MSDU

   - ipv6: fix dst ref loop in ila lwtunnel

   - mptcp: fix 'scheduling while atomic' in
     mptcp_pm_nl_append_new_local_addr

   - bluetooth: add check for mgmt_alloc_skb() in
     mgmt_device_connected()

   - ethtool: allow NULL nlattrs when getting a phy_device

   - eth: be2net: fix sleeping while atomic bugs in
     be_ndo_bridge_getlink

  Previous releases - always broken:

   - core: support TCP GSO case for a few missing flags

   - wifi: mac80211:
      - fix vendor-specific inheritance
      - cleanup sta TXQs on flush

   - llc: do not use skb_get() before dev_queue_xmit()

   - eth: ipa: nable checksum for IPA_ENDPOINT_AP_MODEM_{RX,TX}
     for v4.7"

* tag 'net-6.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (41 commits)
  net: ipv6: fix missing dst ref drop in ila lwtunnel
  net: ipv6: fix dst ref loop in ila lwtunnel
  mctp i3c: handle NULL header address
  net: dsa: mt7530: Fix traffic flooding for MMIO devices
  net-timestamp: support TCP GSO case for a few missing flags
  vlan: enforce underlying device type
  mptcp: fix 'scheduling while atomic' in mptcp_pm_nl_append_new_local_addr
  net: ethtool: netlink: Allow NULL nlattrs when getting a phy_device
  ppp: Fix KMSAN uninit-value warning with bpf
  net: ipa: Enable checksum for IPA_ENDPOINT_AP_MODEM_{RX,TX} for v4.7
  net: ipa: Fix QSB data for v4.7
  net: ipa: Fix v4.7 resource group names
  net: hns3: make sure ptp clock is unregister and freed if hclge_ptp_get_cycle returns an error
  wifi: nl80211: disable multi-link reconfiguration
  net: dsa: rtl8366rb: don't prompt users for LED control
  be2net: fix sleeping while atomic bugs in be_ndo_bridge_getlink
  llc: do not use skb_get() before dev_queue_xmit()
  wifi: cfg80211: regulatory: improve invalid hints checking
  caif_virtio: fix wrong pointer check in cfv_probe()
  net: gso: fix ownership in __udp_gso_segment
  ...

3 months agoMerge tag 'v6.14-rc5-smb3-fixes' of git://git.samba.org/ksmbd
Linus Torvalds [Thu, 6 Mar 2025 19:19:15 +0000 (09:19 -1000)] 
Merge tag 'v6.14-rc5-smb3-fixes' of git://git.samba.org/ksmbd

Pull smb fixes from Steve French:
 "Five SMB server fixes, two related client fixes, and minor MAINTAINERS
  update:

   - Two SMB3 lock fixes fixes (including use after free and bug on fix)

   - Fix to race condition that can happen in processing IPC responses

   - Four ACL related fixes: one related to endianness of num_aces, and
     two related fixes to the checks for num_aces (for both client and
     server), and one fixing missing check for num_subauths which can
     cause memory corruption

   - And minor update to email addresses in MAINTAINERS file"

* tag 'v6.14-rc5-smb3-fixes' of git://git.samba.org/ksmbd:
  cifs: fix incorrect validation for num_aces field of smb_acl
  ksmbd: fix incorrect validation for num_aces field of smb_acl
  smb: common: change the data type of num_aces to le16
  ksmbd: fix bug on trap in smb2_lock
  ksmbd: fix use-after-free in smb2_lock
  ksmbd: fix type confusion via race condition when using ipc_msg_send_request
  ksmbd: fix out-of-bounds in parse_sec_desc()
  MAINTAINERS: update email address in cifs and ksmbd entry