]> git.ipfire.org Git - thirdparty/kernel/linux.git/log
thirdparty/kernel/linux.git
3 months agoselftests: net: Add test cases for link and peer netns
Xiao Liang [Wed, 19 Feb 2025 12:50:39 +0000 (20:50 +0800)] 
selftests: net: Add test cases for link and peer netns

- Add test for creating link in another netns when a link of the same
   name and ifindex exists in current netns.
 - Add test to verify that link is created in target netns directly -
   no link new/del events should be generated in link netns or current
   netns.
 - Add test cases to verify that link-netns is set as expected for
   various drivers and combination of namespace-related parameters.

Signed-off-by: Xiao Liang <shaw.leon@gmail.com>
Link: https://patch.msgid.link/20250219125039.18024-14-shaw.leon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoselftests: net: Add python context manager for netns entering
Xiao Liang [Wed, 19 Feb 2025 12:50:38 +0000 (20:50 +0800)] 
selftests: net: Add python context manager for netns entering

Change netns of current thread and switch back on context exit.
For example:

    with NetNSEnter("ns1"):
        ip("link add dummy0 type dummy")

The command be executed in netns "ns1".

Signed-off-by: Xiao Liang <shaw.leon@gmail.com>
Link: https://patch.msgid.link/20250219125039.18024-13-shaw.leon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agortnetlink: Create link directly in target net namespace
Xiao Liang [Wed, 19 Feb 2025 12:50:37 +0000 (20:50 +0800)] 
rtnetlink: Create link directly in target net namespace

Make rtnl_newlink_create() create device in target namespace directly.
Avoid extra netns change when link netns is provided.

Device drivers has been converted to be aware of link netns, that is not
assuming device netns is and link netns is the same when ops->newlink()
is called.

Signed-off-by: Xiao Liang <shaw.leon@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250219125039.18024-12-shaw.leon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agortnetlink: Remove "net" from newlink params
Xiao Liang [Wed, 19 Feb 2025 12:50:36 +0000 (20:50 +0800)] 
rtnetlink: Remove "net" from newlink params

Now that devices have been converted to use the specific netns instead
of ambiguous "net", let's remove it from newlink parameters.

Signed-off-by: Xiao Liang <shaw.leon@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250219125039.18024-11-shaw.leon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: xfrm: Use link netns in newlink() of rtnl_link_ops
Xiao Liang [Wed, 19 Feb 2025 12:50:35 +0000 (20:50 +0800)] 
net: xfrm: Use link netns in newlink() of rtnl_link_ops

When link_net is set, use it as link netns instead of dev_net(). This
prepares for rtnetlink core to create device in target netns directly,
in which case the two namespaces may be different.

Signed-off-by: Xiao Liang <shaw.leon@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250219125039.18024-10-shaw.leon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: ipv6: Use link netns in newlink() of rtnl_link_ops
Xiao Liang [Wed, 19 Feb 2025 12:50:34 +0000 (20:50 +0800)] 
net: ipv6: Use link netns in newlink() of rtnl_link_ops

When link_net is set, use it as link netns instead of dev_net(). This
prepares for rtnetlink core to create device in target netns directly,
in which case the two namespaces may be different.

Signed-off-by: Xiao Liang <shaw.leon@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250219125039.18024-9-shaw.leon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: ipv6: Init tunnel link-netns before registering dev
Xiao Liang [Wed, 19 Feb 2025 12:50:33 +0000 (20:50 +0800)] 
net: ipv6: Init tunnel link-netns before registering dev

Currently some IPv6 tunnel drivers set tnl->net to dev_net(dev) in
ndo_init(), which is called in register_netdevice(). However, it lacks
the context of link-netns when we enable cross-net tunnels at device
registration time.

Let's move the init of tunnel link-netns before register_netdevice().

ip6_gre has already initialized netns, so just remove the redundant
assignment.

Signed-off-by: Xiao Liang <shaw.leon@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250219125039.18024-8-shaw.leon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: ip_tunnel: Use link netns in newlink() of rtnl_link_ops
Xiao Liang [Wed, 19 Feb 2025 12:50:32 +0000 (20:50 +0800)] 
net: ip_tunnel: Use link netns in newlink() of rtnl_link_ops

When link_net is set, use it as link netns instead of dev_net(). This
prepares for rtnetlink core to create device in target netns directly,
in which case the two namespaces may be different.

Convert common ip_tunnel_newlink() to accept an extra link netns
argument.

Signed-off-by: Xiao Liang <shaw.leon@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250219125039.18024-7-shaw.leon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: ip_tunnel: Don't set tunnel->net in ip_tunnel_init()
Xiao Liang [Wed, 19 Feb 2025 12:50:31 +0000 (20:50 +0800)] 
net: ip_tunnel: Don't set tunnel->net in ip_tunnel_init()

ip_tunnel_init() is called from register_netdevice(). In all code paths
reaching here, tunnel->net should already have been set (either in
ip_tunnel_newlink() or __ip_tunnel_create()). So don't set it again.

Signed-off-by: Xiao Liang <shaw.leon@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250219125039.18024-6-shaw.leon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoieee802154: 6lowpan: Validate link netns in newlink() of rtnl_link_ops
Xiao Liang [Wed, 19 Feb 2025 12:50:30 +0000 (20:50 +0800)] 
ieee802154: 6lowpan: Validate link netns in newlink() of rtnl_link_ops

Device denoted by IFLA_LINK is in link_net (IFLA_LINK_NETNSID) or
source netns by design, but 6lowpan uses dev_net.

Note dev->netns_local is set to true and currently link_net is
implemented via a netns change. These together effectively reject
IFLA_LINK_NETNSID.

This patch adds a validation to ensure link_net is either NULL or
identical to dev_net. Thus it would be fine to continue using dev_net
when rtnetlink core begins to create devices directly in target netns.

Signed-off-by: Xiao Liang <shaw.leon@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250219125039.18024-5-shaw.leon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: Use link/peer netns in newlink() of rtnl_link_ops
Xiao Liang [Wed, 19 Feb 2025 12:50:29 +0000 (20:50 +0800)] 
net: Use link/peer netns in newlink() of rtnl_link_ops

Add two helper functions - rtnl_newlink_link_net() and
rtnl_newlink_peer_net() for netns fallback logic. Peer netns falls back
to link netns, and link netns falls back to source netns.

Convert the use of params->net in netdevice drivers to one of the helper
functions for clarity.

Signed-off-by: Xiao Liang <shaw.leon@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250219125039.18024-4-shaw.leon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agortnetlink: Pack newlink() params into struct
Xiao Liang [Wed, 19 Feb 2025 12:50:28 +0000 (20:50 +0800)] 
rtnetlink: Pack newlink() params into struct

There are 4 net namespaces involved when creating links:

 - source netns - where the netlink socket resides,
 - target netns - where to put the device being created,
 - link netns - netns associated with the device (backend),
 - peer netns - netns of peer device.

Currently, two nets are passed to newlink() callback - "src_net"
parameter and "dev_net" (implicitly in net_device). They are set as
follows, depending on netlink attributes in the request.

 +------------+-------------------+---------+---------+
 | peer netns | IFLA_LINK_NETNSID | src_net | dev_net |
 +------------+-------------------+---------+---------+
 |            | absent            | source  | target  |
 | absent     +-------------------+---------+---------+
 |            | present           | link    | link    |
 +------------+-------------------+---------+---------+
 |            | absent            | peer    | target  |
 | present    +-------------------+---------+---------+
 |            | present           | peer    | link    |
 +------------+-------------------+---------+---------+

When IFLA_LINK_NETNSID is present, the device is created in link netns
first and then moved to target netns. This has some side effects,
including extra ifindex allocation, ifname validation and link events.
These could be avoided if we create it in target netns from
the beginning.

On the other hand, the meaning of src_net parameter is ambiguous. It
varies depending on how parameters are passed. It is the effective
link (or peer netns) by design, but some drivers ignore it and use
dev_net instead.

To provide more netns context for drivers, this patch packs existing
newlink() parameters, along with the source netns, link netns and peer
netns, into a struct. The old "src_net" is renamed to "net" to avoid
confusion with real source netns, and will be deprecated later. The use
of src_net are converted to params->net trivially.

Signed-off-by: Xiao Liang <shaw.leon@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250219125039.18024-3-shaw.leon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agortnetlink: Lookup device in target netns when creating link
Xiao Liang [Wed, 19 Feb 2025 12:50:27 +0000 (20:50 +0800)] 
rtnetlink: Lookup device in target netns when creating link

When creating link, lookup for existing device in target net namespace
instead of current one.
For example, two links created by:

  # ip link add dummy1 type dummy
  # ip link add netns ns1 dummy1 type dummy

should have no conflict since they are in different namespaces.

Signed-off-by: Xiao Liang <shaw.leon@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250219125039.18024-2-shaw.leon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'dt-bindings-net-realtek-rtl9301-switch'
Jakub Kicinski [Fri, 21 Feb 2025 23:08:05 +0000 (15:08 -0800)] 
Merge branch 'dt-bindings-net-realtek-rtl9301-switch'

Chris Packham says:

====================
dt-bindings: net: realtek,rtl9301-switch (schema part)

This is my attempt at trying to sort out the mess I've created with the RTL9300
switch dt-bindings. Some context is available on [1] and [2].

The first patch just moves the binding from mfd/ to net/ (with an adjustment of
the internal path name). The next two patches are successors to patches already
sent as part of the series [3].

[1] - https://lore.kernel.org/lkml/20250204-eccentric-deer-of-felicity-02b7ee@krzk-bin/
[2] - https://lore.kernel.org/lkml/4e3c5d83-d215-4eff-bf02-6d420592df8f@alliedtelesis.co.nz/
[3] - https://lore.kernel.org/lkml/20250204030249.1965444-1-chris.packham@alliedtelesis.co.nz/
====================

Link: https://patch.msgid.link/20250218195216.1034220-1-chris.packham@alliedtelesis.co.nz
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agodt-bindings: net: Add Realtek MDIO controller
Chris Packham [Tue, 18 Feb 2025 19:52:14 +0000 (08:52 +1300)] 
dt-bindings: net: Add Realtek MDIO controller

Add dtschema for the MDIO controller found in the RTL9300 Ethernet
switch. The controller is slightly unusual in that direct MDIO
communication is not possible. We model the MDIO controller with the
MDIO buses as child nodes and the PHYs as children of the buses. The
mapping of switch port number to MDIO bus/addr requires the
ethernet-ports sibling to provide the mapping via the phy-handle
property.

Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20250218195216.1034220-4-chris.packham@alliedtelesis.co.nz
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agodt-bindings: net: Add switch ports and interrupts to RTL9300
Chris Packham [Tue, 18 Feb 2025 19:52:13 +0000 (08:52 +1300)] 
dt-bindings: net: Add switch ports and interrupts to RTL9300

Add bindings for the ethernet-switch and interrupt properties for the
RTL9300.

Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20250218195216.1034220-3-chris.packham@alliedtelesis.co.nz
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agodt-bindings: net: Move realtek,rtl9301-switch to net
Chris Packham [Tue, 18 Feb 2025 19:52:12 +0000 (08:52 +1300)] 
dt-bindings: net: Move realtek,rtl9301-switch to net

Initially realtek,rtl9301-switch was placed under mfd/ because it had
some non-switch related blocks (specifically i2c and reset) but with a
bit more review it has become apparent that this was wrong and the
binding should live under net/.

Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
Acked-by: Lee Jones <lee@kernel.org>
Acked-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20250218195216.1034220-2-chris.packham@alliedtelesis.co.nz
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: sfp: add quirk for 2.5G OEM BX SFP
Birger Koblitz [Tue, 18 Feb 2025 17:59:40 +0000 (18:59 +0100)] 
net: sfp: add quirk for 2.5G OEM BX SFP

The OEM SFP-2.5G-BX10-D/U SFP module pair is meant to operate with
2500Base-X. However, in their EEPROM they incorrectly specify:
Transceiver codes   : 0x00 0x12 0x00 0x00 0x12 0x00 0x01 0x05 0x00
BR, Nominal         : 2500MBd

Use sfp_quirk_2500basex for this module to allow 2500Base-X mode anyway.
Tested on BananaPi R3.

Signed-off-by: Birger Koblitz <mail@birger-koblitz.de>
Reviewed-by: Daniel Golle <daniel@makrotopia.org>
Link: https://patch.msgid.link/20250218-b4-lkmsub-v1-1-1e51dcabed90@birger-koblitz.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: phy: remove unused feature array declarations
Heiner Kallweit [Wed, 19 Feb 2025 20:15:05 +0000 (21:15 +0100)] 
net: phy: remove unused feature array declarations

After 12d5151be010 ("net: phy: remove leftovers from switch to linkmode
bitmaps") the following declarations are unused and can be removed too.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/b2883c75-4108-48f2-ab73-e81647262bc2@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'selftests-drv-net-improve-the-queue-test-for-xsk'
Jakub Kicinski [Fri, 21 Feb 2025 01:57:11 +0000 (17:57 -0800)] 
Merge branch 'selftests-drv-net-improve-the-queue-test-for-xsk'

Jakub Kicinski says:

====================
selftests: drv-net: improve the queue test for XSK

We see some flakes in the the XSK test:

   Exception| Traceback (most recent call last):
   Exception|   File "/home/virtme/testing-18/tools/testing/selftests/net/lib/py/ksft.py", line 218, in ksft_run
   Exception|     case(*args)
   Exception|   File "/home/virtme/testing-18/tools/testing/selftests/drivers/net/./queues.py", line 53, in check_xdp
   Exception|     ksft_eq(q['xsk'], {})
   Exception| KeyError: 'xsk'

I think it's because the method of running the helper in the background
is racy. Add more solid infra for waiting for a background helper to be
initialized.

v1: https://lore.kernel.org/20250218195048.74692-1-kuba@kernel.org
====================

Link: https://patch.msgid.link/20250219234956.520599-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoselftests: drv-net: rename queues check_xdp to check_xsk
Jakub Kicinski [Wed, 19 Feb 2025 23:49:56 +0000 (15:49 -0800)] 
selftests: drv-net: rename queues check_xdp to check_xsk

The test is for AF_XDP, we refer to AF_XDP as XSK.

Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Tested-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250219234956.520599-8-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoselftests: drv-net: improve the use of ksft helpers in XSK queue test
Jakub Kicinski [Wed, 19 Feb 2025 23:49:55 +0000 (15:49 -0800)] 
selftests: drv-net: improve the use of ksft helpers in XSK queue test

Avoid exceptions when xsk attr is not present, and add a proper ksft
helper for "not in" condition.

Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Tested-by: Kurt Kanzenbach <kurt@linutronix.de>
Tested-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250219234956.520599-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoselftests: drv-net: add a way to wait for a local process
Jakub Kicinski [Wed, 19 Feb 2025 23:49:54 +0000 (15:49 -0800)] 
selftests: drv-net: add a way to wait for a local process

We use wait_port_listen() extensively to wait for a process
we spawned to be ready. Not all processes will open listening
sockets. Add a method of explicitly waiting for a child to
be ready. Pass a FD to the spawned process and wait for it
to write a message to us. FD number is passed via KSFT_READY_FD
env variable.

Similarly use KSFT_WAIT_FD to let the child process for a sign
that we are done and child should exit. Sending a signal to
a child with shell=True can get tricky.

Make use of this method in the queues test to make it less flaky.

Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Acked-by: Joe Damato <jdamato@fastly.com>
Tested-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250219234956.520599-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoselftests: drv-net: probe for AF_XDP sockets more explicitly
Jakub Kicinski [Wed, 19 Feb 2025 23:49:53 +0000 (15:49 -0800)] 
selftests: drv-net: probe for AF_XDP sockets more explicitly

Separate the support check from socket binding for easier refactoring.
Use: ./helper - - just to probe if we can open the socket.

Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Tested-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250219234956.520599-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoselftests: drv-net: add missing new line in xdp_helper
Jakub Kicinski [Wed, 19 Feb 2025 23:49:52 +0000 (15:49 -0800)] 
selftests: drv-net: add missing new line in xdp_helper

Kurt and Joe report missing new line at the end of Usage.

Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Tested-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250219234956.520599-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoselftests: drv-net: use cfg.rpath() in netlink xsk attr test
Jakub Kicinski [Wed, 19 Feb 2025 23:49:51 +0000 (15:49 -0800)] 
selftests: drv-net: use cfg.rpath() in netlink xsk attr test

The cfg.rpath() helper was been recently added to make formatting
paths for helper binaries easier.

Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Tested-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250219234956.520599-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoselftests: drv-net: add a warning for bkg + shell + terminate
Jakub Kicinski [Wed, 19 Feb 2025 23:49:50 +0000 (15:49 -0800)] 
selftests: drv-net: add a warning for bkg + shell + terminate

Joe Damato reports that some shells will fork before running
the command when python does "sh -c $cmd", while bash on my
machine does an exec of $cmd directly.

This will have implications for our ability to terminate
the child process on various configurations of bash and
other shells. Warn about using

bkg(... shell=True, termininate=True)

most background commands can hopefully exit cleanly (exit_wait).

Link: https://lore.kernel.org/Z7Yld21sv_Ip3gQx@LQ3V64L9R2
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Acked-by: Joe Damato <jdamato@fastly.com>
Tested-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250219234956.520599-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoocteontx2: hide unused label
Arnd Bergmann [Wed, 19 Feb 2025 16:21:14 +0000 (17:21 +0100)] 
octeontx2: hide unused label

A previous patch introduces a build-time warning when CONFIG_DCB
is disabled:

drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c: In function 'otx2_probe':
drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c:3217:1: error: label 'err_free_zc_bmap' defined but not used [-Werror=unused-label]
drivers/net/ethernet/marvell/octeontx2/nic/otx2_vf.c: In function 'otx2vf_probe':
drivers/net/ethernet/marvell/octeontx2/nic/otx2_vf.c:740:1: error: label 'err_free_zc_bmap' defined but not used [-Werror=unused-label]

Add the same #ifdef check around it.

Fixes: efabce290151 ("octeontx2-pf: AF_XDP zero copy receive support")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Suman Ghosh <sumang@marvell.com>
Link: https://patch.msgid.link/20250219162239.1376865-1-arnd@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: phy: qt2025: Fix hardware revision check comment
Charalampos Mitrodimas [Wed, 19 Feb 2025 12:41:55 +0000 (12:41 +0000)] 
net: phy: qt2025: Fix hardware revision check comment

Correct the hardware revision check comment in the QT2025 driver. The
revision value was documented as 0x3b instead of the correct 0xb3,
which matches the actual comparison logic in the code.

Reviewed-by: FUJITA Tomonori <fujita.tomonori@gmail.com>
Signed-off-by: Charalampos Mitrodimas <charmitro@posteo.net>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Trevor Gross <tmgross@umich.edu>
Link: https://patch.msgid.link/20250219-qt2025-comment-fix-v2-1-029f67696516@posteo.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'mlx5-misc-enhancements-2025-02-19'
Jakub Kicinski [Fri, 21 Feb 2025 01:36:11 +0000 (17:36 -0800)] 
Merge branch 'mlx5-misc-enhancements-2025-02-19'

Tariq Toukan says:

====================
mlx5 misc enhancements 2025-02-19

This small series enhances the mlx5 ethtool link speed code (no
functional change), in addition to a Kconfig description enhancement.
====================

Link: https://patch.msgid.link/20250219114112.403808-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet/mlx5e: Separate extended link modes request from link modes type selection
Shahar Shitrit [Wed, 19 Feb 2025 11:41:12 +0000 (13:41 +0200)] 
net/mlx5e: Separate extended link modes request from link modes type selection

The function ext_requested() serves two distinct purposes: it checks
if extended link modes were requested, and it selects whether to use
extended or legacy link modes.

This change separates these two purposes. Now, ext_link_mode_requested()
is used directly for checking if extended link modes are requested,
while the selection of extended modes is handled independently based on
the autonegotiation status.

By making this distinction, the logic for determining whether to select
extended or legacy link modes is clearer.

Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250219114112.403808-6-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet/mlx5e: Change eth_proto parameter naming
Shahar Shitrit [Wed, 19 Feb 2025 11:41:11 +0000 (13:41 +0200)] 
net/mlx5e: Change eth_proto parameter naming

eth_proto_cap parameter represents the supported link modes,
while eth_proto_admin refers to the configured ones.

The function get_advertising() retrieves the configured link
modes, thus we update its parameter name to eth_proto_admin.

Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250219114112.403808-5-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet/mlx5e: Introduce ptys2ethtool_process_link()
Shahar Shitrit [Wed, 19 Feb 2025 11:41:10 +0000 (13:41 +0200)] 
net/mlx5e: Introduce ptys2ethtool_process_link()

The functions ptys2ethtool_supported_link(), ptys2ethtool_adver_link()
share the same code, thus, in order to remove code duplication we
introduce a new function ptys2ethtool_process_link() to handle the
processing of both supported and advertised link modes.

Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250219114112.403808-4-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet/mlx5e: Refactor ptys2ethtool_adver_link()
Shahar Shitrit [Wed, 19 Feb 2025 11:41:09 +0000 (13:41 +0200)] 
net/mlx5e: Refactor ptys2ethtool_adver_link()

The function ptys2ethtool_adver_link() contains duplicated code that
is found in mlx5e_ethtool_get_speed_arr().

To eliminate this redundancy, we update mlx5e_ethtool_get_speed_arr()
to select the appropriate table based on the ext argument passed by
the caller, rather than querying the supported mode locally.

This allows us to replace the current logic in ptys2ethtool_adver_link()
with a call to mlx5e_ethtool_get_speed_arr().

This adjustment aligns with the ptys2ethtool_supported_link() function
and prepares for an upcoming patch that reduces code duplication.

Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250219114112.403808-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet/mlx5: Bridge, correct config option description
Cosmin Ratiu [Wed, 19 Feb 2025 11:41:08 +0000 (13:41 +0200)] 
net/mlx5: Bridge, correct config option description

The implication of the previous help text was that without this option
enabled, representor devices couldn't be added to a bridge device, while
in fact that was possible, just that rules didn't get offloaded to hw.

This commit clarifies the help text.

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250219114112.403808-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoneighbour: Replace kvzalloc() with kzalloc() when GFP_ATOMIC is specified
Kohei Enju [Wed, 19 Feb 2025 10:22:27 +0000 (19:22 +0900)] 
neighbour: Replace kvzalloc() with kzalloc() when GFP_ATOMIC is specified

kzalloc() uses page allocator when size is larger than
KMALLOC_MAX_CACHE_SIZE, so the intention of commit ab101c553bc1
("neighbour: use kvzalloc()/kvfree()") can be achieved by using kzalloc().

When using GFP_ATOMIC, kvzalloc() only tries the kmalloc path,
since the vmalloc path does not support the flag.
In this case, kvzalloc() is equivalent to kzalloc() in that neither try
the vmalloc path, so this replacement brings no functional change.
This is primarily a cleanup change, as the original code functions
correctly.

This patch replaces kvzalloc() introduced by commit 41b3caa7c076
("neighbour: Add hlist_node to struct neighbour"), which is called in
the same context and with the same gfp flag as the aforementioned commit
ab101c553bc1 ("neighbour: use kvzalloc()/kvfree()").

Signed-off-by: Kohei Enju <enjuk@amazon.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://patch.msgid.link/20250219102227.72488-1-enjuk@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'some-pktgen-fixes-improvments-part-i'
Jakub Kicinski [Fri, 21 Feb 2025 01:24:58 +0000 (17:24 -0800)] 
Merge branch 'some-pktgen-fixes-improvments-part-i'

Peter Seiderer says:

====================
Some pktgen fixes/improvments (part I)

While taking a look at '[PATCH net] pktgen: Avoid out-of-range in
get_imix_entries' ([1]) and '[PATCH net v2] pktgen: Avoid out-of-bounds
access in get_imix_entries' ([2], [3]) and doing some tests and code review
I detected that the /proc/net/pktgen/... parsing logic does not honour the
user given buffer bounds (resulting in out-of-bounds access).

This can be observed e.g. by the following simple test (sometimes the
old/'longer' previous value is re-read from the buffer):

        $ echo add_device lo@0 > /proc/net/pktgen/kpktgend_0

        $ echo "min_pkt_size 12345" > /proc/net/pktgen/lo\@0 && grep min_pkt_size /proc/net/pktgen/lo\@0
Params: count 1000  min_pkt_size: 12345  max_pkt_size: 0
Result: OK: min_pkt_size=12345

        $ echo -n "min_pkt_size 123" > /proc/net/pktgen/lo\@0 && grep min_pkt_size /proc/net/pktgen/lo\@0
Params: count 1000  min_pkt_size: 12345  max_pkt_size: 0
Result: OK: min_pkt_size=12345

        $ echo "min_pkt_size 123" > /proc/net/pktgen/lo\@0 && grep min_pkt_size /proc/net/pktgen/lo\@0
Params: count 1000  min_pkt_size: 123  max_pkt_size: 0
Result: OK: min_pkt_size=123

So fix the out-of-bounds access (and some minor findings) and add a simple
proc_net_pktgen selftest...

[1] https://lore.kernel.org/netdev/20241006221221.3744995-1-artem.chernyshev@red-soft.ru/
[2] https://lore.kernel.org/netdev/20250109083039.14004-1-pchelkin@ispras.ru/
[3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=76201b5979768500bca362871db66d77cb4c225e
====================

Link: https://patch.msgid.link/20250219084527.20488-1-ps.report@gmx.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: pktgen: fix access outside of user given buffer in pktgen_thread_write()
Peter Seiderer [Wed, 19 Feb 2025 08:45:27 +0000 (09:45 +0100)] 
net: pktgen: fix access outside of user given buffer in pktgen_thread_write()

Honour the user given buffer size for the strn_len() calls (otherwise
strn_len() will access memory outside of the user given buffer).

Signed-off-by: Peter Seiderer <ps.report@gmx.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250219084527.20488-8-ps.report@gmx.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: pktgen: fix ctrl interface command parsing
Peter Seiderer [Wed, 19 Feb 2025 08:45:26 +0000 (09:45 +0100)] 
net: pktgen: fix ctrl interface command parsing

Enable command writing without trailing '\n':

- the good case

$ echo "reset" > /proc/net/pktgen/pgctrl

- the bad case (before the patch)

$ echo -n "reset" > /proc/net/pktgen/pgctrl
-bash: echo: write error: Invalid argument

- with patch applied

$ echo -n "reset" > /proc/net/pktgen/pgctrl

Signed-off-by: Peter Seiderer <ps.report@gmx.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250219084527.20488-7-ps.report@gmx.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: pktgen: fix 'ratep 0' error handling (return -EINVAL)
Peter Seiderer [Wed, 19 Feb 2025 08:45:25 +0000 (09:45 +0100)] 
net: pktgen: fix 'ratep 0' error handling (return -EINVAL)

Given an invalid 'ratep' command e.g. 'ratep 0' the return value is '1',
leading to the following misleading output:

- the good case

$ echo "ratep 100" > /proc/net/pktgen/lo\@0
$ grep "Result:" /proc/net/pktgen/lo\@0
Result: OK: ratep=100

- the bad case (before the patch)

$ echo "ratep 0" > /proc/net/pktgen/lo\@0"
-bash: echo: write error: Invalid argument
$ grep "Result:" /proc/net/pktgen/lo\@0
Result: No such parameter "atep"

- with patch applied

$ echo "ratep 0" > /proc/net/pktgen/lo\@0
-bash: echo: write error: Invalid argument
$ grep "Result:" /proc/net/pktgen/lo\@0
Result: Idle

Signed-off-by: Peter Seiderer <ps.report@gmx.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250219084527.20488-6-ps.report@gmx.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: pktgen: fix 'rate 0' error handling (return -EINVAL)
Peter Seiderer [Wed, 19 Feb 2025 08:45:24 +0000 (09:45 +0100)] 
net: pktgen: fix 'rate 0' error handling (return -EINVAL)

Given an invalid 'rate' command e.g. 'rate 0' the return value is '1',
leading to the following misleading output:

- the good case

$ echo "rate 100" > /proc/net/pktgen/lo\@0
$ grep "Result:" /proc/net/pktgen/lo\@0
Result: OK: rate=100

- the bad case (before the patch)

$ echo "rate 0" > /proc/net/pktgen/lo\@0"
-bash: echo: write error: Invalid argument
$ grep "Result:" /proc/net/pktgen/lo\@0
Result: No such parameter "ate"

- with patch applied

$ echo "rate 0" > /proc/net/pktgen/lo\@0
-bash: echo: write error: Invalid argument
$ grep "Result:" /proc/net/pktgen/lo\@0
Result: Idle

Signed-off-by: Peter Seiderer <ps.report@gmx.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250219084527.20488-5-ps.report@gmx.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: pktgen: fix hex32_arg parsing for short reads
Peter Seiderer [Wed, 19 Feb 2025 08:45:23 +0000 (09:45 +0100)] 
net: pktgen: fix hex32_arg parsing for short reads

Fix hex32_arg parsing for short reads (here 7 hex digits instead of the
expected 8), shift result only on successful input parsing.

- before the patch

$ echo "mpls 0000123" > /proc/net/pktgen/lo\@0
$ grep mpls /proc/net/pktgen/lo\@0
     mpls: 00001230
Result: OK: mpls=00001230

- with patch applied

$ echo "mpls 0000123" > /proc/net/pktgen/lo\@0
$ grep mpls /proc/net/pktgen/lo\@0
     mpls: 00000123
Result: OK: mpls=00000123

Signed-off-by: Peter Seiderer <ps.report@gmx.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250219084527.20488-4-ps.report@gmx.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: pktgen: enable 'param=value' parsing
Peter Seiderer [Wed, 19 Feb 2025 08:45:22 +0000 (09:45 +0100)] 
net: pktgen: enable 'param=value' parsing

Enable more flexible parameters syntax, allowing 'param=value' in
addition to the already supported 'param value' pattern (additional
this gives the skipping '=' in count_trail_chars() a purpose).

Tested with:

$ echo "min_pkt_size 999" > /proc/net/pktgen/lo\@0
$ echo "min_pkt_size=999" > /proc/net/pktgen/lo\@0
$ echo "min_pkt_size =999" > /proc/net/pktgen/lo\@0
$ echo "min_pkt_size= 999" > /proc/net/pktgen/lo\@0
$ echo "min_pkt_size = 999" > /proc/net/pktgen/lo\@0

Signed-off-by: Peter Seiderer <ps.report@gmx.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250219084527.20488-3-ps.report@gmx.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: pktgen: replace ENOTSUPP with EOPNOTSUPP
Peter Seiderer [Wed, 19 Feb 2025 08:45:21 +0000 (09:45 +0100)] 
net: pktgen: replace ENOTSUPP with EOPNOTSUPP

Replace ENOTSUPP with EOPNOTSUPP, fixes checkpatch hint

  WARNING: ENOTSUPP is not a SUSV4 error code, prefer EOPNOTSUPP

and e.g.

  $ echo "clone_skb 1" > /proc/net/pktgen/lo\@0
  -bash: echo: write error: Unknown error 524

Signed-off-by: Peter Seiderer <ps.report@gmx.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250219084527.20488-2-ps.report@gmx.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoaf_unix: Fix undefined 'other' error
Purva Yeshi [Tue, 18 Feb 2025 14:10:45 +0000 (19:40 +0530)] 
af_unix: Fix undefined 'other' error

Fix an issue with the sparse static analysis tool where an
"undefined 'other'" error occurs due to `__releases(&unix_sk(other)->lock)`
being placed before 'other' is in scope.

Remove the `__releases()` annotation from the `unix_wait_for_peer()`
function to eliminate the sparse error. The annotation references `other`
before it is declared, leading to a false positive error during static
analysis.

Since AF_UNIX does not use sparse annotations, this annotation is
unnecessary and does not impact functionality.

Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Purva Yeshi <purvayeshi550@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250218141045.38947-1-purvayeshi550@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoeth: fbnic: Add ethtool support for IRQ coalescing
Mohsin Bashir [Tue, 18 Feb 2025 02:35:20 +0000 (18:35 -0800)] 
eth: fbnic: Add ethtool support for IRQ coalescing

Add ethtool support to configure the IRQ coalescing behavior. Support
separate timers for Rx and Tx for time based coalescing. For frame based
configuration, currently we only support the Rx side.

The hardware allows configuration of descriptor count instead of frame
count requiring conversion between the two. We assume 2 descriptors
per frame, one for the metadata and one for the data segment.

When rx-frames are not configured, we set the RX descriptor count to
half the ring size as a fail safe.

Default configuration:
ethtool -c eth0 | grep -E "rx-usecs:|tx-usecs:|rx-frames:"
rx-usecs:       30
rx-frames:      0
tx-usecs:       35

IRQ rate test:
With single iperf flow we monitor IRQ rate while changing the tx-usesc and
rx-usecs to high and low values.

ethtool -C eth0 rx-frames 8192 rx-usecs 150 tx-usecs 150
irq/sec   13k
irq/sec   14k
irq/sec   14k

ethtool -C eth0 rx-frames 8192 rx-usecs 10 tx-usecs 10
irq/sec  27k
irq/sec  28k
irq/sec  28k

Validating the use of extack:
ethtool -C eth0 rx-frames 16384
netlink error: fbnic: rx_frames is above device max
netlink error: Invalid argument

Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/20250218023520.2038010-1-mohsin.bashr@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'support-ptp-clock-for-wangxun-nics'
Jakub Kicinski [Thu, 20 Feb 2025 22:59:39 +0000 (14:59 -0800)] 
Merge branch 'support-ptp-clock-for-wangxun-nics'

Jiawen Wu says:

====================
Support PTP clock for Wangxun NICs

Implement support for PTP clock on Wangxun NICs.

v7: https://lore.kernel.org/20250213083041.78917-1-jiawenwu@trustnetic.com
v6: https://lore.kernel.org/20250208031348.4368-1-jiawenwu@trustnetic.com
v5: https://lore.kernel.org/20250117062051.2257073-1-jiawenwu@trustnetic.com
v4: https://lore.kernel.org/20250114084425.2203428-1-jiawenwu@trustnetic.com
v3: https://lore.kernel.org/20250110031716.2120642-1-jiawenwu@trustnetic.com
v2: https://lore.kernel.org/20250106084506.2042912-1-jiawenwu@trustnetic.com
v1: https://lore.kernel.org/20250102103026.1982137-1-jiawenwu@trustnetic.com
====================

Link: https://patch.msgid.link/20250218023432.146536-1-jiawenwu@trustnetic.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: ngbe: Add support for 1PPS and TOD
Jiawen Wu [Tue, 18 Feb 2025 02:34:32 +0000 (10:34 +0800)] 
net: ngbe: Add support for 1PPS and TOD

Implement support for generating a 1pps output signal on SDP0.
And support custom firmware to output TOD.

Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Link: https://patch.msgid.link/20250218023432.146536-5-jiawenwu@trustnetic.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: wangxun: Add periodic checks for overflow and errors
Jiawen Wu [Tue, 18 Feb 2025 02:34:31 +0000 (10:34 +0800)] 
net: wangxun: Add periodic checks for overflow and errors

Implement watchdog task to detect SYSTIME overflow and error cases of
Rx/Tx timestamp.

Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20250218023432.146536-4-jiawenwu@trustnetic.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: wangxun: Support to get ts info
Jiawen Wu [Tue, 18 Feb 2025 02:34:30 +0000 (10:34 +0800)] 
net: wangxun: Support to get ts info

Implement the function get_ts_info and get_ts_stats in ethtool_ops to
get the HW capabilities and statistics for timestamping.

Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20250218023432.146536-3-jiawenwu@trustnetic.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: wangxun: Add support for PTP clock
Jiawen Wu [Tue, 18 Feb 2025 02:34:29 +0000 (10:34 +0800)] 
net: wangxun: Add support for PTP clock

Implement support for PTP clock on Wangxun NICs.

Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20250218023432.146536-2-jiawenwu@trustnetic.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agotun: Pad virtio headers
Akihiko Odaki [Sat, 15 Feb 2025 06:04:50 +0000 (15:04 +0900)] 
tun: Pad virtio headers

tun simply advances iov_iter when it needs to pad virtio header,
which leaves the garbage in the buffer as is. This will become
especially problematic when tun starts to allow enabling the hash
reporting feature; even if the feature is enabled, the packet may lack a
hash value and may contain a hole in the virtio header because the
packet arrived before the feature gets enabled or does not contain the
header fields to be hashed. If the hole is not filled with zero, it is
impossible to tell if the packet lacks a hash value.

In theory, a user of tun can fill the buffer with zero before calling
read() to avoid such a problem, but leaving the garbage in the buffer is
awkward anyway so replace advancing the iterator with writing zeros.

A user might have initialized the buffer to some non-zero value,
expecting tun to skip writing it. As this was never a documented
feature, this seems unlikely.

The overhead of filling the hole in the header is negligible when the
header size is specified according to the specification as doing so will
not make another cache line dirty under a reasonable assumption. Below
is a proof of this statement:

The first 10 bytes of the header is always written and tun also writes
the packet itself immediately after the packet unless the packet is
empty. This makes a hole between these writes whose size is: sz - 10
where sz is the specified header size.

Therefore, we will never make another cache line dirty when:
sz < L1_CACHE_BYTES + 10
where L1_CACHE_BYTES is the cache line size. Assuming
L1_CACHE_BYTES >= 16, this inequation holds when: sz < 26.

sz <= 20 according to the current specification so we even have a
margin of 5 bytes in case that the header size grows in a future version
of the specification.

Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Link: https://patch.msgid.link/20250215-buffers-v2-1-1fbc6aaf8ad6@daynix.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonetdevsim: call napi_schedule from a timer context
Breno Leitao [Wed, 19 Feb 2025 16:41:20 +0000 (08:41 -0800)] 
netdevsim: call napi_schedule from a timer context

The netdevsim driver was experiencing NOHZ tick-stop errors during packet
transmission due to pending softirq work when calling napi_schedule().
This issue was observed when running the netconsole selftest, which
triggered the following error message:

  NOHZ tick-stop error: local softirq work is pending, handler #08!!!

To fix this issue, introduce a timer that schedules napi_schedule()
from a timer context instead of calling it directly from the TX path.

Create an hrtimer for each queue and kick it from the TX path,
which then schedules napi_schedule() from the timer context.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20250219-netdevsim-v3-1-811e2b8abc4c@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'flexible-array-for-ip-tunnel-options'
Jakub Kicinski [Thu, 20 Feb 2025 21:17:21 +0000 (13:17 -0800)] 
Merge branch 'flexible-array-for-ip-tunnel-options'

Gal Pressman says:

====================
Flexible array for ip tunnel options

Remove the hidden assumption that options are allocated at the end of
the struct, and teach the compiler about them using a flexible array.

First patch is converting hard-coded 'info + 1' to use ip_tunnel_info()
helper.
Second patch adds the 'options' flexible array and changes the helper to
use it.

v4: https://lore.kernel.org/20250217202503.265318-1-gal@nvidia.com
v3: https://lore.kernel.org/20250212140953.107533-1-gal@nvidia.com
v2: https://lore.kernel.org/20250209101853.15828-1-gal@nvidia.com
====================

Link: https://patch.msgid.link/20250219143256.370277-1-gal@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: Add options as a flexible array to struct ip_tunnel_info
Gal Pressman [Wed, 19 Feb 2025 14:32:56 +0000 (16:32 +0200)] 
net: Add options as a flexible array to struct ip_tunnel_info

Remove the hidden assumption that options are allocated at the end of
the struct, and teach the compiler about them using a flexible array.

With this, we can revert the unsafe_memcpy() call we have in
tun_dst_unclone() [1], and resolve the false field-spanning write
warning caused by the memcpy() in ip_tunnel_info_opts_set().

The layout of struct ip_tunnel_info remains the same with this patch.
Before this patch, there was an implicit padding at the end of the
struct, options would be written at 'info + 1' which is after the
padding.
This will remain the same as this patch explicitly aligns 'options'.
The alignment is needed as the options are later casted to different
structs, and might result in unaligned memory access.

Pahole output before this patch:
struct ip_tunnel_info {
    struct ip_tunnel_key       key;                  /*     0    64 */

    /* XXX last struct has 1 byte of padding */

    /* --- cacheline 1 boundary (64 bytes) --- */
    struct ip_tunnel_encap     encap;                /*    64     8 */
    struct dst_cache           dst_cache;            /*    72    16 */
    u8                         options_len;          /*    88     1 */
    u8                         mode;                 /*    89     1 */

    /* size: 96, cachelines: 2, members: 5 */
    /* padding: 6 */
    /* paddings: 1, sum paddings: 1 */
    /* last cacheline: 32 bytes */
};

Pahole output after this patch:
struct ip_tunnel_info {
    struct ip_tunnel_key       key;                  /*     0    64 */

    /* XXX last struct has 1 byte of padding */

    /* --- cacheline 1 boundary (64 bytes) --- */
    struct ip_tunnel_encap     encap;                /*    64     8 */
    struct dst_cache           dst_cache;            /*    72    16 */
    u8                         options_len;          /*    88     1 */
    u8                         mode;                 /*    89     1 */

    /* XXX 6 bytes hole, try to pack */

    u8                         options[] __attribute__((__aligned__(16))); /*    96     0 */

    /* size: 96, cachelines: 2, members: 6 */
    /* sum members: 90, holes: 1, sum holes: 6 */
    /* paddings: 1, sum paddings: 1 */
    /* forced alignments: 1, forced holes: 1, sum forced holes: 6 */
    /* last cacheline: 32 bytes */
} __attribute__((__aligned__(16)));

[1] Commit 13cfd6a6d7ac ("net: Silence false field-spanning write warning in metadata_dst memcpy")

Link: https://lore.kernel.org/all/53D1D353-B8F6-4ADC-8F29-8C48A7C9C6F1@kernel.org/
Suggested-by: Kees Cook <kees@kernel.org>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20250219143256.370277-3-gal@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoip_tunnel: Use ip_tunnel_info() helper instead of 'info + 1'
Gal Pressman [Wed, 19 Feb 2025 14:32:55 +0000 (16:32 +0200)] 
ip_tunnel: Use ip_tunnel_info() helper instead of 'info + 1'

Tunnel options should not be accessed directly, use the ip_tunnel_info()
accessor instead.

Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20250219143256.370277-2-gal@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Jakub Kicinski [Thu, 20 Feb 2025 18:36:34 +0000 (10:36 -0800)] 
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR (net-6.14-rc4).

No conflicts or adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge tag 'net-6.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Linus Torvalds [Thu, 20 Feb 2025 18:19:54 +0000 (10:19 -0800)] 
Merge tag 'net-6.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Smaller than usual with no fixes from any subtree.

  Current release - regressions:

   - core: fix race of rtnl_net_lock(dev_net(dev))

  Previous releases - regressions:

   - core: remove the single page frag cache for good

   - flow_dissector: fix handling of mixed port and port-range keys

   - sched: cls_api: fix error handling causing NULL dereference

   - tcp:
       - adjust rcvq_space after updating scaling ratio
       - drop secpath at the same time as we currently drop dst

   - eth: gtp: suppress list corruption splat in gtp_net_exit_batch_rtnl().

  Previous releases - always broken:

   - vsock:
       - fix variables initialization during resuming
       - for connectible sockets allow only connected

   - eth:
       - geneve: fix use-after-free in geneve_find_dev()
       - ibmvnic: don't reference skb after sending to VIOS"

* tag 'net-6.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (34 commits)
  Revert "net: skb: introduce and use a single page frag cache"
  net: allow small head cache usage with large MAX_SKB_FRAGS values
  nfp: bpf: Add check for nfp_app_ctrl_msg_alloc()
  tcp: drop secpath at the same time as we currently drop dst
  net: axienet: Set mac_managed_pm
  arp: switch to dev_getbyhwaddr() in arp_req_set_public()
  net: Add non-RCU dev_getbyhwaddr() helper
  sctp: Fix undefined behavior in left shift operation
  selftests/bpf: Add a specific dst port matching
  flow_dissector: Fix port range key handling in BPF conversion
  selftests/net/forwarding: Add a test case for tc-flower of mixed port and port-range
  flow_dissector: Fix handling of mixed port and port-range keys
  geneve: Suppress list corruption splat in geneve_destroy_tunnels().
  gtp: Suppress list corruption splat in gtp_net_exit_batch_rtnl().
  dev: Use rtnl_net_dev_lock() in unregister_netdev().
  net: Fix dev_net(dev) race in unregister_netdevice_notifier_dev_net().
  net: Add net_passive_inc() and net_passive_dec().
  net: pse-pd: pd692x0: Fix power limit retrieval
  MAINTAINERS: trim the GVE entry
  gve: set xdp redirect target only when it is available
  ...

3 months agoMerge tag 'v6.14-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Linus Torvalds [Thu, 20 Feb 2025 16:59:00 +0000 (08:59 -0800)] 
Merge tag 'v6.14-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:

 - Fix for chmod regression

 - Two reparse point related fixes

 - One minor cleanup (for GCC 14 compiles)

 - Fix for SMB3.1.1 POSIX Extensions reporting incorrect file type

* tag 'v6.14-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: Treat unhandled directory name surrogate reparse points as mount directory nodes
  cifs: Throw -EOPNOTSUPP error on unsupported reparse point type from parse_reparse_point()
  smb311: failure to open files of length 1040 when mounting with SMB3.1.1 POSIX extensions
  smb: client, common: Avoid multiple -Wflex-array-member-not-at-end warnings
  smb: client: fix chmod(2) regression with ATTR_READONLY

3 months agoMerge tag 'bcachefs-2025-02-20' of git://evilpiepirate.org/bcachefs
Linus Torvalds [Thu, 20 Feb 2025 16:51:57 +0000 (08:51 -0800)] 
Merge tag 'bcachefs-2025-02-20' of git://evilpiepirate.org/bcachefs

Pull bcachefs fixes from Kent Overstreet:
 "Small stuff:

   - The fsck code for Hongbo's directory i_size patch was wrong, caught
     by transaction restart injection: we now have the CI running
     another test variant with restart injection enabled

   - Another fixup for reflink pointers to missing indirect extents:
     previous fix was for fsck code, this fixes the normal runtime paths

   - Another small srcu lock hold time fix, reported by jpsollie"

* tag 'bcachefs-2025-02-20' of git://evilpiepirate.org/bcachefs:
  bcachefs: Fix srcu lock warning in btree_update_nodes_written()
  bcachefs: Fix bch2_indirect_extent_missing_error()
  bcachefs: Fix fsck directory i_size checking

3 months agoMerge tag 'xfs-fixes-6.14-rc4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Linus Torvalds [Thu, 20 Feb 2025 16:48:55 +0000 (08:48 -0800)] 
Merge tag 'xfs-fixes-6.14-rc4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Carlos Maiolino:
 "Just a collection of bug fixes, nothing really stands out"

* tag 'xfs-fixes-6.14-rc4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: flush inodegc before swapon
  xfs: rename xfs_iomap_swapfile_activate to xfs_vm_swap_activate
  xfs: Do not allow norecovery mount with quotacheck
  xfs: do not check NEEDSREPAIR if ro,norecovery mount.
  xfs: fix data fork format filtering during inode repair
  xfs: fix online repair probing when CONFIG_XFS_ONLINE_REPAIR=n

3 months agoMerge branch 'net-remove-the-single-page-frag-cache-for-good'
Paolo Abeni [Thu, 20 Feb 2025 09:53:31 +0000 (10:53 +0100)] 
Merge branch 'net-remove-the-single-page-frag-cache-for-good'

Paolo Abeni says:

====================
net: remove the single page frag cache for good

This is another attempt at reverting commit dbae2b062824 ("net: skb:
introduce and use a single page frag cache"), as it causes regressions
in specific use-cases.

Reverting such commit uncovers an allocation issue for build with
CONFIG_MAX_SKB_FRAGS=45, as reported by Sabrina.

This series handle the latter in patch 1 and brings the revert in patch
2.

Note that there is a little chicken-egg problem, as I included into the
patch 1's changelog the splat that would be visible only applying first
the revert: I think current patch order is better for bisectability,
still the splat is useful for correct attribution.
====================

Link: https://patch.msgid.link/cover.1739899357.git.pabeni@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agoRevert "net: skb: introduce and use a single page frag cache"
Paolo Abeni [Tue, 18 Feb 2025 18:29:40 +0000 (19:29 +0100)] 
Revert "net: skb: introduce and use a single page frag cache"

After the previous commit is finally safe to revert commit dbae2b062824
("net: skb: introduce and use a single page frag cache"): do it here.

The intended goal of such change was to counter a performance regression
introduced by commit 3226b158e67c ("net: avoid 32 x truesize
under-estimation for tiny skbs").

Unfortunately, the blamed commit introduces another regression for the
virtio_net driver. Such a driver calls napi_alloc_skb() with a tiny
size, so that the whole head frag could fit a 512-byte block.

The single page frag cache uses a 1K fragment for such allocation, and
the additional overhead, under small UDP packets flood, makes the page
allocator a bottleneck.

Thanks to commit bf9f1baa279f ("net: add dedicated kmem_cache for
typical/small skb->head"), this revert does not re-introduce the
original regression. Actually, in the relevant test on top of this
revert, I measure a small but noticeable positive delta, just above
noise level.

The revert itself required some additional mangling due to recent updates
in the affected code.

Suggested-by: Eric Dumazet <edumazet@google.com>
Fixes: dbae2b062824 ("net: skb: introduce and use a single page frag cache")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agonet: allow small head cache usage with large MAX_SKB_FRAGS values
Paolo Abeni [Tue, 18 Feb 2025 18:29:39 +0000 (19:29 +0100)] 
net: allow small head cache usage with large MAX_SKB_FRAGS values

Sabrina reported the following splat:

    WARNING: CPU: 0 PID: 1 at net/core/dev.c:6935 netif_napi_add_weight_locked+0x8f2/0xba0
    Modules linked in:
    CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.14.0-rc1-net-00092-g011b03359038 #996
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
    RIP: 0010:netif_napi_add_weight_locked+0x8f2/0xba0
    Code: e8 c3 e6 6a fe 48 83 c4 28 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc c7 44 24 10 ff ff ff ff e9 8f fb ff ff e8 9e e6 6a fe <0f> 0b e9 d3 fe ff ff e8 92 e6 6a fe 48 8b 04 24 be ff ff ff ff 48
    RSP: 0000:ffffc9000001fc60 EFLAGS: 00010293
    RAX: 0000000000000000 RBX: ffff88806ce48128 RCX: 1ffff11001664b9e
    RDX: ffff888008f00040 RSI: ffffffff8317ca42 RDI: ffff88800b325cb6
    RBP: ffff88800b325c40 R08: 0000000000000001 R09: ffffed100167502c
    R10: ffff88800b3a8163 R11: 0000000000000000 R12: ffff88800ac1c168
    R13: ffff88800ac1c168 R14: ffff88800ac1c168 R15: 0000000000000007
    FS:  0000000000000000(0000) GS:ffff88806ce00000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffff888008201000 CR3: 0000000004c94001 CR4: 0000000000370ef0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    <TASK>
    gro_cells_init+0x1ba/0x270
    xfrm_input_init+0x4b/0x2a0
    xfrm_init+0x38/0x50
    ip_rt_init+0x2d7/0x350
    ip_init+0xf/0x20
    inet_init+0x406/0x590
    do_one_initcall+0x9d/0x2e0
    do_initcalls+0x23b/0x280
    kernel_init_freeable+0x445/0x490
    kernel_init+0x20/0x1d0
    ret_from_fork+0x46/0x80
    ret_from_fork_asm+0x1a/0x30
    </TASK>
    irq event stamp: 584330
    hardirqs last  enabled at (584338): [<ffffffff8168bf87>] __up_console_sem+0x77/0xb0
    hardirqs last disabled at (584345): [<ffffffff8168bf6c>] __up_console_sem+0x5c/0xb0
    softirqs last  enabled at (583242): [<ffffffff833ee96d>] netlink_insert+0x14d/0x470
    softirqs last disabled at (583754): [<ffffffff8317c8cd>] netif_napi_add_weight_locked+0x77d/0xba0

on kernel built with MAX_SKB_FRAGS=45, where SKB_WITH_OVERHEAD(1024)
is smaller than GRO_MAX_HEAD.

Such built additionally contains the revert of the single page frag cache
so that napi_get_frags() ends up using the page frag allocator, triggering
the splat.

Note that the underlying issue is independent from the mentioned
revert; address it ensuring that the small head cache will fit either TCP
and GRO allocation and updating napi_alloc_skb() and __netdev_alloc_skb()
to select kmalloc() usage for any allocation fitting such cache.

Reported-by: Sabrina Dubroca <sd@queasysnail.net>
Suggested-by: Eric Dumazet <edumazet@google.com>
Fixes: 3948b05950fd ("net: introduce a config option to tweak MAX_SKB_FRAGS")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agoMerge tag 'linux-can-next-for-6.15-20250219' of git://git.kernel.org/pub/scm/linux...
Paolo Abeni [Thu, 20 Feb 2025 09:18:37 +0000 (10:18 +0100)] 
Merge tag 'linux-can-next-for-6.15-20250219' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next

Marc Kleine-Budde says:

====================
pull-request: can-next 2025-02-19

this is a pull request of 12 patches for net-next/master.

The first 4 patches are by Krzysztof Kozlowski and simplify the c_can
driver's c_can_plat_probe() function.

Ciprian Marian Costea contributes 3 patches to add S32G2/S32G3 support
to the flexcan driver.

Ruffalo Lavoisier's patch removes a duplicated word from the mcp251xfd
DT bindings documentation.

Oleksij Rempel extends the J1939 documentation.

The next patch is by Oliver Hartkopp and adds access for the Remote
Request Substitution bit in CAN-XL frames.

Henrik Brix Andersen's patch for the gs_usb driver adds support for
the CANnectivity firmware.

The last patch is by Robin van der Gracht and removes a duplicated
setup of RX FIFO in the rockchip_canfd driver.

linux-can-next-for-6.15-20250219

* tag 'linux-can-next-for-6.15-20250219' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next:
  can: rockchip_canfd: rkcanfd_chip_fifo_setup(): remove duplicated setup of RX FIFO
  can: gs_usb: add VID/PID for the CANnectivity firmware
  can: canxl: support Remote Request Substitution bit access
  can: j1939: Extend stack documentation with buffer size behavior
  dt-binding: can: mcp251xfd: remove duplicate word
  can: flexcan: add NXP S32G2/S32G3 SoC support
  can: flexcan: Add quirk to handle separate interrupt lines for mailboxes
  dt-bindings: can: fsl,flexcan: add S32G2/S32G3 SoC support
  can: c_can: Use syscon_regmap_lookup_by_phandle_args
  can: c_can: Use of_property_present() to test existence of DT property
  can: c_can: Simplify handling syscon error path
  can: c_can: Drop useless final probe failure message
====================

Link: https://patch.msgid.link/20250219113354.529611-1-mkl@pengutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agonfp: bpf: Add check for nfp_app_ctrl_msg_alloc()
Haoxiang Li [Tue, 18 Feb 2025 03:04:09 +0000 (11:04 +0800)] 
nfp: bpf: Add check for nfp_app_ctrl_msg_alloc()

Add check for the return value of nfp_app_ctrl_msg_alloc() in
nfp_bpf_cmsg_alloc() to prevent null pointer dereference.

Fixes: ff3d43f7568c ("nfp: bpf: implement helpers for FW map ops")
Cc: stable@vger.kernel.org
Signed-off-by: Haoxiang Li <haoxiang_li2024@163.com>
Link: https://patch.msgid.link/20250218030409.2425798-1-haoxiang_li2024@163.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agotcp: drop secpath at the same time as we currently drop dst
Sabrina Dubroca [Mon, 17 Feb 2025 10:23:35 +0000 (11:23 +0100)] 
tcp: drop secpath at the same time as we currently drop dst

Xiumei reported hitting the WARN in xfrm6_tunnel_net_exit while
running tests that boil down to:
 - create a pair of netns
 - run a basic TCP test over ipcomp6
 - delete the pair of netns

The xfrm_state found on spi_byaddr was not deleted at the time we
delete the netns, because we still have a reference on it. This
lingering reference comes from a secpath (which holds a ref on the
xfrm_state), which is still attached to an skb. This skb is not
leaked, it ends up on sk_receive_queue and then gets defer-free'd by
skb_attempt_defer_free.

The problem happens when we defer freeing an skb (push it on one CPU's
defer_list), and don't flush that list before the netns is deleted. In
that case, we still have a reference on the xfrm_state that we don't
expect at this point.

We already drop the skb's dst in the TCP receive path when it's no
longer needed, so let's also drop the secpath. At this point,
tcp_filter has already called into the LSM hooks that may require the
secpath, so it should not be needed anymore. However, in some of those
places, the MPTCP extension has just been attached to the skb, so we
cannot simply drop all extensions.

Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
Reported-by: Xiumei Mu <xmu@redhat.com>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/5055ba8f8f72bdcb602faa299faca73c280b7735.1739743613.git.sd@queasysnail.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agonet: axienet: Set mac_managed_pm
Nick Hu [Mon, 17 Feb 2025 05:58:42 +0000 (13:58 +0800)] 
net: axienet: Set mac_managed_pm

The external PHY will undergo a soft reset twice during the resume process
when it wake up from suspend. The first reset occurs when the axienet
driver calls phylink_of_phy_connect(), and the second occurs when
mdio_bus_phy_resume() invokes phy_init_hw(). The second soft reset of the
external PHY does not reinitialize the internal PHY, which causes issues
with the internal PHY, resulting in the PHY link being down. To prevent
this, setting the mac_managed_pm flag skips the mdio_bus_phy_resume()
function.

Fixes: a129b41fe0a8 ("Revert "net: phy: dp83867: perform soft reset and retain established link"")
Signed-off-by: Nick Hu <nick.hu@sifive.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250217055843.19799-1-nick.hu@sifive.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 months agoMerge branch 'selftests-drv-net-add-a-simple-tso-test'
Jakub Kicinski [Thu, 20 Feb 2025 03:08:53 +0000 (19:08 -0800)] 
Merge branch 'selftests-drv-net-add-a-simple-tso-test'

Jakub Kicinski says:

====================
selftests: drv-net: add a simple TSO test

Add a simple test for exercising TSO over tunnels.

Similarly to csum test we want to iterate over ip versions.
Rework how addresses are stored in env to make this easier.
====================

Link: https://patch.msgid.link/20250218225426.77726-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoselftests: drv-net: add a simple TSO test
Jakub Kicinski [Tue, 18 Feb 2025 22:54:26 +0000 (14:54 -0800)] 
selftests: drv-net: add a simple TSO test

Add a simple test for TSO. Send a few MB of data and check device
stats to verify that the device was performing segmentation.
Do the same thing over a few tunnel types.

Injecting GSO packets directly would give us more ability to test
corner cases, but perhaps starting simple is good enough?

  # ./ksft-net-drv/drivers/net/hw/tso.py
  # Detected qstat for LSO wire-packets
  KTAP version 1
  1..14
  ok 1 tso.ipv4 # SKIP Test requires IPv4 connectivity
  ok 2 tso.vxlan4_ipv4 # SKIP Test requires IPv4 connectivity
  ok 3 tso.vxlan6_ipv4 # SKIP Test requires IPv4 connectivity
  ok 4 tso.vxlan_csum4_ipv4 # SKIP Test requires IPv4 connectivity
  ok 5 tso.vxlan_csum6_ipv4 # SKIP Test requires IPv4 connectivity
  ok 6 tso.gre4_ipv4 # SKIP Test requires IPv4 connectivity
  ok 7 tso.gre6_ipv4 # SKIP Test requires IPv4 connectivity
  ok 8 tso.ipv6
  ok 9 tso.vxlan4_ipv6
  ok 10 tso.vxlan6_ipv6
  ok 11 tso.vxlan_csum4_ipv6
  ok 12 tso.vxlan_csum6_ipv6
  # Testing with mangleid enabled
  ok 13 tso.gre4_ipv6
  ok 14 tso.gre6_ipv6
  # Totals: pass:7 fail:0 xfail:0 xpass:0 skip:7 error:0

Note that the test currently depends on the driver reporting
the LSO count via qstat, which appears to be relatively rare
(virtio, cisco/enic, sfc/efc; but virtio needs host support).

Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250218225426.77726-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoselftests: drv-net: store addresses in dict indexed by ipver
Jakub Kicinski [Tue, 18 Feb 2025 22:54:25 +0000 (14:54 -0800)] 
selftests: drv-net: store addresses in dict indexed by ipver

Looks like more and more tests want to iterate over IP version,
run the same test over ipv4 and ipv6. The current naming of
members in the env class makes it a bit awkward, we have
separate members for ipv4 and ipv6 parameters.

Store the parameters inside dicts, so that tests can easily
index them with ip version.

Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250218225426.77726-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoselftests: drv-net: get detailed interface info
Jakub Kicinski [Tue, 18 Feb 2025 22:54:24 +0000 (14:54 -0800)] 
selftests: drv-net: get detailed interface info

We already record output of ip link for NETIF in env for easy access.
Record the detailed version. TSO test will want to know the max tso size.

Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/20250218225426.77726-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoselftests: drv-net: resolve remote interface name
Jakub Kicinski [Tue, 18 Feb 2025 22:54:23 +0000 (14:54 -0800)] 
selftests: drv-net: resolve remote interface name

Find out and record in env the name of the interface which remote host
will use for the IP address provided via config.

Interface name is useful for mausezahn and for setting up tunnels.

Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/20250218225426.77726-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'mptcp-rx-path-refactor'
Jakub Kicinski [Thu, 20 Feb 2025 03:05:30 +0000 (19:05 -0800)] 
Merge branch 'mptcp-rx-path-refactor'

Matthieu Baerts says:

====================
mptcp: rx path refactor

Paolo worked on this RX path refactor for these two main reasons:

- Currently, the MPTCP RX path introduces quite a bit of 'exceptional'
  accounting/locking processing WRT to plain TCP, adding up to the
  implementation complexity in a miserable way.

- The performance gap WRT plain TCP for single subflow connections is
  quite measurable.

The present refactor addresses both the above items: most of the
additional complexity is dropped, and single stream performances
increase measurably, from 55Gbps to 71Gbps in Paolo's loopback test.
As a reference, plain TCP was around 84Gbps on the same host.

The above comes to a price: the patch are invasive, even in subtle ways.

Note: patch 5/7 removes the sk_forward_alloc_get() helper, which caused
some trivial modifications in different places in the net tree: sockets,
IPv4, sched. That's why a few more people have been Cc here. Feel free
to only look at this patch 5/7.
====================

Link: https://patch.msgid.link/20250218-net-next-mptcp-rx-path-refactor-v1-0-4a47d90d7998@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agomptcp: micro-optimize __mptcp_move_skb()
Paolo Abeni [Tue, 18 Feb 2025 18:36:18 +0000 (19:36 +0100)] 
mptcp: micro-optimize __mptcp_move_skb()

After the RX path refactor the mentioned function is expected to run
frequently, let's optimize it a bit.

Scan for ready subflow from the last processed one, and stop after
traversing the list once or reaching the msk memory limit - instead of
looking for dubious per-subflow conditions.
Also re-order the memory limit checks, to avoid duplicate tests.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250218-net-next-mptcp-rx-path-refactor-v1-7-4a47d90d7998@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agomptcp: dismiss __mptcp_rmem()
Paolo Abeni [Tue, 18 Feb 2025 18:36:17 +0000 (19:36 +0100)] 
mptcp: dismiss __mptcp_rmem()

After the RX path refactor, it become a wrapper for sk_rmem_alloc
access, with a slightly misleading name. Just drop it.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250218-net-next-mptcp-rx-path-refactor-v1-6-4a47d90d7998@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: dismiss sk_forward_alloc_get()
Paolo Abeni [Tue, 18 Feb 2025 18:36:16 +0000 (19:36 +0100)] 
net: dismiss sk_forward_alloc_get()

After the previous patch we can remove the forward_alloc_get
proto callback, basically reverting commit 292e6077b040 ("net: introduce
sk_forward_alloc_get()") and commit 66d58f046c9d ("net: use
sk_forward_alloc_get() in sk_get_meminfo()").

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250218-net-next-mptcp-rx-path-refactor-v1-5-4a47d90d7998@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agomptcp: cleanup mem accounting
Paolo Abeni [Tue, 18 Feb 2025 18:36:15 +0000 (19:36 +0100)] 
mptcp: cleanup mem accounting

After the previous patch, updating sk_forward_memory is cheap and
we can drop a lot of complexity from the MPTCP memory accounting,
removing the custom fwd mem allocations for rmem.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250218-net-next-mptcp-rx-path-refactor-v1-4-4a47d90d7998@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agomptcp: move the whole rx path under msk socket lock protection
Paolo Abeni [Tue, 18 Feb 2025 18:36:14 +0000 (19:36 +0100)] 
mptcp: move the whole rx path under msk socket lock protection

After commit c2e6048fa1cf ("mptcp: fix race in release_cb") we can
move the whole MPTCP rx path under the socket lock leveraging the
release_cb.

We can drop a bunch of spin_lock pairs in the receive functions, use
a single receive queue and invoke __mptcp_move_skbs only when subflows
ask for it.

This will allow more cleanup in the next patch.

Some changes are worth specific mention:

The msk rcvbuf update now always happens under both the msk and the
subflow socket lock: we can drop a bunch of ONCE annotation and
consolidate the checks.

When the skbs move is delayed at msk release callback time, even the
msk rcvbuf update is delayed; additionally take care of such action in
__mptcp_move_skbs().

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250218-net-next-mptcp-rx-path-refactor-v1-3-4a47d90d7998@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agomptcp: drop __mptcp_fastopen_gen_msk_ackseq()
Paolo Abeni [Tue, 18 Feb 2025 18:36:13 +0000 (19:36 +0100)] 
mptcp: drop __mptcp_fastopen_gen_msk_ackseq()

When we will move the whole RX path under the msk socket lock, updating
the already queued skb for passive fastopen socket at 3rd ack time will
be extremely painful and race prone

The map_seq for already enqueued skbs is used only to allow correct
coalescing with later data; preventing collapsing to the first skb of
a fastopen connect we can completely remove the
__mptcp_fastopen_gen_msk_ackseq() helper.

Before dropping this helper, a new item had to be added to the
mptcp_skb_cb structure. Because this item will be frequently tested in
the fast path -- almost on every packet -- and because there is free
space there, a single byte is used instead of a bitfield. This micro
optimisation slightly reduces the number of CPU operations to do the
associated check.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250218-net-next-mptcp-rx-path-refactor-v1-2-4a47d90d7998@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agomptcp: consolidate subflow cleanup
Paolo Abeni [Tue, 18 Feb 2025 18:36:12 +0000 (19:36 +0100)] 
mptcp: consolidate subflow cleanup

Consolidate all the cleanup actions requiring the worker in a single
helper and ensure the dummy data fin creation for fallback socket is
performed only when the tcp rx queue is empty.

There are no functional changes intended, but this will simplify the
next patch, when the tcp rx queue spooling could be delayed at release_cb
time.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250218-net-next-mptcp-rx-path-refactor-v1-1-4a47d90d7998@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonfc: hci: Remove unused nfc_llc_unregister
Dr. David Alan Gilbert [Wed, 19 Feb 2025 02:02:58 +0000 (02:02 +0000)] 
nfc: hci: Remove unused nfc_llc_unregister

nfc_llc_unregister() has been unused since it was added in 2012's
commit 67cccfe17d1b ("NFC: Add an LLC Core layer to HCI")

Remove it.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://patch.msgid.link/20250219020258.297995-1-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoselftests: net: Fix minor typos in MPTCP and psock tests
Suchit [Tue, 18 Feb 2025 16:59:23 +0000 (22:29 +0530)] 
selftests: net: Fix minor typos in MPTCP and psock tests

Fixes minor spelling errors:
- `simult_flows.sh`: "al testcases" -> "all testcases"
- `psock_tpacket.c`: "accross" -> "across"

Signed-off-by: Suchit Karunakaran <suchitkarunakaran@gmail.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250218165923.20740-1-suchitkarunakaran@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'net-core-improvements-to-device-lookup-by-hardware-address'
Jakub Kicinski [Thu, 20 Feb 2025 03:01:00 +0000 (19:01 -0800)] 
Merge branch 'net-core-improvements-to-device-lookup-by-hardware-address'

Breno Leitao says:

====================
net: core: improvements to device lookup by hardware address.

The first patch adds a new dev_getbyhwaddr() helper function for
finding devices by hardware address when the rtnl lock is held. This
prevents PROVE_LOCKING warnings that occurred when rtnl lock was held
but the RCU read lock wasn't. The common address comparison logic is
extracted into dev_comp_addr() to avoid code duplication.

The second coverts arp_req_set_public() to the new helper.
====================

Link: https://patch.msgid.link/20250218-arm_fix_selftest-v5-0-d3d6892db9e1@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoarp: switch to dev_getbyhwaddr() in arp_req_set_public()
Breno Leitao [Tue, 18 Feb 2025 13:49:31 +0000 (05:49 -0800)] 
arp: switch to dev_getbyhwaddr() in arp_req_set_public()

The arp_req_set_public() function is called with the rtnl lock held,
which provides enough synchronization protection. This makes the RCU
variant of dev_getbyhwaddr() unnecessary. Switch to using the simpler
dev_getbyhwaddr() function since we already have the required rtnl
locking.

This change helps maintain consistency in the networking code by using
the appropriate helper function for the existing locking context.
Since we're not holding the RCU read lock in arp_req_set_public()
existing code could trigger false positive locking warnings.

Fixes: 941666c2e3e0 ("net: RCU conversion of dev_getbyhwaddr() and arp_ioctl()")
Suggested-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20250218-arm_fix_selftest-v5-2-d3d6892db9e1@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: Add non-RCU dev_getbyhwaddr() helper
Breno Leitao [Tue, 18 Feb 2025 13:49:30 +0000 (05:49 -0800)] 
net: Add non-RCU dev_getbyhwaddr() helper

Add dedicated helper for finding devices by hardware address when
holding rtnl_lock, similar to existing dev_getbyhwaddr_rcu(). This prevents
PROVE_LOCKING warnings when rtnl_lock is held but RCU read lock is not.

Extract common address comparison logic into dev_addr_cmp().

The context about this change could be found in the following
discussion:

Link: https://lore.kernel.org/all/20250206-scarlet-ermine-of-improvement-1fcac5@leitao/
Cc: kuniyu@amazon.com
Cc: ushankar@purestorage.com
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250218-arm_fix_selftest-v5-1-d3d6892db9e1@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'net-stmmac-further-cleanups'
Jakub Kicinski [Thu, 20 Feb 2025 02:57:31 +0000 (18:57 -0800)] 
Merge branch 'net-stmmac-further-cleanups'

Russell King says:

====================
net: stmmac: further cleanups

This small series does further cleanups to the stmmac driver:

1. Name priv->pause to indicate that it's a timeout and clarify the
   units of the "pause" module parameter
2. Remove useless priv->flow_ctrl member and deprecate the useless
  "flow_ctrl" module parameter
3. Fix long-standing signed-ness issue with "speed" passed around the
   driver from the mac_link_up method.
====================

Link: https://patch.msgid.link/Z7Rf2daOaf778TOg@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: stmmac: "speed" passed to fix_mac_speed is an int
Russell King (Oracle) [Tue, 18 Feb 2025 10:24:39 +0000 (10:24 +0000)] 
net: stmmac: "speed" passed to fix_mac_speed is an int

priv->plat->fix_mac_speed() is called from stmmac_mac_link_up(), which
is passed the speed as an "int". However, fix_mac_speed() implicitly
casts this to an unsigned int. Some platform glue code print this value
using %u, others with %d. Some implicitly cast it back to an int, and
others to u32.

Good practice is to use one type and only one type to represent a value
being passed around a driver.

Switch all of these over to consistently use "int" when dealing with a
speed passed from stmmac_mac_link_up(), even though the speed will
always be positive.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Acked-by: Chen-Yu Tsai <wens@csie.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Nobuhiro Iwamatsu <nobuhiro1.iwamatsu@toshiba.co.jp>
Link: https://patch.msgid.link/E1tkKmN-004ObM-Ge@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: stmmac: remove useless priv->flow_ctrl
Russell King (Oracle) [Tue, 18 Feb 2025 10:24:34 +0000 (10:24 +0000)] 
net: stmmac: remove useless priv->flow_ctrl

priv->flow_ctrl is only accessed by stmmac_main.c, and the only place
that it is read is in stmmac_mac_flow_ctrl(). This function is only
called from stmmac_mac_link_up() which always sets priv->flow_ctrl
immediately before calling this function.

Therefore, initialising this at probe time is ineffectual because it
will always be overwritten before it's read. As such, the "flow_ctrl"
module parameter has been useless for some time. Rather than remove
the module parameter, which would risk module load failure, change the
description to indicate that it is obsolete, and warn if it is set by
userspace.

Moreover, storing the value in the stmmac_priv has no benefit as it's
set and then immediately read stmmac_mac_flow_ctrl(). Instead, pass it
as a parameter to this function..

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Nobuhiro Iwamatsu <nobuhiro1.iwamatsu@toshiba.co.jp>
Link: https://patch.msgid.link/E1tkKmI-004ObG-DL@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: stmmac: clarify priv->pause and pause module parameter
Russell King (Oracle) [Tue, 18 Feb 2025 10:24:29 +0000 (10:24 +0000)] 
net: stmmac: clarify priv->pause and pause module parameter

priv->pause corresponds with "pause_time" in the 802.3 specification,
and is also called "pause_time" in the various MAC backends. For
consistency, use the same name in the core stmmac code.

Clarify the units of the "pause" module parameter which sets up this
struct member to indicate that it's in units of the pause_quanta
defined by 802.3, which is 512 bit times.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Link: https://patch.msgid.link/E1tkKmD-004ObA-9K@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agosctp: Fix undefined behavior in left shift operation
Yu-Chun Lin [Tue, 18 Feb 2025 08:12:16 +0000 (16:12 +0800)] 
sctp: Fix undefined behavior in left shift operation

According to the C11 standard (ISO/IEC 9899:2011, 6.5.7):
"If E1 has a signed type and E1 x 2^E2 is not representable in the result
type, the behavior is undefined."

Shifting 1 << 31 causes signed integer overflow, which leads to undefined
behavior.

Fix this by explicitly using '1U << 31' to ensure the shift operates on
an unsigned type, avoiding undefined behavior.

Signed-off-by: Yu-Chun Lin <eleanor15x@gmail.com>
Link: https://patch.msgid.link/20250218081217.3468369-1-eleanor15x@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'flow_dissector-fix-handling-of-mixed-port-and-port-range-keys'
Jakub Kicinski [Thu, 20 Feb 2025 02:55:32 +0000 (18:55 -0800)] 
Merge branch 'flow_dissector-fix-handling-of-mixed-port-and-port-range-keys'

Cong Wang says:

====================
flow_dissector: Fix handling of mixed port and port-range keys

This patchset contains two fixes for flow_dissector handling of mixed
port and port-range keys, for both tc-flower case and bpf case. Each
of them also comes with a selftest.
====================

Link: https://patch.msgid.link/20250218043210.732959-1-xiyou.wangcong@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoselftests/bpf: Add a specific dst port matching
Cong Wang [Tue, 18 Feb 2025 04:32:10 +0000 (20:32 -0800)] 
selftests/bpf: Add a specific dst port matching

After this patch:

 #102/1   flow_dissector_classification/ipv4:OK
 #102/2   flow_dissector_classification/ipv4_continue_dissect:OK
 #102/3   flow_dissector_classification/ipip:OK
 #102/4   flow_dissector_classification/gre:OK
 #102/5   flow_dissector_classification/port_range:OK
 #102/6   flow_dissector_classification/ipv6:OK
 #102     flow_dissector_classification:OK
 Summary: 1/6 PASSED, 0 SKIPPED, 0 FAILED

Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Link: https://patch.msgid.link/20250218043210.732959-5-xiyou.wangcong@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoflow_dissector: Fix port range key handling in BPF conversion
Cong Wang [Tue, 18 Feb 2025 04:32:09 +0000 (20:32 -0800)] 
flow_dissector: Fix port range key handling in BPF conversion

Fix how port range keys are handled in __skb_flow_bpf_to_target() by:
- Separating PORTS and PORTS_RANGE key handling
- Using correct key_ports_range structure for range keys
- Properly initializing both key types independently

This ensures port range information is correctly stored in its dedicated
structure rather than incorrectly using the regular ports key structure.

Fixes: 59fb9b62fb6c ("flow_dissector: Fix to use new variables for port ranges in bpf hook")
Reported-by: Qiang Zhang <dtzq01@gmail.com>
Closes: https://lore.kernel.org/netdev/CAPx+-5uvFxkhkz4=j_Xuwkezjn9U6kzKTD5jz4tZ9msSJ0fOJA@mail.gmail.com/
Cc: Yoshiki Komachi <komachi.yoshiki@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Link: https://patch.msgid.link/20250218043210.732959-4-xiyou.wangcong@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoselftests/net/forwarding: Add a test case for tc-flower of mixed port and port-range
Cong Wang [Tue, 18 Feb 2025 04:32:08 +0000 (20:32 -0800)] 
selftests/net/forwarding: Add a test case for tc-flower of mixed port and port-range

After this patch:

 # ./tc_flower_port_range.sh
 TEST: Port range matching - IPv4 UDP                                [ OK ]
 TEST: Port range matching - IPv4 TCP                                [ OK ]
 TEST: Port range matching - IPv6 UDP                                [ OK ]
 TEST: Port range matching - IPv6 TCP                                [ OK ]
 TEST: Port range matching - IPv4 UDP Drop                           [ OK ]

Cc: Qiang Zhang <dtzq01@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250218043210.732959-3-xiyou.wangcong@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoflow_dissector: Fix handling of mixed port and port-range keys
Cong Wang [Tue, 18 Feb 2025 04:32:07 +0000 (20:32 -0800)] 
flow_dissector: Fix handling of mixed port and port-range keys

This patch fixes a bug in TC flower filter where rules combining a
specific destination port with a source port range weren't working
correctly.

The specific case was when users tried to configure rules like:

tc filter add dev ens38 ingress protocol ip flower ip_proto udp \
dst_port 5000 src_port 2000-3000 action drop

The root cause was in the flow dissector code. While both
FLOW_DISSECTOR_KEY_PORTS and FLOW_DISSECTOR_KEY_PORTS_RANGE flags
were being set correctly in the classifier, the __skb_flow_dissect_ports()
function was only populating one of them: whichever came first in
the enum check. This meant that when the code needed both a specific
port and a port range, one of them would be left as 0, causing the
filter to not match packets as expected.

Fix it by removing the either/or logic and instead checking and
populating both key types independently when they're in use.

Fixes: 8ffb055beae5 ("cls_flower: Fix the behavior using port ranges with hw-offload")
Reported-by: Qiang Zhang <dtzq01@gmail.com>
Closes: https://lore.kernel.org/netdev/CAPx+-5uvFxkhkz4=j_Xuwkezjn9U6kzKTD5jz4tZ9msSJ0fOJA@mail.gmail.com/
Cc: Yoshiki Komachi <komachi.yoshiki@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250218043210.732959-2-xiyou.wangcong@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agonet: mana: Add debug logs in MANA network driver
Erni Sri Satya Vennela [Tue, 18 Feb 2025 01:34:15 +0000 (17:34 -0800)] 
net: mana: Add debug logs in MANA network driver

Add more logs to assist in debugging and monitoring
driver behaviour, making it easier to identify potential
issues  during development and testing.

Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Link: https://patch.msgid.link/1739842455-23899-1-git-send-email-ernis@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agoMerge branch 'gtp-geneve-suppress-list_del-splat-during-exit_batch_rtnl'
Jakub Kicinski [Thu, 20 Feb 2025 02:49:30 +0000 (18:49 -0800)] 
Merge branch 'gtp-geneve-suppress-list_del-splat-during-exit_batch_rtnl'

Kuniyuki Iwashima says:

====================
gtp/geneve: Suppress list_del() splat during ->exit_batch_rtnl().

The common pattern in tunnel device's ->exit_batch_rtnl() is iterating
two netdev lists for each netns: (i) for_each_netdev() to clean up
devices in the netns, and (ii) the device type specific list to clean
up devices in other netns.

list_for_each_entry(net, net_list, exit_list) {
for_each_netdev_safe(net, dev, next) {
/* (i)  call unregister_netdevice_queue(dev, list) */
}

list_for_each_entry_safe(xxx, xxx_next, &net->yyy, zzz) {
/* (ii) call unregister_netdevice_queue(xxx->dev, list) */
}
}

Then, ->exit_batch_rtnl() could touch the same device twice.

Say we have two netns A & B and device B that is created in netns A and
moved to netns B.

  1. cleanup_net() processes netns A and then B.

  2. ->exit_batch_rtnl() finds the device B while iterating netns A's (ii)

  [ device B is not yet unlinked from netns B as
    unregister_netdevice_many() has not been called. ]

  3. ->exit_batch_rtnl() finds the device B while iterating netns B's (i)

gtp and geneve calls ->dellink() at 2. and 3. that calls list_del() for (ii)
and unregister_netdevice_queue().

Calling unregister_netdevice_queue() twice is fine because it uses
list_move_tail(), but the 2nd list_del() triggers a splat when
CONFIG_DEBUG_LIST is enabled.

Possible solution is either of

 (a) Use list_del_init() in ->dellink()

 (b) Iterate dev with empty ->unreg_list for (i) like

#define for_each_netdev_alive(net, d) \
list_for_each_entry(d, &(net)->dev_base_head, dev_list) \
if (list_empty(&d->unreg_list))

 (c) Remove (i) and delegate it to default_device_exit_batch().

This series avoids the 2nd ->dellink() by (c) to suppress the splat for
gtp and geneve.

Note that IPv4/IPv6 tunnels calls just unregister_netdevice() during
->exit_batch_rtnl() and dev is unlinked from (ii) later in ->ndo_uninit(),
so they are safe.

Also, pfcp has the same pattern but is safe because
unregister_netdevice_many() is called for each netns.
====================

Link: https://patch.msgid.link/20250217203705.40342-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agogeneve: Suppress list corruption splat in geneve_destroy_tunnels().
Kuniyuki Iwashima [Mon, 17 Feb 2025 20:37:05 +0000 (12:37 -0800)] 
geneve: Suppress list corruption splat in geneve_destroy_tunnels().

As explained in the previous patch, iterating for_each_netdev() and
gn->geneve_list during ->exit_batch_rtnl() could trigger ->dellink()
twice for the same device.

If CONFIG_DEBUG_LIST is enabled, we will see a list_del() corruption
splat in the 2nd call of geneve_dellink().

Let's remove for_each_netdev() in geneve_destroy_tunnels() and delegate
that part to default_device_exit_batch().

Fixes: 9593172d93b9 ("geneve: Fix use-after-free in geneve_find_dev().")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250217203705.40342-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 months agogtp: Suppress list corruption splat in gtp_net_exit_batch_rtnl().
Kuniyuki Iwashima [Mon, 17 Feb 2025 20:37:04 +0000 (12:37 -0800)] 
gtp: Suppress list corruption splat in gtp_net_exit_batch_rtnl().

Brad Spengler reported the list_del() corruption splat in
gtp_net_exit_batch_rtnl(). [0]

Commit eb28fd76c0a0 ("gtp: Destroy device along with udp socket's netns
dismantle.") added the for_each_netdev() loop in gtp_net_exit_batch_rtnl()
to destroy devices in each netns as done in geneve and ip tunnels.

However, this could trigger ->dellink() twice for the same device during
->exit_batch_rtnl().

Say we have two netns A & B and gtp device B that resides in netns B but
whose UDP socket is in netns A.

  1. cleanup_net() processes netns A and then B.

  2. gtp_net_exit_batch_rtnl() finds the device B while iterating
     netns A's gn->gtp_dev_list and calls ->dellink().

  [ device B is not yet unlinked from netns B
    as unregister_netdevice_many() has not been called. ]

  3. gtp_net_exit_batch_rtnl() finds the device B while iterating
     netns B's for_each_netdev() and calls ->dellink().

gtp_dellink() cleans up the device's hash table, unlinks the dev from
gn->gtp_dev_list, and calls unregister_netdevice_queue().

Basically, calling gtp_dellink() multiple times is fine unless
CONFIG_DEBUG_LIST is enabled.

Let's remove for_each_netdev() in gtp_net_exit_batch_rtnl() and
delegate the destruction to default_device_exit_batch() as done
in bareudp.

[0]:
list_del corruption, ffff8880aaa62c00->next (autoslab_size_M_dev_P_net_core_dev_11127_8_1328_8_S_4096_A_64_n_139+0xc00/0x1000 [slab object]) is LIST_POISON1 (ffffffffffffff02) (prev is 0xffffffffffffff04)
kernel BUG at lib/list_debug.c:58!
Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 UID: 0 PID: 1804 Comm: kworker/u8:7 Tainted: G                T   6.12.13-grsec-full-20250211091339 #1
Tainted: [T]=RANDSTRUCT
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
Workqueue: netns cleanup_net
RIP: 0010:[<ffffffff84947381>] __list_del_entry_valid_or_report+0x141/0x200 lib/list_debug.c:58
Code: c2 76 91 31 c0 e8 9f b1 f7 fc 0f 0b 4d 89 f0 48 c7 c1 02 ff ff ff 48 89 ea 48 89 ee 48 c7 c7 e0 c2 76 91 31 c0 e8 7f b1 f7 fc <0f> 0b 4d 89 e8 48 c7 c1 04 ff ff ff 48 89 ea 48 89 ee 48 c7 c7 60
RSP: 0018:fffffe8040b4fbd0 EFLAGS: 00010283
RAX: 00000000000000cc RBX: dffffc0000000000 RCX: ffffffff818c4054
RDX: ffffffff84947381 RSI: ffffffff818d1512 RDI: 0000000000000000
RBP: ffff8880aaa62c00 R08: 0000000000000001 R09: fffffbd008169f32
R10: fffffe8040b4f997 R11: 0000000000000001 R12: a1988d84f24943e4
R13: ffffffffffffff02 R14: ffffffffffffff04 R15: ffff8880aaa62c08
RBX: kasan shadow of 0x0
RCX: __wake_up_klogd.part.0+0x74/0xe0 kernel/printk/printk.c:4554
RDX: __list_del_entry_valid_or_report+0x141/0x200 lib/list_debug.c:58
RSI: vprintk+0x72/0x100 kernel/printk/printk_safe.c:71
RBP: autoslab_size_M_dev_P_net_core_dev_11127_8_1328_8_S_4096_A_64_n_139+0xc00/0x1000 [slab object]
RSP: process kstack fffffe8040b4fbd0+0x7bd0/0x8000 [kworker/u8:7+netns 1804 ]
R09: kasan shadow of process kstack fffffe8040b4f990+0x7990/0x8000 [kworker/u8:7+netns 1804 ]
R10: process kstack fffffe8040b4f997+0x7997/0x8000 [kworker/u8:7+netns 1804 ]
R15: autoslab_size_M_dev_P_net_core_dev_11127_8_1328_8_S_4096_A_64_n_139+0xc08/0x1000 [slab object]
FS:  0000000000000000(0000) GS:ffff888116000000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000748f5372c000 CR3: 0000000015408000 CR4: 00000000003406f0 shadow CR4: 00000000003406f0
Stack:
 0000000000000000 ffffffff8a0c35e7 ffffffff8a0c3603 ffff8880aaa62c00
 ffff8880aaa62c00 0000000000000004 ffff88811145311c 0000000000000005
 0000000000000001 ffff8880aaa62000 fffffe8040b4fd40 ffffffff8a0c360d
Call Trace:
 <TASK>
 [<ffffffff8a0c360d>] __list_del_entry_valid include/linux/list.h:131 [inline] fffffe8040b4fc28
 [<ffffffff8a0c360d>] __list_del_entry include/linux/list.h:248 [inline] fffffe8040b4fc28
 [<ffffffff8a0c360d>] list_del include/linux/list.h:262 [inline] fffffe8040b4fc28
 [<ffffffff8a0c360d>] gtp_dellink+0x16d/0x360 drivers/net/gtp.c:1557 fffffe8040b4fc28
 [<ffffffff8a0d0404>] gtp_net_exit_batch_rtnl+0x124/0x2c0 drivers/net/gtp.c:2495 fffffe8040b4fc88
 [<ffffffff8e705b24>] cleanup_net+0x5a4/0xbe0 net/core/net_namespace.c:635 fffffe8040b4fcd0
 [<ffffffff81754c97>] process_one_work+0xbd7/0x2160 kernel/workqueue.c:3326 fffffe8040b4fd88
 [<ffffffff81757195>] process_scheduled_works kernel/workqueue.c:3407 [inline] fffffe8040b4fec0
 [<ffffffff81757195>] worker_thread+0x6b5/0xfa0 kernel/workqueue.c:3488 fffffe8040b4fec0
 [<ffffffff817782a0>] kthread+0x360/0x4c0 kernel/kthread.c:397 fffffe8040b4ff78
 [<ffffffff814d8594>] ret_from_fork+0x74/0xe0 arch/x86/kernel/process.c:172 fffffe8040b4ffb8
 [<ffffffff8110f509>] ret_from_fork_asm+0x29/0xc0 arch/x86/entry/entry_64.S:399 fffffe8040b4ffe8
 </TASK>
Modules linked in:

Fixes: eb28fd76c0a0 ("gtp: Destroy device along with udp socket's netns dismantle.")
Reported-by: Brad Spengler <spender@grsecurity.net>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250217203705.40342-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>