]> git.ipfire.org Git - thirdparty/kernel/linux.git/log
thirdparty/kernel/linux.git
5 months agodevlink: Improve the port attributes description
Parav Pandit [Tue, 24 Dec 2024 18:37:06 +0000 (20:37 +0200)] 
devlink: Improve the port attributes description

Current PF number description is vague, sometimes interpreted as
some PF index. VF number in the PCI specification starts at 1; however
in kernel, it starts at 0 for representor model.

Improve the description of devlink port attributes PF, VF and SF
numbers with these details.

Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Parav Pandit <parav@nvidia.com>
Link: https://patch.msgid.link/20241224183706.26571-1-parav@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agoptp: ocp: constify 'struct bin_attribute'
Thomas Weißschuh [Sun, 22 Dec 2024 20:08:20 +0000 (21:08 +0100)] 
ptp: ocp: constify 'struct bin_attribute'

The sysfs core now allows instances of 'struct bin_attribute' to be
moved into read-only memory. Make use of that to protect them against
accidental or malicious modifications.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20241222-sysfs-const-bin_attr-ptp-v1-1-5c1f3ee246fb@weissschuh.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet: phy: fix phy_disable_eee
Heiner Kallweit [Fri, 20 Dec 2024 22:02:06 +0000 (23:02 +0100)] 
net: phy: fix phy_disable_eee

genphy_c45_write_eee_adv() becomes a no-op if phydev->supported_eee
is cleared. That's not what we want because this function is still
needed to clear the EEE advertisement register(s).
Fill phydev->eee_broken_modes instead to ensure that userspace
can't re-enable EEE advertising.

Fixes: b55498ff14bd ("net: phy: add phy_disable_eee")
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://patch.msgid.link/57e2ae5f-4319-413c-b5c4-ebc8d049bc23@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agoMerge branch 'net-lan969x-add-rgmii-support'
Jakub Kicinski [Mon, 23 Dec 2024 18:58:07 +0000 (10:58 -0800)] 
Merge branch 'net-lan969x-add-rgmii-support'

Daniel Machon says:

====================
net: lan969x: add RGMII support

== Description:

This series is the fourth of a multi-part series, that prepares and adds
support for the new lan969x switch driver.

The upstreaming efforts is split into multiple series (might change a
bit as we go along):

        1) Prepare the Sparx5 driver for lan969x (merged)

        2) Add support for lan969x (same basic features as Sparx5
           provides excl. FDMA and VCAP, merged).

        3) Add lan969x VCAP functionality (merged).

    --> 4) Add RGMII support.

        5) Add FDMA support.

== RGMII support:

The lan969x switch device includes two RGMII port interfaces (port 28
and 29) supporting data speeds of 1 Gbps, 100 Mbps and 10 Mbps.

== Patch breakdown:

Patch #1 does some preparation work.

Patch #2 adds new function: is_port_rgmii() to the match data ops.

Patch #3 uses the is_port_rgmii() in a number of places.

Patch #4 makes sure that we do not configure an RGMII device as a
         low-speed device, when doing a port config.

Patch #5 makes sure we only return the PCS if the port mode requires
         it.

Patch #6 adds checks for RGMII PHY modes in sparx5_verify_speeds().

Patch #7 adds registers required to configure RGMII.

Patch #8 adds RGMII implementation.

Patch #9 documents RGMII delays in the dt-bindings.

Details are in the commit description of the individual patches

v4: https://lore.kernel.org/20241213-sparx5-lan969x-switch-driver-4-v4-0-d1a72c9c4714@microchip.com
v3: https://lore.kernel.org/20241118-sparx5-lan969x-switch-driver-4-v3-0-3cefee5e7e3a@microchip.com
v2: https://lore.kernel.org/20241113-sparx5-lan969x-switch-driver-4-v2-0-0db98ac096d1@microchip.com
v1: https://lore.kernel.org/20241106-sparx5-lan969x-switch-driver-4-v1-0-f7f7316436bd@microchip.com
====================

Link: https://patch.msgid.link/20241220-sparx5-lan969x-switch-driver-4-v5-0-fa8ba5dff732@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agodt-bindings: net: sparx5: document RGMII delays
Daniel Machon [Fri, 20 Dec 2024 13:48:48 +0000 (14:48 +0100)] 
dt-bindings: net: sparx5: document RGMII delays

The lan969x switch device supports two RGMII port interfaces that can be
configured for MAC level rx and tx delays. Document two new properties
{rx,tx}-internal-delay-ps in the bindings, used to select these delays.

Tested-by: Robert Marko <robert.marko@sartura.hr>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20241220-sparx5-lan969x-switch-driver-4-v5-9-fa8ba5dff732@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet: lan969x: add RGMII implementation
Daniel Machon [Fri, 20 Dec 2024 13:48:47 +0000 (14:48 +0100)] 
net: lan969x: add RGMII implementation

The lan969x switch device includes two RGMII port interfaces (port 28
and 29) supporting data speeds of 1 Gbps, 100 Mbps and 10 Mbps. MAC
level delays are configurable through the HSIO_WRAP target, by choosing
a phase shift selector, corresponding to a certain time delay in nano
seconds.

Add new file: lan969x_rgmii.c that contains the implementation for
configuring the RGMII port devices. MAC level delays are configured
using the "{rx,tx}-internal-delay-ps" properties. These properties must
be specified independently of the phy-mode. If missing, or set to zero,
the MAC will not apply any delay.

Reviewed-by: Steen Hegelund <Steen.Hegelund@microchip.com>
Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Tested-by: Robert Marko <robert.marko@sartura.hr>
Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://patch.msgid.link/20241220-sparx5-lan969x-switch-driver-4-v5-8-fa8ba5dff732@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet: lan969x: add RGMII registers
Daniel Machon [Fri, 20 Dec 2024 13:48:46 +0000 (14:48 +0100)] 
net: lan969x: add RGMII registers

Configuration of RGMII is done by configuring the GPIO and clock
settings in the HSIOWRAP target, and configuring the RGMII port devices
in the DEVRGMII target. Both targets contain registers replicated for
the number of RGMII port devices, which is two.

Add said targets and register macros required to configure RGMII.

Reviewed-by: Steen Hegelund <Steen.Hegelund@microchip.com>
Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Tested-by: Robert Marko <robert.marko@sartura.hr>
Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://patch.msgid.link/20241220-sparx5-lan969x-switch-driver-4-v5-7-fa8ba5dff732@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet: sparx5: verify RGMII speeds
Daniel Machon [Fri, 20 Dec 2024 13:48:45 +0000 (14:48 +0100)] 
net: sparx5: verify RGMII speeds

When doing a port config, we verify the port speed against the PHY mode
and supported speeds of that PHY mode. Add checks for the four RGMII phy
modes: RGMII, RGMII_ID, RGMII_TXID and RGMII_RXID.

Reviewed-by: Steen Hegelund <Steen.Hegelund@microchip.com>
Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Tested-by: Robert Marko <robert.marko@sartura.hr>
Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://patch.msgid.link/20241220-sparx5-lan969x-switch-driver-4-v5-6-fa8ba5dff732@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet: sparx5: only return PCS for modes that require it
Daniel Machon [Fri, 20 Dec 2024 13:48:44 +0000 (14:48 +0100)] 
net: sparx5: only return PCS for modes that require it

The RGMII ports have no PCS to configure. Make sure we only return the
PCS for port modes that require it.

Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Tested-by: Robert Marko <robert.marko@sartura.hr>
Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://patch.msgid.link/20241220-sparx5-lan969x-switch-driver-4-v5-5-fa8ba5dff732@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet: sparx5: skip low-speed configuration when port is RGMII
Daniel Machon [Fri, 20 Dec 2024 13:48:43 +0000 (14:48 +0100)] 
net: sparx5: skip low-speed configuration when port is RGMII

When doing a port config, we configure low-speed port devices, among
other things. We have a check to ensure, that the device is indeed a
low-speed device, an not a high-speed device. Add an additional check,
to ensure that the device is not an RGMII device.

Reviewed-by: Steen Hegelund <Steen.Hegelund@microchip.com>
Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Tested-by: Robert Marko <robert.marko@sartura.hr>
Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://patch.msgid.link/20241220-sparx5-lan969x-switch-driver-4-v5-4-fa8ba5dff732@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet: sparx5: use is_port_rgmii() throughout
Daniel Machon [Fri, 20 Dec 2024 13:48:42 +0000 (14:48 +0100)] 
net: sparx5: use is_port_rgmii() throughout

Now that we can check if a given port is an RGMII port, use it in the
following cases:

 - To set RGMII PHY modes for RGMII port devices.

 - To avoid checking for a SerDes node in the devicetree, when the port
   is an RGMII port.

 - To bail out of sparx5_port_init() when the common configuration is
   done.

Reviewed-by: Steen Hegelund <Steen.Hegelund@microchip.com>
Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Tested-by: Robert Marko <robert.marko@sartura.hr>
Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://patch.msgid.link/20241220-sparx5-lan969x-switch-driver-4-v5-3-fa8ba5dff732@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet: sparx5: add function for RGMII port check
Daniel Machon [Fri, 20 Dec 2024 13:48:41 +0000 (14:48 +0100)] 
net: sparx5: add function for RGMII port check

The lan969x device contains two RGMII port interfaces, sitting at port
28 and 29. Add function: is_port_rgmii() to the match data ops, that
checks if a given port is an RGMII port or not. For Sparx5, this
function always returns false.

Reviewed-by: Steen Hegelund <Steen.Hegelund@microchip.com>
Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Tested-by: Robert Marko <robert.marko@sartura.hr>
Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://patch.msgid.link/20241220-sparx5-lan969x-switch-driver-4-v5-2-fa8ba5dff732@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet: sparx5: do some preparation work
Daniel Machon [Fri, 20 Dec 2024 13:48:40 +0000 (14:48 +0100)] 
net: sparx5: do some preparation work

The sparx5_port_init() does initial configuration of a variety of
different features and options for each port. Some are shared for all
types of devices, some are not. As it is now, common configuration is
done after configuration of low-speed devices. This will not work when
adding RGMII support in a subsequent patch.

In preparation for lan969x RGMII support, move a block of code, that
configures 2g5 devices, down. This ensures that the configuration common
to all devices is done before configuration of 2g5, 5g, 10g and 25g
devices.

Reviewed-by: Steen Hegelund <Steen.Hegelund@microchip.com>
Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Tested-by: Robert Marko <robert.marko@sartura.hr>
Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://patch.msgid.link/20241220-sparx5-lan969x-switch-driver-4-v5-1-fa8ba5dff732@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agoMerge branch '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next...
Jakub Kicinski [Mon, 23 Dec 2024 18:46:49 +0000 (10:46 -0800)] 
Merge branch '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue

Tony Nguyen says:

====================
ixgbe, ixgbevf: Add support for Intel(R) E610 device

Piotr Kwapulinski says:

Add initial support for Intel(R) E610 Series of network devices. The E610
is based on X550 but adds firmware managed link, enhanced security
capabilities and support for updated server manageability.

* '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  ixgbevf: Add support for Intel(R) E610 device
  PCI: Add PCI_VDEVICE_SUB helper macro
  ixgbe: Enable link management in E610 device
  ixgbe: Clean up the E610 link management related code
  ixgbe: Add ixgbe_x540 multiple header inclusion protection
  ixgbe: Add support for EEPROM dump in E610 device
  ixgbe: Add support for NVM handling in E610 device
  ixgbe: Add link management support for E610 device
  ixgbe: Add support for E610 device capabilities detection
  ixgbe: Add support for E610 FW Admin Command Interface
====================

Link: https://patch.msgid.link/20241220201521.3363985-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet: ethtool: Fix suspicious rcu_dereference usage
Kory Maincent [Fri, 20 Dec 2024 08:37:40 +0000 (09:37 +0100)] 
net: ethtool: Fix suspicious rcu_dereference usage

The __ethtool_get_ts_info function can be called with or without the
rtnl lock held. When the rtnl lock is not held, using rtnl_dereference()
triggers a warning due to the lack of lock context.

Add an rcu_read_lock() to ensure the lock is acquired and to maintain
synchronization.

Reported-by: syzbot+a344326c05c98ba19682@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/676147f8.050a0220.37aaf.0154.GAE@google.com/
Fixes: b9e3f7dc9ed9 ("net: ethtool: tsinfo: Enhance tsinfo to support several hwtstamp by net topology")
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241220083741.175329-1-kory.maincent@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agoMerge branch 'eth-fbnic-support-basic-rss-config-and-setting-channel-count'
Jakub Kicinski [Mon, 23 Dec 2024 18:35:58 +0000 (10:35 -0800)] 
Merge branch 'eth-fbnic-support-basic-rss-config-and-setting-channel-count'

Jakub Kicinski says:

====================
eth: fbnic: support basic RSS config and setting channel count

Add support for basic RSS config (indirection table, key get and set),
and changing the number of channels.

  # ./ksft-net-drv/run_kselftest.sh -t drivers/net/hw:rss_ctx.py
  TAP version 13
  1..1
  # timeout set to 0
  # selftests: drivers/net/hw: rss_ctx.py
  # KTAP version 1
  # 1..15
  # ok 1 rss_ctx.test_rss_key_indir
  # ok 2 rss_ctx.test_rss_queue_reconfigure
  # ok 3 rss_ctx.test_rss_resize
  # ok 4 rss_ctx.test_hitless_key_update

  .. the rest of the tests are for additional contexts so they
  get skipped..

The slicing of the patches (and bugs) are mine, but I'm keeping
Alex as the author on the patches where he wrote 100% of the code.
====================

Link: https://patch.msgid.link/20241220025241.1522781-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agoeth: fbnic: support ring channel set while up
Jakub Kicinski [Fri, 20 Dec 2024 02:52:41 +0000 (18:52 -0800)] 
eth: fbnic: support ring channel set while up

Implement the channel count changes. Copy the netdev priv,
allocate new channels using it. Stop, swap, start.
Then free the copy of the priv along with the channels it
holds, which are now the channels that used to be on the
real priv.

Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://patch.msgid.link/20241220025241.1522781-11-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agoeth: fbnic: support ring channel get and set while down
Jakub Kicinski [Fri, 20 Dec 2024 02:52:40 +0000 (18:52 -0800)] 
eth: fbnic: support ring channel get and set while down

Trivial implementation of ethtool channel get and set. Set is only
supported when device is closed, next patch will add code for
live reconfig.

Asymmetric configurations are supported (combined + extra Tx or Rx),
so are configurations with independent IRQs for Rx and Tx.
Having all 3 NAPI types (combined, Tx, Rx) is not supported.

We used to only call fbnic_reset_indir_tbl() during init.
Now that we call it after device had been register must
be careful not to override user config.

Link: https://patch.msgid.link/20241220025241.1522781-10-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agoeth: fbnic: centralize the queue count and NAPI<>queue setting
Alexander Duyck [Fri, 20 Dec 2024 02:52:39 +0000 (18:52 -0800)] 
eth: fbnic: centralize the queue count and NAPI<>queue setting

To simplify dealing with RTNL_ASSERT() requirements further
down the line, move setting queue count and NAPI<>queue
association to their own helpers.

Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://patch.msgid.link/20241220025241.1522781-9-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agoeth: fbnic: add IRQ reuse support
Jakub Kicinski [Fri, 20 Dec 2024 02:52:38 +0000 (18:52 -0800)] 
eth: fbnic: add IRQ reuse support

Change our method of swapping NAPIs without disturbing existing config.
This is primarily needed for "live reconfiguration" such as changing
the channel count when interface is already up.

Previously we were planning to use a trick of using shared interrupts.
We would install a second IRQ handler for the new NAPI, and make it
return IRQ_NONE until we were ready for it to take over. This works fine
functionally but breaks IRQ naming. The IRQ subsystem uses the IRQ name
to create the procfs entry, since both handlers used the same name
the second handler wouldn't get a proc directory registered.
When first one gets removed on success full ring count change
it would remove its directory and we would be left with none.

New approach uses a double pointer to the NAPI. The IRQ handler needs
to know how to locate the NAPI to schedule. We register a single IRQ handler
and give it a pointer to a pointer. We can then change what it points to
without re-registering. This may have a tiny perf impact, but really
really negligible.

Link: https://patch.msgid.link/20241220025241.1522781-8-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agoeth: fbnic: store NAPIs in an array instead of the list
Jakub Kicinski [Fri, 20 Dec 2024 02:52:37 +0000 (18:52 -0800)] 
eth: fbnic: store NAPIs in an array instead of the list

We will need an array for storing NAPIs in the upcoming IRQ handler
reuse rework. Replace the current list we have, so that we are able
to reuse it later.

In a few places replace i as the iterator with t when we iterate
over triads, this seems slightly less confusing than having
i, j, k variables.

Link: https://patch.msgid.link/20241220025241.1522781-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agoeth: fbnic: let user control the RSS hash fields
Alexander Duyck [Fri, 20 Dec 2024 02:52:36 +0000 (18:52 -0800)] 
eth: fbnic: let user control the RSS hash fields

Support setting the fields over which RSS computes its hash.

Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://patch.msgid.link/20241220025241.1522781-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agoeth: fbnic: support setting RSS configuration
Alexander Duyck [Fri, 20 Dec 2024 02:52:35 +0000 (18:52 -0800)] 
eth: fbnic: support setting RSS configuration

Let the user program the RSS indirection table and the RSS key.
Straightforward implementation. Track the changes and don't bother
poking the HW if user asked for a config identical to what's already
programmed. The device only supports Toeplitz hash.

Similarly to the GET support - all the real code that does the programming
was part of initial driver submission, already.

Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://patch.msgid.link/20241220025241.1522781-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agoeth: fbnic: don't reset the secondary RSS indir table
Jakub Kicinski [Fri, 20 Dec 2024 02:52:34 +0000 (18:52 -0800)] 
eth: fbnic: don't reset the secondary RSS indir table

Secondary RSS indirection table is for additional contexts.
It can / should be initialized when such context is created.
Since we don't support creating RSS contexts, yet, this change
has no user visible effect.

Link: https://patch.msgid.link/20241220025241.1522781-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agoeth: fbnic: support querying RSS config
Alexander Duyck [Fri, 20 Dec 2024 02:52:33 +0000 (18:52 -0800)] 
eth: fbnic: support querying RSS config

The initial driver submission already added all the RSS state,
as part of multi-queue support. Expose the configuration via
the ethtool APIs.

Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://patch.msgid.link/20241220025241.1522781-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agoeth: fbnic: reorder ethtool code
Jakub Kicinski [Fri, 20 Dec 2024 02:52:32 +0000 (18:52 -0800)] 
eth: fbnic: reorder ethtool code

Define ethtool callback handlers in order in which they are defined
in the ops struct. It doesn't really matter what the order is,
but it's good to have an order.

Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com>
Link: https://patch.msgid.link/20241220025241.1522781-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agoMerge branch 'mlx5-misc-changes-2024-12-19'
Jakub Kicinski [Mon, 23 Dec 2024 18:34:48 +0000 (10:34 -0800)] 
Merge branch 'mlx5-misc-changes-2024-12-19'

Tariq Toukan says:

====================
mlx5 misc changes 2024-12-19

The first two patches by Rongwei add support for multi-host LAG. The new
multi-host NICs provide each host with partial ports, allowing each host
to maintain its unique LAG configuration.

Patches 3-7 by Moshe, Mark and Yevgeny are enhancements and preparations
in fs_core and HW steering, in preparation for future patchsets.

Patches 8-9 by Itamar add SW Steering support for ConnectX-8. They are
moved here after being part of previous submissions, yet to be accepted.

Patch 10 by Carolina cleans up an unnecessary log message.

Patch 11 by Patrisious allows RDMA RX steering creation over devices
with IB link layer.
====================

Link: https://patch.msgid.link/20241219175841.1094544-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet/mlx5: fs, Add support for RDMA RX steering over IB link layer
Patrisious Haddad [Thu, 19 Dec 2024 17:58:41 +0000 (19:58 +0200)] 
net/mlx5: fs, Add support for RDMA RX steering over IB link layer

Relax the capability check for creating the RDMA RX steering domain
by considering only the capabilities reported by the firmware
as necessary for its creation, which in turn allows RDMA RX creation
over devices with IB link layer as well.

The table_miss_action_domain capability is required only for a specific
priority, which is handled in mlx5_rdma_enable_roce_steering().
The additional capability check for this case is already in place.

Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20241219175841.1094544-12-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet/mlx5: Remove PTM support log message
Carolina Jubran [Thu, 19 Dec 2024 17:58:40 +0000 (19:58 +0200)] 
net/mlx5: Remove PTM support log message

The absence of Precision Time Measurement support should not emit a
message, as it can be misleading in contexts where PTM is not required.

Remove the log message indicating the lack of PCIe PTM support.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20241219175841.1094544-11-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet/mlx5: DR, add support for ConnectX-8 steering
Itamar Gozlan [Thu, 19 Dec 2024 17:58:39 +0000 (19:58 +0200)] 
net/mlx5: DR, add support for ConnectX-8 steering

Add support for a new steering format version that is implemented by
ConnectX-8.
Except for several differences, the STEv3 is identical to STEv2, so
for most callbacks STEv3 context struct will call STEv2 functions.

Signed-off-by: Itamar Gozlan <igozlan@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20241219175841.1094544-10-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet/mlx5: DR, expand SWS STE callbacks and consolidate common structs
Itamar Gozlan [Thu, 19 Dec 2024 17:58:38 +0000 (19:58 +0200)] 
net/mlx5: DR, expand SWS STE callbacks and consolidate common structs

Expand SWS STE callbacks to support ConnectX-8 hardware.
Move common enums and structures to a shared header file.

Signed-off-by: Itamar Gozlan <igozlan@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20241219175841.1094544-9-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet/mlx5: HWS, do not initialize native API queues
Yevgeny Kliteynik [Thu, 19 Dec 2024 17:58:37 +0000 (19:58 +0200)] 
net/mlx5: HWS, do not initialize native API queues

HWS has two types of APIs:
 - Native: fastest and slimmest, async API.
   The user of this API is required to manage rule handles memory,
   and to poll for completion for each rule.
 - BWC: backward compatible API, similar semantics to SWS API.
   This layer is implemented above native API and it does all
   the work for the user, so that it is easy to switch between
   SWS and HWS.

Right now the existing users of HWS require only BWC API.
Therefore, in order to not waste resources, this patch disables
send queues allocation for native API.

If in the future support for faster HWS rule insertion will be required
(such as for Connection Tracking), native queues can be enabled.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Itamar Gozlan <igozlan@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20241219175841.1094544-8-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet/mlx5: HWS, no need to expose mlx5hws_send_queues_open/close
Yevgeny Kliteynik [Thu, 19 Dec 2024 17:58:36 +0000 (19:58 +0200)] 
net/mlx5: HWS, no need to expose mlx5hws_send_queues_open/close

No need to have mlx5hws_send_queues_open/close in header.
Make them static and remove from header.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Itamar Gozlan <igozlan@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20241219175841.1094544-7-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet/mlx5: fs, retry insertion to hash table on EBUSY
Mark Bloch [Thu, 19 Dec 2024 17:58:35 +0000 (19:58 +0200)] 
net/mlx5: fs, retry insertion to hash table on EBUSY

When inserting into an rhashtable faster than it can grow, an -EBUSY error
may be encountered. Modify the insertion logic to retry on -EBUSY until
either a successful insertion or a genuine error is returned.

Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://patch.msgid.link/20241219175841.1094544-6-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet/mlx5: fs, add mlx5_fs_pool API
Moshe Shemesh [Thu, 19 Dec 2024 17:58:34 +0000 (19:58 +0200)] 
net/mlx5: fs, add mlx5_fs_pool API

Refactor fc_pool API to create generic fs_pool API, as HW steering has
more flow steering elements which can take advantage of the same pool of
bulks API. Change fs_counters code to use the fs_pool API.

Note, removed __counted_by from struct mlx5_fc_bulk as bulk_len is now
inner struct member. It will be added back once __counted_by can support
inner struct members.

Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20241219175841.1094544-5-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet/mlx5: fs, add counter object to flow destination
Moshe Shemesh [Thu, 19 Dec 2024 17:58:33 +0000 (19:58 +0200)] 
net/mlx5: fs, add counter object to flow destination

Currently mlx5_flow_destination includes counter_id which is assigned in
case we use flow counter on the flow steering rule. However, counter_id
is not enough data in case of using HW Steering. Thus, have mlx5_fc
object as part of mlx5_flow_destination instead of counter_id and assign
it where needed.

In case counter_id is received from user space, create a local counter
object to represent it.

Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20241219175841.1094544-4-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet/mlx5: LAG, Support LAG over Multi-Host NICs
Rongwei Liu [Thu, 19 Dec 2024 17:58:32 +0000 (19:58 +0200)] 
net/mlx5: LAG, Support LAG over Multi-Host NICs

New multi-host NICs provide each host with partial ports,
allowing each host to maintain its unique LAG configuration.

On these multi-host NICs, the 'native_port_num' capability
is no longer continuous on each host and can exceed the
'num_lag_ports' capability. Therefore, it is necessary to
skip the PFs with ldev->pf[i].dev == NULL when querying/modifying
the lag devices' information.
There is no need to check dev.native_port_num against ldev->ports.

Signed-off-by: Rongwei Liu <rongweil@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20241219175841.1094544-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet/mlx5: LAG, Refactor lag logic
Rongwei Liu [Thu, 19 Dec 2024 17:58:31 +0000 (19:58 +0200)] 
net/mlx5: LAG, Refactor lag logic

Wrap the lag pf access into two new macros:
1. ldev_for_each()
2. ldev_for_each_reverse()
The maximum number of lag ports and the index to `natvie_port_num`
mapping will be handled by the two new macros.
Users shouldn't use the for loop anymore.

Signed-off-by: Rongwei Liu <rongweil@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20241219175841.1094544-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agoMerge branch 'add-rds-ptp-library-for-microchip-phys'
Jakub Kicinski [Mon, 23 Dec 2024 18:31:01 +0000 (10:31 -0800)] 
Merge branch 'add-rds-ptp-library-for-microchip-phys'

Divya Koppera says:

====================
Add rds ptp library for Microchip phys

Adds support for rds ptp library in Microchip phys, where rds is internal
code name for ptp IP or hardware. This library will be re-used in
Microchip phys where same ptp hardware is used. Register base addresses
and mmd may changes, due to which base addresses and mmd is made variable
in this library.
====================

Link: https://patch.msgid.link/20241219123311.30213-1-divya.koppera@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet: phy: microchip_t1 : Add initialization of ptp for lan887x
Divya Koppera [Thu, 19 Dec 2024 12:33:11 +0000 (18:03 +0530)] 
net: phy: microchip_t1 : Add initialization of ptp for lan887x

Add initialization of ptp for lan887x.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Divya Koppera <divya.koppera@microchip.com>
Link: https://patch.msgid.link/20241219123311.30213-6-divya.koppera@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet: phy: Makefile: Add makefile support for rds ptp in Microchip phys
Divya Koppera [Thu, 19 Dec 2024 12:33:10 +0000 (18:03 +0530)] 
net: phy: Makefile: Add makefile support for rds ptp in Microchip phys

Add makefile support for rds ptp library.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Signed-off-by: Divya Koppera <divya.koppera@microchip.com>
Link: https://patch.msgid.link/20241219123311.30213-5-divya.koppera@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet: phy: Kconfig: Add rds ptp library support and 1588 optional flag in Microchip...
Divya Koppera [Thu, 19 Dec 2024 12:33:09 +0000 (18:03 +0530)] 
net: phy: Kconfig: Add rds ptp library support and 1588 optional flag in Microchip phys

Add ptp library support in Kconfig
As some of Microchip T1 phys support ptp, add dependency
of 1588 optional flag in Kconfig

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Signed-off-by: Divya Koppera <divya.koppera@microchip.com>
Link: https://patch.msgid.link/20241219123311.30213-4-divya.koppera@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet: phy: microchip_rds_ptp : Add rds ptp library for Microchip phys
Divya Koppera [Thu, 19 Dec 2024 12:33:08 +0000 (18:03 +0530)] 
net: phy: microchip_rds_ptp : Add rds ptp library for Microchip phys

Add rds ptp library for Microchip phys
1-step and 2-step modes are supported, over Ethernet and UDP(ipv4, ipv6)

Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Signed-off-by: Divya Koppera <divya.koppera@microchip.com>
Link: https://patch.msgid.link/20241219123311.30213-3-divya.koppera@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet: phy: microchip_rds_ptp: Add header file for Microchip rds ptp library
Divya Koppera [Thu, 19 Dec 2024 12:33:07 +0000 (18:03 +0530)] 
net: phy: microchip_rds_ptp: Add header file for Microchip rds ptp library

This rds ptp header file will cover ptp macros for future phys in
Microchip where addresses will be same but base offset and mmd address
may changes.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Signed-off-by: Divya Koppera <divya.koppera@microchip.com>
Link: https://patch.msgid.link/20241219123311.30213-2-divya.koppera@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agoMerge branch 'vsock-test-tests-for-memory-leaks'
Jakub Kicinski [Mon, 23 Dec 2024 18:29:00 +0000 (10:29 -0800)] 
Merge branch 'vsock-test-tests-for-memory-leaks'

Michal Luczaj says:

====================
vsock/test: Tests for memory leaks

Series adds tests for recently fixed memory leaks[1]:

commit d7b0ff5a8667 ("virtio/vsock: Fix accept_queue memory leak")
commit fbf7085b3ad1 ("vsock: Fix sk_error_queue memory leak")
commit 60cf6206a1f5 ("virtio/vsock: Improve MSG_ZEROCOPY error handling")

Patch 1 is a non-functional preparatory cleanup.
Patch 2 is a test suite extension for picking specific tests.
Patch 3 explains the need of kmemleak scans.
Patch 4 adapts utility functions to handle MSG_ZEROCOPY.
Patches 5-6-7 add the tests.

NOTE: Test in the last patch ("vsock/test: Add test for MSG_ZEROCOPY
completion memory leak") may stop working even before this series is
merged. See changes proposed in [2]. The failslab variant would be
unaffected.

[1] https://lore.kernel.org/20241107-vsock-mem-leaks-v2-0-4e21bfcfc818@rbox.co
[2] https://lore.kernel.org/CANn89i+oL+qoPmbbGvE_RT3_3OWgeck7cCPcTafeehKrQZ8kyw@mail.gmail.com

v3: https://lore.kernel.org/20241218-test-vsock-leaks-v3-0-f1a4dcef9228@rbox.co
v2: https://lore.kernel.org/20241216-test-vsock-leaks-v2-0-55e1405742fc@rbox.co
v1: https://lore.kernel.org/20241206-test-vsock-leaks-v1-0-c31e8c875797@rbox.co
====================

Link: https://patch.msgid.link/20241219-test-vsock-leaks-v4-0-a416e554d9d7@rbox.co
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agovsock/test: Add test for MSG_ZEROCOPY completion memory leak
Michal Luczaj [Thu, 19 Dec 2024 09:49:34 +0000 (10:49 +0100)] 
vsock/test: Add test for MSG_ZEROCOPY completion memory leak

Exercise the ENOMEM error path by attempting to hit net.core.optmem_max
limit on send().

Test aims to create a memory leak, kmemleak should be employed.

Fixed by commit 60cf6206a1f5 ("virtio/vsock: Improve MSG_ZEROCOPY error
handling").

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://patch.msgid.link/20241219-test-vsock-leaks-v4-7-a416e554d9d7@rbox.co
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agovsock/test: Add test for sk_error_queue memory leak
Michal Luczaj [Thu, 19 Dec 2024 09:49:33 +0000 (10:49 +0100)] 
vsock/test: Add test for sk_error_queue memory leak

Ask for MSG_ZEROCOPY completion notification, but do not recv() it.
Test attempts to create a memory leak, kmemleak should be employed.

Fixed by commit fbf7085b3ad1 ("vsock: Fix sk_error_queue memory leak").

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://patch.msgid.link/20241219-test-vsock-leaks-v4-6-a416e554d9d7@rbox.co
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agovsock/test: Add test for accept_queue memory leak
Michal Luczaj [Thu, 19 Dec 2024 09:49:32 +0000 (10:49 +0100)] 
vsock/test: Add test for accept_queue memory leak

Attempt to enqueue a child after the queue was flushed, but before
SOCK_DONE flag has been set.

Test tries to produce a memory leak, kmemleak should be employed. Dealing
with a race condition, test by its very nature may lead to a false
negative.

Fixed by commit d7b0ff5a8667 ("virtio/vsock: Fix accept_queue memory
leak").

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://patch.msgid.link/20241219-test-vsock-leaks-v4-5-a416e554d9d7@rbox.co
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agovsock/test: Adapt send_byte()/recv_byte() to handle MSG_ZEROCOPY
Michal Luczaj [Thu, 19 Dec 2024 09:49:31 +0000 (10:49 +0100)] 
vsock/test: Adapt send_byte()/recv_byte() to handle MSG_ZEROCOPY

For a zerocopy send(), buffer (always byte 'A') needs to be preserved (thus
it can not be on the stack) or the data recv()ed check in recv_byte() might
fail.

While there, change the printf format to 0x%02x so the '\0' bytes can be
seen.

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://patch.msgid.link/20241219-test-vsock-leaks-v4-4-a416e554d9d7@rbox.co
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agovsock/test: Add README blurb about kmemleak usage
Michal Luczaj [Thu, 19 Dec 2024 09:49:30 +0000 (10:49 +0100)] 
vsock/test: Add README blurb about kmemleak usage

Document the suggested use of kmemleak for memory leak detection.

Suggested-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://patch.msgid.link/20241219-test-vsock-leaks-v4-3-a416e554d9d7@rbox.co
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agovsock/test: Introduce option to select tests
Michal Luczaj [Thu, 19 Dec 2024 09:49:29 +0000 (10:49 +0100)] 
vsock/test: Introduce option to select tests

Allow for selecting specific test IDs to be executed.

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://patch.msgid.link/20241219-test-vsock-leaks-v4-2-a416e554d9d7@rbox.co
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agovsock/test: Use NSEC_PER_SEC
Michal Luczaj [Thu, 19 Dec 2024 09:49:28 +0000 (10:49 +0100)] 
vsock/test: Use NSEC_PER_SEC

Replace 1000000000ULL with NSEC_PER_SEC.

No functional change intended.

Reviewed-by: Luigi Leonardi <leonardi@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://patch.msgid.link/20241219-test-vsock-leaks-v4-1-a416e554d9d7@rbox.co
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonetlink: correct nlmsg size for multicast notifications
Yuyang Huang [Sat, 21 Dec 2024 10:00:07 +0000 (19:00 +0900)] 
netlink: correct nlmsg size for multicast notifications

Corrected the netlink message size calculation for multicast group
join/leave notifications. The previous calculation did not account for
the inclusion of both IPv4/IPv6 addresses and ifa_cacheinfo in the
payload. This fix ensures that the allocated message size is
sufficient to hold all necessary information.

This patch also includes the following improvements:
* Uses GFP_KERNEL instead of GFP_ATOMIC when holding the RTNL mutex.
* Uses nla_total_size(sizeof(struct in6_addr)) instead of
  nla_total_size(16).
* Removes unnecessary EXPORT_SYMBOL().

Fixes: 2c2b61d2138f ("netlink: add IGMP/MLD join/leave notifications")
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: Yuyang Huang <yuyanghuang@google.com>
Link: https://patch.msgid.link/20241221100007.1910089-1-yuyanghuang@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agoselftests: drv-net: assume stats refresh is 0 if no ethtool -c support
Jakub Kicinski [Fri, 20 Dec 2024 00:31:16 +0000 (16:31 -0800)] 
selftests: drv-net: assume stats refresh is 0 if no ethtool -c support

Tests using HW stats wait for them to stabilize, using data from
ethtool -c as the delay. Not all drivers implement ethtool -c
so handle the errors gracefully.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20241220003116.1458863-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agosfc: Use netdev refcount tracking in struct efx_async_filter_insertion
YiFei Zhu [Thu, 19 Dec 2024 17:30:04 +0000 (17:30 +0000)] 
sfc: Use netdev refcount tracking in struct efx_async_filter_insertion

I was debugging some netdev refcount issues in OpenOnload, and one
of the places I was looking at was in the sfc driver. Only
struct efx_async_filter_insertion was not using netdev refcount tracker,
so add it here. GFP_ATOMIC because this code path is called by
ndo_rx_flow_steer which holds RCU.

This patch should be a no-op if !CONFIG_NET_DEV_REFCNT_TRACKER

Signed-off-by: YiFei Zhu <zhuyifei@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241219173004.2615655-1-zhuyifei@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agoMerge branch 'net-bridge-add-skb-drop-reasons-to-the-most-common-drop-points'
Jakub Kicinski [Mon, 23 Dec 2024 18:11:07 +0000 (10:11 -0800)] 
Merge branch 'net-bridge-add-skb-drop-reasons-to-the-most-common-drop-points'

Radu Rendec says:

====================
net/bridge: Add skb drop reasons to the most common drop points

The bridge input code may drop frames for various reasons and at various
points in the ingress handling logic. Currently kfree_skb() is used
everywhere, and therefore no drop reason is specified. Add drop reasons
to the most common drop points.

The purpose of this series is to address the most common drop points on
the bridge ingress path. It does not exhaustively add drop reasons to
the entire bridge code. The intention here is to incrementally add drop
reasons to the rest of the bridge code in follow up patches.

Most of the skb drop points that are addressed in this series can be
easily tested by sending crafted packets. The diagram below shows a
simple test configuration, and some examples using `packit`(*) are
also included. The bridge is set up with STP disabled.
(*) https://github.com/resurrecting-open-source-projects/packit

The following changes were *not* tested:
* SKB_DROP_REASON_NOMEM in br_flood(). It's not easy to trigger an OOM
  condition for testing purposes, while everything else works correctly.
* All drop reasons in br_multicast_flood(). I could not find an easy way
  to make a crafted packet get there.
* SKB_DROP_REASON_BRIDGE_INGRESS_STP_STATE in br_handle_frame_finish()
  when the port state is BR_STATE_DISABLED, because in that case the
  frame is already dropped in the switch/case block at the end of
  br_handle_frame().

    +-------+
    |  br0  |
    +---+---+
        |
    +---+---+  veth pair  +-------+
    | veth0 +-------------+ xeth0 |
    +-------+             +-------+

SKB_DROP_REASON_MAC_INVALID_SOURCE - br_handle_frame()
packit -t UDP -s 192.168.0.1 -d 192.168.0.2 -S 8000 -D 8000 \
  -e 01:22:33:44:55:66 -E aa:bb:cc:dd:ee:ff -c 1 \
  -p '0x de ad be ef' -i xeth0

SKB_DROP_REASON_MAC_IEEE_MAC_CONTROL - br_handle_frame()
packit -t UDP -s 192.168.0.1 -d 192.168.0.2 -S 8000 -D 8000 \
  -e 02:22:33:44:55:66 -E 01:80:c2:00:00:01 -c 1 \
  -p '0x de ad be ef' -i xeth0

SKB_DROP_REASON_BRIDGE_INGRESS_STP_STATE - br_handle_frame()
bridge link set dev veth0 state 0 # disabled
packit -t UDP -s 192.168.0.1 -d 192.168.0.2 -S 8000 -D 8000 \
  -e 02:22:33:44:55:66 -E aa:bb:cc:dd:ee:ff -c 1 \
  -p '0x de ad be ef' -i xeth0

SKB_DROP_REASON_BRIDGE_INGRESS_STP_STATE - br_handle_frame_finish()
bridge link set dev veth0 state 2 # learning
packit -t UDP -s 192.168.0.1 -d 192.168.0.2 -S 8000 -D 8000 \
  -e 02:22:33:44:55:66 -E aa:bb:cc:dd:ee:ff -c 1 \
  -p '0x de ad be ef' -i xeth0

SKB_DROP_REASON_NO_TX_TARGET - br_flood()
packit -t UDP -s 192.168.0.1 -d 192.168.0.2 -S 8000 -D 8000 \
  -e 02:22:33:44:55:66 -E aa:bb:cc:dd:ee:ff -c 1 \
  -p '0x de ad be ef' -i xeth0
====================

Link: https://patch.msgid.link/20241219163606.717758-1-rrendec@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet: bridge: add skb drop reasons to the most common drop points
Radu Rendec [Thu, 19 Dec 2024 16:36:06 +0000 (11:36 -0500)] 
net: bridge: add skb drop reasons to the most common drop points

The bridge input code may drop frames for various reasons and at various
points in the ingress handling logic. Currently kfree_skb() is used
everywhere, and therefore no drop reason is specified. Add drop reasons
to the most common drop points.

Drop reasons are not added exhaustively to the entire bridge code. The
intention is to incrementally add drop reasons to the rest of the bridge
code in follow up patches.

Signed-off-by: Radu Rendec <rrendec@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20241219163606.717758-3-rrendec@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet: vxlan: rename SKB_DROP_REASON_VXLAN_NO_REMOTE
Radu Rendec [Thu, 19 Dec 2024 16:36:05 +0000 (11:36 -0500)] 
net: vxlan: rename SKB_DROP_REASON_VXLAN_NO_REMOTE

The SKB_DROP_REASON_VXLAN_NO_REMOTE skb drop reason was introduced in
the specific context of vxlan. As it turns out, there are similar cases
when a packet needs to be dropped in other parts of the network stack,
such as the bridge module.

Rename SKB_DROP_REASON_VXLAN_NO_REMOTE and give it a more generic name,
so that it can be used in other parts of the network stack. This is not
a functional change, and the numeric value of the drop reason even
remains unchanged.

Signed-off-by: Radu Rendec <rrendec@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20241219163606.717758-2-rrendec@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agoMerge branch 'add-more-feautues-for-enetc-v4-round-1'
Jakub Kicinski [Mon, 23 Dec 2024 17:54:35 +0000 (09:54 -0800)] 
Merge branch 'add-more-feautues-for-enetc-v4-round-1'

Wei Fang says:

====================
Add more feautues for ENETC v4 - round 1

Compared to ENETC v1 (LS1028A), ENETC v4 (i.MX95) adds more features, and
some features are configured completely differently from v1. In order to
more fully support ENETC v4, these features will be added through several
rounds of patch sets. This round adds these features, such as Tx and Rx
checksum offload, increase maximum chained Tx BD number and Large send
offload (LSO).

Link: https://lore.kernel.org/20241107033817.1654163-1-wei.fang@nxp.com
Link: https://lore.kernel.org/20241111015216.1804534-1-wei.fang@nxp.com
Link: https://lore.kernel.org/20241112091447.1850899-1-wei.fang@nxp.com
Link: https://lore.kernel.org/20241115024744.1903377-1-wei.fang@nxp.com
Link: https://lore.kernel.org/20241118060630.1956134-1-wei.fang@nxp.com
Link: https://lore.kernel.org/20241119082344.2022830-1-wei.fang@nxp.com
Link: https://lore.kernel.org/20241204052932.112446-1-wei.fang@nxp.com
Link: https://lore.kernel.org/20241211063752.744975-1-wei.fang@nxp.com
Link: https://lore.kernel.org/20241213021731.1157535-1-wei.fang@nxp.com
====================

Link: https://patch.msgid.link/20241219054755.1615626-1-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet: enetc: add UDP segmentation offload support
Wei Fang [Thu, 19 Dec 2024 05:47:55 +0000 (13:47 +0800)] 
net: enetc: add UDP segmentation offload support

Set NETIF_F_GSO_UDP_L4 bit of hw_features and features because i.MX95
enetc and LS1028A driver implements UDP segmentation.

- i.MX95 ENETC supports UDP segmentation via LSO.
- LS1028A ENETC supports UDP segmentation since the commit 3d5b459ba0e3
("net: tso: add UDP segmentation support").

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://patch.msgid.link/20241219054755.1615626-5-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet: enetc: add LSO support for i.MX95 ENETC PF
Wei Fang [Thu, 19 Dec 2024 05:47:54 +0000 (13:47 +0800)] 
net: enetc: add LSO support for i.MX95 ENETC PF

ENETC rev 4.1 supports large send offload (LSO), segmenting large TCP
and UDP transmit units into multiple Ethernet frames. To support LSO,
software needs to fill some auxiliary information in Tx BD, such as LSO
header length, frame length, LSO maximum segment size, etc.

At 1Gbps link rate, TCP segmentation was tested using iperf3, and the
CPU performance before and after applying the patch was compared through
the top command. It can be seen that LSO saves a significant amount of
CPU cycles compared to software TSO.

Before applying the patch:
%Cpu(s):  0.1 us,  4.1 sy,  0.0 ni, 85.7 id,  0.0 wa,  0.5 hi,  9.7 si

After applying the patch:
%Cpu(s):  0.1 us,  2.3 sy,  0.0 ni, 94.5 id,  0.0 wa,  0.4 hi,  2.6 si

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Link: https://patch.msgid.link/20241219054755.1615626-4-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet: enetc: update max chained Tx BD number for i.MX95 ENETC
Wei Fang [Thu, 19 Dec 2024 05:47:53 +0000 (13:47 +0800)] 
net: enetc: update max chained Tx BD number for i.MX95 ENETC

The max chained Tx BDs of latest ENETC (i.MX95 ENETC, rev 4.1) has been
increased to 63, but since the range of MAX_SKB_FRAGS is 17~45, so for
i.MX95 ENETC and later revision, it is better to set ENETC4_MAX_SKB_FRAGS
to MAX_SKB_FRAGS.

In addition, add max_frags in struct enetc_drvdata to indicate the max
chained BDs supported by device. Because the max number of chained BDs
supported by LS1028A and i.MX95 ENETC is different.

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20241219054755.1615626-3-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agonet: enetc: add Tx checksum offload for i.MX95 ENETC
Wei Fang [Thu, 19 Dec 2024 05:47:52 +0000 (13:47 +0800)] 
net: enetc: add Tx checksum offload for i.MX95 ENETC

In addition to supporting Rx checksum offload, i.MX95 ENETC also supports
Tx checksum offload. The transmit checksum offload is implemented through
the Tx BD. To support Tx checksum offload, software needs to fill some
auxiliary information in Tx BD, such as IP version, IP header offset and
size, whether L4 is UDP or TCP, etc.

Same as Rx checksum offload, Tx checksum offload capability isn't defined
in register, so tx_csum bit is added to struct enetc_drvdata to indicate
whether the device supports Tx checksum offload.

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Link: https://patch.msgid.link/20241219054755.1615626-2-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 months agoudp: Deal with race between UDP socket address change and rehash
Stefano Brivio [Wed, 18 Dec 2024 16:21:16 +0000 (17:21 +0100)] 
udp: Deal with race between UDP socket address change and rehash

If a UDP socket changes its local address while it's receiving
datagrams, as a result of connect(), there is a period during which
a lookup operation might fail to find it, after the address is changed
but before the secondary hash (port and address) and the four-tuple
hash (local and remote ports and addresses) are updated.

Secondary hash chains were introduced by commit 30fff9231fad ("udp:
bind() optimisation") and, as a result, a rehash operation became
needed to make a bound socket reachable again after a connect().

This operation was introduced by commit 719f835853a9 ("udp: add
rehash on connect()") which isn't however a complete fix: the
socket will be found once the rehashing completes, but not while
it's pending.

This is noticeable with a socat(1) server in UDP4-LISTEN mode, and a
client sending datagrams to it. After the server receives the first
datagram (cf. _xioopen_ipdgram_listen()), it issues a connect() to
the address of the sender, in order to set up a directed flow.

Now, if the client, running on a different CPU thread, happens to
send a (subsequent) datagram while the server's socket changes its
address, but is not rehashed yet, this will result in a failed
lookup and a port unreachable error delivered to the client, as
apparent from the following reproducer:

  LEN=$(($(cat /proc/sys/net/core/wmem_default) / 4))
  dd if=/dev/urandom bs=1 count=${LEN} of=tmp.in

  while :; do
   taskset -c 1 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc &
   sleep 0.1 || sleep 1
   taskset -c 2 socat OPEN:tmp.in UDP4:localhost:1337,shut-null
   wait
  done

where the client will eventually get ECONNREFUSED on a write()
(typically the second or third one of a given iteration):

  2024/11/13 21:28:23 socat[46901] E write(6, 0x556db2e3c000, 8192): Connection refused

This issue was first observed as a seldom failure in Podman's tests
checking UDP functionality while using pasta(1) to connect the
container's network namespace, which leads us to a reproducer with
the lookup error resulting in an ICMP packet on a tap device:

  LOCAL_ADDR="$(ip -j -4 addr show|jq -rM '.[] | .addr_info[0] | select(.scope == "global").local')"

  while :; do
   ./pasta --config-net -p pasta.pcap -u 1337 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc &
   sleep 0.2 || sleep 1
   socat OPEN:tmp.in UDP4:${LOCAL_ADDR}:1337,shut-null
   wait
   cmp tmp.in tmp.out
  done

Once this fails:

  tmp.in tmp.out differ: char 8193, line 29

we can finally have a look at what's going on:

  $ tshark -r pasta.pcap
      1   0.000000           :: ? ff02::16     ICMPv6 110 Multicast Listener Report Message v2
      2   0.168690 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
      3   0.168767 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
      4   0.168806 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
      5   0.168827 c6:47:05:8d:dc:04 ? Broadcast    ARP 42 Who has 88.198.0.161? Tell 88.198.0.164
      6   0.168851 9a:55:9a:55:9a:55 ? c6:47:05:8d:dc:04 ARP 42 88.198.0.161 is at 9a:55:9a:55:9a:55
      7   0.168875 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
      8   0.168896 88.198.0.164 ? 88.198.0.161 ICMP 590 Destination unreachable (Port unreachable)
      9   0.168926 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
     10   0.168959 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
     11   0.168989 88.198.0.161 ? 88.198.0.164 UDP 4138 60260 ? 1337 Len=4096
     12   0.169010 88.198.0.161 ? 88.198.0.164 UDP 42 60260 ? 1337 Len=0

On the third datagram received, the network namespace of the container
initiates an ARP lookup to deliver the ICMP message.

In another variant of this reproducer, starting the client with:

  strace -f pasta --config-net -u 1337 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc 2>strace.log &

and connecting to the socat server using a loopback address:

  socat OPEN:tmp.in UDP4:localhost:1337,shut-null

we can more clearly observe a sendmmsg() call failing after the
first datagram is delivered:

  [pid 278012] connect(173, 0x7fff96c95fc0, 16) = 0
  [...]
  [pid 278012] recvmmsg(173, 0x7fff96c96020, 1024, MSG_DONTWAIT, NULL) = -1 EAGAIN (Resource temporarily unavailable)
  [pid 278012] sendmmsg(173, 0x561c5ad0a720, 1, MSG_NOSIGNAL) = 1
  [...]
  [pid 278012] sendmmsg(173, 0x561c5ad0a720, 1, MSG_NOSIGNAL) = -1 ECONNREFUSED (Connection refused)

and, somewhat confusingly, after a connect() on the same socket
succeeded.

Until commit 4cdeeee9252a ("net: udp: prefer listeners bound to an
address"), the race between receive address change and lookup didn't
actually cause visible issues, because, once the lookup based on the
secondary hash chain failed, we would still attempt a lookup based on
the primary hash (destination port only), and find the socket with the
outdated secondary hash.

That change, however, dropped port-only lookups altogether, as side
effect, making the race visible.

To fix this, while avoiding the need to make address changes and
rehash atomic against lookups, reintroduce primary hash lookups as
fallback, if lookups based on four-tuple and secondary hashes fail.

To this end, introduce a simplified lookup implementation, which
doesn't take care of SO_REUSEPORT groups: if we have one, there are
multiple sockets that would match the four-tuple or secondary hash,
meaning that we can't run into this race at all.

v2:
  - instead of synchronising lookup operations against address change
    plus rehash, reintroduce a simplified version of the original
    primary hash lookup as fallback

v1:
  - fix build with CONFIG_IPV6=n: add ifdef around sk_v6_rcv_saddr
    usage (Kuniyuki Iwashima)
  - directly use sk_rcv_saddr for IPv4 receive addresses instead of
    fetching inet_rcv_saddr (Kuniyuki Iwashima)
  - move inet_update_saddr() to inet_hashtables.h and use that
    to set IPv4/IPv6 addresses as suitable (Kuniyuki Iwashima)
  - rebase onto net-next, update commit message accordingly

Reported-by: Ed Santiago <santiago@redhat.com>
Link: https://github.com/containers/podman/issues/24147
Analysed-by: David Gibson <david@gibson.dropbear.id.au>
Fixes: 30fff9231fad ("udp: bind() optimisation")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 months agoMerge branch 'ipv4-consolidate-route-lookups-from-ipv4-sockets'
Jakub Kicinski [Fri, 20 Dec 2024 21:50:14 +0000 (13:50 -0800)] 
Merge branch 'ipv4-consolidate-route-lookups-from-ipv4-sockets'

Guillaume Nault says:

====================
ipv4: Consolidate route lookups from IPv4 sockets.

Create inet_sk_init_flowi4() so that the different IPv4 code paths that
need to do a route lookup based on an IPv4 socket don't need to
reimplement that logic.
====================

Link: https://patch.msgid.link/cover.1734357769.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agol2tp: Use inet_sk_init_flowi4() in l2tp_ip_sendmsg().
Guillaume Nault [Mon, 16 Dec 2024 17:21:54 +0000 (18:21 +0100)] 
l2tp: Use inet_sk_init_flowi4() in l2tp_ip_sendmsg().

Use inet_sk_init_flowi4() to automatically initialise the flowi4
structure in l2tp_ip_sendmsg() instead of passing parameters manually
to ip_route_output_ports().

Override ->daddr with the value passed in the msghdr structure if
provided.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: James Chapman <jchapman@katalix.com>
Link: https://patch.msgid.link/2ff22a3560c5050228928456662b80b9c84a8fe4.1734357769.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoipv4: Use inet_sk_init_flowi4() in __ip_queue_xmit().
Guillaume Nault [Mon, 16 Dec 2024 17:21:51 +0000 (18:21 +0100)] 
ipv4: Use inet_sk_init_flowi4() in __ip_queue_xmit().

Use inet_sk_init_flowi4() to automatically initialise the flowi4
structure in __ip_queue_xmit() instead of passing parameters manually
to ip_route_output_ports().

Override ->flowi4_tos with the value passed as parameter since that's
required by SCTP.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/37e64ffbd9adac187b14aa9097b095f5c86e85be.1734357769.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoipv4: Use inet_sk_init_flowi4() in inet_csk_rebuild_route().
Guillaume Nault [Mon, 16 Dec 2024 17:21:48 +0000 (18:21 +0100)] 
ipv4: Use inet_sk_init_flowi4() in inet_csk_rebuild_route().

Use inet_sk_init_flowi4() to automatically initialise the flowi4
structure in inet_csk_rebuild_route() instead of passing parameters
manually to ip_route_output_ports().

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/b270931636effa1095508e0f0a3e8c3a0e6d357f.1734357769.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoipv4: Use inet_sk_init_flowi4() in ip4_datagram_release_cb().
Guillaume Nault [Mon, 16 Dec 2024 17:21:46 +0000 (18:21 +0100)] 
ipv4: Use inet_sk_init_flowi4() in ip4_datagram_release_cb().

Use inet_sk_init_flowi4() to automatically initialise the flowi4
structure in ip4_datagram_release_cb() instead of passing parameters
manually to ip_route_output_ports().

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/9c326b8d9e919478f7952b21473d31da07eba2dd.1734357769.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoipv4: Define inet_sk_init_flowi4() and use it in inet_sk_rebuild_header().
Guillaume Nault [Mon, 16 Dec 2024 17:21:44 +0000 (18:21 +0100)] 
ipv4: Define inet_sk_init_flowi4() and use it in inet_sk_rebuild_header().

IPv4 code commonly has to initialise a flowi4 structure from an IPv4
socket. This requires looking at potential IPv4 options to set the
proper destination address, call flowi4_init_output() with the correct
set of parameters and run the sk_classify_flow security hook.

Instead of reimplementing these operations in different parts of the
stack, let's define inet_sk_init_flowi4() which does all these
operations.

The first user is inet_sk_rebuild_header(), where inet_sk_init_flowi4()
replaces ip_route_output_ports(). Unlike ip_route_output_ports(), which
sets the flowi4 structure and performs the route lookup in one go,
inet_sk_init_flowi4() only initialises the flow. The route lookup is
then done by ip_route_output_flow(). Decoupling flow initialisation
from route lookup makes this new interface applicable more broadly as
it will allow some users to overwrite specific struct flowi4 members
before the route lookup.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/fd416275262b1f518d5abfcef740ce4f4a1a6522.1734357769.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agonet: dsa: microchip: Do not execute PTP driver code for unsupported switches
Tristram Ha [Wed, 18 Dec 2024 02:02:40 +0000 (18:02 -0800)] 
net: dsa: microchip: Do not execute PTP driver code for unsupported switches

The PTP driver code only works for certain KSZ switches like KSZ9477,
KSZ9567, LAN937X and their varieties.  This code is enabled by kernel
configuration CONFIG_NET_DSA_MICROCHIP_KSZ_PTP.  As the DSA driver is
common to work with all KSZ switches this PTP code is not appropriate
for other unsupported switches.  The ptp_capable indication is added to
the chip data structure to signal whether to execute those code.

Signed-off-by: Tristram Ha <tristram.ha@microchip.com>
Link: https://patch.msgid.link/20241218020240.70601-1-Tristram.Ha@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoqlcnic: use const 'struct bin_attribute' callbacks
Thomas Weißschuh [Thu, 19 Dec 2024 10:00:19 +0000 (11:00 +0100)] 
qlcnic: use const 'struct bin_attribute' callbacks

The sysfs core now provides callback variants that explicitly take a
const pointer. Use them so the non-const variants can be removed.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Link: https://patch.msgid.link/20241219-sysfs-const-bin_attr-net-v2-1-93bdaece3c90@weissschuh.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoMerge branch 'bridge-handle-changes-in-vlan_flag_bridge_binding'
Jakub Kicinski [Fri, 20 Dec 2024 21:16:46 +0000 (13:16 -0800)] 
Merge branch 'bridge-handle-changes-in-vlan_flag_bridge_binding'

Petr Machata says:

====================
bridge: Handle changes in VLAN_FLAG_BRIDGE_BINDING

When bridge binding is enabled on a VLAN netdevice, its link state should
track bridge ports that are members of the corresponding VLAN. This works
for a newly-added netdevices. However toggling the option does not have the
effect of enabling or disabling the behavior as appropriate.

In this patchset, have bridge react to bridge_binding toggles on VLAN
uppers.

There has been another attempt at supporting this behavior in 2022 by
Sevinj Aghayeva [0]. A discussion ensued that informed how this new
patchset is constructed, namely that the logic is in the bridge as opposed
to the 8021q driver, and the bridge reacts to NETDEV_CHANGE events on the
8021q upper.

Patches #1 and #2 contain the implementation, patches #3 and #4 a
selftest.

[0] https://lore.kernel.org/netdev/cover.1660100506.git.sevinj.aghayeva@gmail.com/
====================

Link: https://patch.msgid.link/cover.1734540770.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoselftests: net: Add a VLAN bridge binding selftest
Petr Machata [Wed, 18 Dec 2024 17:15:59 +0000 (18:15 +0100)] 
selftests: net: Add a VLAN bridge binding selftest

Add a test that exercises bridge binding.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/baf7244fd1fe223a6d93e027584fa9f99dee982c.1734540770.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoselftests: net: lib: Add a couple autodefer helpers
Petr Machata [Wed, 18 Dec 2024 17:15:58 +0000 (18:15 +0100)] 
selftests: net: lib: Add a couple autodefer helpers

Alongside the helper ip_link_set_up(), one to set the link down will be
useful as well. Add a helper to determine the link state as well,
ip_link_is_up(), and use it to short-circuit any changes if the state is
already the desired one.

Furthermore, add a helper bridge_vlan_add().

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/856d9e01725fdba21b7f6716358f645b19131af2.1734540770.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agonet: bridge: Handle changes in VLAN_FLAG_BRIDGE_BINDING
Petr Machata [Wed, 18 Dec 2024 17:15:57 +0000 (18:15 +0100)] 
net: bridge: Handle changes in VLAN_FLAG_BRIDGE_BINDING

When bridge binding is enabled on a VLAN netdevice, its link state should
track bridge ports that are members of the corresponding VLAN. This works
for newly-added netdevices. However toggling the option does not have the
effect of enabling or disabling the behavior as appropriate.

In this patch, react to bridge_binding toggles on VLAN uppers.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/90a8ca8aea4d81378b29d75d9e562433e0d5c7ff.1734540770.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agonet: bridge: Extract a helper to handle bridge_binding toggles
Petr Machata [Wed, 18 Dec 2024 17:15:56 +0000 (18:15 +0100)] 
net: bridge: Extract a helper to handle bridge_binding toggles

Currently, the BROPT_VLAN_BRIDGE_BINDING bridge option is only toggled when
VLAN devices are added on top of a bridge or removed from it. Extract the
toggling of the option to a function so that it could be invoked by a
subsequent patch when the state of an upper VLAN device changes.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/a7455f6fe1dfa7b13126ed8a7fb33d3b611eecb8.1734540770.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoinetpeer: avoid false sharing in inet_peer_xrlim_allow()
Eric Dumazet [Thu, 19 Dec 2024 15:03:30 +0000 (15:03 +0000)] 
inetpeer: avoid false sharing in inet_peer_xrlim_allow()

Under DOS, inet_peer_xrlim_allow() might be called millions
of times per second from different cpus.

Make sure to write over peer->rate_tokens and peer->rate_last
only when really needed.

Note the inherent races of this function are still there,
we do not care of precise ICMP rate limiting.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20241219150330.3159027-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoMerge branch 'hisilicon-hns-deadcoding'
Jakub Kicinski [Fri, 20 Dec 2024 20:56:19 +0000 (12:56 -0800)] 
Merge branch 'hisilicon-hns-deadcoding'

Dr. David Alan Gilbert says:

====================
hisilicon hns deadcoding

From: "Dr. David Alan Gilbert" <linux@treblig.org>

A small set of deadcoding for functions that are not
called, and a couple of function pointers that they
called.

Build tested only; I don't have the hardware.
====================

Link: https://patch.msgid.link/20241218163341.40297-1-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agonet: hisilicon: hns: Remove unused enums
Dr. David Alan Gilbert [Wed, 18 Dec 2024 16:33:41 +0000 (16:33 +0000)] 
net: hisilicon: hns: Remove unused enums

The enums dsaf_roce_port_mode, dsaf_roce_port_num and dsaf_roce_qos_sl
are unused after the removal of the reset code.

Remove them.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Jijie Shao<shaojijie@huawei.com>
Link: https://patch.msgid.link/20241218163341.40297-5-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agonet: hisilicon: hns: Remove reset helpers
Dr. David Alan Gilbert [Wed, 18 Dec 2024 16:33:40 +0000 (16:33 +0000)] 
net: hisilicon: hns: Remove reset helpers

With hns_dsaf_roce_reset() removed in a previous patch, the two
helper member pointers, 'hns_dsaf_roce_srst',  and 'hns_dsaf_srst_chns'
are now unread.

Remove them, and the helper functions that they were initialised
to, that is hns_dsaf_srst_chns(), hns_dsaf_srst_chns_acpi(),
hns_dsaf_roce_srst() and hns_dsaf_roce_srst_acpi().

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Jijie Shao<shaojijie@huawei.com>
Link: https://patch.msgid.link/20241218163341.40297-4-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agonet: hisilicon: hns: Remove unused hns_rcb_start
Dr. David Alan Gilbert [Wed, 18 Dec 2024 16:33:39 +0000 (16:33 +0000)] 
net: hisilicon: hns: Remove unused hns_rcb_start

hns_rcb_start() has been unused since 2016's
commit 454784d85de3 ("net: hns: delete redundancy ring enable operations")

Remove it.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Jijie Shao<shaojijie@huawei.com>
Link: https://patch.msgid.link/20241218163341.40297-3-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agonet: hisilicon: hns: Remove unused hns_dsaf_roce_reset
Dr. David Alan Gilbert [Wed, 18 Dec 2024 16:33:38 +0000 (16:33 +0000)] 
net: hisilicon: hns: Remove unused hns_dsaf_roce_reset

hns_dsaf_roce_reset() has been unused since 2021's
commit 38d220882426 ("RDMA/hns: Remove support for HIP06")

Remove it.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Jijie Shao<shaojijie@huawei.com>
Link: https://patch.msgid.link/20241218163341.40297-2-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoixgbevf: Add support for Intel(R) E610 device
Piotr Kwapulinski [Wed, 18 Dec 2024 13:12:38 +0000 (14:12 +0100)] 
ixgbevf: Add support for Intel(R) E610 device

Add support for Intel(R) E610 Series of network devices. The E610
is based on X550 but adds firmware managed link, enhanced security
capabilities and support for updated server manageability

Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
6 months agoPCI: Add PCI_VDEVICE_SUB helper macro
Piotr Kwapulinski [Wed, 18 Dec 2024 13:12:37 +0000 (14:12 +0100)] 
PCI: Add PCI_VDEVICE_SUB helper macro

PCI_VDEVICE_SUB generates the pci_device_id struct layout for
the specific PCI device/subdevice. Private data may follow the
output.

Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
6 months agoixgbe: Enable link management in E610 device
Piotr Kwapulinski [Thu, 5 Dec 2024 08:44:50 +0000 (09:44 +0100)] 
ixgbe: Enable link management in E610 device

Add high level link management support for E610 device. Enable the
following features:
- driver load
- bring up network interface
- IP address assignment
- pass traffic
- show statistics (e.g. via ethtool)
- disable network interface
- driver unload

Co-developed-by: Carolyn Wyborny <carolyn.wyborny@intel.com>
Signed-off-by: Carolyn Wyborny <carolyn.wyborny@intel.com>
Co-developed-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Reviewed-by: Jan Glaza <jan.glaza@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Bharath R <bharath.r@intel.com>
Signed-off-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
6 months agoixgbe: Clean up the E610 link management related code
Piotr Kwapulinski [Thu, 5 Dec 2024 08:44:49 +0000 (09:44 +0100)] 
ixgbe: Clean up the E610 link management related code

Required for enabling the link management in E610 device.

Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Bharath R <bharath.r@intel.com>
Signed-off-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
6 months agoixgbe: Add ixgbe_x540 multiple header inclusion protection
Piotr Kwapulinski [Thu, 5 Dec 2024 08:44:48 +0000 (09:44 +0100)] 
ixgbe: Add ixgbe_x540 multiple header inclusion protection

Required to adopt x540 specific functions by E610 device.

Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Bharath R <bharath.r@intel.com>
Signed-off-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
6 months agoixgbe: Add support for EEPROM dump in E610 device
Piotr Kwapulinski [Thu, 5 Dec 2024 08:44:47 +0000 (09:44 +0100)] 
ixgbe: Add support for EEPROM dump in E610 device

Add low level support for EEPROM dump for the specified network device.

Co-developed-by: Stefan Wegrzyn <stefan.wegrzyn@intel.com>
Signed-off-by: Stefan Wegrzyn <stefan.wegrzyn@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Bharath R <bharath.r@intel.com>
Signed-off-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
6 months agoixgbe: Add support for NVM handling in E610 device
Piotr Kwapulinski [Thu, 5 Dec 2024 08:44:46 +0000 (09:44 +0100)] 
ixgbe: Add support for NVM handling in E610 device

Add low level support for accessing NVM in E610 device. NVM operations are
handled via the Admin Command Interface. Add the following NVM specific
operations:
- acquire, release, read
- validate checksum
- read shadow ram

Co-developed-by: Stefan Wegrzyn <stefan.wegrzyn@intel.com>
Signed-off-by: Stefan Wegrzyn <stefan.wegrzyn@intel.com>
Co-developed-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Bharath R <bharath.r@intel.com>
Signed-off-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
6 months agoixgbe: Add link management support for E610 device
Piotr Kwapulinski [Thu, 5 Dec 2024 08:44:45 +0000 (09:44 +0100)] 
ixgbe: Add link management support for E610 device

Add low level link management support for E610 device. Link management
operations are handled via the Admin Command Interface. Add the following
link management operations:
- get link capabilities
- set up link
- get media type
- get link status, link status events
- link power management

Co-developed-by: Stefan Wegrzyn <stefan.wegrzyn@intel.com>
Signed-off-by: Stefan Wegrzyn <stefan.wegrzyn@intel.com>
Co-developed-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Reviewed-by: Jan Glaza <jan.glaza@intel.com>
Tested-by: Bharath R <bharath.r@intel.com>
Signed-off-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
6 months agoixgbe: Add support for E610 device capabilities detection
Piotr Kwapulinski [Thu, 5 Dec 2024 08:44:44 +0000 (09:44 +0100)] 
ixgbe: Add support for E610 device capabilities detection

Add low level support for E610 device capabilities detection. The
capabilities are discovered via the Admin Command Interface. Discover the
following capabilities:
- function caps: vmdq, dcb, rss, rx/tx qs, msix, nvm, orom, reset
- device caps: vsi, fdir, 1588
- phy caps

Co-developed-by: Stefan Wegrzyn <stefan.wegrzyn@intel.com>
Signed-off-by: Stefan Wegrzyn <stefan.wegrzyn@intel.com>
Co-developed-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Reviewed-by: Jan Sokolowski <jan.sokolowski@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Bharath R <bharath.r@intel.com>
Signed-off-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
6 months agoixgbe: Add support for E610 FW Admin Command Interface
Piotr Kwapulinski [Thu, 5 Dec 2024 08:44:43 +0000 (09:44 +0100)] 
ixgbe: Add support for E610 FW Admin Command Interface

Add low level support for Admin Command Interface (ACI). ACI is the
Firmware interface used by a driver to communicate with E610 adapter. Add
the following ACI features:
- data structures, macros, register definitions
- commands handling
- events handling

Co-developed-by: Stefan Wegrzyn <stefan.wegrzyn@intel.com>
Signed-off-by: Stefan Wegrzyn <stefan.wegrzyn@intel.com>
Co-developed-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Bharath R <bharath.r@intel.com>
Signed-off-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
6 months agoMerge branch 'xdp-a-fistful-of-generic-changes-pt-iii'
Jakub Kicinski [Fri, 20 Dec 2024 03:51:17 +0000 (19:51 -0800)] 
Merge branch 'xdp-a-fistful-of-generic-changes-pt-iii'

Alexander Lobakin says:

====================
xdp: a fistful of generic changes pt. III

XDP for idpf is currently 5.(6) chapters:
* convert Rx to libeth;
* convert Tx and stats to libeth;
* generic XDP and XSk code changes;
* generic XDP and XSk code additions pt. 1;
* generic XDP and XSk code additions pt. 2 (you are here);
* actual XDP for idpf via new libeth_xdp;
* XSk for idpf (via ^).

Part III.3 does the following:
* adds generic functions to build skbs from xdp_buffs (regular and
  XSk) and attach frags to xdp_buffs (regular and XSk);
* adds helper to optimize XSk xmit in drivers.

Everything is prereq for libeth_xdp, but will be useful standalone
as well: less code in drivers, faster XSk XDP_PASS, smaller object
code.
====================

Link: https://patch.msgid.link/20241218174435.1445282-1-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoxsk: add generic XSk &xdp_buff -> skb conversion
Alexander Lobakin [Wed, 18 Dec 2024 17:44:33 +0000 (18:44 +0100)] 
xsk: add generic XSk &xdp_buff -> skb conversion

Same as with converting &xdp_buff to skb on Rx, the code which allocates
a new skb and copies the XSk frame there is identical across the
drivers, so make it generic. This includes copying all the frags if they
are present in the original buff.
System percpu page_pools greatly improve XDP_PASS performance on XSk:
instead of page_alloc() + page_free(), the net core recycles the same
pages, so the only overhead left is memcpy()s. When the Page Pool is
not compiled in, the whole function is a return-NULL (but it always
gets selected when eBPF is enabled).
Note that the passed buff gets freed if the conversion is done w/o any
error, assuming you don't need this buffer after you convert it to an
skb.

Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20241218174435.1445282-6-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoxsk: make xsk_buff_add_frag() really add the frag via __xdp_buff_add_frag()
Alexander Lobakin [Wed, 18 Dec 2024 17:44:32 +0000 (18:44 +0100)] 
xsk: make xsk_buff_add_frag() really add the frag via __xdp_buff_add_frag()

Currently, xsk_buff_add_frag() only adds the frag to pool's linked list,
not doing anything with the &xdp_buff. The drivers do that manually and
the logic is the same.
Make it really add an skb frag, just like xdp_buff_add_frag() does that,
and freeing frags on error if needed. This allows to remove repeating
code from i40e and ice and not add the same code again and again.

Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20241218174435.1445282-5-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoxdp: add generic xdp_build_skb_from_buff()
Alexander Lobakin [Wed, 18 Dec 2024 17:44:31 +0000 (18:44 +0100)] 
xdp: add generic xdp_build_skb_from_buff()

The code which builds an skb from an &xdp_buff keeps multiplying itself
around the drivers with almost no changes. Let's try to stop that by
adding a generic function.
Unlike __xdp_build_skb_from_frame(), always allocate an skbuff head
using napi_build_skb() and make use of the available xdp_rxq pointer to
assign the Rx queue index. In case of PP-backed buffer, mark the skb to
be recycled, as every PP user's been switched to recycle skbs.

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20241218174435.1445282-4-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agoxdp: add generic xdp_buff_add_frag()
Alexander Lobakin [Wed, 18 Dec 2024 17:44:30 +0000 (18:44 +0100)] 
xdp: add generic xdp_buff_add_frag()

The code piece which would attach a frag to &xdp_buff is almost
identical across the drivers supporting XDP multi-buffer on Rx.
Make it a generic elegant "oneliner".
Also, I see lots of drivers calculating frags_truesize as
`xdp->frame_sz * nr_frags`. I can't say this is fully correct, since
frags might be backed by chunks of different sizes, especially with
stuff like the header split. Even page_pool_alloc() can give you two
different truesizes on two subsequent requests to allocate the same
buffer size. Add a field to &skb_shared_info (unionized as there's no
free slot currently on x86_64) to track the "true" truesize. It can
be used later when updating the skb.

Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20241218174435.1445282-3-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agopage_pool: add page_pool_dev_alloc_netmem()
Alexander Lobakin [Wed, 18 Dec 2024 17:44:29 +0000 (18:44 +0100)] 
page_pool: add page_pool_dev_alloc_netmem()

Similarly to other _dev shorthands, add one for page_pool_alloc_netmem()
to allocate a netmem using the default Rx GFP flags (ATOMIC | NOWARN) to
make the page -> netmem transition of drivers easier.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20241218174435.1445282-2-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 months agogre: Drop ip_route_output_gre().
Guillaume Nault [Wed, 18 Dec 2024 13:17:16 +0000 (14:17 +0100)] 
gre: Drop ip_route_output_gre().

We already have enough variants of ip_route_output*() functions. We
don't need a GRE specific one in the generic route.h header file.

Furthermore, ip_route_output_gre() is only used once, in ipgre_open(),
where it can be easily replaced by a simple call to
ip_route_output_key().

While there, and for clarity, explicitly set .flowi4_scope to
RT_SCOPE_UNIVERSE instead of relying on the implicit zero
initialisation.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/ab7cba47b8558cd4bfe2dc843c38b622a95ee48e.1734527729.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>