]> git.ipfire.org Git - thirdparty/linux.git/log
thirdparty/linux.git
6 weeks agoipv6: Simplify link-local address generation for IPv6 GRE.
Guillaume Nault [Mon, 16 Jun 2025 11:58:29 +0000 (13:58 +0200)] 
ipv6: Simplify link-local address generation for IPv6 GRE.

Since commit 3e6a0243ff00 ("gre: Fix again IPv6 link-local address
generation."), addrconf_gre_config() has stopped handling IP6GRE
devices specially and just calls the regular addrconf_addr_gen()
function to create their link-local IPv6 addresses.

We can thus avoid using addrconf_gre_config() for IP6GRE devices and
use the normal IPv6 initialisation path instead (that is, jump directly
to addrconf_dev_config() in addrconf_init_auto_addrs()).

See commit 3e6a0243ff00 ("gre: Fix again IPv6 link-local address
generation.") for a deeper explanation on how and why GRE devices
started handling their IPv6 link-local address generation specially,
why it was a problem, and why this is not even necessary in most cases
(especially for GRE over IPv6).

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/a9144be9c7ec3cf09f25becae5e8fdf141fde9f6.1750075076.git.gnault@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agonet/mlx4_en: Remove the redundant NULL check for the 'my_ets' object
Andrey Vatoropin [Mon, 16 Jun 2025 04:50:44 +0000 (04:50 +0000)] 
net/mlx4_en: Remove the redundant NULL check for the 'my_ets' object

Static analysis shows that pointer "my_ets" cannot be NULL because it
points to the object "struct ieee_ets".

Remove the extra NULL check. It is meaningless and harms the readability
of the code.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Signed-off-by: Andrey Vatoropin <a.vatoropin@crpt.ru>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250616045034.26000-1-a.vatoropin@crpt.ru
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agoMerge branch 'add-support-for-pse-budget-evaluation-strategy'
Jakub Kicinski [Thu, 19 Jun 2025 02:03:02 +0000 (19:03 -0700)] 
Merge branch 'add-support-for-pse-budget-evaluation-strategy'

Kory Maincent says:

====================
Add support for PSE budget evaluation strategy

This series brings support for budget evaluation strategy in the PSE
subsystem. PSE controllers can set priorities to decide which ports should
be turned off in case of special events like over-current.

This patch series adds support for two budget evaluation strategy.
1. Static Method:

   This method involves distributing power based on PD classification.
   It’s straightforward and stable, the PSE core keeping track of the
   budget and subtracting the power requested by each PD’s class.

   Advantages: Every PD gets its promised power at any time, which
   guarantees reliability.

   Disadvantages: PD classification steps are large, meaning devices
   request much more power than they actually need. As a result, the power
   supply may only operate at, say, 50% capacity, which is inefficient and
   wastes money.

2. Dynamic Method:

   To address the inefficiencies of the static method, vendors like
   Microchip have introduced dynamic power budgeting, as seen in the
   PD692x0 firmware. This method monitors the current consumption per port
   and subtracts it from the available power budget. When the budget is
   exceeded, lower-priority ports are shut down.

   Advantages: This method optimizes resource utilization, saving costs.

   Disadvantages: Low-priority devices may experience instability.

The UAPI allows adding support for software port priority mode managed from
userspace later if needed.

Patches 1-2: Add support for interrupt event report in PSE core, ethtool
     and ethtool specs.
Patch 3: Adds support for interrupt and event report in TPS23881 driver.
Patches 4,5: Add support for PSE power domain in PSE core and ethtool.
Patches 6-8: Add support for budget evaluation strategy in PSE core,
     ethtool and ethtool specs.
Patches 9-11: Add support for port priority and power supplies in PD692x0
      drivers.
Patches 12,13: Add support for port priority in TPS23881 drivers.
====================

Link: https://patch.msgid.link/20250617-feature_poe_port_prio-v14-0-78a1a645e2ee@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agodt-bindings: net: pse-pd: ti,tps23881: Add interrupt description
Kory Maincent (Dent Project) [Tue, 17 Jun 2025 12:12:12 +0000 (14:12 +0200)] 
dt-bindings: net: pse-pd: ti,tps23881: Add interrupt description

Add an interrupt property to the device tree bindings for the TI TPS23881
PSE controller. The interrupt is primarily used to detect classification
and disconnection events, which are essential for managing the PSE
controller in compliance with the PoE standard.

Interrupt support is essential for the proper functioning of the TPS23881
controller. Without it, after a power-on (PWON), the controller will
no longer perform detection and classification. This could lead to
potential hazards, such as connecting a non-PoE device after a PoE device,
which might result in magic smoke.

Signed-off-by: Kory Maincent (Dent Project) <kory.maincent@bootlin.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://patch.msgid.link/20250617-feature_poe_port_prio-v14-13-78a1a645e2ee@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: pse-pd: tps23881: Add support for static port priority feature
Kory Maincent (Dent Project) [Tue, 17 Jun 2025 12:12:11 +0000 (14:12 +0200)] 
net: pse-pd: tps23881: Add support for static port priority feature

This patch enhances PSE callbacks by introducing support for the static
port priority feature. It extends interrupt management to handle and report
detection, classification, and disconnection events. Additionally, it
introduces the pi_get_pw_req() callback, which provides information about
the power requested by the Powered Devices.

Interrupt support is essential for the proper functioning of the TPS23881
controller. Without it, after a power-on (PWON), the controller will
no longer perform detection and classification. This could lead to
potential hazards, such as connecting a non-PoE device after a PoE device,
which might result in magic smoke.

Signed-off-by: Kory Maincent (Dent Project) <kory.maincent@bootlin.com>
Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20250617-feature_poe_port_prio-v14-12-78a1a645e2ee@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agodt-bindings: net: pse-pd: microchip,pd692x0: Add manager regulator supply
Kory Maincent (Dent Project) [Tue, 17 Jun 2025 12:12:10 +0000 (14:12 +0200)] 
dt-bindings: net: pse-pd: microchip,pd692x0: Add manager regulator supply

Adds the regulator supply parameter of the managers.
Update also the example as the regulator supply of the PSE PIs
should be the managers itself and not an external regulator.

Signed-off-by: Kory Maincent (Dent Project) <kory.maincent@bootlin.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://patch.msgid.link/20250617-feature_poe_port_prio-v14-11-78a1a645e2ee@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: pse-pd: pd692x0: Add support for controller and manager power supplies
Kory Maincent (Dent Project) [Tue, 17 Jun 2025 12:12:09 +0000 (14:12 +0200)] 
net: pse-pd: pd692x0: Add support for controller and manager power supplies

Add support for managing the VDD and VDDA power supplies for the PD692x0
PSE controller, as well as the VAUX5 and VAUX3P3 power supplies for the
PD6920x PSE managers.

Signed-off-by: Kory Maincent (Dent Project) <kory.maincent@bootlin.com>
Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20250617-feature_poe_port_prio-v14-10-78a1a645e2ee@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: pse-pd: pd692x0: Add support for PSE PI priority feature
Kory Maincent (Dent Project) [Tue, 17 Jun 2025 12:12:08 +0000 (14:12 +0200)] 
net: pse-pd: pd692x0: Add support for PSE PI priority feature

This patch extends the PSE callbacks by adding support for the newly
introduced pi_set_prio() callback, enabling the configuration of PSE PI
priorities. The current port priority is now also included in the status
information returned to users.

Signed-off-by: Kory Maincent (Dent Project) <kory.maincent@bootlin.com>
Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20250617-feature_poe_port_prio-v14-9-78a1a645e2ee@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: ethtool: Add PSE port priority support feature
Kory Maincent (Dent Project) [Tue, 17 Jun 2025 12:12:07 +0000 (14:12 +0200)] 
net: ethtool: Add PSE port priority support feature

This patch expands the status information provided by ethtool for PSE c33
with current port priority and max port priority. It also adds a call to
pse_ethtool_set_prio() to configure the PSE port priority.

Signed-off-by: Kory Maincent (Dent Project) <kory.maincent@bootlin.com>
Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20250617-feature_poe_port_prio-v14-8-78a1a645e2ee@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: pse-pd: Add support for budget evaluation strategies
Kory Maincent (Dent Project) [Tue, 17 Jun 2025 12:12:06 +0000 (14:12 +0200)] 
net: pse-pd: Add support for budget evaluation strategies

This patch introduces the ability to configure the PSE PI budget evaluation
strategies. Budget evaluation strategies is utilized by PSE controllers to
determine which ports to turn off first in scenarios such as power budget
exceedance.

The pis_prio_max value is used to define the maximum priority level
supported by the controller. Both the current priority and the maximum
priority are exposed to the user through the pse_ethtool_get_status call.

This patch add support for two mode of budget evaluation strategies.
1. Static Method:

   This method involves distributing power based on PD classification.
   It’s straightforward and stable, the PSE core keeping track of the
   budget and subtracting the power requested by each PD’s class.

   Advantages: Every PD gets its promised power at any time, which
   guarantees reliability.

   Disadvantages: PD classification steps are large, meaning devices
   request much more power than they actually need. As a result, the power
   supply may only operate at, say, 50% capacity, which is inefficient and
   wastes money.

   Priority max value is matching the number of PSE PIs within the PSE.

2. Dynamic Method:

   To address the inefficiencies of the static method, vendors like
   Microchip have introduced dynamic power budgeting, as seen in the
   PD692x0 firmware. This method monitors the current consumption per port
   and subtracts it from the available power budget. When the budget is
   exceeded, lower-priority ports are shut down.

   Advantages: This method optimizes resource utilization, saving costs.

   Disadvantages: Low-priority devices may experience instability.

   Priority max value is set by the PSE controller driver.

For now, budget evaluation methods are not configurable and cannot be
mixed. They are hardcoded in the PSE driver itself, as no current PSE
controller supports both methods.

Signed-off-by: Kory Maincent (Dent Project) <kory.maincent@bootlin.com>
Acked-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20250617-feature_poe_port_prio-v14-7-78a1a645e2ee@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: pse-pd: Add helper to report hardware enable status of the PI
Kory Maincent (Dent Project) [Tue, 17 Jun 2025 12:12:05 +0000 (14:12 +0200)] 
net: pse-pd: Add helper to report hardware enable status of the PI

Refactor code by introducing a helper function to retrieve the hardware
enabled state of the PI, avoiding redundant implementations in the
future.

Signed-off-by: Kory Maincent (Dent Project) <kory.maincent@bootlin.com>
Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20250617-feature_poe_port_prio-v14-6-78a1a645e2ee@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: ethtool: Add support for new power domains index description
Kory Maincent (Dent Project) [Tue, 17 Jun 2025 12:12:04 +0000 (14:12 +0200)] 
net: ethtool: Add support for new power domains index description

Report the index of the newly introduced PSE power domain to the user,
enabling improved management of the power budget for PSE devices.

Signed-off-by: Kory Maincent (Dent Project) <kory.maincent@bootlin.com>
Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20250617-feature_poe_port_prio-v14-5-78a1a645e2ee@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: pse-pd: Add support for PSE power domains
Kory Maincent (Dent Project) [Tue, 17 Jun 2025 12:12:03 +0000 (14:12 +0200)] 
net: pse-pd: Add support for PSE power domains

Introduce PSE power domain support as groundwork for upcoming port
priority features. Multiple PSE PIs can now be grouped under a single
PSE power domain, enabling future enhancements like defining available
power budgets, port priority modes, and disconnection policies. This
setup will allow the system to assess whether activating a port would
exceed the available power budget, preventing over-budget states
proactively.

Signed-off-by: Kory Maincent (Dent Project) <kory.maincent@bootlin.com>
Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20250617-feature_poe_port_prio-v14-4-78a1a645e2ee@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: pse-pd: tps23881: Add support for PSE events and interrupts
Kory Maincent (Dent Project) [Tue, 17 Jun 2025 12:12:02 +0000 (14:12 +0200)] 
net: pse-pd: tps23881: Add support for PSE events and interrupts

Add support for PSE event reporting through interrupts. Set up the newly
introduced devm_pse_irq_helper helper to register the interrupt. Events are
reported for over-current and over-temperature conditions.

Signed-off-by: Kory Maincent (Dent Project) <kory.maincent@bootlin.com>
Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20250617-feature_poe_port_prio-v14-3-78a1a645e2ee@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: pse-pd: Add support for reporting events
Kory Maincent (Dent Project) [Tue, 17 Jun 2025 12:12:01 +0000 (14:12 +0200)] 
net: pse-pd: Add support for reporting events

Add support for devm_pse_irq_helper() to register PSE interrupts and report
events such as over-current or over-temperature conditions. This follows a
similar approach to the regulator API but also sends notifications using a
dedicated PSE ethtool netlink socket.

Signed-off-by: Kory Maincent (Dent Project) <kory.maincent@bootlin.com>
Link: https://patch.msgid.link/20250617-feature_poe_port_prio-v14-2-78a1a645e2ee@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: pse-pd: Introduce attached_phydev to pse control
Kory Maincent (Dent Project) [Tue, 17 Jun 2025 12:12:00 +0000 (14:12 +0200)] 
net: pse-pd: Introduce attached_phydev to pse control

In preparation for reporting PSE events via ethtool notifications,
introduce an attached_phydev field in the pse_control structure.
This field stores the phy_device associated with the PSE PI,
ensuring that notifications are sent to the correct network
interface.

The attached_phydev pointer is directly tied to the PHY lifecycle. It
is set when the PHY is registered and cleared when the PHY is removed.
There is no need to use a refcount, as doing so could interfere with
the PHY removal process.

Signed-off-by: Kory Maincent (Dent Project) <kory.maincent@bootlin.com>
Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20250617-feature_poe_port_prio-v14-1-78a1a645e2ee@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoMerge branch 'phc-support-in-ena-driver'
Jakub Kicinski [Thu, 19 Jun 2025 01:57:32 +0000 (18:57 -0700)] 
Merge branch 'phc-support-in-ena-driver'

David Arinzon says:

====================
PHC support in ENA driver

This patchset adds the support for PHC (PTP Hardware Clock)
in the ENA driver. The documentation part of the patchset
includes additional information, including statistics,
utilization and invocation examples through the testptp
utility.
====================

Link: https://patch.msgid.link/20250617110545.5659-1-darinzon@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: ena: Add PHC documentation
David Arinzon [Tue, 17 Jun 2025 11:05:45 +0000 (14:05 +0300)] 
net: ena: Add PHC documentation

Provide the relevant information and guidelines
about the feature support in the ENA driver.

Signed-off-by: Amit Bernstein <amitbern@amazon.com>
Signed-off-by: David Arinzon <darinzon@amazon.com>
Link: https://patch.msgid.link/20250617110545.5659-10-darinzon@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: ena: View PHC stats using debugfs
David Arinzon [Tue, 17 Jun 2025 11:05:44 +0000 (14:05 +0300)] 
net: ena: View PHC stats using debugfs

Add an entry named `phc_stats` to view the PHC statistics.
If PHC is enabled, the stats are printed, as below:

phc_cnt: 0
phc_exp: 0
phc_skp: 0
phc_err_dv: 0
phc_err_ts: 0

If PHC is disabled, no statistics will be displayed.

Signed-off-by: David Arinzon <darinzon@amazon.com>
Link: https://patch.msgid.link/20250617110545.5659-9-darinzon@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: ena: Add debugfs support to the ENA driver
David Arinzon [Tue, 17 Jun 2025 11:05:43 +0000 (14:05 +0300)] 
net: ena: Add debugfs support to the ENA driver

Adding the base directory of debugfs to the driver.
In order for the folder to be unique per driver instantiation,
the chosen name is the device name.

This commit contains the initialization and the
base folder.

The creation of the base folder may fail, but is considered
non-fatal.

Signed-off-by: David Arinzon <darinzon@amazon.com>
Link: https://patch.msgid.link/20250617110545.5659-8-darinzon@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: ena: Control PHC enable through devlink
David Arinzon [Tue, 17 Jun 2025 11:05:42 +0000 (14:05 +0300)] 
net: ena: Control PHC enable through devlink

Add the capability to set parameters through the devlink framework.

The parameter used for controlling PHC (enable/disable) details
are as follows:
- Name: enable_phc
- Type: Boolean (true - enable/false - disable)
- Mode: DEVLINK_PARAM_CMODE_DRIVERINIT
- Effect: Changes take place during driver initialization,
          any changes require a devlink reload to take effect.

Signed-off-by: David Arinzon <darinzon@amazon.com>
Link: https://patch.msgid.link/20250617110545.5659-7-darinzon@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agodevlink: Add new "enable_phc" generic device param
David Arinzon [Tue, 17 Jun 2025 11:05:41 +0000 (14:05 +0300)] 
devlink: Add new "enable_phc" generic device param

Add a new device generic parameter to enable/disable the
PHC (PTP Hardware Clock) functionality in the device associated
with the devlink instance.

Signed-off-by: David Arinzon <darinzon@amazon.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20250617110545.5659-6-darinzon@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: ena: Add devlink port support
David Arinzon [Tue, 17 Jun 2025 11:05:40 +0000 (14:05 +0300)] 
net: ena: Add devlink port support

Add the basic functionality to support devlink port
for devlink model completeness purposes.
Current support is for registration/un-registration.

Signed-off-by: David Arinzon <darinzon@amazon.com>
Link: https://patch.msgid.link/20250617110545.5659-5-darinzon@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: ena: Add device reload capability through devlink
David Arinzon [Tue, 17 Jun 2025 11:05:39 +0000 (14:05 +0300)] 
net: ena: Add device reload capability through devlink

Adding basic devlink capability support of reloading the driver.
This capability is required to support driver init type
devlink params (DEVLINK_PARAM_CMODE_DRIVERINIT). Such params
require reloading of the driver (destroy/restore sequence).
The reloading is done by the devlink framework using the
hooks provided by the driver.

Signed-off-by: David Arinzon <darinzon@amazon.com>
Link: https://patch.msgid.link/20250617110545.5659-4-darinzon@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: ena: PHC silent reset
David Arinzon [Tue, 17 Jun 2025 11:05:38 +0000 (14:05 +0300)] 
net: ena: PHC silent reset

Each PHC device kernel registration receives a unique kernel index,
which is associated with a new PHC device file located at
"/dev/ptp<index>".
This device file serves as an interface for obtaining PHC timestamps.
Examples of tools that use "/dev/ptp" include testptp [1]
and chrony [2].

A reset flow may occur in the ENA driver while PHC is active.
During a reset, the driver will unregister and then re-register the
PHC device with the kernel.
Under race conditions, particularly during heavy PHC loads,
the driver’s reset flow might complete faster than the kernel’s PHC
unregister/register process.
This can result in the PHC index being different from what it was prior
to the reset, as the PHC index is selected using kernel ID
allocation [3].

While driver rmmod/insmod are done by the user, a reset may occur
at anytime, without the user awareness, consequently, the driver
might receive a new PHC index after the reset, potentially compromising
the user experience.

To prevent this issue, the PHC flow will detect the reset during PHC
destruction and will skip the PHC unregister/register calls to preserve
the kernel PHC index.
During the reset flow, any attempt to get a PHC timestamp will fail as
expected, but the kernel PHC index will remain unchanged.

[1]: https://github.com/torvalds/linux/blob/v6.1/tools/testing/selftests/ptp/testptp.c
[2]: https://github.com/mlichvar/chrony
[3]: https://www.kernel.org/doc/html/latest/core-api/idr.html

Signed-off-by: Amit Bernstein <amitbern@amazon.com>
Signed-off-by: David Arinzon <darinzon@amazon.com>
Link: https://patch.msgid.link/20250617110545.5659-3-darinzon@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: ena: Add PHC support in the ENA driver
David Arinzon [Tue, 17 Jun 2025 11:05:37 +0000 (14:05 +0300)] 
net: ena: Add PHC support in the ENA driver

The ENA driver will be extended to support the new PHC feature using
ptp_clock interface [1]. this will provide timestamp reference for user
space to allow measuring time offset between the PHC and the system
clock in order to achieve nanosecond accuracy.

[1] - https://www.kernel.org/doc/html/latest/driver-api/ptp.html

Signed-off-by: Amit Bernstein <amitbern@amazon.com>
Signed-off-by: David Arinzon <darinzon@amazon.com>
Link: https://patch.msgid.link/20250617110545.5659-2-darinzon@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoMerge branch 'udp_tunnel-remove-rtnl_lock-dependency'
Jakub Kicinski [Thu, 19 Jun 2025 01:53:53 +0000 (18:53 -0700)] 
Merge branch 'udp_tunnel-remove-rtnl_lock-dependency'

Stanislav Fomichev says:

====================
udp_tunnel: remove rtnl_lock dependency

Recently bnxt had to grow back a bunch of rtnl dependencies because
of udp_tunnel's infra. Add separate (global) mutext to protect
udp_tunnel state.
====================

Link: https://patch.msgid.link/20250616162117.287806-1-stfomichev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoRevert "bnxt_en: bring back rtnl_lock() in the bnxt_open() path"
Stanislav Fomichev [Mon, 16 Jun 2025 16:21:17 +0000 (09:21 -0700)] 
Revert "bnxt_en: bring back rtnl_lock() in the bnxt_open() path"

This reverts commit 325eb217e41fa14f307c7cc702bd18d0bb38fe84.

udp_tunnel infra doesn't need RTNL, should be safe to get back
to only netdev instance lock.

Cc: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com>
Link: https://patch.msgid.link/20250616162117.287806-7-stfomichev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonetdevsim: remove udp_ports_sleep
Stanislav Fomichev [Mon, 16 Jun 2025 16:21:16 +0000 (09:21 -0700)] 
netdevsim: remove udp_ports_sleep

Now that there is only one path in udp_tunnel, there is no need to
have udp_ports_sleep knob. Remove it and adjust the test.

Cc: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com>
Link: https://patch.msgid.link/20250616162117.287806-6-stfomichev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: remove redundant ASSERT_RTNL() in queue setup functions
Stanislav Fomichev [Mon, 16 Jun 2025 16:21:15 +0000 (09:21 -0700)] 
net: remove redundant ASSERT_RTNL() in queue setup functions

The existing netdev_ops_assert_locked() already asserts that either
the RTNL lock or the per-device lock is held, making the explicit
ASSERT_RTNL() redundant.

Cc: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com>
Link: https://patch.msgid.link/20250616162117.287806-5-stfomichev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoudp_tunnel: remove rtnl_lock dependency
Stanislav Fomichev [Mon, 16 Jun 2025 16:21:14 +0000 (09:21 -0700)] 
udp_tunnel: remove rtnl_lock dependency

Drivers that are using ops lock and don't depend on RTNL lock
still need to manage it because udp_tunnel's RTNL dependency.
Introduce new udp_tunnel_nic_lock and use it instead of
rtnl_lock. Drop non-UDP_TUNNEL_NIC_INFO_MAY_SLEEP mode from
udp_tunnel infra (udp_tunnel_nic_device_sync_work needs to
grab udp_tunnel_nic_lock mutex and might sleep).

Cover more places in v4:

- netlink
  - udp_tunnel_notify_add_rx_port (ndo_open)
    - triggers udp_tunnel_nic_device_sync_work
  - udp_tunnel_notify_del_rx_port (ndo_stop)
    - triggers udp_tunnel_nic_device_sync_work
  - udp_tunnel_get_rx_info (__netdev_update_features)
    - triggers NETDEV_UDP_TUNNEL_PUSH_INFO
  - udp_tunnel_drop_rx_info (__netdev_update_features)
    - triggers NETDEV_UDP_TUNNEL_DROP_INFO
  - udp_tunnel_nic_reset_ntf (ndo_open)

- notifiers
  - udp_tunnel_nic_netdevice_event, depending on the event:
    - triggers NETDEV_UDP_TUNNEL_PUSH_INFO
    - triggers NETDEV_UDP_TUNNEL_DROP_INFO

- ethnl_tunnel_info_reply_size
- udp_tunnel_nic_set_port_priv (two intel drivers)

Cc: Michael Chan <michael.chan@broadcom.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com>
Link: https://patch.msgid.link/20250616162117.287806-4-stfomichev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agovxlan: drop sock_lock
Stanislav Fomichev [Mon, 16 Jun 2025 16:21:13 +0000 (09:21 -0700)] 
vxlan: drop sock_lock

We won't be able to sleep soon in vxlan_offload_rx_ports and won't be
able to grab sock_lock. Instead of having separate spinlock to
manage sockets, rely on rtnl lock. This is similar to how geneve
manages its sockets.

Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20250616162117.287806-3-stfomichev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agogeneve: rely on rtnl lock in geneve_offload_rx_ports
Stanislav Fomichev [Mon, 16 Jun 2025 16:21:12 +0000 (09:21 -0700)] 
geneve: rely on rtnl lock in geneve_offload_rx_ports

udp_tunnel_push_rx_port will grab mutex in the next patch so
we can't use rcu. geneve_offload_rx_ports is called
from geneve_netdevice_event for NETDEV_UDP_TUNNEL_PUSH_INFO and
NETDEV_UDP_TUNNEL_DROP_INFO which both have ASSERT_RTNL.
Entries are added to and removed from the sock_list under rtnl
lock as well (when adding or removing a tunneling device).

Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20250616162117.287806-2-stfomichev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agotcp: Remove inet_hashinfo2_free_mod()
Yue Haibing [Tue, 17 Jun 2025 13:06:13 +0000 (21:06 +0800)] 
tcp: Remove inet_hashinfo2_free_mod()

DCCP was removed, inet_hashinfo2_free_mod() is unused now.

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250617130613.498659-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agodpaa_eth: don't use fixed_phy_change_carrier
Heiner Kallweit [Mon, 16 Jun 2025 21:24:05 +0000 (23:24 +0200)] 
dpaa_eth: don't use fixed_phy_change_carrier

This effectively reverts 6e8b0ff1ba4c ("dpaa_eth: Add change_carrier()
for Fixed PHYs"). Usage of fixed_phy_change_carrier() requires that
fixed_phy_register() has been called before, directly or indirectly.
And that's not the case in this driver.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/7eb189b3-d5fd-4be6-8517-a66671a4e4e3@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet/mlx4e: Don't redefine IB_MTU_XXX enum
Mark Zhang [Tue, 17 Jun 2025 08:06:30 +0000 (11:06 +0300)] 
net/mlx4e: Don't redefine IB_MTU_XXX enum

Rely on existing IB_MTU_XXX definitions which exist in ib_verbs.h.

Reviewed-by: Patrisious Haddad <phaddad@nvidia.com>
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/382c91ee506e7f1f3c1801957df6b28963484b7d.1750147222.git.leon@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonfc: Remove checks for nla_data returning NULL
Simon Horman [Tue, 17 Jun 2025 08:45:39 +0000 (09:45 +0100)] 
nfc: Remove checks for nla_data returning NULL

The implementation of nla_data is as follows:

static inline void *nla_data(const struct nlattr *nla)
{
return (char *) nla + NLA_HDRLEN;
}

Excluding the case where nla is exactly -NLA_HDRLEN, it will not return
NULL. And it seems misleading to assume that it can, other than in this
corner case. So drop checks for this condition.

Flagged by Smatch.

Compile tested only.

Signed-off-by: Simon Horman <horms@kernel.org>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://patch.msgid.link/20250617-nfc-null-data-v1-1-c7525ead2e95@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoMerge branch 'eth-migrate-more-drivers-to-new-rxfh-callbacks'
Jakub Kicinski [Wed, 18 Jun 2025 20:19:08 +0000 (13:19 -0700)] 
Merge branch 'eth-migrate-more-drivers-to-new-rxfh-callbacks'

Jakub Kicinski says:

====================
eth: migrate more drivers to new RXFH callbacks

Migrate a batch of drivers to the recently added dedicated
.get_rxfh_fields and .set_rxfh_fields ethtool callbacks.
====================

Link: https://patch.msgid.link/20250617014848.436741-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoeth: sxgbe: migrate to new RXFH callbacks
Jakub Kicinski [Tue, 17 Jun 2025 01:48:48 +0000 (18:48 -0700)] 
eth: sxgbe: migrate to new RXFH callbacks

Migrate to new callbacks added by commit 9bb00786fc61 ("net: ethtool:
add dedicated callbacks for getting and setting rxfh fields").

RXFH is all this driver supports in RXNFC so old callbacks are
completely removed.

Reviewed-by: Joe Damato <joe@dama.to>
Link: https://patch.msgid.link/20250617014848.436741-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoeth: dpaa2: migrate to new RXFH callbacks
Jakub Kicinski [Tue, 17 Jun 2025 01:48:47 +0000 (18:48 -0700)] 
eth: dpaa2: migrate to new RXFH callbacks

Migrate to new callbacks added by commit 9bb00786fc61 ("net: ethtool:
add dedicated callbacks for getting and setting rxfh fields").

Reviewed-by: Joe Damato <joe@dama.to>
Link: https://patch.msgid.link/20250617014848.436741-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoeth: dpaa: migrate to new RXFH callbacks
Jakub Kicinski [Tue, 17 Jun 2025 01:48:46 +0000 (18:48 -0700)] 
eth: dpaa: migrate to new RXFH callbacks

Migrate to new callbacks added by commit 9bb00786fc61 ("net: ethtool:
add dedicated callbacks for getting and setting rxfh fields").

RXFH is all this driver supports in RXNFC so old callbacks are
completely removed.

Reviewed-by: Joe Damato <joe@dama.to>
Link: https://patch.msgid.link/20250617014848.436741-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoeth: mvpp2: migrate to new RXFH callbacks
Jakub Kicinski [Tue, 17 Jun 2025 01:48:45 +0000 (18:48 -0700)] 
eth: mvpp2: migrate to new RXFH callbacks

Migrate to new callbacks added by commit 9bb00786fc61 ("net: ethtool:
add dedicated callbacks for getting and setting rxfh fields").

Reviewed-by: Joe Damato <joe@dama.to>
Link: https://patch.msgid.link/20250617014848.436741-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoeth: niu: migrate to new RXFH callbacks
Jakub Kicinski [Tue, 17 Jun 2025 01:48:44 +0000 (18:48 -0700)] 
eth: niu: migrate to new RXFH callbacks

Migrate to new callbacks added by commit 9bb00786fc61 ("net: ethtool:
add dedicated callbacks for getting and setting rxfh fields").

Reviewed-by: Joe Damato <joe@dama.to>
Link: https://patch.msgid.link/20250617014848.436741-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoMerge branch 'eth-migrate-some-drivers-to-new-rxfh-callbacks'
Jakub Kicinski [Wed, 18 Jun 2025 20:17:52 +0000 (13:17 -0700)] 
Merge branch 'eth-migrate-some-drivers-to-new-rxfh-callbacks'

Jakub Kicinski says:

====================
eth: migrate some drivers to new RXFH callbacks

Migrate a batch of drivers to the recently added dedicated
.get_rxfh_fields and .set_rxfh_fields ethtool callbacks.
====================

Link: https://patch.msgid.link/20250617014555.434790-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoeth: otx2: migrate to new RXFH callbacks
Jakub Kicinski [Tue, 17 Jun 2025 01:45:55 +0000 (18:45 -0700)] 
eth: otx2: migrate to new RXFH callbacks

Migrate to new callbacks added by commit 9bb00786fc61 ("net: ethtool:
add dedicated callbacks for getting and setting rxfh fields").

Link: https://patch.msgid.link/20250617014555.434790-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoeth: thunder: migrate to new RXFH callbacks
Jakub Kicinski [Tue, 17 Jun 2025 01:45:54 +0000 (18:45 -0700)] 
eth: thunder: migrate to new RXFH callbacks

Migrate to new callbacks added by commit 9bb00786fc61 ("net: ethtool:
add dedicated callbacks for getting and setting rxfh fields").

The driver has no other RXNFC functionality so the SET callback can
be now removed.

Link: https://patch.msgid.link/20250617014555.434790-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoeth: ena: migrate to new RXFH callbacks
Jakub Kicinski [Tue, 17 Jun 2025 01:45:53 +0000 (18:45 -0700)] 
eth: ena: migrate to new RXFH callbacks

Migrate to new callbacks added by commit 9bb00786fc61 ("net: ethtool:
add dedicated callbacks for getting and setting rxfh fields").

The driver has no other RXNFC functionality so the SET callback can
be now removed.

Reviewed-by: David Arinzon <darinzon@amazon.com>
Link: https://patch.msgid.link/20250617014555.434790-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoeth: bnxt: migrate to new RXFH callbacks
Jakub Kicinski [Tue, 17 Jun 2025 01:45:52 +0000 (18:45 -0700)] 
eth: bnxt: migrate to new RXFH callbacks

Migrate to new callbacks added by commit 9bb00786fc61 ("net: ethtool:
add dedicated callbacks for getting and setting rxfh fields").

Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250617014555.434790-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoeth: bnx2x: migrate to new RXFH callbacks
Jakub Kicinski [Tue, 17 Jun 2025 01:45:51 +0000 (18:45 -0700)] 
eth: bnx2x: migrate to new RXFH callbacks

Migrate to new callbacks added by commit 9bb00786fc61 ("net: ethtool:
add dedicated callbacks for getting and setting rxfh fields").

The driver has no other RXNFC functionality so the SET callback can
be now removed.

Reviewed-by: Subbaraya Sundeep <sbhatta@marvell.com>
Link: https://patch.msgid.link/20250617014555.434790-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoMerge branch 'netconsole-msgid' into main
David S. Miller [Wed, 18 Jun 2025 09:46:31 +0000 (10:46 +0100)] 
Merge branch 'netconsole-msgid' into main

Gustavo Luiz Duarte says:

====================
netconsole: Add support for msgid in sysdata

This patch series introduces a new feature to netconsole which allows
appending a message ID to the userdata dictionary.

If the msgid feature is enabled, the message ID is built from a per-target 32
bit counter that is incremented and appended to every message sent to the target.

Example::
  echo 1 > "/sys/kernel/config/netconsole/cmdline0/userdata/msgid_enabled"
  echo "This is message #1" > /dev/kmsg
  echo "This is message #2" > /dev/kmsg
  13,434,54928466,-;This is message #1
   msgid=1
  13,435,54934019,-;This is message #2
   msgid=2

This feature can be used by the target to detect if messages were dropped or
reordered before reaching the target. This allows system administrators to
assess the reliability of their netconsole pipeline and detect loss of messages
due to network contention or temporary unavailability.

---
Changes in v3:
- Add kdoc documentation for msgcounter.
- Link to v2: https://lore.kernel.org/r/20250612-netconsole-msgid-v2-0-d4c1abc84bac@gmail.com

Changes in v2:
- Use wrapping_assign_add() to avoid warnings in UBSAN and friends.
- Improve documentation to clarify wrapping and distinguish msgid from sequnum.
- Rebase and fix conflict in prepare_extradata().
- Link to v1: https://lore.kernel.org/r/20250611-netconsole-msgid-v1-0-1784a51feb1e@gmail.com
====================

Suggested-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Gustavo Luiz Duarte <gustavold@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 weeks agodocs: netconsole: document msgid feature
Gustavo Luiz Duarte [Mon, 16 Jun 2025 17:08:39 +0000 (10:08 -0700)] 
docs: netconsole: document msgid feature

Add documentation explaining the msgid feature in netconsole.

This feature appends unique id to the userdata dictionary. The message
ID is populated from a per-target 32 bit counter which is incremented
for each message sent to the target. This allows a target to detect if
messages are dropped before reaching the target.

Signed-off-by: Gustavo Luiz Duarte <gustavold@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 weeks agoselftests: netconsole: Add tests for 'msgid' feature in sysdata
Gustavo Luiz Duarte [Mon, 16 Jun 2025 17:08:38 +0000 (10:08 -0700)] 
selftests: netconsole: Add tests for 'msgid' feature in sysdata

Extend the self-tests to cover the 'msgid' feature in sysdata.

Verify that msgid is appended to the message when the feature is enabled
and that it is not appended when the feature is disabled.

Signed-off-by: Gustavo Luiz Duarte <gustavold@gmail.com>
Reviewed-by: Breno Leitao <leitao@debian.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 weeks agonetconsole: append msgid to sysdata
Gustavo Luiz Duarte [Mon, 16 Jun 2025 17:08:37 +0000 (10:08 -0700)] 
netconsole: append msgid to sysdata

Add msgcounter to the netconsole_target struct to generate message IDs.
If the msgid_enabled attribute is true, increment msgcounter and append
msgid=<msgcounter> to sysdata buffer before sending the message.

Signed-off-by: Gustavo Luiz Duarte <gustavold@gmail.com>
Reviewed-by: Breno Leitao <leitao@debian.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 weeks agonetconsole: implement configfs for msgid_enabled
Gustavo Luiz Duarte [Mon, 16 Jun 2025 17:08:36 +0000 (10:08 -0700)] 
netconsole: implement configfs for msgid_enabled

Implement the _show and _store functions for the msgid_enabled configfs
attribute under userdata.
Set the sysdata_fields bit accordingly.

Reviewed-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Gustavo Luiz Duarte <gustavold@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 weeks agonetconsole: introduce 'msgid' as a new sysdata field
Gustavo Luiz Duarte [Mon, 16 Jun 2025 17:08:35 +0000 (10:08 -0700)] 
netconsole: introduce 'msgid' as a new sysdata field

This adds a new sysdata field to enable assigning a per-target unique id
to each message sent to that target. This id can later be appended as
part of sysdata, allowing targets to detect dropped netconsole messages.
Update count_extradata_entries() to take the new field into account.

Reviewed-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Gustavo Luiz Duarte <gustavold@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 weeks agodpll: remove documentation of rclk_dev_name
Simon Horman [Mon, 16 Jun 2025 12:58:35 +0000 (13:58 +0100)] 
dpll: remove documentation of rclk_dev_name

Remove documentation of rclk_dev_name member of dpll_device which
doesn't exist.

Flagged by ./scripts/kernel-doc -none

Introduced by commit 9431063ad323 ("dpll: core: Add DPLL framework base
functions")

Signed-off-by: Simon Horman <horms@kernel.org>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20250616-dpll-member-v1-1-8c9e6b8e1fd4@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoMerge branch '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next...
Jakub Kicinski [Wed, 18 Jun 2025 01:50:57 +0000 (18:50 -0700)] 
Merge branch '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue

Tony Nguyen says:

====================
libeth: add libeth_xdp helper lib

Alexander Lobakin says:

Time to add XDP helpers infra to libeth to greatly simplify adding
XDP to idpf and iavf, as well as improve and extend XDP in ice and
i40e. Any vendor is free to reuse helpers. If this happens, I'm fine
with moving the folder of out intel/.

The helpers greatly simplify building xdp_buff, running a prog,
handling the verdict, implement XDP_TX, .ndo_xdp_xmit, XDP buffer
completion. Same applies to XSk (with XSk xmit instead of
.ndo_xdp_xmit, plus stuff like XSk wakeup).
They are entirely generic with no HW definitions or assumptions.
HW-specific stuff like parsing Rx desc / filling Tx desc is passed
from the driver as inline callbacks.

For now, key assumptions that optimize performance / avoid code
bloat, but might not fit every driver in driver/net/:
 * netmem holding the buffers are always order-0;
 * driver has separate XDP Tx queues, doesn't use stack queues for
   that. For best efficiency, you may want to have nr_cpu_ids XDP
   queues, but less (queue sharing) is also supported;
 * XDP Tx queues are interrupt-less and use "lazy" cleaning only
   when there are less than 1/4 free Tx descriptors of the queue
   size;
 * main target platforms are 64-bit, although 32-bit is also fully
   supported, but the code might be not as optimized for them.

Library code already supports multi-buffer for all kinds of Tx and
both header split and no split for Rx and Tx. Frags can come from
devmem/io_uring etc., direct `struct page *` is used only for header
buffers for which it's always true.
Drivers are free to pass their own Rx hints and XSK xmit hints ops.

XDP_TX and ndo_xdp_xmit use onstack bulk for the frames to be sent
and send them by batches of 16 buffers. This eats ~280 bytes on the
stack, but gives good boosts and allow to greatly optimize the main
sending function leaving it without any error/exception paths.

XSk xmit fills Tx descriptors in the loop unrolled by 8. This was
proven to improve perf on ice and i40e. XDP_TX and ndo_xdp_xmit
doesn't use unrolling as I wasn't able to get any improvements in
those scenenarios from this, while +1 Kb for their sending functions
for nothing doesn't sound reasonable.

XSk wakeup, instead of traditionally used "SW interrupts" provided
by NICs, uses IPI to schedule NAPI on the CPU corresponding to the
given queue pair. It gives better control over CPU distribution and
in general performs way better than "SW interrupts", plus allows us
to not pass any HW-specific callbacks there.

The code is built the way that all callbacks passed from drivers
get inlined; in general, most of hotpath gets inlined. Everything
slow/exception lands to .c files in the libeth folder, doesn't
create copies in the drivers themselves and doesn't overloat
hotpath.
Sure, inlining means that hotpath will be compiled into every driver
that uses the lib, but the core code is written in one place, so no
copying of bugs happens. Fixed once -- works everywhere.

The last commit might look like sorta hack, but it gives really good
boosts and decreases object code size, plus there are checks that
all those wider accesses are fully safe, so I don't feel anything
bad about it.

An example of using libeth_xdp can be found either on my GitHub or
on the mailing lists here ("XDP for idpf"). Macros for building
driver XDP functions lead to that some implementations (XDP_TX,
ndo_xdp_xmit etc.) consist of really only a few lines.

* '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  libeth: xdp, xsk: access adjacent u32s as u64 where applicable
  libeth: xsk: add XSkFQ refill and XSk wakeup helpers
  libeth: xsk: add XSk Rx processing support
  libeth: xsk: add XSk xmit functions
  libeth: xsk: add XSk XDP_TX sending helpers
  libeth: xdp: add RSS hash hint and XDP features setup helpers
  libeth: xdp: add templates for building driver-side callbacks
  libeth: xdp: add XDP prog run and verdict result handling
  libeth: xdp: add helpers for preparing/processing &libeth_xdp_buff
  libeth: xdp: add XDPSQ cleanup timers
  libeth: xdp: add XDPSQ locking helpers
  libeth: xdp: add XDPSQE completion helpers
  libeth: xdp: add .ndo_xdp_xmit() helpers
  libeth: xdp: add XDP_TX buffers sending
  libeth: support native XDP and register memory model
  libeth: convert to netmem
  libeth, libie: clean symbol exports up a little
====================

Link: https://patch.msgid.link/20250616201639.710420-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoMerge branch 'net-mlx5e-add-support-for-devmem-and-io_uring-tcp-zero-copy'
Jakub Kicinski [Wed, 18 Jun 2025 01:34:15 +0000 (18:34 -0700)] 
Merge branch 'net-mlx5e-add-support-for-devmem-and-io_uring-tcp-zero-copy'

Mark Bloch says:

====================
net/mlx5e: Add support for devmem and io_uring TCP zero-copy

This series adds support for zerocopy rx TCP with devmem and io_uring
for ConnectX7 NICs and above. For performance reasons and simplicity
HW-GRO will also be turned on when header-data split mode is on.

Performance
===========

Test setup:

* CPU: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (single NUMA)
* NIC: ConnectX7
* Benchmarking tool: kperf [0]
* Single TCP flow
* Test duration: 60s

With application thread and interrupts pinned to the *same* core:

|------+-----------+----------|
| MTU  | epoll     | io_uring |
|------+-----------+----------|
| 1500 | 61.6 Gbps | 114 Gbps |
| 4096 | 69.3 Gbps | 151 Gbps |
| 9000 | 67.8 Gbps | 187 Gbps |
|------+-----------+----------|

The CPU usage for io_uring is 95%.

Reproduction steps for io_uring:

server --no-daemon -a 2001:db8::1 --no-memcmp --iou --iou_sendzc \
--iou_zcrx --iou_dev_name eth2 --iou_zcrx_queue_id 2

server --no-daemon -a 2001:db8::2 --no-memcmp --iou --iou_sendzc

client --src 2001:db8::2 --dst 2001:db8::1 \
--msg-zerocopy -t 60 --cpu-min=2 --cpu-max=2

Patch overview:
================

First, a netmem API for skb_can_coalesce is added to the core to be able
to do skb fragment coalescing on netmems.

The next patches introduce some cleanups in the internal SHAMPO code and
improvements to hw gro capability checks in FW.

A separate page_pool is introduced for headers, to be used only when
the rxq has a memory provider.

Then the driver is converted to use the netmem API and to allow support
for unreadable netmem page pool.

The queue management ops are implemented.

Finally, the tcp-data-split ring parameter is exposed.
References
==========
[0] kperf: git://git.kernel.dk/kperf.git
v1: https://lore.kernel.org/20250116215530.158886-1-saeed@kernel.org
v2: https://lore.kernel.org/1747950086-1246773-1-git-send-email-tariqt@nvidia.com
v3: https://lore.kernel.org/20250609145833.990793-1-mbloch@nvidia.com
v4: https://lore.kernel.org/20250610150950.1094376-1-mbloch@nvidia.com
v5: https://lore.kernel.org/20250612154648.1161201-1-mbloch@nvidia.com
====================

Link: https://patch.msgid.link/20250616141441.1243044-1-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet/mlx5e: Add TX support for netmems
Dragos Tatulea [Mon, 16 Jun 2025 14:14:41 +0000 (17:14 +0300)] 
net/mlx5e: Add TX support for netmems

Declare netmem TX support in netdev.

As required, use the netmem aware dma unmapping APIs
for unmapping netmems in tx completion path.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250616141441.1243044-13-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet/mlx5e: Support ethtool tcp-data-split settings
Saeed Mahameed [Mon, 16 Jun 2025 14:14:40 +0000 (17:14 +0300)] 
net/mlx5e: Support ethtool tcp-data-split settings

In mlx5, tcp header-data split requires HW GRO to be on.

Enabling it fails when HW GRO is off.
mlx5e_fix_features now keeps HW GRO on when tcp data split is enabled.
Finally, when tcp data split is disabled, features are updated to maybe
remove the forced HW GRO.

Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250616141441.1243044-12-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet/mlx5e: Implement queue mgmt ops and single channel swap
Saeed Mahameed [Mon, 16 Jun 2025 14:14:39 +0000 (17:14 +0300)] 
net/mlx5e: Implement queue mgmt ops and single channel swap

The bulk of the work is done in mlx5e_queue_mem_alloc, where we allocate
and create the new channel resources, similar to
mlx5e_safe_switch_params, but here we do it for a single channel using
existing params, sort of a clone channel.
To swap the old channel with the new one, we deactivate and close the
old channel then replace it with the new one, since the swap procedure
doesn't fail in mlx5, we do it all in one place (mlx5e_queue_start).

Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Acked-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20250616141441.1243044-11-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet/mlx5e: Add support for UNREADABLE netmem page pools
Saeed Mahameed [Mon, 16 Jun 2025 14:14:38 +0000 (17:14 +0300)] 
net/mlx5e: Add support for UNREADABLE netmem page pools

On netdev_rx_queue_restart, a special type of page pool maybe expected.

In this patch declare support for UNREADABLE netmem iov pages in the
pool params only when header data split shampo RQ mode is enabled, also
set the queue index in the page pool params struct.

Shampo mode requirement: Without header split rx needs to peek at the data,
we can't do UNREADABLE_NETMEM.

The patch also enables the use of a separate page pool for headers when
a memory provider is installed for the queue, otherwise the same common
page pool continues to be used.

Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250616141441.1243044-10-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet/mlx5e: Convert over to netmem
Saeed Mahameed [Mon, 16 Jun 2025 14:14:37 +0000 (17:14 +0300)] 
net/mlx5e: Convert over to netmem

mlx5e_page_frag holds the physical page itself, to naturally support
zc page pools, remove physical page reference from mlx5 and replace it
with netmem_ref, to avoid internal handling in mlx5 for net_iov backed
pages.

SHAMPO can issue packets that are not split into header and data. These
packets will be dropped if the data part resides in a net_iov as the
driver can't read into this area.

No performance degradation observed.

Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250616141441.1243044-9-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet/mlx5e: SHAMPO: Separate pool for headers
Saeed Mahameed [Mon, 16 Jun 2025 14:14:36 +0000 (17:14 +0300)] 
net/mlx5e: SHAMPO: Separate pool for headers

Allow allocating a separate page pool for headers when SHAMPO is on.
This will be useful for adding support to zc page pool, which has to be
different from the headers page pool.
For now, the pools are the same.

Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250616141441.1243044-8-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet/mlx5e: SHAMPO: Improve hw gro capability checking
Saeed Mahameed [Mon, 16 Jun 2025 14:14:35 +0000 (17:14 +0300)] 
net/mlx5e: SHAMPO: Improve hw gro capability checking

Add missing HW capabilities, declare the feature in
netdev->vlan_features, similar to other features in mlx5e_build_nic_netdev.
No functional change here as all by default disabled features are
explicitly disabled at the bottom of the function.

Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250616141441.1243044-7-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet/mlx5e: SHAMPO: Remove redundant params
Saeed Mahameed [Mon, 16 Jun 2025 14:14:34 +0000 (17:14 +0300)] 
net/mlx5e: SHAMPO: Remove redundant params

Two SHAMPO params are static and always the same, remove them from the
global mlx5e_params struct.

Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250616141441.1243044-6-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet/mlx5e: SHAMPO: Reorganize mlx5_rq_shampo_alloc
Saeed Mahameed [Mon, 16 Jun 2025 14:14:33 +0000 (17:14 +0300)] 
net/mlx5e: SHAMPO: Reorganize mlx5_rq_shampo_alloc

Drop redundant SHAMPO structure alloc/free functions.

Gather together function calls pertaining to header split info, pass
header per WQE (hd_per_wqe) as parameter to those function to avoid use
before initialization future mistakes.

Allocate HW GRO related info outside of the header related info scope.

Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250616141441.1243044-5-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agopage_pool: Add page_pool_dev_alloc_netmems helper
Dragos Tatulea [Mon, 16 Jun 2025 14:14:32 +0000 (17:14 +0300)] 
page_pool: Add page_pool_dev_alloc_netmems helper

This is the netmem counterpart of page_pool_dev_alloc_pages() which
uses the default GFP flags for RX.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250616141441.1243044-4-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: Add skb_can_coalesce for netmem
Dragos Tatulea [Mon, 16 Jun 2025 14:14:31 +0000 (17:14 +0300)] 
net: Add skb_can_coalesce for netmem

Allow drivers that have moved over to netmem to do fragment coalescing.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250616141441.1243044-3-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: Allow const args for of page_to_netmem()
Dragos Tatulea [Mon, 16 Jun 2025 14:14:30 +0000 (17:14 +0300)] 
net: Allow const args for of page_to_netmem()

This allows calling page_to_netmem() with a const page * argument.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250616141441.1243044-2-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: tcp: tsq: Convert from tasklet to BH workqueue
Tejun Heo [Mon, 16 Jun 2025 18:10:47 +0000 (08:10 -1000)] 
net: tcp: tsq: Convert from tasklet to BH workqueue

The only generic interface to execute asynchronously in the BH context is
tasklet; however, it's marked deprecated and has some design flaws. To
replace tasklets, BH workqueue support was recently added. A BH workqueue
behaves similarly to regular workqueues except that the queued work items
are executed in the BH context.

This patch converts TCP Small Queues implementation from tasklet to BH
workqueue.

Semantically, this is an equivalent conversion and there shouldn't be any
user-visible behavior changes. While workqueue's queueing and execution
paths are a bit heavier than tasklet's, unless the work item is being queued
every packet, the difference hopefully shouldn't matter.

My experience with the networking stack is very limited and this patch
definitely needs attention from someone who actually understands networking.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Cc: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/aFBeJ38AS1ZF3Dq5@slm.duckdns.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoMerge branch 'ipmr-ip6mr-allow-mc-routing-locally-generated-mc-packets'
Jakub Kicinski [Wed, 18 Jun 2025 01:18:48 +0000 (18:18 -0700)] 
Merge branch 'ipmr-ip6mr-allow-mc-routing-locally-generated-mc-packets'

Petr Machata says:

====================
ipmr, ip6mr: Allow MC-routing locally-generated MC packets

Multicast routing is today handled in the input path. Locally generated MC
packets don't hit the IPMR code. Thus if a VXLAN remote address is
multicast, the driver needs to set an OIF during route lookup. In practice
that means that MC routing configuration needs to be kept in sync with the
VXLAN FDB and MDB. Ideally, the VXLAN packets would be routed by the MC
routing code instead.

To that end, this patchset adds support to route locally generated
multicast packets.

However, an installation that uses a VXLAN underlay netdevice for which it
also has matching MC routes, would get a different routing with this patch.
Previously, the MC packets would be delivered directly to the underlay
port, whereas now they would be MC-routed. In order to avoid this change in
behavior, introduce an IPCB/IP6CB flag. Unless the flag is set, the new
MC-routing code is skipped.

All this is keyed to a new VXLAN attribute, IFLA_VXLAN_MC_ROUTE. Only when
it is set does any of the above engage.

In addition to that, and as is the case today with MC forwarding,
IPV4_DEVCONF_MC_FORWARDING must be enabled for the netdevice that acts as a
source of MC traffic (i.e. the VXLAN PHYS_DEV), so an MC daemon must be
attached to the netdevice.

When a VXLAN netdevice with a MC remote is brought up, the physical
netdevice joins the indicated MC group. This is important for local
delivery of MC packets, so it is still necessary to configure a physical
netdevice -- the parameter cannot go away. The netdevice would however
typically not be a front panel port, but a dummy. An MC daemon would then
sit on top of that netdevice as well as any front panel ports that it needs
to service, and have routes set up between the two.

A way to configure the VXLAN netdevice to take advantage of the new MC
routing would be:

 # ip link add name d up type dummy
 # ip link add name vx10 up type vxlan id 1000 dstport 4789 \
local 192.0.2.1 group 225.0.0.1 ttl 16 dev d mrcoute
 # ip link set dev vx10 master br # plus vlans etc.

With the following MC routes:

 (192.0.2.1, 225.0.0.1) iif=d oil=swp1,swp2 # TX route
 (*, 225.0.0.1) iif=swp1 oil=d,swp2         # RX route
 (*, 225.0.0.1) iif=swp2 oil=d,swp1         # RX route

The RX path has not changed, with the exception of an extra MC hop. Packets
are delivered to the front panel port and MC-forwarded to the VXLAN
physical port, here "d". Since the port has joined the multicast group, the
packets are locally delivered, and end up being processed by the VXLAN
netdevice.

This patchset is based on earlier patches from Nikolay Aleksandrov and
Roopa Prabhu, though it underwent significant changes. Roopa broadly
presented the topic on LPC 2019 [0].

Patchset progression:

- Patches #1 to #4 add ip_mr_output()
- Patches #5 to #10 add ip6_mr_output()
- Patch #11 adds the VXLAN bits to enable MR engagement
- Patches #12 to #14 prepare selftest libraries
- Patch #15 includes a new test suite

[0] https://www.youtube.com/watch?v=xlReECfi-uo
====================

Link: https://patch.msgid.link/cover.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoselftests: forwarding: Add a test for verifying VXLAN MC underlay
Petr Machata [Mon, 16 Jun 2025 22:44:23 +0000 (00:44 +0200)] 
selftests: forwarding: Add a test for verifying VXLAN MC underlay

Add tests for MC-routing underlay VXLAN traffic.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/eecd2c0fefc754182e74be8e8e65751bf5749c21.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoselftests: forwarding: adf_mcd_start(): Allow configuring custom interfaces
Petr Machata [Mon, 16 Jun 2025 22:44:22 +0000 (00:44 +0200)] 
selftests: forwarding: adf_mcd_start(): Allow configuring custom interfaces

Tests may wish to add other interfaces to listen on. Notably locally
generated traffic uses dummy interfaces. The multicast daemon needs to know
about these so that it allows forming rules that involve these interfaces,
and so that net.ipv4.conf.X.mc_forwarding is set for the interfaces.

To that end, allow passing in a list of interfaces to configure in addition
to all the physical ones.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/2e8d83297985933be4850f2b9f296b3c27110388.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoselftests: net: lib: Add ip_link_has_flag()
Petr Machata [Mon, 16 Jun 2025 22:44:21 +0000 (00:44 +0200)] 
selftests: net: lib: Add ip_link_has_flag()

Add a helper to determine whether a given netdevice has a given flag.

Rewrite ip_link_is_up() in terms of the new helper.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/e1eb174a411f9d24735d095984c731d1d4a5a592.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoselftests: forwarding: lib: Move smcrouted helpers here
Petr Machata [Mon, 16 Jun 2025 22:44:20 +0000 (00:44 +0200)] 
selftests: forwarding: lib: Move smcrouted helpers here

router_multicast.sh has several helpers for work with smcrouted. Extract
them to lib.sh so that other selftests can use them as well. Convert the
helpers to defer in the process, because that simplifies the interface
quite a bit. Therefore have router_multicast.sh invoke
defer_scopes_cleanup() in its cleanup() function.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/410411c1a81225ce6e44542289b9c3ec21e5786c.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agovxlan: Support MC routing in the underlay
Petr Machata [Mon, 16 Jun 2025 22:44:19 +0000 (00:44 +0200)] 
vxlan: Support MC routing in the underlay

Locally-generated MC packets have so far not been subject to MC routing.
Instead an MC-enabled installation would maintain the MC routing tables,
and separately from that the list of interfaces to send packets to as part
of the VXLAN FDB and MDB.

In a previous patch, a ip_mr_output() and ip6_mr_output() routines were
added for IPv4 and IPv6. All locally generated MC traffic is now passed
through these functions. For reasons of backward compatibility, an SKB
(IPCB / IP6CB) flag guards the actual MC routing.

This patch adds logic to set the flag, and the UAPI to enable the behavior.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/d899655bb7e9b2521ee8c793e67056b9fd02ba12.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: ipv6: Add ip6_mr_output()
Petr Machata [Mon, 16 Jun 2025 22:44:18 +0000 (00:44 +0200)] 
net: ipv6: Add ip6_mr_output()

Multicast routing is today handled in the input path. Locally generated MC
packets don't hit the IPMR code today. Thus if a VXLAN remote address is
multicast, the driver needs to set an OIF during route lookup. Thus MC
routing configuration needs to be kept in sync with the VXLAN FDB and MDB.
Ideally, the VXLAN packets would be routed by the MC routing code instead.

To that end, this patch adds support to route locally generated multicast
packets. The newly-added routines do largely what ip6_mr_input() and
ip6_mr_forward() do: make an MR cache lookup to find where to send the
packets, and use ip6_output() to send each of them. When no cache entry is
found, the packet is punted to the daemon for resolution.

Similarly to the IPv4 case in a previous patch, the new logic is contingent
on a newly-added IP6CB flag being set.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/3bcc034a3ab4d3c291072fff38f78d7fbbeef4e6.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: ipv6: ip6mr: Split ip6mr_forward2() in two
Petr Machata [Mon, 16 Jun 2025 22:44:17 +0000 (00:44 +0200)] 
net: ipv6: ip6mr: Split ip6mr_forward2() in two

Some of the work of ip6mr_forward2() is specific to IPMR forwarding, and
should not take place on the output path. In order to allow reuse of the
common parts, extract out of the function a helper,
ip6mr_prepare_forward().

Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/8932bd5c0fbe3f662b158803b8509604fa7bc113.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: ipv6: ip6mr: Make ip6mr_forward2() void
Petr Machata [Mon, 16 Jun 2025 22:44:16 +0000 (00:44 +0200)] 
net: ipv6: ip6mr: Make ip6mr_forward2() void

Nobody uses the return value, so convert the function to void.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/e0bee259da0da58da96647ea9e21e6360c8f7e11.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: ipv6: ip6mr: Fix in/out netdev to pass to the FORWARD chain
Petr Machata [Mon, 16 Jun 2025 22:44:15 +0000 (00:44 +0200)] 
net: ipv6: ip6mr: Fix in/out netdev to pass to the FORWARD chain

The netfilter hook is invoked with skb->dev for input netdevice, and
vif_dev for output netdevice. However at the point of invocation, skb->dev
is already set to vif_dev, and MR-forwarded packets are reported with
in=out:

 # ip6tables -A FORWARD -j LOG --log-prefix '[forw]'
 # cd tools/testing/selftests/net/forwarding
 # ./router_multicast.sh
 # dmesg | fgrep '[forw]'
 [ 1670.248245] [forw]IN=v5 OUT=v5 [...]

For reference, IPv4 MR code shows in and out as appropriate.
Fix by caching skb->dev and using the updated value for output netdev.

Fixes: 7bc570c8b4f7 ("[IPV6] MROUTE: Support multicast forwarding.")
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/3141ae8386fbe13fef4b793faa75e6bae58d798a.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: ipv6: Add a flags argument to ip6tunnel_xmit(), udp_tunnel6_xmit_skb()
Petr Machata [Mon, 16 Jun 2025 22:44:14 +0000 (00:44 +0200)] 
net: ipv6: Add a flags argument to ip6tunnel_xmit(), udp_tunnel6_xmit_skb()

ip6tunnel_xmit() erases the contents of the SKB control block. In order to
be able to set particular IP6CB flags on the SKB, add a corresponding
parameter, and propagate it to udp_tunnel6_xmit_skb() as well.

In one of the following patches, VXLAN driver will use this facility to
mark packets as subject to IPv6 multicast routing.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/acb4f9f3e40c3a931236c3af08a720b017fbfbfb.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: ipv6: Make udp_tunnel6_xmit_skb() void
Petr Machata [Mon, 16 Jun 2025 22:44:13 +0000 (00:44 +0200)] 
net: ipv6: Make udp_tunnel6_xmit_skb() void

The function always returns zero, thus the return value does not carry any
signal. Just make it void.

Most callers already ignore the return value. However:

- Refold arguments of the call from sctp_v6_xmit() so that they fit into
  the 80-column limit.

- tipc_udp_xmit() initializes err from the return value, but that should
  already be always zero at that point. So there's no practical change, but
  elision of the assignment prompts a couple more tweaks to clean up the
  function.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/7facacf9d8ca3ca9391a4aee88160913671b868d.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: ipv4: Add ip_mr_output()
Petr Machata [Mon, 16 Jun 2025 22:44:12 +0000 (00:44 +0200)] 
net: ipv4: Add ip_mr_output()

Multicast routing is today handled in the input path. Locally generated MC
packets don't hit the IPMR code today. Thus if a VXLAN remote address is
multicast, the driver needs to set an OIF during route lookup. Thus MC
routing configuration needs to be kept in sync with the VXLAN FDB and MDB.
Ideally, the VXLAN packets would be routed by the MC routing code instead.

To that end, this patch adds support to route locally generated multicast
packets. The newly-added routines do largely what ip_mr_input() and
ip_mr_forward() do: make an MR cache lookup to find where to send the
packets, and use ip_mc_output() to send each of them. When no cache entry
is found, the packet is punted to the daemon for resolution.

However, an installation that uses a VXLAN underlay netdevice for which it
also has matching MC routes, would get a different routing with this patch.
Previously, the MC packets would be delivered directly to the underlay
port, whereas now they would be MC-routed. In order to avoid this change in
behavior, introduce an IPCB flag. Only if the flag is set will
ip_mr_output() actually engage, otherwise it reverts to ip_mc_output().

This code is based on work by Roopa Prabhu and Nikolay Aleksandrov.

Signed-off-by: Roopa Prabhu <roopa@nvidia.com>
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/0aadbd49330471c0f758d54afb05eb3b6e3a6b65.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: ipv4: ipmr: Split ipmr_queue_xmit() in two
Petr Machata [Mon, 16 Jun 2025 22:44:11 +0000 (00:44 +0200)] 
net: ipv4: ipmr: Split ipmr_queue_xmit() in two

Some of the work of ipmr_queue_xmit() is specific to IPMR forwarding, and
should not take place on the output path. In order to allow reuse of the
common parts, split the function into two: the ipmr_prepare_xmit() helper
that takes care of the common bits, and the ipmr_queue_fwd_xmit(), which
invokes the former and encapsulates the whole forwarding algorithm.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/4e8db165572a4f8bd29a723a801e854e9d20df4d.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: ipv4: ipmr: ipmr_queue_xmit(): Drop local variable `dev'
Petr Machata [Mon, 16 Jun 2025 22:44:10 +0000 (00:44 +0200)] 
net: ipv4: ipmr: ipmr_queue_xmit(): Drop local variable `dev'

The variable is used for caching of rt->dst.dev. The netdevice referenced
therein does not change during the scope of validity of that local. At the
same time, the local is only used twice, and each of these uses will end up
in a different function in the following patches, further eliminating any
use the local could have had.

Drop the local altogether and inline the uses.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/c80600a4b51679fe78f429ccb6d60892c2f9e4de.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: ipv4: Add a flags argument to iptunnel_xmit(), udp_tunnel_xmit_skb()
Petr Machata [Mon, 16 Jun 2025 22:44:09 +0000 (00:44 +0200)] 
net: ipv4: Add a flags argument to iptunnel_xmit(), udp_tunnel_xmit_skb()

iptunnel_xmit() erases the contents of the SKB control block. In order to
be able to set particular IPCB flags on the SKB, add a corresponding
parameter, and propagate it to udp_tunnel_xmit_skb() as well.

In one of the following patches, VXLAN driver will use this facility to
mark packets as subject to IP multicast routing.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Acked-by: Antonio Quartulli <antonio@openvpn.net>
Link: https://patch.msgid.link/89c9daf9f2dc088b6b92ccebcc929f51742de91f.1750113335.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoMerge branch 'misc-vlan-cleanups'
Jakub Kicinski [Wed, 18 Jun 2025 01:06:42 +0000 (18:06 -0700)] 
Merge branch 'misc-vlan-cleanups'

Gal Pressman says:

====================
Misc vlan cleanups

This patch series addresses compilation issues with objtool when VLAN
support is disabled (CONFIG_VLAN_8021Q=n) and makes related improvements
to the VLAN infrastructure.

When CONFIG_VLAN_8021Q=n, CONFIG_OBJTOOL=y, and CONFIG_OBJTOOL_WERROR=y,
the following compilation error occurs:

drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.o: error: objtool: parse_mirred.isra.0+0x370: mlx5e_tc_act_vlan_add_push_action() missing __noreturn in .c/.h or NORETURN() in noreturns.h

The error occurs because objtool cannot determine that unreachable BUG()
calls in VLAN code paths are actually dead code when VLAN support is
disabled.

First patch makes is_vlan_dev() a stub when VLAN is not configured,
allows compile-out of VLAN-dependent dead code paths and resolves the
objtool compilation error.

Second patch replaces BUG() calls with WARN_ON_ONCE(), as the usage of
BUG() should be avoided.

Third patch uses the "kernel" way of testing whether an option is
configured as builtin/module, instead of open-coding it.

v2: https://lore.kernel.org/20250610072611.1647593-1-gal@nvidia.com/
====================

Link: https://patch.msgid.link/20250616132626.1749331-1-gal@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: vlan: Use IS_ENABLED() helper for CONFIG_VLAN_8021Q guard
Gal Pressman [Mon, 16 Jun 2025 13:26:26 +0000 (16:26 +0300)] 
net: vlan: Use IS_ENABLED() helper for CONFIG_VLAN_8021Q guard

The header currently tests the VLAN core with an explicit pair of 'if
defined' checks:
    #if defined(CONFIG_VLAN_8021Q) || defined(CONFIG_VLAN_8021Q_MODULE)

Instead, use IS_ENABLED() which is the kernel way to test whether an
option is configured as builtin/module.

This is purely cosmetic – no functional changes.

Reviewed-by: Alex Lazar <alazar@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Gal Pressman <gal@nvidia.com>
Link: https://patch.msgid.link/20250616132626.1749331-4-gal@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: vlan: Replace BUG() with WARN_ON_ONCE() in vlan_dev_* stubs
Gal Pressman [Mon, 16 Jun 2025 13:26:25 +0000 (16:26 +0300)] 
net: vlan: Replace BUG() with WARN_ON_ONCE() in vlan_dev_* stubs

When CONFIG_VLAN_8021Q=n, a set of stub helpers are used, three of these
helpers use BUG() unconditionally.

This code should not be reached, as callers of these functions should
always check for is_vlan_dev() first, but the usage of BUG() is not
recommended, replace it with WARN_ON() instead.

Reviewed-by: Alex Lazar <alazar@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Gal Pressman <gal@nvidia.com>
Link: https://patch.msgid.link/20250616132626.1749331-3-gal@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: vlan: Make is_vlan_dev() a stub when VLAN is not configured
Gal Pressman [Mon, 16 Jun 2025 13:26:24 +0000 (16:26 +0300)] 
net: vlan: Make is_vlan_dev() a stub when VLAN is not configured

Add a stub implementation of is_vlan_dev() that returns false when
VLAN support is not compiled in (CONFIG_VLAN_8021Q=n).

This allows us to compile-out VLAN-dependent dead code when it is not
needed.

This also resolves the following compilation error when:
* CONFIG_VLAN_8021Q=n
* CONFIG_OBJTOOL=y
* CONFIG_OBJTOOL_WERROR=y

drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.o: error: objtool: parse_mirred.isra.0+0x370: mlx5e_tc_act_vlan_add_push_action() missing __noreturn in .c/.h or NORETURN() in noreturns.h

The error occurs because objtool cannot determine that unreachable BUG()
(which doesn't return) calls in VLAN code paths are actually dead code
when VLAN support is disabled.

Signed-off-by: Gal Pressman <gal@nvidia.com>
Link: https://patch.msgid.link/20250616132626.1749331-2-gal@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoMerge branch 'net-use-new-gpio-line-value-setter-callbacks'
Jakub Kicinski [Wed, 18 Jun 2025 01:04:13 +0000 (18:04 -0700)] 
Merge branch 'net-use-new-gpio-line-value-setter-callbacks'

Bartosz Golaszewski says:

====================
net: use new GPIO line value setter callbacks

Commit 98ce1eb1fd87e ("gpiolib: introduce gpio_chip setters that return
values") added new line setter callbacks to struct gpio_chip. They allow
to indicate failures to callers. We're in the process of converting all
GPIO controllers to using them before removing the old ones. This series
converts all GPIO chips implemented under drivers/net/.

v1: https://lore.kernel.org/20250610-gpiochip-set-rv-net-v1-0-35668dd1c76f@linaro.org
====================

Link: https://patch.msgid.link/20250616-gpiochip-set-rv-net-v2-0-cae0b182a552@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: phy: qca807x: use new GPIO line value setter callbacks
Bartosz Golaszewski [Mon, 16 Jun 2025 07:24:08 +0000 (09:24 +0200)] 
net: phy: qca807x: use new GPIO line value setter callbacks

struct gpio_chip now has callbacks for setting line values that return
an integer, allowing to indicate failures. Convert the driver to using
them.

Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250616-gpiochip-set-rv-net-v2-5-cae0b182a552@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: can: mcp251x: use new GPIO line value setter callbacks
Bartosz Golaszewski [Mon, 16 Jun 2025 07:24:07 +0000 (09:24 +0200)] 
net: can: mcp251x: use new GPIO line value setter callbacks

struct gpio_chip now has callbacks for setting line values that return
an integer, allowing to indicate failures. Convert the driver to using
them.

Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Reviewed-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Reviewed-by: Marc Kleine-Budde <mkl@pengutronix.de>
Link: https://patch.msgid.link/20250616-gpiochip-set-rv-net-v2-4-cae0b182a552@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: can: mcp251x: propagate the return value of mcp251x_spi_write()
Bartosz Golaszewski [Mon, 16 Jun 2025 07:24:06 +0000 (09:24 +0200)] 
net: can: mcp251x: propagate the return value of mcp251x_spi_write()

Add an integer return value to mcp251x_write_bits() and use it to
propagate the one returned by mcp251x_spi_write(). Return that value on
error in the request() GPIO callback.

Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Reviewed-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Reviewed-by: Marc Kleine-Budde <mkl@pengutronix.de>
Link: https://patch.msgid.link/20250616-gpiochip-set-rv-net-v2-3-cae0b182a552@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: dsa: mt7530: use new GPIO line value setter callbacks
Bartosz Golaszewski [Mon, 16 Jun 2025 07:24:05 +0000 (09:24 +0200)] 
net: dsa: mt7530: use new GPIO line value setter callbacks

struct gpio_chip now has callbacks for setting line values that return
an integer, allowing to indicate failures. Convert the driver to using
them.

Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Link: https://patch.msgid.link/20250616-gpiochip-set-rv-net-v2-2-cae0b182a552@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: dsa: vsc73xx: use new GPIO line value setter callbacks
Bartosz Golaszewski [Mon, 16 Jun 2025 07:24:04 +0000 (09:24 +0200)] 
net: dsa: vsc73xx: use new GPIO line value setter callbacks

struct gpio_chip now has callbacks for setting line values that return
an integer, allowing to indicate failures. Convert the driver to using
them.

Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Link: https://patch.msgid.link/20250616-gpiochip-set-rv-net-v2-1-cae0b182a552@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agogve: Return error for unknown admin queue command
Alok Tiwari [Mon, 16 Jun 2025 05:45:01 +0000 (22:45 -0700)] 
gve: Return error for unknown admin queue command

In gve_adminq_issue_cmd(), return -EINVAL instead of 0 when an unknown
admin queue command opcode is encountered.

This prevents the function from silently succeeding on invalid input
and prevents undefined behavior by ensuring the function fails gracefully
when an unrecognized opcode is provided.

These changes improve error handling.

Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Link: https://patch.msgid.link/20250616054504.1644770-2-alok.a.tiwari@oracle.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agogve: Fix various typos and improve code comments
Alok Tiwari [Mon, 16 Jun 2025 05:45:00 +0000 (22:45 -0700)] 
gve: Fix various typos and improve code comments

- Correct spelling and improves the clarity of comments
   "confiugration" -> "configuration"
   "spilt" -> "split"
   "It if is 0" -> "If it is 0"
   "DQ" -> "DQO" (correct abbreviation)
- Clarify BIT(0) flag usage in gve_get_priv_flags()
- Replaced hardcoded array size with GVE_NUM_PTYPES
  for clarity and maintainability.

These changes are purely cosmetic and do not affect functionality.

Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Reviewed-by: Joe Damato <joe@dama.to>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20250616054504.1644770-1-alok.a.tiwari@oracle.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoselftests: devmem: add ipv4 support to chunks test
Mina Almasry [Sun, 15 Jun 2025 20:35:11 +0000 (20:35 +0000)] 
selftests: devmem: add ipv4 support to chunks test

Add ipv4 support to the recently added chunks tests, which was added as
ipv6 only.

Signed-off-by: Mina Almasry <almasrymina@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250615203511.591438-3-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>