Guillaume Nault [Mon, 16 Jun 2025 11:58:29 +0000 (13:58 +0200)]
ipv6: Simplify link-local address generation for IPv6 GRE.
Since commit 3e6a0243ff00 ("gre: Fix again IPv6 link-local address
generation."), addrconf_gre_config() has stopped handling IP6GRE
devices specially and just calls the regular addrconf_addr_gen()
function to create their link-local IPv6 addresses.
We can thus avoid using addrconf_gre_config() for IP6GRE devices and
use the normal IPv6 initialisation path instead (that is, jump directly
to addrconf_dev_config() in addrconf_init_auto_addrs()).
See commit 3e6a0243ff00 ("gre: Fix again IPv6 link-local address
generation.") for a deeper explanation on how and why GRE devices
started handling their IPv6 link-local address generation specially,
why it was a problem, and why this is not even necessary in most cases
(especially for GRE over IPv6).
====================
Add support for PSE budget evaluation strategy
This series brings support for budget evaluation strategy in the PSE
subsystem. PSE controllers can set priorities to decide which ports should
be turned off in case of special events like over-current.
This patch series adds support for two budget evaluation strategy.
1. Static Method:
This method involves distributing power based on PD classification.
It’s straightforward and stable, the PSE core keeping track of the
budget and subtracting the power requested by each PD’s class.
Advantages: Every PD gets its promised power at any time, which
guarantees reliability.
Disadvantages: PD classification steps are large, meaning devices
request much more power than they actually need. As a result, the power
supply may only operate at, say, 50% capacity, which is inefficient and
wastes money.
2. Dynamic Method:
To address the inefficiencies of the static method, vendors like
Microchip have introduced dynamic power budgeting, as seen in the
PD692x0 firmware. This method monitors the current consumption per port
and subtracts it from the available power budget. When the budget is
exceeded, lower-priority ports are shut down.
Advantages: This method optimizes resource utilization, saving costs.
Disadvantages: Low-priority devices may experience instability.
The UAPI allows adding support for software port priority mode managed from
userspace later if needed.
Patches 1-2: Add support for interrupt event report in PSE core, ethtool
and ethtool specs.
Patch 3: Adds support for interrupt and event report in TPS23881 driver.
Patches 4,5: Add support for PSE power domain in PSE core and ethtool.
Patches 6-8: Add support for budget evaluation strategy in PSE core,
ethtool and ethtool specs.
Patches 9-11: Add support for port priority and power supplies in PD692x0
drivers.
Patches 12,13: Add support for port priority in TPS23881 drivers.
====================
Add an interrupt property to the device tree bindings for the TI TPS23881
PSE controller. The interrupt is primarily used to detect classification
and disconnection events, which are essential for managing the PSE
controller in compliance with the PoE standard.
Interrupt support is essential for the proper functioning of the TPS23881
controller. Without it, after a power-on (PWON), the controller will
no longer perform detection and classification. This could lead to
potential hazards, such as connecting a non-PoE device after a PoE device,
which might result in magic smoke.
net: pse-pd: tps23881: Add support for static port priority feature
This patch enhances PSE callbacks by introducing support for the static
port priority feature. It extends interrupt management to handle and report
detection, classification, and disconnection events. Additionally, it
introduces the pi_get_pw_req() callback, which provides information about
the power requested by the Powered Devices.
Interrupt support is essential for the proper functioning of the TPS23881
controller. Without it, after a power-on (PWON), the controller will
no longer perform detection and classification. This could lead to
potential hazards, such as connecting a non-PoE device after a PoE device,
which might result in magic smoke.
Adds the regulator supply parameter of the managers.
Update also the example as the regulator supply of the PSE PIs
should be the managers itself and not an external regulator.
net: pse-pd: pd692x0: Add support for controller and manager power supplies
Add support for managing the VDD and VDDA power supplies for the PD692x0
PSE controller, as well as the VAUX5 and VAUX3P3 power supplies for the
PD6920x PSE managers.
net: pse-pd: pd692x0: Add support for PSE PI priority feature
This patch extends the PSE callbacks by adding support for the newly
introduced pi_set_prio() callback, enabling the configuration of PSE PI
priorities. The current port priority is now also included in the status
information returned to users.
net: ethtool: Add PSE port priority support feature
This patch expands the status information provided by ethtool for PSE c33
with current port priority and max port priority. It also adds a call to
pse_ethtool_set_prio() to configure the PSE port priority.
net: pse-pd: Add support for budget evaluation strategies
This patch introduces the ability to configure the PSE PI budget evaluation
strategies. Budget evaluation strategies is utilized by PSE controllers to
determine which ports to turn off first in scenarios such as power budget
exceedance.
The pis_prio_max value is used to define the maximum priority level
supported by the controller. Both the current priority and the maximum
priority are exposed to the user through the pse_ethtool_get_status call.
This patch add support for two mode of budget evaluation strategies.
1. Static Method:
This method involves distributing power based on PD classification.
It’s straightforward and stable, the PSE core keeping track of the
budget and subtracting the power requested by each PD’s class.
Advantages: Every PD gets its promised power at any time, which
guarantees reliability.
Disadvantages: PD classification steps are large, meaning devices
request much more power than they actually need. As a result, the power
supply may only operate at, say, 50% capacity, which is inefficient and
wastes money.
Priority max value is matching the number of PSE PIs within the PSE.
2. Dynamic Method:
To address the inefficiencies of the static method, vendors like
Microchip have introduced dynamic power budgeting, as seen in the
PD692x0 firmware. This method monitors the current consumption per port
and subtracts it from the available power budget. When the budget is
exceeded, lower-priority ports are shut down.
Advantages: This method optimizes resource utilization, saving costs.
Disadvantages: Low-priority devices may experience instability.
Priority max value is set by the PSE controller driver.
For now, budget evaluation methods are not configurable and cannot be
mixed. They are hardcoded in the PSE driver itself, as no current PSE
controller supports both methods.
Introduce PSE power domain support as groundwork for upcoming port
priority features. Multiple PSE PIs can now be grouped under a single
PSE power domain, enabling future enhancements like defining available
power budgets, port priority modes, and disconnection policies. This
setup will allow the system to assess whether activating a port would
exceed the available power budget, preventing over-budget states
proactively.
net: pse-pd: tps23881: Add support for PSE events and interrupts
Add support for PSE event reporting through interrupts. Set up the newly
introduced devm_pse_irq_helper helper to register the interrupt. Events are
reported for over-current and over-temperature conditions.
Add support for devm_pse_irq_helper() to register PSE interrupts and report
events such as over-current or over-temperature conditions. This follows a
similar approach to the regulator API but also sends notifications using a
dedicated PSE ethtool netlink socket.
net: pse-pd: Introduce attached_phydev to pse control
In preparation for reporting PSE events via ethtool notifications,
introduce an attached_phydev field in the pse_control structure.
This field stores the phy_device associated with the PSE PI,
ensuring that notifications are sent to the correct network
interface.
The attached_phydev pointer is directly tied to the PHY lifecycle. It
is set when the PHY is registered and cleared when the PHY is removed.
There is no need to use a refcount, as doing so could interfere with
the PHY removal process.
Jakub Kicinski [Thu, 19 Jun 2025 01:57:32 +0000 (18:57 -0700)]
Merge branch 'phc-support-in-ena-driver'
David Arinzon says:
====================
PHC support in ENA driver
This patchset adds the support for PHC (PTP Hardware Clock)
in the ENA driver. The documentation part of the patchset
includes additional information, including statistics,
utilization and invocation examples through the testptp
utility.
====================
David Arinzon [Tue, 17 Jun 2025 11:05:42 +0000 (14:05 +0300)]
net: ena: Control PHC enable through devlink
Add the capability to set parameters through the devlink framework.
The parameter used for controlling PHC (enable/disable) details
are as follows:
- Name: enable_phc
- Type: Boolean (true - enable/false - disable)
- Mode: DEVLINK_PARAM_CMODE_DRIVERINIT
- Effect: Changes take place during driver initialization,
any changes require a devlink reload to take effect.
David Arinzon [Tue, 17 Jun 2025 11:05:39 +0000 (14:05 +0300)]
net: ena: Add device reload capability through devlink
Adding basic devlink capability support of reloading the driver.
This capability is required to support driver init type
devlink params (DEVLINK_PARAM_CMODE_DRIVERINIT). Such params
require reloading of the driver (destroy/restore sequence).
The reloading is done by the devlink framework using the
hooks provided by the driver.
David Arinzon [Tue, 17 Jun 2025 11:05:38 +0000 (14:05 +0300)]
net: ena: PHC silent reset
Each PHC device kernel registration receives a unique kernel index,
which is associated with a new PHC device file located at
"/dev/ptp<index>".
This device file serves as an interface for obtaining PHC timestamps.
Examples of tools that use "/dev/ptp" include testptp [1]
and chrony [2].
A reset flow may occur in the ENA driver while PHC is active.
During a reset, the driver will unregister and then re-register the
PHC device with the kernel.
Under race conditions, particularly during heavy PHC loads,
the driver’s reset flow might complete faster than the kernel’s PHC
unregister/register process.
This can result in the PHC index being different from what it was prior
to the reset, as the PHC index is selected using kernel ID
allocation [3].
While driver rmmod/insmod are done by the user, a reset may occur
at anytime, without the user awareness, consequently, the driver
might receive a new PHC index after the reset, potentially compromising
the user experience.
To prevent this issue, the PHC flow will detect the reset during PHC
destruction and will skip the PHC unregister/register calls to preserve
the kernel PHC index.
During the reset flow, any attempt to get a PHC timestamp will fail as
expected, but the kernel PHC index will remain unchanged.
David Arinzon [Tue, 17 Jun 2025 11:05:37 +0000 (14:05 +0300)]
net: ena: Add PHC support in the ENA driver
The ENA driver will be extended to support the new PHC feature using
ptp_clock interface [1]. this will provide timestamp reference for user
space to allow measuring time offset between the PHC and the system
clock in order to achieve nanosecond accuracy.
Recently bnxt had to grow back a bunch of rtnl dependencies because
of udp_tunnel's infra. Add separate (global) mutext to protect
udp_tunnel state.
====================
udp_tunnel infra doesn't need RTNL, should be safe to get back
to only netdev instance lock.
Cc: Michael Chan <michael.chan@broadcom.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com> Link: https://patch.msgid.link/20250616162117.287806-7-stfomichev@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Now that there is only one path in udp_tunnel, there is no need to
have udp_ports_sleep knob. Remove it and adjust the test.
Cc: Michael Chan <michael.chan@broadcom.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com> Link: https://patch.msgid.link/20250616162117.287806-6-stfomichev@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: remove redundant ASSERT_RTNL() in queue setup functions
The existing netdev_ops_assert_locked() already asserts that either
the RTNL lock or the per-device lock is held, making the explicit
ASSERT_RTNL() redundant.
Drivers that are using ops lock and don't depend on RTNL lock
still need to manage it because udp_tunnel's RTNL dependency.
Introduce new udp_tunnel_nic_lock and use it instead of
rtnl_lock. Drop non-UDP_TUNNEL_NIC_INFO_MAY_SLEEP mode from
udp_tunnel infra (udp_tunnel_nic_device_sync_work needs to
grab udp_tunnel_nic_lock mutex and might sleep).
We won't be able to sleep soon in vxlan_offload_rx_ports and won't be
able to grab sock_lock. Instead of having separate spinlock to
manage sockets, rely on rtnl lock. This is similar to how geneve
manages its sockets.
Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20250616162117.287806-3-stfomichev@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
geneve: rely on rtnl lock in geneve_offload_rx_ports
udp_tunnel_push_rx_port will grab mutex in the next patch so
we can't use rcu. geneve_offload_rx_ports is called
from geneve_netdevice_event for NETDEV_UDP_TUNNEL_PUSH_INFO and
NETDEV_UDP_TUNNEL_DROP_INFO which both have ASSERT_RTNL.
Entries are added to and removed from the sock_list under rtnl
lock as well (when adding or removing a tunneling device).
Heiner Kallweit [Mon, 16 Jun 2025 21:24:05 +0000 (23:24 +0200)]
dpaa_eth: don't use fixed_phy_change_carrier
This effectively reverts 6e8b0ff1ba4c ("dpaa_eth: Add change_carrier()
for Fixed PHYs"). Usage of fixed_phy_change_carrier() requires that
fixed_phy_register() has been called before, directly or indirectly.
And that's not the case in this driver.
Excluding the case where nla is exactly -NLA_HDRLEN, it will not return
NULL. And it seems misleading to assume that it can, other than in this
corner case. So drop checks for this condition.
David S. Miller [Wed, 18 Jun 2025 09:46:31 +0000 (10:46 +0100)]
Merge branch 'netconsole-msgid' into main
Gustavo Luiz Duarte says:
====================
netconsole: Add support for msgid in sysdata
This patch series introduces a new feature to netconsole which allows
appending a message ID to the userdata dictionary.
If the msgid feature is enabled, the message ID is built from a per-target 32
bit counter that is incremented and appended to every message sent to the target.
Example::
echo 1 > "/sys/kernel/config/netconsole/cmdline0/userdata/msgid_enabled"
echo "This is message #1" > /dev/kmsg
echo "This is message #2" > /dev/kmsg
13,434,54928466,-;This is message #1
msgid=1
13,435,54934019,-;This is message #2
msgid=2
This feature can be used by the target to detect if messages were dropped or
reordered before reaching the target. This allows system administrators to
assess the reliability of their netconsole pipeline and detect loss of messages
due to network contention or temporary unavailability.
---
Changes in v3:
- Add kdoc documentation for msgcounter.
- Link to v2: https://lore.kernel.org/r/20250612-netconsole-msgid-v2-0-d4c1abc84bac@gmail.com
Changes in v2:
- Use wrapping_assign_add() to avoid warnings in UBSAN and friends.
- Improve documentation to clarify wrapping and distinguish msgid from sequnum.
- Rebase and fix conflict in prepare_extradata().
- Link to v1: https://lore.kernel.org/r/20250611-netconsole-msgid-v1-0-1784a51feb1e@gmail.com
====================
Suggested-by: Breno Leitao <leitao@debian.org> Signed-off-by: Gustavo Luiz Duarte <gustavold@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Add documentation explaining the msgid feature in netconsole.
This feature appends unique id to the userdata dictionary. The message
ID is populated from a per-target 32 bit counter which is incremented
for each message sent to the target. This allows a target to detect if
messages are dropped before reaching the target.
Signed-off-by: Gustavo Luiz Duarte <gustavold@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
selftests: netconsole: Add tests for 'msgid' feature in sysdata
Extend the self-tests to cover the 'msgid' feature in sysdata.
Verify that msgid is appended to the message when the feature is enabled
and that it is not appended when the feature is disabled.
Signed-off-by: Gustavo Luiz Duarte <gustavold@gmail.com> Reviewed-by: Breno Leitao <leitao@debian.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Add msgcounter to the netconsole_target struct to generate message IDs.
If the msgid_enabled attribute is true, increment msgcounter and append
msgid=<msgcounter> to sysdata buffer before sending the message.
Signed-off-by: Gustavo Luiz Duarte <gustavold@gmail.com> Reviewed-by: Breno Leitao <leitao@debian.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Implement the _show and _store functions for the msgid_enabled configfs
attribute under userdata.
Set the sysdata_fields bit accordingly.
Reviewed-by: Breno Leitao <leitao@debian.org> Signed-off-by: Gustavo Luiz Duarte <gustavold@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
netconsole: introduce 'msgid' as a new sysdata field
This adds a new sysdata field to enable assigning a per-target unique id
to each message sent to that target. This id can later be appended as
part of sysdata, allowing targets to detect dropped netconsole messages.
Update count_extradata_entries() to take the new field into account.
Reviewed-by: Breno Leitao <leitao@debian.org> Signed-off-by: Gustavo Luiz Duarte <gustavold@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Time to add XDP helpers infra to libeth to greatly simplify adding
XDP to idpf and iavf, as well as improve and extend XDP in ice and
i40e. Any vendor is free to reuse helpers. If this happens, I'm fine
with moving the folder of out intel/.
The helpers greatly simplify building xdp_buff, running a prog,
handling the verdict, implement XDP_TX, .ndo_xdp_xmit, XDP buffer
completion. Same applies to XSk (with XSk xmit instead of
.ndo_xdp_xmit, plus stuff like XSk wakeup).
They are entirely generic with no HW definitions or assumptions.
HW-specific stuff like parsing Rx desc / filling Tx desc is passed
from the driver as inline callbacks.
For now, key assumptions that optimize performance / avoid code
bloat, but might not fit every driver in driver/net/:
* netmem holding the buffers are always order-0;
* driver has separate XDP Tx queues, doesn't use stack queues for
that. For best efficiency, you may want to have nr_cpu_ids XDP
queues, but less (queue sharing) is also supported;
* XDP Tx queues are interrupt-less and use "lazy" cleaning only
when there are less than 1/4 free Tx descriptors of the queue
size;
* main target platforms are 64-bit, although 32-bit is also fully
supported, but the code might be not as optimized for them.
Library code already supports multi-buffer for all kinds of Tx and
both header split and no split for Rx and Tx. Frags can come from
devmem/io_uring etc., direct `struct page *` is used only for header
buffers for which it's always true.
Drivers are free to pass their own Rx hints and XSK xmit hints ops.
XDP_TX and ndo_xdp_xmit use onstack bulk for the frames to be sent
and send them by batches of 16 buffers. This eats ~280 bytes on the
stack, but gives good boosts and allow to greatly optimize the main
sending function leaving it without any error/exception paths.
XSk xmit fills Tx descriptors in the loop unrolled by 8. This was
proven to improve perf on ice and i40e. XDP_TX and ndo_xdp_xmit
doesn't use unrolling as I wasn't able to get any improvements in
those scenenarios from this, while +1 Kb for their sending functions
for nothing doesn't sound reasonable.
XSk wakeup, instead of traditionally used "SW interrupts" provided
by NICs, uses IPI to schedule NAPI on the CPU corresponding to the
given queue pair. It gives better control over CPU distribution and
in general performs way better than "SW interrupts", plus allows us
to not pass any HW-specific callbacks there.
The code is built the way that all callbacks passed from drivers
get inlined; in general, most of hotpath gets inlined. Everything
slow/exception lands to .c files in the libeth folder, doesn't
create copies in the drivers themselves and doesn't overloat
hotpath.
Sure, inlining means that hotpath will be compiled into every driver
that uses the lib, but the core code is written in one place, so no
copying of bugs happens. Fixed once -- works everywhere.
The last commit might look like sorta hack, but it gives really good
boosts and decreases object code size, plus there are checks that
all those wider accesses are fully safe, so I don't feel anything
bad about it.
An example of using libeth_xdp can be found either on my GitHub or
on the mailing lists here ("XDP for idpf"). Macros for building
driver XDP functions lead to that some implementations (XDP_TX,
ndo_xdp_xmit etc.) consist of really only a few lines.
* '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
libeth: xdp, xsk: access adjacent u32s as u64 where applicable
libeth: xsk: add XSkFQ refill and XSk wakeup helpers
libeth: xsk: add XSk Rx processing support
libeth: xsk: add XSk xmit functions
libeth: xsk: add XSk XDP_TX sending helpers
libeth: xdp: add RSS hash hint and XDP features setup helpers
libeth: xdp: add templates for building driver-side callbacks
libeth: xdp: add XDP prog run and verdict result handling
libeth: xdp: add helpers for preparing/processing &libeth_xdp_buff
libeth: xdp: add XDPSQ cleanup timers
libeth: xdp: add XDPSQ locking helpers
libeth: xdp: add XDPSQE completion helpers
libeth: xdp: add .ndo_xdp_xmit() helpers
libeth: xdp: add XDP_TX buffers sending
libeth: support native XDP and register memory model
libeth: convert to netmem
libeth, libie: clean symbol exports up a little
====================
====================
net/mlx5e: Add support for devmem and io_uring TCP zero-copy
This series adds support for zerocopy rx TCP with devmem and io_uring
for ConnectX7 NICs and above. For performance reasons and simplicity
HW-GRO will also be turned on when header-data split mode is on.
Performance
===========
Test setup:
* CPU: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (single NUMA)
* NIC: ConnectX7
* Benchmarking tool: kperf [0]
* Single TCP flow
* Test duration: 60s
With application thread and interrupts pinned to the *same* core:
Saeed Mahameed [Mon, 16 Jun 2025 14:14:40 +0000 (17:14 +0300)]
net/mlx5e: Support ethtool tcp-data-split settings
In mlx5, tcp header-data split requires HW GRO to be on.
Enabling it fails when HW GRO is off.
mlx5e_fix_features now keeps HW GRO on when tcp data split is enabled.
Finally, when tcp data split is disabled, features are updated to maybe
remove the forced HW GRO.
Saeed Mahameed [Mon, 16 Jun 2025 14:14:39 +0000 (17:14 +0300)]
net/mlx5e: Implement queue mgmt ops and single channel swap
The bulk of the work is done in mlx5e_queue_mem_alloc, where we allocate
and create the new channel resources, similar to
mlx5e_safe_switch_params, but here we do it for a single channel using
existing params, sort of a clone channel.
To swap the old channel with the new one, we deactivate and close the
old channel then replace it with the new one, since the swap procedure
doesn't fail in mlx5, we do it all in one place (mlx5e_queue_start).
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Acked-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20250616141441.1243044-11-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Saeed Mahameed [Mon, 16 Jun 2025 14:14:38 +0000 (17:14 +0300)]
net/mlx5e: Add support for UNREADABLE netmem page pools
On netdev_rx_queue_restart, a special type of page pool maybe expected.
In this patch declare support for UNREADABLE netmem iov pages in the
pool params only when header data split shampo RQ mode is enabled, also
set the queue index in the page pool params struct.
Shampo mode requirement: Without header split rx needs to peek at the data,
we can't do UNREADABLE_NETMEM.
The patch also enables the use of a separate page pool for headers when
a memory provider is installed for the queue, otherwise the same common
page pool continues to be used.
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250616141441.1243044-10-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Saeed Mahameed [Mon, 16 Jun 2025 14:14:37 +0000 (17:14 +0300)]
net/mlx5e: Convert over to netmem
mlx5e_page_frag holds the physical page itself, to naturally support
zc page pools, remove physical page reference from mlx5 and replace it
with netmem_ref, to avoid internal handling in mlx5 for net_iov backed
pages.
SHAMPO can issue packets that are not split into header and data. These
packets will be dropped if the data part resides in a net_iov as the
driver can't read into this area.
No performance degradation observed.
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250616141441.1243044-9-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Saeed Mahameed [Mon, 16 Jun 2025 14:14:36 +0000 (17:14 +0300)]
net/mlx5e: SHAMPO: Separate pool for headers
Allow allocating a separate page pool for headers when SHAMPO is on.
This will be useful for adding support to zc page pool, which has to be
different from the headers page pool.
For now, the pools are the same.
Add missing HW capabilities, declare the feature in
netdev->vlan_features, similar to other features in mlx5e_build_nic_netdev.
No functional change here as all by default disabled features are
explicitly disabled at the bottom of the function.
Drop redundant SHAMPO structure alloc/free functions.
Gather together function calls pertaining to header split info, pass
header per WQE (hd_per_wqe) as parameter to those function to avoid use
before initialization future mistakes.
Allocate HW GRO related info outside of the header related info scope.
Tejun Heo [Mon, 16 Jun 2025 18:10:47 +0000 (08:10 -1000)]
net: tcp: tsq: Convert from tasklet to BH workqueue
The only generic interface to execute asynchronously in the BH context is
tasklet; however, it's marked deprecated and has some design flaws. To
replace tasklets, BH workqueue support was recently added. A BH workqueue
behaves similarly to regular workqueues except that the queued work items
are executed in the BH context.
This patch converts TCP Small Queues implementation from tasklet to BH
workqueue.
Semantically, this is an equivalent conversion and there shouldn't be any
user-visible behavior changes. While workqueue's queueing and execution
paths are a bit heavier than tasklet's, unless the work item is being queued
every packet, the difference hopefully shouldn't matter.
My experience with the networking stack is very limited and this patch
definitely needs attention from someone who actually understands networking.
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Cc: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/aFBeJ38AS1ZF3Dq5@slm.duckdns.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
ipmr, ip6mr: Allow MC-routing locally-generated MC packets
Multicast routing is today handled in the input path. Locally generated MC
packets don't hit the IPMR code. Thus if a VXLAN remote address is
multicast, the driver needs to set an OIF during route lookup. In practice
that means that MC routing configuration needs to be kept in sync with the
VXLAN FDB and MDB. Ideally, the VXLAN packets would be routed by the MC
routing code instead.
To that end, this patchset adds support to route locally generated
multicast packets.
However, an installation that uses a VXLAN underlay netdevice for which it
also has matching MC routes, would get a different routing with this patch.
Previously, the MC packets would be delivered directly to the underlay
port, whereas now they would be MC-routed. In order to avoid this change in
behavior, introduce an IPCB/IP6CB flag. Unless the flag is set, the new
MC-routing code is skipped.
All this is keyed to a new VXLAN attribute, IFLA_VXLAN_MC_ROUTE. Only when
it is set does any of the above engage.
In addition to that, and as is the case today with MC forwarding,
IPV4_DEVCONF_MC_FORWARDING must be enabled for the netdevice that acts as a
source of MC traffic (i.e. the VXLAN PHYS_DEV), so an MC daemon must be
attached to the netdevice.
When a VXLAN netdevice with a MC remote is brought up, the physical
netdevice joins the indicated MC group. This is important for local
delivery of MC packets, so it is still necessary to configure a physical
netdevice -- the parameter cannot go away. The netdevice would however
typically not be a front panel port, but a dummy. An MC daemon would then
sit on top of that netdevice as well as any front panel ports that it needs
to service, and have routes set up between the two.
A way to configure the VXLAN netdevice to take advantage of the new MC
routing would be:
# ip link add name d up type dummy
# ip link add name vx10 up type vxlan id 1000 dstport 4789 \
local 192.0.2.1 group 225.0.0.1 ttl 16 dev d mrcoute
# ip link set dev vx10 master br # plus vlans etc.
The RX path has not changed, with the exception of an extra MC hop. Packets
are delivered to the front panel port and MC-forwarded to the VXLAN
physical port, here "d". Since the port has joined the multicast group, the
packets are locally delivered, and end up being processed by the VXLAN
netdevice.
This patchset is based on earlier patches from Nikolay Aleksandrov and
Roopa Prabhu, though it underwent significant changes. Roopa broadly
presented the topic on LPC 2019 [0].
Patchset progression:
- Patches #1 to #4 add ip_mr_output()
- Patches #5 to #10 add ip6_mr_output()
- Patch #11 adds the VXLAN bits to enable MR engagement
- Patches #12 to #14 prepare selftest libraries
- Patch #15 includes a new test suite
Tests may wish to add other interfaces to listen on. Notably locally
generated traffic uses dummy interfaces. The multicast daemon needs to know
about these so that it allows forming rules that involve these interfaces,
and so that net.ipv4.conf.X.mc_forwarding is set for the interfaces.
To that end, allow passing in a list of interfaces to configure in addition
to all the physical ones.
Petr Machata [Mon, 16 Jun 2025 22:44:20 +0000 (00:44 +0200)]
selftests: forwarding: lib: Move smcrouted helpers here
router_multicast.sh has several helpers for work with smcrouted. Extract
them to lib.sh so that other selftests can use them as well. Convert the
helpers to defer in the process, because that simplifies the interface
quite a bit. Therefore have router_multicast.sh invoke
defer_scopes_cleanup() in its cleanup() function.
Petr Machata [Mon, 16 Jun 2025 22:44:19 +0000 (00:44 +0200)]
vxlan: Support MC routing in the underlay
Locally-generated MC packets have so far not been subject to MC routing.
Instead an MC-enabled installation would maintain the MC routing tables,
and separately from that the list of interfaces to send packets to as part
of the VXLAN FDB and MDB.
In a previous patch, a ip_mr_output() and ip6_mr_output() routines were
added for IPv4 and IPv6. All locally generated MC traffic is now passed
through these functions. For reasons of backward compatibility, an SKB
(IPCB / IP6CB) flag guards the actual MC routing.
This patch adds logic to set the flag, and the UAPI to enable the behavior.
Petr Machata [Mon, 16 Jun 2025 22:44:18 +0000 (00:44 +0200)]
net: ipv6: Add ip6_mr_output()
Multicast routing is today handled in the input path. Locally generated MC
packets don't hit the IPMR code today. Thus if a VXLAN remote address is
multicast, the driver needs to set an OIF during route lookup. Thus MC
routing configuration needs to be kept in sync with the VXLAN FDB and MDB.
Ideally, the VXLAN packets would be routed by the MC routing code instead.
To that end, this patch adds support to route locally generated multicast
packets. The newly-added routines do largely what ip6_mr_input() and
ip6_mr_forward() do: make an MR cache lookup to find where to send the
packets, and use ip6_output() to send each of them. When no cache entry is
found, the packet is punted to the daemon for resolution.
Similarly to the IPv4 case in a previous patch, the new logic is contingent
on a newly-added IP6CB flag being set.
Petr Machata [Mon, 16 Jun 2025 22:44:17 +0000 (00:44 +0200)]
net: ipv6: ip6mr: Split ip6mr_forward2() in two
Some of the work of ip6mr_forward2() is specific to IPMR forwarding, and
should not take place on the output path. In order to allow reuse of the
common parts, extract out of the function a helper,
ip6mr_prepare_forward().
Petr Machata [Mon, 16 Jun 2025 22:44:15 +0000 (00:44 +0200)]
net: ipv6: ip6mr: Fix in/out netdev to pass to the FORWARD chain
The netfilter hook is invoked with skb->dev for input netdevice, and
vif_dev for output netdevice. However at the point of invocation, skb->dev
is already set to vif_dev, and MR-forwarded packets are reported with
in=out:
Petr Machata [Mon, 16 Jun 2025 22:44:14 +0000 (00:44 +0200)]
net: ipv6: Add a flags argument to ip6tunnel_xmit(), udp_tunnel6_xmit_skb()
ip6tunnel_xmit() erases the contents of the SKB control block. In order to
be able to set particular IP6CB flags on the SKB, add a corresponding
parameter, and propagate it to udp_tunnel6_xmit_skb() as well.
In one of the following patches, VXLAN driver will use this facility to
mark packets as subject to IPv6 multicast routing.
Petr Machata [Mon, 16 Jun 2025 22:44:13 +0000 (00:44 +0200)]
net: ipv6: Make udp_tunnel6_xmit_skb() void
The function always returns zero, thus the return value does not carry any
signal. Just make it void.
Most callers already ignore the return value. However:
- Refold arguments of the call from sctp_v6_xmit() so that they fit into
the 80-column limit.
- tipc_udp_xmit() initializes err from the return value, but that should
already be always zero at that point. So there's no practical change, but
elision of the assignment prompts a couple more tweaks to clean up the
function.
Petr Machata [Mon, 16 Jun 2025 22:44:12 +0000 (00:44 +0200)]
net: ipv4: Add ip_mr_output()
Multicast routing is today handled in the input path. Locally generated MC
packets don't hit the IPMR code today. Thus if a VXLAN remote address is
multicast, the driver needs to set an OIF during route lookup. Thus MC
routing configuration needs to be kept in sync with the VXLAN FDB and MDB.
Ideally, the VXLAN packets would be routed by the MC routing code instead.
To that end, this patch adds support to route locally generated multicast
packets. The newly-added routines do largely what ip_mr_input() and
ip_mr_forward() do: make an MR cache lookup to find where to send the
packets, and use ip_mc_output() to send each of them. When no cache entry
is found, the packet is punted to the daemon for resolution.
However, an installation that uses a VXLAN underlay netdevice for which it
also has matching MC routes, would get a different routing with this patch.
Previously, the MC packets would be delivered directly to the underlay
port, whereas now they would be MC-routed. In order to avoid this change in
behavior, introduce an IPCB flag. Only if the flag is set will
ip_mr_output() actually engage, otherwise it reverts to ip_mc_output().
This code is based on work by Roopa Prabhu and Nikolay Aleksandrov.
Signed-off-by: Roopa Prabhu <roopa@nvidia.com> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/0aadbd49330471c0f758d54afb05eb3b6e3a6b65.1750113335.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Petr Machata [Mon, 16 Jun 2025 22:44:11 +0000 (00:44 +0200)]
net: ipv4: ipmr: Split ipmr_queue_xmit() in two
Some of the work of ipmr_queue_xmit() is specific to IPMR forwarding, and
should not take place on the output path. In order to allow reuse of the
common parts, split the function into two: the ipmr_prepare_xmit() helper
that takes care of the common bits, and the ipmr_queue_fwd_xmit(), which
invokes the former and encapsulates the whole forwarding algorithm.
Petr Machata [Mon, 16 Jun 2025 22:44:10 +0000 (00:44 +0200)]
net: ipv4: ipmr: ipmr_queue_xmit(): Drop local variable `dev'
The variable is used for caching of rt->dst.dev. The netdevice referenced
therein does not change during the scope of validity of that local. At the
same time, the local is only used twice, and each of these uses will end up
in a different function in the following patches, further eliminating any
use the local could have had.
Petr Machata [Mon, 16 Jun 2025 22:44:09 +0000 (00:44 +0200)]
net: ipv4: Add a flags argument to iptunnel_xmit(), udp_tunnel_xmit_skb()
iptunnel_xmit() erases the contents of the SKB control block. In order to
be able to set particular IPCB flags on the SKB, add a corresponding
parameter, and propagate it to udp_tunnel_xmit_skb() as well.
In one of the following patches, VXLAN driver will use this facility to
mark packets as subject to IP multicast routing.
Jakub Kicinski [Wed, 18 Jun 2025 01:06:42 +0000 (18:06 -0700)]
Merge branch 'misc-vlan-cleanups'
Gal Pressman says:
====================
Misc vlan cleanups
This patch series addresses compilation issues with objtool when VLAN
support is disabled (CONFIG_VLAN_8021Q=n) and makes related improvements
to the VLAN infrastructure.
When CONFIG_VLAN_8021Q=n, CONFIG_OBJTOOL=y, and CONFIG_OBJTOOL_WERROR=y,
the following compilation error occurs:
drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.o: error: objtool: parse_mirred.isra.0+0x370: mlx5e_tc_act_vlan_add_push_action() missing __noreturn in .c/.h or NORETURN() in noreturns.h
The error occurs because objtool cannot determine that unreachable BUG()
calls in VLAN code paths are actually dead code when VLAN support is
disabled.
First patch makes is_vlan_dev() a stub when VLAN is not configured,
allows compile-out of VLAN-dependent dead code paths and resolves the
objtool compilation error.
Second patch replaces BUG() calls with WARN_ON_ONCE(), as the usage of
BUG() should be avoided.
Third patch uses the "kernel" way of testing whether an option is
configured as builtin/module, instead of open-coding it.
Gal Pressman [Mon, 16 Jun 2025 13:26:26 +0000 (16:26 +0300)]
net: vlan: Use IS_ENABLED() helper for CONFIG_VLAN_8021Q guard
The header currently tests the VLAN core with an explicit pair of 'if
defined' checks:
#if defined(CONFIG_VLAN_8021Q) || defined(CONFIG_VLAN_8021Q_MODULE)
Instead, use IS_ENABLED() which is the kernel way to test whether an
option is configured as builtin/module.
This is purely cosmetic – no functional changes.
Reviewed-by: Alex Lazar <alazar@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Link: https://patch.msgid.link/20250616132626.1749331-4-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Gal Pressman [Mon, 16 Jun 2025 13:26:25 +0000 (16:26 +0300)]
net: vlan: Replace BUG() with WARN_ON_ONCE() in vlan_dev_* stubs
When CONFIG_VLAN_8021Q=n, a set of stub helpers are used, three of these
helpers use BUG() unconditionally.
This code should not be reached, as callers of these functions should
always check for is_vlan_dev() first, but the usage of BUG() is not
recommended, replace it with WARN_ON() instead.
Reviewed-by: Alex Lazar <alazar@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Link: https://patch.msgid.link/20250616132626.1749331-3-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Gal Pressman [Mon, 16 Jun 2025 13:26:24 +0000 (16:26 +0300)]
net: vlan: Make is_vlan_dev() a stub when VLAN is not configured
Add a stub implementation of is_vlan_dev() that returns false when
VLAN support is not compiled in (CONFIG_VLAN_8021Q=n).
This allows us to compile-out VLAN-dependent dead code when it is not
needed.
This also resolves the following compilation error when:
* CONFIG_VLAN_8021Q=n
* CONFIG_OBJTOOL=y
* CONFIG_OBJTOOL_WERROR=y
drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.o: error: objtool: parse_mirred.isra.0+0x370: mlx5e_tc_act_vlan_add_push_action() missing __noreturn in .c/.h or NORETURN() in noreturns.h
The error occurs because objtool cannot determine that unreachable BUG()
(which doesn't return) calls in VLAN code paths are actually dead code
when VLAN support is disabled.
====================
net: use new GPIO line value setter callbacks
Commit 98ce1eb1fd87e ("gpiolib: introduce gpio_chip setters that return
values") added new line setter callbacks to struct gpio_chip. They allow
to indicate failures to callers. We're in the process of converting all
GPIO controllers to using them before removing the old ones. This series
converts all GPIO chips implemented under drivers/net/.
net: can: mcp251x: propagate the return value of mcp251x_spi_write()
Add an integer return value to mcp251x_write_bits() and use it to
propagate the one returned by mcp251x_spi_write(). Return that value on
error in the request() GPIO callback.
Alok Tiwari [Mon, 16 Jun 2025 05:45:01 +0000 (22:45 -0700)]
gve: Return error for unknown admin queue command
In gve_adminq_issue_cmd(), return -EINVAL instead of 0 when an unknown
admin queue command opcode is encountered.
This prevents the function from silently succeeding on invalid input
and prevents undefined behavior by ensuring the function fails gracefully
when an unrecognized opcode is provided.
Alok Tiwari [Mon, 16 Jun 2025 05:45:00 +0000 (22:45 -0700)]
gve: Fix various typos and improve code comments
- Correct spelling and improves the clarity of comments
"confiugration" -> "configuration"
"spilt" -> "split"
"It if is 0" -> "If it is 0"
"DQ" -> "DQO" (correct abbreviation)
- Clarify BIT(0) flag usage in gve_get_priv_flags()
- Replaced hardcoded array size with GVE_NUM_PTYPES
for clarity and maintainability.
These changes are purely cosmetic and do not affect functionality.