Time to add XDP helpers infra to libeth to greatly simplify adding
XDP to idpf and iavf, as well as improve and extend XDP in ice and
i40e. Any vendor is free to reuse helpers. If this happens, I'm fine
with moving the folder of out intel/.
The helpers greatly simplify building xdp_buff, running a prog,
handling the verdict, implement XDP_TX, .ndo_xdp_xmit, XDP buffer
completion. Same applies to XSk (with XSk xmit instead of
.ndo_xdp_xmit, plus stuff like XSk wakeup).
They are entirely generic with no HW definitions or assumptions.
HW-specific stuff like parsing Rx desc / filling Tx desc is passed
from the driver as inline callbacks.
For now, key assumptions that optimize performance / avoid code
bloat, but might not fit every driver in driver/net/:
* netmem holding the buffers are always order-0;
* driver has separate XDP Tx queues, doesn't use stack queues for
that. For best efficiency, you may want to have nr_cpu_ids XDP
queues, but less (queue sharing) is also supported;
* XDP Tx queues are interrupt-less and use "lazy" cleaning only
when there are less than 1/4 free Tx descriptors of the queue
size;
* main target platforms are 64-bit, although 32-bit is also fully
supported, but the code might be not as optimized for them.
Library code already supports multi-buffer for all kinds of Tx and
both header split and no split for Rx and Tx. Frags can come from
devmem/io_uring etc., direct `struct page *` is used only for header
buffers for which it's always true.
Drivers are free to pass their own Rx hints and XSK xmit hints ops.
XDP_TX and ndo_xdp_xmit use onstack bulk for the frames to be sent
and send them by batches of 16 buffers. This eats ~280 bytes on the
stack, but gives good boosts and allow to greatly optimize the main
sending function leaving it without any error/exception paths.
XSk xmit fills Tx descriptors in the loop unrolled by 8. This was
proven to improve perf on ice and i40e. XDP_TX and ndo_xdp_xmit
doesn't use unrolling as I wasn't able to get any improvements in
those scenenarios from this, while +1 Kb for their sending functions
for nothing doesn't sound reasonable.
XSk wakeup, instead of traditionally used "SW interrupts" provided
by NICs, uses IPI to schedule NAPI on the CPU corresponding to the
given queue pair. It gives better control over CPU distribution and
in general performs way better than "SW interrupts", plus allows us
to not pass any HW-specific callbacks there.
The code is built the way that all callbacks passed from drivers
get inlined; in general, most of hotpath gets inlined. Everything
slow/exception lands to .c files in the libeth folder, doesn't
create copies in the drivers themselves and doesn't overloat
hotpath.
Sure, inlining means that hotpath will be compiled into every driver
that uses the lib, but the core code is written in one place, so no
copying of bugs happens. Fixed once -- works everywhere.
The last commit might look like sorta hack, but it gives really good
boosts and decreases object code size, plus there are checks that
all those wider accesses are fully safe, so I don't feel anything
bad about it.
An example of using libeth_xdp can be found either on my GitHub or
on the mailing lists here ("XDP for idpf"). Macros for building
driver XDP functions lead to that some implementations (XDP_TX,
ndo_xdp_xmit etc.) consist of really only a few lines.
* '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
libeth: xdp, xsk: access adjacent u32s as u64 where applicable
libeth: xsk: add XSkFQ refill and XSk wakeup helpers
libeth: xsk: add XSk Rx processing support
libeth: xsk: add XSk xmit functions
libeth: xsk: add XSk XDP_TX sending helpers
libeth: xdp: add RSS hash hint and XDP features setup helpers
libeth: xdp: add templates for building driver-side callbacks
libeth: xdp: add XDP prog run and verdict result handling
libeth: xdp: add helpers for preparing/processing &libeth_xdp_buff
libeth: xdp: add XDPSQ cleanup timers
libeth: xdp: add XDPSQ locking helpers
libeth: xdp: add XDPSQE completion helpers
libeth: xdp: add .ndo_xdp_xmit() helpers
libeth: xdp: add XDP_TX buffers sending
libeth: support native XDP and register memory model
libeth: convert to netmem
libeth, libie: clean symbol exports up a little
====================
====================
net/mlx5e: Add support for devmem and io_uring TCP zero-copy
This series adds support for zerocopy rx TCP with devmem and io_uring
for ConnectX7 NICs and above. For performance reasons and simplicity
HW-GRO will also be turned on when header-data split mode is on.
Performance
===========
Test setup:
* CPU: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (single NUMA)
* NIC: ConnectX7
* Benchmarking tool: kperf [0]
* Single TCP flow
* Test duration: 60s
With application thread and interrupts pinned to the *same* core:
Saeed Mahameed [Mon, 16 Jun 2025 14:14:40 +0000 (17:14 +0300)]
net/mlx5e: Support ethtool tcp-data-split settings
In mlx5, tcp header-data split requires HW GRO to be on.
Enabling it fails when HW GRO is off.
mlx5e_fix_features now keeps HW GRO on when tcp data split is enabled.
Finally, when tcp data split is disabled, features are updated to maybe
remove the forced HW GRO.
Saeed Mahameed [Mon, 16 Jun 2025 14:14:39 +0000 (17:14 +0300)]
net/mlx5e: Implement queue mgmt ops and single channel swap
The bulk of the work is done in mlx5e_queue_mem_alloc, where we allocate
and create the new channel resources, similar to
mlx5e_safe_switch_params, but here we do it for a single channel using
existing params, sort of a clone channel.
To swap the old channel with the new one, we deactivate and close the
old channel then replace it with the new one, since the swap procedure
doesn't fail in mlx5, we do it all in one place (mlx5e_queue_start).
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Acked-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20250616141441.1243044-11-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Saeed Mahameed [Mon, 16 Jun 2025 14:14:38 +0000 (17:14 +0300)]
net/mlx5e: Add support for UNREADABLE netmem page pools
On netdev_rx_queue_restart, a special type of page pool maybe expected.
In this patch declare support for UNREADABLE netmem iov pages in the
pool params only when header data split shampo RQ mode is enabled, also
set the queue index in the page pool params struct.
Shampo mode requirement: Without header split rx needs to peek at the data,
we can't do UNREADABLE_NETMEM.
The patch also enables the use of a separate page pool for headers when
a memory provider is installed for the queue, otherwise the same common
page pool continues to be used.
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250616141441.1243044-10-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Saeed Mahameed [Mon, 16 Jun 2025 14:14:37 +0000 (17:14 +0300)]
net/mlx5e: Convert over to netmem
mlx5e_page_frag holds the physical page itself, to naturally support
zc page pools, remove physical page reference from mlx5 and replace it
with netmem_ref, to avoid internal handling in mlx5 for net_iov backed
pages.
SHAMPO can issue packets that are not split into header and data. These
packets will be dropped if the data part resides in a net_iov as the
driver can't read into this area.
No performance degradation observed.
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250616141441.1243044-9-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Saeed Mahameed [Mon, 16 Jun 2025 14:14:36 +0000 (17:14 +0300)]
net/mlx5e: SHAMPO: Separate pool for headers
Allow allocating a separate page pool for headers when SHAMPO is on.
This will be useful for adding support to zc page pool, which has to be
different from the headers page pool.
For now, the pools are the same.
Add missing HW capabilities, declare the feature in
netdev->vlan_features, similar to other features in mlx5e_build_nic_netdev.
No functional change here as all by default disabled features are
explicitly disabled at the bottom of the function.
Drop redundant SHAMPO structure alloc/free functions.
Gather together function calls pertaining to header split info, pass
header per WQE (hd_per_wqe) as parameter to those function to avoid use
before initialization future mistakes.
Allocate HW GRO related info outside of the header related info scope.
Tejun Heo [Mon, 16 Jun 2025 18:10:47 +0000 (08:10 -1000)]
net: tcp: tsq: Convert from tasklet to BH workqueue
The only generic interface to execute asynchronously in the BH context is
tasklet; however, it's marked deprecated and has some design flaws. To
replace tasklets, BH workqueue support was recently added. A BH workqueue
behaves similarly to regular workqueues except that the queued work items
are executed in the BH context.
This patch converts TCP Small Queues implementation from tasklet to BH
workqueue.
Semantically, this is an equivalent conversion and there shouldn't be any
user-visible behavior changes. While workqueue's queueing and execution
paths are a bit heavier than tasklet's, unless the work item is being queued
every packet, the difference hopefully shouldn't matter.
My experience with the networking stack is very limited and this patch
definitely needs attention from someone who actually understands networking.
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Cc: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/aFBeJ38AS1ZF3Dq5@slm.duckdns.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
ipmr, ip6mr: Allow MC-routing locally-generated MC packets
Multicast routing is today handled in the input path. Locally generated MC
packets don't hit the IPMR code. Thus if a VXLAN remote address is
multicast, the driver needs to set an OIF during route lookup. In practice
that means that MC routing configuration needs to be kept in sync with the
VXLAN FDB and MDB. Ideally, the VXLAN packets would be routed by the MC
routing code instead.
To that end, this patchset adds support to route locally generated
multicast packets.
However, an installation that uses a VXLAN underlay netdevice for which it
also has matching MC routes, would get a different routing with this patch.
Previously, the MC packets would be delivered directly to the underlay
port, whereas now they would be MC-routed. In order to avoid this change in
behavior, introduce an IPCB/IP6CB flag. Unless the flag is set, the new
MC-routing code is skipped.
All this is keyed to a new VXLAN attribute, IFLA_VXLAN_MC_ROUTE. Only when
it is set does any of the above engage.
In addition to that, and as is the case today with MC forwarding,
IPV4_DEVCONF_MC_FORWARDING must be enabled for the netdevice that acts as a
source of MC traffic (i.e. the VXLAN PHYS_DEV), so an MC daemon must be
attached to the netdevice.
When a VXLAN netdevice with a MC remote is brought up, the physical
netdevice joins the indicated MC group. This is important for local
delivery of MC packets, so it is still necessary to configure a physical
netdevice -- the parameter cannot go away. The netdevice would however
typically not be a front panel port, but a dummy. An MC daemon would then
sit on top of that netdevice as well as any front panel ports that it needs
to service, and have routes set up between the two.
A way to configure the VXLAN netdevice to take advantage of the new MC
routing would be:
# ip link add name d up type dummy
# ip link add name vx10 up type vxlan id 1000 dstport 4789 \
local 192.0.2.1 group 225.0.0.1 ttl 16 dev d mrcoute
# ip link set dev vx10 master br # plus vlans etc.
The RX path has not changed, with the exception of an extra MC hop. Packets
are delivered to the front panel port and MC-forwarded to the VXLAN
physical port, here "d". Since the port has joined the multicast group, the
packets are locally delivered, and end up being processed by the VXLAN
netdevice.
This patchset is based on earlier patches from Nikolay Aleksandrov and
Roopa Prabhu, though it underwent significant changes. Roopa broadly
presented the topic on LPC 2019 [0].
Patchset progression:
- Patches #1 to #4 add ip_mr_output()
- Patches #5 to #10 add ip6_mr_output()
- Patch #11 adds the VXLAN bits to enable MR engagement
- Patches #12 to #14 prepare selftest libraries
- Patch #15 includes a new test suite
Tests may wish to add other interfaces to listen on. Notably locally
generated traffic uses dummy interfaces. The multicast daemon needs to know
about these so that it allows forming rules that involve these interfaces,
and so that net.ipv4.conf.X.mc_forwarding is set for the interfaces.
To that end, allow passing in a list of interfaces to configure in addition
to all the physical ones.
Petr Machata [Mon, 16 Jun 2025 22:44:20 +0000 (00:44 +0200)]
selftests: forwarding: lib: Move smcrouted helpers here
router_multicast.sh has several helpers for work with smcrouted. Extract
them to lib.sh so that other selftests can use them as well. Convert the
helpers to defer in the process, because that simplifies the interface
quite a bit. Therefore have router_multicast.sh invoke
defer_scopes_cleanup() in its cleanup() function.
Petr Machata [Mon, 16 Jun 2025 22:44:19 +0000 (00:44 +0200)]
vxlan: Support MC routing in the underlay
Locally-generated MC packets have so far not been subject to MC routing.
Instead an MC-enabled installation would maintain the MC routing tables,
and separately from that the list of interfaces to send packets to as part
of the VXLAN FDB and MDB.
In a previous patch, a ip_mr_output() and ip6_mr_output() routines were
added for IPv4 and IPv6. All locally generated MC traffic is now passed
through these functions. For reasons of backward compatibility, an SKB
(IPCB / IP6CB) flag guards the actual MC routing.
This patch adds logic to set the flag, and the UAPI to enable the behavior.
Petr Machata [Mon, 16 Jun 2025 22:44:18 +0000 (00:44 +0200)]
net: ipv6: Add ip6_mr_output()
Multicast routing is today handled in the input path. Locally generated MC
packets don't hit the IPMR code today. Thus if a VXLAN remote address is
multicast, the driver needs to set an OIF during route lookup. Thus MC
routing configuration needs to be kept in sync with the VXLAN FDB and MDB.
Ideally, the VXLAN packets would be routed by the MC routing code instead.
To that end, this patch adds support to route locally generated multicast
packets. The newly-added routines do largely what ip6_mr_input() and
ip6_mr_forward() do: make an MR cache lookup to find where to send the
packets, and use ip6_output() to send each of them. When no cache entry is
found, the packet is punted to the daemon for resolution.
Similarly to the IPv4 case in a previous patch, the new logic is contingent
on a newly-added IP6CB flag being set.
Petr Machata [Mon, 16 Jun 2025 22:44:17 +0000 (00:44 +0200)]
net: ipv6: ip6mr: Split ip6mr_forward2() in two
Some of the work of ip6mr_forward2() is specific to IPMR forwarding, and
should not take place on the output path. In order to allow reuse of the
common parts, extract out of the function a helper,
ip6mr_prepare_forward().
Petr Machata [Mon, 16 Jun 2025 22:44:15 +0000 (00:44 +0200)]
net: ipv6: ip6mr: Fix in/out netdev to pass to the FORWARD chain
The netfilter hook is invoked with skb->dev for input netdevice, and
vif_dev for output netdevice. However at the point of invocation, skb->dev
is already set to vif_dev, and MR-forwarded packets are reported with
in=out:
Petr Machata [Mon, 16 Jun 2025 22:44:14 +0000 (00:44 +0200)]
net: ipv6: Add a flags argument to ip6tunnel_xmit(), udp_tunnel6_xmit_skb()
ip6tunnel_xmit() erases the contents of the SKB control block. In order to
be able to set particular IP6CB flags on the SKB, add a corresponding
parameter, and propagate it to udp_tunnel6_xmit_skb() as well.
In one of the following patches, VXLAN driver will use this facility to
mark packets as subject to IPv6 multicast routing.
Petr Machata [Mon, 16 Jun 2025 22:44:13 +0000 (00:44 +0200)]
net: ipv6: Make udp_tunnel6_xmit_skb() void
The function always returns zero, thus the return value does not carry any
signal. Just make it void.
Most callers already ignore the return value. However:
- Refold arguments of the call from sctp_v6_xmit() so that they fit into
the 80-column limit.
- tipc_udp_xmit() initializes err from the return value, but that should
already be always zero at that point. So there's no practical change, but
elision of the assignment prompts a couple more tweaks to clean up the
function.
Petr Machata [Mon, 16 Jun 2025 22:44:12 +0000 (00:44 +0200)]
net: ipv4: Add ip_mr_output()
Multicast routing is today handled in the input path. Locally generated MC
packets don't hit the IPMR code today. Thus if a VXLAN remote address is
multicast, the driver needs to set an OIF during route lookup. Thus MC
routing configuration needs to be kept in sync with the VXLAN FDB and MDB.
Ideally, the VXLAN packets would be routed by the MC routing code instead.
To that end, this patch adds support to route locally generated multicast
packets. The newly-added routines do largely what ip_mr_input() and
ip_mr_forward() do: make an MR cache lookup to find where to send the
packets, and use ip_mc_output() to send each of them. When no cache entry
is found, the packet is punted to the daemon for resolution.
However, an installation that uses a VXLAN underlay netdevice for which it
also has matching MC routes, would get a different routing with this patch.
Previously, the MC packets would be delivered directly to the underlay
port, whereas now they would be MC-routed. In order to avoid this change in
behavior, introduce an IPCB flag. Only if the flag is set will
ip_mr_output() actually engage, otherwise it reverts to ip_mc_output().
This code is based on work by Roopa Prabhu and Nikolay Aleksandrov.
Signed-off-by: Roopa Prabhu <roopa@nvidia.com> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/0aadbd49330471c0f758d54afb05eb3b6e3a6b65.1750113335.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Petr Machata [Mon, 16 Jun 2025 22:44:11 +0000 (00:44 +0200)]
net: ipv4: ipmr: Split ipmr_queue_xmit() in two
Some of the work of ipmr_queue_xmit() is specific to IPMR forwarding, and
should not take place on the output path. In order to allow reuse of the
common parts, split the function into two: the ipmr_prepare_xmit() helper
that takes care of the common bits, and the ipmr_queue_fwd_xmit(), which
invokes the former and encapsulates the whole forwarding algorithm.
Petr Machata [Mon, 16 Jun 2025 22:44:10 +0000 (00:44 +0200)]
net: ipv4: ipmr: ipmr_queue_xmit(): Drop local variable `dev'
The variable is used for caching of rt->dst.dev. The netdevice referenced
therein does not change during the scope of validity of that local. At the
same time, the local is only used twice, and each of these uses will end up
in a different function in the following patches, further eliminating any
use the local could have had.
Petr Machata [Mon, 16 Jun 2025 22:44:09 +0000 (00:44 +0200)]
net: ipv4: Add a flags argument to iptunnel_xmit(), udp_tunnel_xmit_skb()
iptunnel_xmit() erases the contents of the SKB control block. In order to
be able to set particular IPCB flags on the SKB, add a corresponding
parameter, and propagate it to udp_tunnel_xmit_skb() as well.
In one of the following patches, VXLAN driver will use this facility to
mark packets as subject to IP multicast routing.
Jakub Kicinski [Wed, 18 Jun 2025 01:06:42 +0000 (18:06 -0700)]
Merge branch 'misc-vlan-cleanups'
Gal Pressman says:
====================
Misc vlan cleanups
This patch series addresses compilation issues with objtool when VLAN
support is disabled (CONFIG_VLAN_8021Q=n) and makes related improvements
to the VLAN infrastructure.
When CONFIG_VLAN_8021Q=n, CONFIG_OBJTOOL=y, and CONFIG_OBJTOOL_WERROR=y,
the following compilation error occurs:
drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.o: error: objtool: parse_mirred.isra.0+0x370: mlx5e_tc_act_vlan_add_push_action() missing __noreturn in .c/.h or NORETURN() in noreturns.h
The error occurs because objtool cannot determine that unreachable BUG()
calls in VLAN code paths are actually dead code when VLAN support is
disabled.
First patch makes is_vlan_dev() a stub when VLAN is not configured,
allows compile-out of VLAN-dependent dead code paths and resolves the
objtool compilation error.
Second patch replaces BUG() calls with WARN_ON_ONCE(), as the usage of
BUG() should be avoided.
Third patch uses the "kernel" way of testing whether an option is
configured as builtin/module, instead of open-coding it.
Gal Pressman [Mon, 16 Jun 2025 13:26:26 +0000 (16:26 +0300)]
net: vlan: Use IS_ENABLED() helper for CONFIG_VLAN_8021Q guard
The header currently tests the VLAN core with an explicit pair of 'if
defined' checks:
#if defined(CONFIG_VLAN_8021Q) || defined(CONFIG_VLAN_8021Q_MODULE)
Instead, use IS_ENABLED() which is the kernel way to test whether an
option is configured as builtin/module.
This is purely cosmetic – no functional changes.
Reviewed-by: Alex Lazar <alazar@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Link: https://patch.msgid.link/20250616132626.1749331-4-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Gal Pressman [Mon, 16 Jun 2025 13:26:25 +0000 (16:26 +0300)]
net: vlan: Replace BUG() with WARN_ON_ONCE() in vlan_dev_* stubs
When CONFIG_VLAN_8021Q=n, a set of stub helpers are used, three of these
helpers use BUG() unconditionally.
This code should not be reached, as callers of these functions should
always check for is_vlan_dev() first, but the usage of BUG() is not
recommended, replace it with WARN_ON() instead.
Reviewed-by: Alex Lazar <alazar@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Link: https://patch.msgid.link/20250616132626.1749331-3-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Gal Pressman [Mon, 16 Jun 2025 13:26:24 +0000 (16:26 +0300)]
net: vlan: Make is_vlan_dev() a stub when VLAN is not configured
Add a stub implementation of is_vlan_dev() that returns false when
VLAN support is not compiled in (CONFIG_VLAN_8021Q=n).
This allows us to compile-out VLAN-dependent dead code when it is not
needed.
This also resolves the following compilation error when:
* CONFIG_VLAN_8021Q=n
* CONFIG_OBJTOOL=y
* CONFIG_OBJTOOL_WERROR=y
drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.o: error: objtool: parse_mirred.isra.0+0x370: mlx5e_tc_act_vlan_add_push_action() missing __noreturn in .c/.h or NORETURN() in noreturns.h
The error occurs because objtool cannot determine that unreachable BUG()
(which doesn't return) calls in VLAN code paths are actually dead code
when VLAN support is disabled.
====================
net: use new GPIO line value setter callbacks
Commit 98ce1eb1fd87e ("gpiolib: introduce gpio_chip setters that return
values") added new line setter callbacks to struct gpio_chip. They allow
to indicate failures to callers. We're in the process of converting all
GPIO controllers to using them before removing the old ones. This series
converts all GPIO chips implemented under drivers/net/.
net: can: mcp251x: propagate the return value of mcp251x_spi_write()
Add an integer return value to mcp251x_write_bits() and use it to
propagate the one returned by mcp251x_spi_write(). Return that value on
error in the request() GPIO callback.
Alok Tiwari [Mon, 16 Jun 2025 05:45:01 +0000 (22:45 -0700)]
gve: Return error for unknown admin queue command
In gve_adminq_issue_cmd(), return -EINVAL instead of 0 when an unknown
admin queue command opcode is encountered.
This prevents the function from silently succeeding on invalid input
and prevents undefined behavior by ensuring the function fails gracefully
when an unrecognized opcode is provided.
Alok Tiwari [Mon, 16 Jun 2025 05:45:00 +0000 (22:45 -0700)]
gve: Fix various typos and improve code comments
- Correct spelling and improves the clarity of comments
"confiugration" -> "configuration"
"spilt" -> "split"
"It if is 0" -> "If it is 0"
"DQ" -> "DQO" (correct abbreviation)
- Clarify BIT(0) flag usage in gve_get_priv_flags()
- Replaced hardcoded array size with GVE_NUM_PTYPES
for clarity and maintainability.
These changes are purely cosmetic and do not affect functionality.
Yuyang Huang [Sat, 14 Jun 2025 05:35:22 +0000 (14:35 +0900)]
selftest: Add selftest for multicast address notifications
This commit adds a new kernel selftest to verify RTNLGRP_IPV4_MCADDR
and RTNLGRP_IPV6_MCADDR notifications. The test works by adding and
removing a dummy interface and then confirming that the system
correctly receives join and removal notifications for the 224.0.0.1
and ff02::1 multicast addresses.
The test relies on the iproute2 version to be 6.13+.
Tested by the following command:
$ vng -v --user root --cpus 16 -- \
make -C tools/testing/selftests TARGETS=net
TEST_PROGS=rtnetlink_notification.sh \
TEST_GEN_PROGS="" run_tests
Jakub Kicinski [Wed, 18 Jun 2025 00:52:32 +0000 (17:52 -0700)]
Merge branch 'net-dsa-b53-fix-bcm5325-support'
Álvaro Fernández Rojas says:
====================
net: dsa: b53: fix BCM5325 support
These patches get the BCM5325 switch working with b53.
The existing brcm legacy tag only works with BCM63xx switches.
We need to add a new legacy tag for BCM5325 and BCM5365 switches, which
require including the FCS and length.
I'm not really sure that everything here is correct since I don't work for
Broadcom and all this is based on the public datasheet available for the
BCM5325 and my own experiments with a Huawei HG556a (BCM6358).
Both sets of patches have been merged due to the change requested by Jonas
about BRCM_HDR register access depending on legacy tags.
====================
According to the datasheet, BCM5325 uses B53_PD_MODE_CTRL_25 register to
disable clocking to individual PHYs.
Only ports 1-4 can be enabled or disabled and the datasheet is explicit
about not toggling BIT(0) since it disables the PLL power and the switch.
net: dsa: b53: fix unicast/multicast flooding on BCM5325
BCM5325 doesn't implement UC_FLOOD_MASK, MC_FLOOD_MASK and IPMC_FLOOD_MASK
registers.
This has to be handled differently with other pages and registers.
net: dsa: b53: prevent GMII_PORT_OVERRIDE_CTRL access on BCM5325
BCM5325 doesn't implement GMII_PORT_OVERRIDE_CTRL register so we should
avoid reading or writing it.
PORT_OVERRIDE_RX_FLOW and PORT_OVERRIDE_TX_FLOW aren't defined on BCM5325
and we should use PORT_OVERRIDE_LP_FLOW_25 instead.
Florian Fainelli [Sat, 14 Jun 2025 07:59:51 +0000 (09:59 +0200)]
net: dsa: b53: add support for FDB operations on 5325/5365
BCM5325 and BCM5365 are part of a much older generation of switches which,
due to their limited number of ports and VLAN entries (up to 256) allowed
a single 64-bit register to hold a full ARL entry.
This requires a little bit of massaging when reading, writing and
converting ARL entries in both directions.
We need to be able to differentiate the BCM5325 variants because:
- BCM5325M switches lack the ARLIO_PAGE->VLAN_ID_IDX register.
- BCM5325E have less 512 ARL buckets instead of 1024.
Commit 46c5176c586c ("net: dsa: b53: support legacy tags") introduced
support for legacy tags, but it turns out that BCM5325 and BCM5365
switches require the original FCS value and length, so they have to be
treated differently.
net: dsa: tag_brcm: add support for legacy FCS tags
Add support for legacy Broadcom FCS tags, which are similar to
DSA_TAG_PROTO_BRCM_LEGACY.
BCM5325 and BCM5365 switches require including the original FCS value and
length, as opposed to BCM63xx switches.
Adding the original FCS value and length to DSA_TAG_PROTO_BRCM_LEGACY would
impact performance of BCM63xx switches, so it's better to create a new tag.
There is little need to have phy_intf_sel as a member of struct
visconti_eth when we have the PHY interface mode available from
phylink in visconti_eth_set_clk_tx_rate(). Without multiple
interface support, phylink is fixed to supporting only
plat->phy_interface, so we can be sure that "interface" passed
into this function is the same as plat->phy_interface.
Make phy_intf_sel local to visconti_eth_init_hw() and clean up.
Rather than testing dwmac->phy_intf_sel several times for the same
values in this function, group the code together. The only part
which was common was stopping the internal clock before programming
the clock setting.
This further improves the readability of this function.
Re-arrange the speed decode in visconti_eth_set_clk_tx_rate() to be
more readable by first checking to see if we're using RGMII or RMII
and then decoding the speed, rather than decoding the speed and then
testing the interface mode.
====================
Link NAPI instances to queues and IRQs
This patch series introduces netdev-genl support to rtase, enabling
user-space applications to query the relationships between IRQs,
queues, and NAPI instances.
====================
Alok Tiwari [Sun, 15 Jun 2025 08:48:12 +0000 (01:48 -0700)]
selftests: nettest: Fix typo in log and error messages for clarity
This patch corrects several logging and error message in nettest.c:
- Corrects function name in log messages "setsockopt" -> "getsockopt".
- Closes missing parentheses in "setsockopt(IPV6_FREEBIND)".
- Replaces misleading error text ("Invalid port") with the correct
description ("Invalid prefix length").
- remove Redundant wording like "status from status" and clarifies
context in IPC error messages.
These changes improve readability and aid in debugging test output.
RACK-TLP loss detection has been enabled as the default loss detection
algorithm for Linux TCP since 2018, in:
commit b38a51fec1c1 ("tcp: disable RFC6675 loss detection")
In case users ran into unexpected bugs or performance regressions,
that commit allowed Linux system administrators to revert to using
RFC3517/RFC6675 loss recovery by setting net.ipv4.tcp_recovery to 0.
In the seven years since 2018, our team has not heard reports of
anyone reverting Linux TCP to use RFC3517/RFC6675 loss recovery, and
we can't find any record in web searches of such a revert.
RACK-TLP was published as a standards-track RFC, RFC8985, in February
2021.
Several other major TCP implementations have default-enabled RACK-TLP
at this point as well.
RACK-TLP offers several significant performance advantages over
RFC3517/RFC6675 loss recovery, including much better performance in
the common cases of tail drops, lost retransmissions, and reordering.
It is now time to remove the obsolete and unused RFC3517/RFC6675 loss
recovery code. This will allow a substantial simplification of the
Linux TCP code base, and removes 12 bytes of state in every tcp_sock
for 64-bit machines (8 bytes on 32-bit machines).
To arrange the commits in reasonable sizes, this patch series is split
into 3 commits:
(1) Removes the core RFC3517/RFC6675 logic.
(2) Removes the RFC3517/RFC6675 hint state and the first layer of logic that
updates that state.
(3) Removes the emptied-out tcp_clear_retrans_hints_partial() helper function
and all of its call sites.
====================
Neal Cardwell [Sun, 15 Jun 2025 00:14:34 +0000 (20:14 -0400)]
tcp: remove RFC3517/RFC6675 hint state: lost_skb_hint, lost_cnt_hint
Now that obsolete RFC3517/RFC6675 TCP loss detection has been removed,
we can remove the somewhat complex and intrusive code to maintain its
hint state: lost_skb_hint and lost_cnt_hint.
This commit makes tcp_clear_retrans_hints_partial() empty. We will
remove tcp_clear_retrans_hints_partial() and its call sites in the
next commit.
Neal Cardwell [Sun, 15 Jun 2025 00:14:33 +0000 (20:14 -0400)]
tcp: remove obsolete and unused RFC3517/RFC6675 loss recovery code
RACK-TLP loss detection has been enabled as the default loss detection
algorithm for Linux TCP since 2018, in:
commit b38a51fec1c1 ("tcp: disable RFC6675 loss detection")
In case users ran into unexpected bugs or performance regressions,
that commit allowed Linux system administrators to revert to using
RFC3517/RFC6675 loss recovery by setting net.ipv4.tcp_recovery to 0.
In the seven years since 2018, our team has not heard reports of
anyone reverting Linux TCP to use RFC3517/RFC6675 loss recovery, and
we can't find any record in web searches of such a revert.
RACK-TLP was published as a standards-track RFC, RFC8985, in February
2021.
Several other major TCP implementations have default-enabled RACK-TLP
at this point as well.
RACK-TLP offers several significant performance advantages over
RFC3517/RFC6675 loss recovery, including much better performance in
the common cases of tail drops, lost retransmissions, and reordering.
It is now time to remove the obsolete and unused RFC3517/RFC6675 loss
recovery code. This will allow a substantial simplification of the
Linux TCP code base, and removes 12 bytes of state in every tcp_sock
for 64-bit machines (8 bytes on 32-bit machines).
To arrange the commits in reasonable sizes, this patch series is split
into 3 commits. The following 2 commits remove bookkeeping state and
code that is no longer needed after this removal of RFC3517/RFC6675
loss recovery.
Doug Berger [Sat, 14 Jun 2025 02:58:16 +0000 (19:58 -0700)]
net: bcmgenet: update PHY power down
The disable sequence in bcmgenet_phy_power_set() is updated to
match the inverse sequence and timing (and spacing) of the
enable sequence. This ensures that LEDs driven by the GENET IP
are disabled when the GPHY is powered down.
[Note, I'm wondering if actually this is a case of a missing call;
the other similar function is called in __verify_octeon_config_info(),
but I don't have or know the hardware.]
validate_cn23xx_pf_config_info() was added in 2016 by
commit 72c0091293c0 ("liquidio: CN23XX device init and sriov config")
The stmmac platform code already gets the "stmmaceth" clock, so there
is no need for drivers to get it. Use the stored pointer in struct
plat_stmmacenet_data instead of getting and storing our own pointer.
net: stmmac: rk: use device rather than platform device in rk_priv_data
All the code in dwmac-rk uses &bsp_priv->pdev->dev, nothing uses
bsp_priv->pdev directly. Store the struct device rather than the
struct platform_device in struct rk_priv_data, and simplifying the
code.
====================
vsock/test: Improve transport_uaf test
Increase the coverage of a test implemented in commit 301a62dfb0d0
("vsock/test: Add test for UAF due to socket unbinding"). Take this
opportunity to factor out some utility code, drop a redundant sync between
client and server, and introduce a /proc/kallsyms harvesting logic for
auto-detecting registered vsock transports.
Michal Luczaj [Wed, 11 Jun 2025 19:56:52 +0000 (21:56 +0200)]
vsock/test: Cover more CIDs in transport_uaf test
Increase the coverage of test for UAF due to socket unbinding, and losing
transport in general. It's a follow up to commit 301a62dfb0d0 ("vsock/test:
Add test for UAF due to socket unbinding") and discussion in [1].
The idea remains the same: take an unconnected stream socket with a
transport assigned and then attempt to switch the transport by trying (and
failing) to connect to some other CID. Now do this iterating over all the
well known CIDs (plus one).
While at it, drop the redundant synchronization between client and server.
Some single-transport setups can't be tested effectively; a warning is
issued. Depending on transports available, a variety of splats are possible
on unpatched machines. After reverting commit 78dafe1cf3af ("vsock: Orphan
socket after transport release") and commit fcdd2242c023 ("vsock: Keep the
binding until socket destruction"):
BUG: KASAN: slab-use-after-free in __vsock_bind+0x61f/0x720
Read of size 4 at addr ffff88811ff46b54 by task vsock_test/1475
Call Trace:
dump_stack_lvl+0x68/0x90
print_report+0x170/0x53d
kasan_report+0xc2/0x180
__vsock_bind+0x61f/0x720
vsock_connect+0x727/0xc40
__sys_connect+0xe8/0x100
__x64_sys_connect+0x6e/0xc0
do_syscall_64+0x92/0x1c0
entry_SYSCALL_64_after_hwframe+0x4b/0x53
Michal Luczaj [Wed, 11 Jun 2025 19:56:51 +0000 (21:56 +0200)]
vsock/test: Introduce get_transports()
Return a bitmap of registered vsock transports. As guesstimated by grepping
/proc/kallsyms (CONFIG_KALLSYMS=y) for known symbols of type `struct
vsock_transport`, or `struct virtio_transport` in case the vsock_transport
is embedded within.
Note that the way `enum transport` and `transport_ksyms[]` are defined
triggers checkpatch.pl:
util.h:11: ERROR: Macros with complex values should be enclosed in parentheses
util.h:20: ERROR: Macros with complex values should be enclosed in parentheses
util.h:20: WARNING: Argument 'symbol' is not used in function-like macro
util.h:28: WARNING: Argument 'name' is not used in function-like macro
While commit 15d4734c7a58 ("checkpatch: qualify do-while-0 advice")
suggests it is known that the ERRORs heuristics are insufficient, I can not
find many other places where preprocessor is used in this
checkpatch-unhappy fashion. Notable exception being bcachefs, e.g.
fs/bcachefs/alloc_background_format.h. WARNINGs regarding unused macro
arguments seem more common, e.g. __ASM_SEL in arch/x86/include/asm/asm.h.
In other words, this might be unnecessarily complex. The same can be
achieved by just telling human to keep the order:
Jakub Kicinski [Tue, 17 Jun 2025 21:43:58 +0000 (14:43 -0700)]
Merge branch 'shradha_v6.16-rc1' of https://github.com/shradhagupta6/linux
Shradha Gupta says:
====================
Allow dyn MSI-X vector allocation of MANA
In this patchset we want to enable the MANA driver to be able to
allocate MSI-X vectors in PCI dynamically.
The first patch exports pci_msix_prepare_desc() in PCI to be able to
correctly prepare descriptors for dynamically added MSI-X vectors.
The second patch adds the support of dynamic vector allocation in
pci-hyperv PCI controller by enabling the MSI_FLAG_PCI_MSIX_ALLOC_DYN
flag and using the pci_msix_prepare_desc() exported in first patch.
The third patch adds a detailed description of the irq_setup(), to
help understand the function design better.
The fourth patch is a preparation patch for mana changes to support
dynamic IRQ allocation. It contains changes in irq_setup() to allow
skipping first sibling CPU sets, in case certain IRQs are already
affinitized to them.
The fifth patch has the changes in MANA driver to be able to allocate
MSI-X vectors dynamically. If the support does not exist it defaults to
older behavior.
* 'shradha_v6.16-rc1' of https://github.com/shradhagupta6/linux:
net: mana: Allocate MSI-X vectors dynamically
net: mana: Allow irq_setup() to skip cpus for affinity
net: mana: explain irq_setup() algorithm
PCI: hv: Allow dynamic MSI-X vector allocation
PCI/MSI: Export pci_msix_prepare_desc() for dynamic MSI-X allocations
====================
Paolo Abeni [Tue, 17 Jun 2025 12:58:47 +0000 (14:58 +0200)]
Merge branch 'intel-next-queue-1GbE'
Tony Nguyen says:
====================
Faizal Rahim says:
MAC Merge support for frame preemption was previously added for igc:
https://lore.kernel.org/netdev/20250418163822.3519810-1-anthony.l.nguyen@intel.com/
This series builds on that work and adds support for:
- Harmonizing taprio and mqprio queue priority behavior, based on past
discussions and suggestions:
https://lore.kernel.org/all/20250214102206.25dqgut5tbak2rkz@skbuf/
- Enabling preemptible queue support for both taprio and mqprio, with
priority harmonization as a prerequisite.
Patch organization:
- Patches 1-3: Preparation work for patches 6 and 7
- Patches 4-5: Queue priority harmonization
- Patches 6-7: Add preemptible queue support
====================
Wei Fang [Fri, 13 Jun 2025 09:36:05 +0000 (17:36 +0800)]
net: enetc: replace PCVLANR1/2 with SICVLANR1/2 and remove dead branch
Both PF and VF have rx-vlan-offload enabled, however, the PCVLANR1/2
registers are resources controlled by PF, so VF cannot access these
two registers. Fortunately, the hardware provides SICVLANR1/2 registers
for each SI to reflect the value of PCVLANR1/2 registers. Therefore,
use SICVLANR1/2 instead of PCVLANR1/2. Note that this is not an issue
in actual use, because the current driver does not support custom TPID,
the driver will not access these two registers in actual use, so this
modification is just an optimization.
In addition, since ENETC_RXBD_FLAG_TPID is defined as GENMASK(1, 0),
the possible values are only 0, 1, 2, 3, so the default branch will
never be true, so remove the default branch.
Shradha Gupta [Wed, 11 Jun 2025 14:11:13 +0000 (07:11 -0700)]
net: mana: Allocate MSI-X vectors dynamically
Currently, the MANA driver allocates MSI-X vectors statically based on
MANA_MAX_NUM_QUEUES and num_online_cpus() values and in some cases ends
up allocating more vectors than it needs. This is because, by this time
we do not have a HW channel and do not know how many IRQs should be
allocated.
To avoid this, we allocate 1 MSI-X vector during the creation of HWC and
after getting the value supported by hardware, dynamically add the
remaining MSI-X vectors.
Shradha Gupta [Wed, 11 Jun 2025 14:10:42 +0000 (07:10 -0700)]
net: mana: Allow irq_setup() to skip cpus for affinity
In order to prepare the MANA driver to allocate the MSI-X IRQs
dynamically, we need to enhance irq_setup() to allow skipping
affinitizing IRQs to the first CPU sibling group.
This would be for cases when the number of IRQs is less than or equal
to the number of online CPUs. In such cases for dynamically added IRQs
the first CPU sibling group would already be affinitized with HWC IRQ.
Yury Norov [Wed, 11 Jun 2025 14:10:29 +0000 (07:10 -0700)]
net: mana: explain irq_setup() algorithm
Commit 91bfe210e196 ("net: mana: add a function to spread IRQs per CPUs")
added the irq_setup() function that distributes IRQs on CPUs according
to a tricky heuristic. The corresponding commit message explains the
heuristic.
Duplicate it in the source code to make available for readers without
digging git in history. Also, add more detailed explanation about how
the heuristics is implemented.
Shradha Gupta [Wed, 11 Jun 2025 14:10:15 +0000 (07:10 -0700)]
PCI: hv: Allow dynamic MSI-X vector allocation
Allow dynamic MSI-X vector allocation for pci_hyperv PCI controller
by adding support for the flag MSI_FLAG_PCI_MSIX_ALLOC_DYN and using
pci_msix_prepare_desc() to prepare the MSI-X descriptors.
Shradha Gupta [Wed, 11 Jun 2025 14:10:01 +0000 (07:10 -0700)]
PCI/MSI: Export pci_msix_prepare_desc() for dynamic MSI-X allocations
For supporting dynamic MSI-X vector allocation by PCI controllers, enabling
the flag MSI_FLAG_PCI_MSIX_ALLOC_DYN is not enough, msix_prepare_msi_desc()
to prepare the MSI descriptor is also needed.
Export pci_msix_prepare_desc() to allow PCI controllers to support dynamic
MSI-X vector allocation.
Jakub Kicinski [Fri, 13 Jun 2025 17:27:51 +0000 (10:27 -0700)]
eth: gianfar: migrate to new RXFH callbacks
Migrate to new callbacks added by commit 9bb00786fc61 ("net: ethtool:
add dedicated callbacks for getting and setting rxfh fields").
Uniquely, this driver supports only the SET operation. It does not
support GET at all. The SET callback also always returns 0, even
tho it checks a bunch of conditions, and if my quick reading is
right, expects the user to insert filtering rules for given flow
type first? Long story short it seems too convoluted to easily
add the GET as part of the conversion.