Joe Damato [Wed, 5 Feb 2025 19:37:47 +0000 (19:37 +0000)]
netdev-genl: Elide napi_id when not present
There are at least two cases where napi_id may not present and the
napi_id should be elided:
1. Queues could be created, but napi_enable may not have been called
yet. In this case, there may be a NAPI but it may not have an ID and
output of a napi_id should be elided.
2. TX-only NAPIs currently do not have NAPI IDs. If a TX queue happens
to be linked with a TX-only NAPI, elide the NAPI ID from the netlink
output as a NAPI ID of 0 is not useful for users.
Signed-off-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250205193751.297211-1-jdamato@fastly.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Fri, 7 Feb 2025 00:27:33 +0000 (16:27 -0800)]
Merge branch 'io_uring-zero-copy-rx'
David Wei says:
====================
io_uring zero copy rx
This patchset contains net/ patches needed by a new io_uring request
implementing zero copy rx into userspace pages, eliminating a kernel
to user copy.
We configure a page pool that a driver uses to fill a hw rx queue to
hand out user pages instead of kernel pages. Any data that ends up
hitting this hw rx queue will thus be dma'd into userspace memory
directly, without needing to be bounced through kernel memory. 'Reading'
data out of a socket instead becomes a _notification_ mechanism, where
the kernel tells userspace where the data is. The overall approach is
similar to the devmem TCP proposal.
This relies on hw header/data split, flow steering and RSS to ensure
packet headers remain in kernel memory and only desired flows hit a hw
rx queue configured for zero copy. Configuring this is outside of the
scope of this patchset.
We share netdev core infra with devmem TCP. The main difference is that
io_uring is used for the uAPI and the lifetime of all objects are bound
to an io_uring instance. Data is 'read' using a new io_uring request
type. When done, data is returned via a new shared refill queue. A zero
copy page pool refills a hw rx queue from this refill queue directly. Of
course, the lifetime of these data buffers are managed by io_uring
rather than the networking stack, with different refcounting rules.
This patchset is the first step adding basic zero copy support. We will
extend this iteratively with new features e.g. dynamically allocated
zero copy areas, THP support, dmabuf support, improved copy fallback,
general optimisations and more.
In terms of netdev support, we're first targeting Broadcom bnxt. Patches
aren't included since Taehee Yoo has already sent a more comprehensive
patchset adding support in [1]. Google gve should already support this,
and Mellanox mlx5 support is WIP pending driver changes.
===========
Performance
===========
Note: Comparison with epoll + TCP_ZEROCOPY_RECEIVE isn't done yet.
Test setup:
* AMD EPYC 9454
* Broadcom BCM957508 200G
* Kernel v6.11 base [2]
* liburing fork [3]
* kperf fork [4]
* 4K MTU
* Single TCP flow
With application thread + net rx softirq pinned to _different_ cores:
David Wei [Tue, 4 Feb 2025 21:56:21 +0000 (13:56 -0800)]
net: add helpers for setting a memory provider on an rx queue
Add helpers that properly prep or remove a memory provider for an rx
queue then restart the queue.
Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Link: https://patch.msgid.link/20250204215622.695511-11-dw@davidwei.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pavel Begunkov [Tue, 4 Feb 2025 21:56:20 +0000 (13:56 -0800)]
net: page_pool: add memory provider helpers
Add helpers for memory providers to interact with page pools.
net_mp_niov_{set,clear}_page_pool() serve to [dis]associate a net_iov
with a page pool. If used, the memory provider is responsible to match
"set" calls with "clear" once a net_iov is not going to be used by a page
pool anymore, changing a page pool, etc.
Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Link: https://patch.msgid.link/20250204215622.695511-10-dw@davidwei.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pavel Begunkov [Tue, 4 Feb 2025 21:56:19 +0000 (13:56 -0800)]
net: prepare for non devmem TCP memory providers
There is a good bunch of places in generic paths assuming that the only
page pool memory provider is devmem TCP. As we want to reuse the net_iov
and provider infrastructure, we need to patch it up and explicitly check
the provider type when we branch into devmem TCP code.
Reviewed-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Link: https://patch.msgid.link/20250204215622.695511-9-dw@davidwei.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pavel Begunkov [Tue, 4 Feb 2025 21:56:18 +0000 (13:56 -0800)]
net: page_pool: add a mp hook to unregister_netdevice*
Devmem TCP needs a hook in unregister_netdevice_many_notify() to upkeep
the set tracking queues it's bound to, i.e. ->bound_rxqs. Instead of
devmem sticking directly out of the genetic path, add a mp function.
Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Link: https://patch.msgid.link/20250204215622.695511-8-dw@davidwei.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pavel Begunkov [Tue, 4 Feb 2025 21:56:17 +0000 (13:56 -0800)]
net: page_pool: add callback for mp info printing
Add a mandatory callback that prints information about the memory
provider to netlink.
Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Link: https://patch.msgid.link/20250204215622.695511-7-dw@davidwei.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Wei [Tue, 4 Feb 2025 21:56:16 +0000 (13:56 -0800)]
netdev: add io_uring memory provider info
Add a nested attribute for io_uring memory provider info. For now it is
empty and its presence indicates that a particular page pool or queue
has an io_uring memory provider attached.
Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Link: https://patch.msgid.link/20250204215622.695511-6-dw@davidwei.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pavel Begunkov [Tue, 4 Feb 2025 21:56:15 +0000 (13:56 -0800)]
net: page_pool: create hooks for custom memory providers
A spin off from the original page pool memory providers patch by Jakub,
which allows extending page pools with custom allocators. One of such
providers is devmem TCP, and the other is io_uring zerocopy added in
following patches.
Pavel Begunkov [Tue, 4 Feb 2025 21:56:14 +0000 (13:56 -0800)]
net: generalise net_iov chunk owners
Currently net_iov stores a pointer to struct dmabuf_genpool_chunk_owner,
which serves as a useful abstraction to share data and provide a
context. However, it's too devmem specific, and we want to reuse it for
other memory providers, and for that we need to decouple net_iov from
devmem. Make net_iov to point to a new base structure called
net_iov_area, which dmabuf_genpool_chunk_owner extends.
Reviewed-by: Mina Almasry <almasrymina@google.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Link: https://patch.msgid.link/20250204215622.695511-4-dw@davidwei.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pavel Begunkov [Tue, 4 Feb 2025 21:56:13 +0000 (13:56 -0800)]
net: prefix devmem specific helpers
Add prefixes to all helpers that are specific to devmem TCP, i.e.
net_iov_binding[_id].
Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Link: https://patch.msgid.link/20250204215622.695511-3-dw@davidwei.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pavel Begunkov [Tue, 4 Feb 2025 21:56:12 +0000 (13:56 -0800)]
net: page_pool: don't cast mp param to devmem
page_pool_check_memory_provider() is a generic path and shouldn't assume
anything about the actual type of the memory provider argument. It's
fine while devmem is the only provider, but cast away the devmem
specific binding types to avoid confusion.
Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Link: https://patch.msgid.link/20250204215622.695511-2-dw@davidwei.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Wed, 5 Feb 2025 17:33:52 +0000 (09:33 -0800)]
tools: ynl: add all headers to makefile deps
The Makefile.deps lists uAPI headers to make the build work when
system headers are older than in-tree headers. The problem doesn't
occur for new headers, because system headers are not there at all.
But out-of-tree YNL clone on GH also uses this header to identify
header dependencies, and one day the system headers will exist,
and will get out of date. So let's add the headers we missed.
I don't think this is a fix, but FWIW the commits which added
the missing headers are:
commit 04e65df94b31 ("netlink: spec: add shaper YAML spec")
commit 49922401c219 ("ethtool: separate definitions that are gonna be generated")
Linus Torvalds [Thu, 6 Feb 2025 17:14:54 +0000 (09:14 -0800)]
Merge tag 'net-6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Interestingly the recent kmemleak improvements allowed our CI to catch
a couple of percpu leaks addressed here.
We (mostly Jakub, to be accurate) are working to increase review
coverage over the net code-base tweaking the MAINTAINER entries.
Current release - regressions:
- core: harmonize tstats and dstats
- ipv6: fix dst refleaks in rpl, seg6 and ioam6 lwtunnels
- eth: tun: revert fix group permission check
- eth: stmmac: revert "specify hardware capability value when FIFO
size isn't specified"
Previous releases - regressions:
- udp: gso: do not drop small packets when PMTU reduces
- rxrpc: fix race in call state changing vs recvmsg()
- eth: ice: fix Rx data path for heavy 9k MTU traffic
- eth: vmxnet3: fix tx queue race condition with XDP
Previous releases - always broken:
- sched: pfifo_tail_enqueue: drop new packet when sch->limit == 0
- ethtool: ntuple: fix rss + ring_cookie check
- rxrpc: fix the rxrpc_connection attend queue handling
Misc:
- recognize Kuniyuki Iwashima as a maintainer"
* tag 'net-6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (34 commits)
Revert "net: stmmac: Specify hardware capability value when FIFO size isn't specified"
MAINTAINERS: add a sample ethtool section entry
MAINTAINERS: add entry for ethtool
rxrpc: Fix race in call state changing vs recvmsg()
rxrpc: Fix call state set to not include the SERVER_SECURING state
net: sched: Fix truncation of offloaded action statistics
tun: revert fix group permission check
selftests/tc-testing: Add a test case for qdisc_tree_reduce_backlog()
netem: Update sch->q.qlen before qdisc_tree_reduce_backlog()
selftests/tc-testing: Add a test case for pfifo_head_drop qdisc when limit==0
pfifo_tail_enqueue: Drop new packet when sch->limit == 0
selftests: mptcp: connect: -f: no reconnect
net: rose: lock the socket in rose_bind()
net: atlantic: fix warning during hot unplug
rxrpc: Fix the rxrpc_connection attend queue handling
net: harmonize tstats and dstats
selftests: drv-net: rss_ctx: don't fail reconfigure test if queue offset not supported
selftests: drv-net: rss_ctx: add missing cleanup in queue reconfigure
ethtool: ntuple: fix rss + ring_cookie check
ethtool: rss: fix hiding unsupported fields in dumps
...
Revert "net: stmmac: Specify hardware capability value when FIFO size isn't specified"
This reverts commit 8865d22656b4, which caused breakage for platforms
which are not using xgmac2 or gmac4. Only these two cores have the
capability of providing the FIFO sizes from hardware capability fields
(which are provided in priv->dma_cap.[tr]x_fifo_size.)
All other cores can not, which results in these two fields containing
zero. We also have platforms that do not provide a value in
priv->plat->[tr]x_fifo_size, resulting in these also being zero.
This causes the new tests introduced by the reverted commit to fail,
and produce e.g.:
An example of such a platform which fails is QEMU's npcm750-evb.
This uses dwmac1000 which, as noted above, does not have the capability
to provide the FIFO sizes from hardware.
Therefore, revert the commit to maintain compatibility with the way
the driver used to work.
Alexander Duyck [Tue, 4 Feb 2025 01:00:38 +0000 (17:00 -0800)]
eth: fbnic: set IFF_UNICAST_FLT to avoid enabling promiscuous mode when adding unicast addrs
I realized when we were adding unicast addresses we were enabling
promiscuous mode. I did a bit of digging and realized we had overlooked
setting the driver private flag to indicate we supported unicast filtering.
Example below shows the table with 00deadbeef01 as the main NIC address,
and 5 additional addresses in the 00deadbeefX0 format.
Before rule 31 would be active. With this change it correctly sticks
to just the unicast filters.
Signed-off-by: Alexander Duyck <alexanderduyck@meta.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250204010038.1404268-2-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Alexander Duyck [Tue, 4 Feb 2025 01:00:37 +0000 (17:00 -0800)]
eth: fbnic: add MAC address TCAM to debugfs
Add read only access to the 32-entry MAC address TCAM via debugfs.
BMC filtering shares the same table so this is quite useful
to access during debug. See next commit for an example output.
Signed-off-by: Alexander Duyck <alexanderduyck@meta.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250204010038.1404268-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jakub Kicinski [Mon, 3 Feb 2025 21:55:09 +0000 (13:55 -0800)]
tools: ynl-gen: don't output external constants
A definition with a "header" property is an "external" definition
for C code, as in it is defined already in another C header file.
Other languages will need the exact value but C codegen should
not recreate it. So don't output those definitions in the uAPI
header.
Jakub Kicinski [Tue, 4 Feb 2025 21:57:50 +0000 (13:57 -0800)]
MAINTAINERS: add a sample ethtool section entry
I feel like we don't do a good enough keeping authors of driver
APIs around. The ethtool code base was very nicely compartmentalized
by Michal. Establish a precedent of creating MAINTAINERS entries
for "sections" of the ethtool API. Use Andrew and cable test as
a sample entry. The entry should ideally cover 3 elements:
a core file, test(s), and keywords. The last one is important
because we intend the entries to cover core code *and* reviews
of drivers implementing given API!
Jakub Kicinski [Tue, 4 Feb 2025 21:57:29 +0000 (13:57 -0800)]
MAINTAINERS: add entry for ethtool
Michal did an amazing job converting ethtool to Netlink, but never
added an entry to MAINTAINERS for himself. Create a formal entry
so that we can delegate (portions) of this code to folks.
Over the last 3 years majority of the reviews have been done by
Andrew and I. I suppose Michal didn't want to be on the receiving
end of the flood of patches.
====================
Support one PTP device per hardware clock
This series contains two features from Jianbo, followed by simple
cleanups.
Patches 1-9 by Jianbo add support for one PTP device per hardware clock,
described below [1].
Patches 10-12 by Jianbo add support for 200Gbps per-lane link modes in
kernel and mlx5 driver.
Patches 13-15 are simple cleanups by Gal and Carolina.
[1]
PHC (PTP hardware clock) is normally shared by multiple functions
(PF/VF/SF). mlx5 driver currently creates a separate PTP device for each
network interface that shares one PHC.
PHC can be configured to work as free running mode or real time mode.
In this series, only one PTP device is created for the shared PHC when
it is running in real time mode.
To support this feature,
* Firmware needs to support clock identity. When functions share a
PHC, the clock identities they query are same.
* Driver dynamically allocates mlx5_clock to represent a PHC.
* New devcom component is added for hardware clock. Functions are
grouped by the identity, and one mlx5_clock is allocated and shared
by the functions with the same identity.
* When PTP device accesses PHC by its callbacks, the first function
in the clock devcom list is selected to send commands to firmware.
* PPS IN event is armed on one function. It should be re-armed on
the other one when current is unloaded.
====================
Carolina Jubran [Mon, 3 Feb 2025 21:35:16 +0000 (23:35 +0200)]
net/mlx5e: Avoid WARN_ON when configuring MQPRIO with HTB offload enabled
When attempting to enable MQPRIO while HTB offload is already
configured, the driver currently returns `-EINVAL` and triggers a
`WARN_ON`, leading to an unnecessary call trace.
Update the code to handle this case more gracefully by returning
`-EOPNOTSUPP` instead, while also providing a helpful user message.
Jianbo Liu [Mon, 3 Feb 2025 21:35:13 +0000 (23:35 +0200)]
net/mlx5e: Support FEC settings for 200G per lane link modes
Add support to show and config FEC by ethtool for 200G/lane link
modes. The RS encoding setting is mapped, and can be overridden to
FEC_RS_544_514_INTERLEAVED_QUAD for these modes.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Shahar Shitrit <shshitrit@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jianbo Liu [Mon, 3 Feb 2025 21:35:10 +0000 (23:35 +0200)]
net/mlx5: Generate PPS IN event on new function for shared clock
As a specific function (mdev) is chosen to send MTPPSE command to
firmware, the event is generated only on that function. When that
function is unloaded, the PPS event can't be forward to PTP device,
even when there are other functions in the group, and PTP device is
not destroyed. To resolve this problem, need to send MTPPSE again from
new function, and dis-arm the event on old function after that.
PPS events are handled by EQ notifier. The async EQs and notifiers are
destroyed in mlx5_eq_table_destroy() which is called before
mlx5_cleanup_clock(). During the period between
mlx5_eq_table_destroy() and mlx5_cleanup_clock(), the events can't be
handled. To avoid event loss, add mlx5_clock_unload() in mlx5_unload()
to arm the event on other available function, and mlx5_clock_load in
mlx5_load() for symmetry.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jianbo Liu [Mon, 3 Feb 2025 21:35:09 +0000 (23:35 +0200)]
net/mlx5: Support one PTP device per hardware clock
Currently, mlx5 driver exposes a PTP device for each network interface,
resulting in multiple device nodes representing the same underlying
PHC (PTP hardware clock). This causes problem if it is trying to
synchronize to itself. For instance, when ptp4l operates on multiple
interfaces following different masters, phc2sys attempts to
synchronize them in automatic mode.
PHC can be configured to work as free running mode or real time mode.
All functions can access it directly. In this patch, we create one PTP
device for each PHC when it's running in real time mode. All the
functions share the same PTP device if the clock identifies they query
are same, and they are already grouped by devcom in previous commit.
The first mdev in the peer list is chosen when sending
MTPPS/MTUTC/MTPPSE/MRTCQ to firmware. Since the function can be
unloaded at any time, we need to use a mutex lock to protect the mdev
pointer used in PTP and PPS callbacks. Besides, new one should be
picked from the peer list when the current is not available.
The clock info, which is used by IB, is shared by all the interfaces
using the same hardware clock.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jianbo Liu [Mon, 3 Feb 2025 21:35:08 +0000 (23:35 +0200)]
net/mlx5: Move PPS notifier and out_work to clock_state
The PPS notifier is currently in mlx5_clock, and mlx5_clock can be
shared in later patch, so the notifier should be registered for each
device to avoid any event miss. Besides, the out_work is scheduled by
PPS out event which is triggered only when the device is in free
running mode. So, both are moved to mlx5_core_dev's clock_state.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jianbo Liu [Mon, 3 Feb 2025 21:35:06 +0000 (23:35 +0200)]
net/mlx5: Change clock in mlx5_core_dev to mlx5_clock pointer
Change clock member in mlx5_core_dev to a pointer, so it can point to
a clock shared by multiple functions in later patch.
For now, each function has its own clock, so mdev in mlx5_clock_priv
is the back pointer to the function. Later it points to one (normally
the first one) of the multiple functions sharing the same clock.
Change mlx5_init_clock() to return error if mlx5_clock is not
allocated. Besides, a null clock is defined and used when hardware
clock is not supported. So, the clock pointer is always pointing to
something valid.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jianbo Liu [Mon, 3 Feb 2025 21:35:05 +0000 (23:35 +0200)]
net/mlx5: Add API to get mlx5_core_dev from mlx5_clock
The mdev is calculated directly from mlx5_clock, as it's one of the
fields in mlx5_core_dev. Move to a function so it can be easily
changed in next patch.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jianbo Liu [Mon, 3 Feb 2025 21:35:04 +0000 (23:35 +0200)]
net/mlx5: Add init and destruction functions for a single HW clock
Move hardware clock initialization and destruction to the functions,
which will be used for dynamically allocated clock. Such clock is
shared by all the devices if the queried clock identities are same.
The out_work is for PPS out event, which can't be triggered when clock
is shared, so INIT_WORK is not moved to the initialization function.
Besides, we still need to register notifier for each device.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jianbo Liu [Mon, 3 Feb 2025 21:35:03 +0000 (23:35 +0200)]
net/mlx5: Change parameters for PTP internal functions
In later patch, the mlx5_clock will be allocated dynamically, its
address can be obtained from mlx5_core_dev struct, but mdev can't be
obtained from mlx5_clock because it can be shared by multiple
interfaces. So change the parameter for such internal functions, only
mdev is passed down from the callers.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
====================
vxlan: Age FDB entries based on Rx traffic
tl;dr - This patchset prevents VXLAN FDB entries from lingering if
traffic is only forwarded to a silent host.
The VXLAN driver maintains two timestamps for each FDB entry: 'used' and
'updated'. The first is refreshed by both the Rx and Tx paths and the
second is refreshed upon migration.
The driver ages out entries according to their 'used' time which means
that an entry can linger when traffic is only forwarded to a silent host
that might have migrated to a different remote.
This patchset solves the problem by adjusting the above semantics and
aligning them to those of the bridge driver. That is, 'used' time is
refreshed by the Tx path, 'updated' time is refresh by Rx path or user
space updates and entries are aged out according to their 'updated'
time.
Patches #1-#2 perform small changes in how the 'used' and 'updated'
fields are accessed.
Patches #3-#5 refresh the 'updated' time where needed.
Patch #6 flips the driver to age out FDB entries according to their
'updated' time.
Patch #7 removes unnecessary updates to the 'used' time.
Patch #8 extends a test case to cover aging of FDB entries in the
presence of Tx traffic.
====================
Ido Schimmel [Tue, 4 Feb 2025 14:55:48 +0000 (16:55 +0200)]
vxlan: Avoid unnecessary updates to FDB 'used' time
Now that the VXLAN driver ages out FDB entries based on their 'updated'
time we can remove unnecessary updates of the 'used' time from the Rx
path and the control path, so that the 'used' time is only updated by
the Tx path.
Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20250204145549.1216254-8-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ido Schimmel [Tue, 4 Feb 2025 14:55:47 +0000 (16:55 +0200)]
vxlan: Age out FDB entries based on 'updated' time
Currently, the VXLAN driver ages out FDB entries based on their 'used'
time which is refreshed by both the Tx and Rx paths. This means that an
FDB entry will not age out if traffic is only forwarded to the target
host:
# ip link add name vx1 up type vxlan id 10010 local 192.0.2.1 dstport 4789 learning ageing 10
# bridge fdb add 00:11:22:33:44:55 dev vx1 self dynamic dst 198.51.100.1
# bridge fdb get 00:11:22:33:44:55 br vx1 self
00:11:22:33:44:55 dev vx1 dst 198.51.100.1 self
# mausezahn vx1 -a own -b 00:11:22:33:44:55 -c 0 -p 100 -q &
# sleep 20
# bridge fdb get 00:11:22:33:44:55 br vx1 self
00:11:22:33:44:55 dev vx1 dst 198.51.100.1 self
This is wrong as an FDB entry will remain present when we no longer have
an indication that the host is still behind the current remote. It is
also inconsistent with the bridge driver:
# ip link add name br1 up type bridge ageing_time $((10 * 100))
# ip link add name swp1 up master br1 type dummy
# bridge fdb add 00:11:22:33:44:55 dev swp1 master dynamic
# bridge fdb get 00:11:22:33:44:55 br br1
00:11:22:33:44:55 dev swp1 master br1
# mausezahn br1 -a own -b 00:11:22:33:44:55 -c 0 -p 100 -q &
# sleep 20
# bridge fdb get 00:11:22:33:44:55 br br1
Error: Fdb entry not found.
Solve this by aging out entries based on their 'updated' time, which is
not refreshed by the Tx path:
# ip link add name vx1 up type vxlan id 10010 local 192.0.2.1 dstport 4789 learning ageing 10
# bridge fdb add 00:11:22:33:44:55 dev vx1 self dynamic dst 198.51.100.1
# bridge fdb get 00:11:22:33:44:55 br vx1 self
00:11:22:33:44:55 dev vx1 dst 198.51.100.1 self
# mausezahn vx1 -a own -b 00:11:22:33:44:55 -c 0 -p 100 -q &
# sleep 20
# bridge fdb get 00:11:22:33:44:55 br vx1 self
Error: Fdb entry not found.
But is refreshed by the Rx path:
# ip address add 192.0.2.1/32 dev lo
# ip link add name vx1 up type vxlan id 10010 local 192.0.2.1 dstport 4789 localbypass
# ip link add name vx2 up type vxlan id 20010 local 192.0.2.1 dstport 4789 learning ageing 10
# bridge fdb add 00:11:22:33:44:55 dev vx1 self static dst 127.0.0.1 vni 20010
# mausezahn vx1 -a 00:aa:bb:cc:dd:ee -b 00:11:22:33:44:55 -c 0 -p 100 -q &
# sleep 20
# bridge fdb get 00:aa:bb:cc:dd:ee br vx2 self
00:aa:bb:cc:dd:ee dev vx2 dst 127.0.0.1 self
# pkill mausezahn
# sleep 20
# bridge fdb get 00:aa:bb:cc:dd:ee br vx2 self
Error: Fdb entry not found.
Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20250204145549.1216254-7-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ido Schimmel [Tue, 4 Feb 2025 14:55:46 +0000 (16:55 +0200)]
vxlan: Refresh FDB 'updated' time upon user space updates
When a host migrates to a different remote and a packet is received from
the new remote, the corresponding FDB entry is updated and its 'updated'
time is refreshed.
However, when user space replaces the remote of an FDB entry, its
'updated' time is not refreshed:
# ip link add name vx1 up type vxlan id 10010 dstport 4789
# bridge fdb add 00:11:22:33:44:55 dev vx1 self dynamic dst 198.51.100.1
# sleep 10
# bridge -s -j -p fdb get 00:11:22:33:44:55 br vx1 self | jq '.[]["updated"]'
10
# bridge fdb replace 00:11:22:33:44:55 dev vx1 self dynamic dst 198.51.100.2
# bridge -s -j -p fdb get 00:11:22:33:44:55 br vx1 self | jq '.[]["updated"]'
10
This can lead to the entry being aged out prematurely and it is also
inconsistent with the bridge driver:
# ip link add name br1 up type bridge
# ip link add name swp1 master br1 up type dummy
# ip link add name swp2 master br1 up type dummy
# bridge fdb add 00:11:22:33:44:55 dev swp1 master dynamic vlan 1
# sleep 10
# bridge -s -j fdb get 00:11:22:33:44:55 br br1 vlan 1 | jq '.[]["updated"]'
10
# bridge fdb replace 00:11:22:33:44:55 dev swp2 master dynamic vlan 1
# bridge -s -j fdb get 00:11:22:33:44:55 br br1 vlan 1 | jq '.[]["updated"]'
0
Adjust the VXLAN driver to refresh the 'updated' time of an FDB entry
whenever one of its attributes is changed by user space:
# ip link add name vx1 up type vxlan id 10010 dstport 4789
# bridge fdb add 00:11:22:33:44:55 dev vx1 self dynamic dst 198.51.100.1
# sleep 10
# bridge -s -j -p fdb get 00:11:22:33:44:55 br vx1 self | jq '.[]["updated"]'
10
# bridge fdb replace 00:11:22:33:44:55 dev vx1 self dynamic dst 198.51.100.2
# bridge -s -j -p fdb get 00:11:22:33:44:55 br vx1 self | jq '.[]["updated"]'
0
Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20250204145549.1216254-6-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ido Schimmel [Tue, 4 Feb 2025 14:55:45 +0000 (16:55 +0200)]
vxlan: Refresh FDB 'updated' time upon 'NTF_USE'
The 'NTF_USE' flag can be used by user space to refresh FDB entries so
that they will not age out. Currently, the VXLAN driver implements it by
refreshing the 'used' field in the FDB entry as this is the field
according to which FDB entries are aged out.
Subsequent patches will switch the VXLAN driver to age out entries based
on the 'updated' field. Prepare for this change by refreshing the
'updated' field upon 'NTF_USE'. This is consistent with the bridge
driver's FDB:
# ip link add name br1 up type bridge
# ip link add name swp1 master br1 up type dummy
# bridge fdb add 00:11:22:33:44:55 dev swp1 master dynamic vlan 1
# sleep 10
# bridge fdb replace 00:11:22:33:44:55 dev swp1 master dynamic vlan 1
# bridge -s -j fdb get 00:11:22:33:44:55 br br1 vlan 1 | jq '.[]["updated"]'
10
# sleep 10
# bridge fdb replace 00:11:22:33:44:55 dev swp1 master use dynamic vlan 1
# bridge -s -j fdb get 00:11:22:33:44:55 br br1 vlan 1 | jq '.[]["updated"]'
0
Before:
# ip link add name vx1 up type vxlan id 10010 dstport 4789
# bridge fdb add 00:11:22:33:44:55 dev vx1 self dynamic dst 198.51.100.1
# sleep 10
# bridge fdb replace 00:11:22:33:44:55 dev vx1 self dynamic dst 198.51.100.1
# bridge -s -j -p fdb get 00:11:22:33:44:55 br vx1 self | jq '.[]["updated"]'
10
# sleep 10
# bridge fdb replace 00:11:22:33:44:55 dev vx1 self use dynamic dst 198.51.100.1
# bridge -s -j -p fdb get 00:11:22:33:44:55 br vx1 self | jq '.[]["updated"]'
20
After:
# ip link add name vx1 up type vxlan id 10010 dstport 4789
# bridge fdb add 00:11:22:33:44:55 dev vx1 self dynamic dst 198.51.100.1
# sleep 10
# bridge fdb replace 00:11:22:33:44:55 dev vx1 self dynamic dst 198.51.100.1
# bridge -s -j -p fdb get 00:11:22:33:44:55 br vx1 self | jq '.[]["updated"]'
10
# sleep 10
# bridge fdb replace 00:11:22:33:44:55 dev vx1 self use dynamic dst 198.51.100.1
# bridge -s -j -p fdb get 00:11:22:33:44:55 br vx1 self | jq '.[]["updated"]'
0
Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20250204145549.1216254-5-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ido Schimmel [Tue, 4 Feb 2025 14:55:44 +0000 (16:55 +0200)]
vxlan: Always refresh FDB 'updated' time when learning is enabled
Currently, when learning is enabled and a packet is received from the
expected remote, the 'updated' field of the FDB entry is not refreshed.
This will become a problem when we switch the VXLAN driver to age out
entries based on the 'updated' field.
Solve this by always refreshing an FDB entry when we receive a packet
with a matching source MAC address, regardless if it was received via
the expected remote or not as it indicates the host is alive. This is
consistent with the bridge driver's FDB.
Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20250204145549.1216254-4-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ido Schimmel [Tue, 4 Feb 2025 14:55:42 +0000 (16:55 +0200)]
vxlan: Annotate FDB data races
The 'used' and 'updated' fields in the FDB entry structure can be
accessed concurrently by multiple threads, leading to reports such as
[1]. Can be reproduced using [2].
Suppress these reports by annotating these accesses using
READ_ONCE() / WRITE_ONCE().
[1]
BUG: KCSAN: data-race in vxlan_xmit / vxlan_xmit
write to 0xffff942604d263a8 of 8 bytes by task 286 on cpu 0:
vxlan_xmit+0xb29/0x2380
dev_hard_start_xmit+0x84/0x2f0
__dev_queue_xmit+0x45a/0x1650
packet_xmit+0x100/0x150
packet_sendmsg+0x2114/0x2ac0
__sys_sendto+0x318/0x330
__x64_sys_sendto+0x76/0x90
x64_sys_call+0x14e8/0x1c00
do_syscall_64+0x9e/0x1a0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
read to 0xffff942604d263a8 of 8 bytes by task 287 on cpu 2:
vxlan_xmit+0xadf/0x2380
dev_hard_start_xmit+0x84/0x2f0
__dev_queue_xmit+0x45a/0x1650
packet_xmit+0x100/0x150
packet_sendmsg+0x2114/0x2ac0
__sys_sendto+0x318/0x330
__x64_sys_sendto+0x76/0x90
x64_sys_call+0x14e8/0x1c00
do_syscall_64+0x9e/0x1a0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
value changed: 0x00000000fffbac6e -> 0x00000000fffbac6f
Reported by Kernel Concurrency Sanitizer on:
CPU: 2 UID: 0 PID: 287 Comm: mausezahn Not tainted 6.13.0-rc7-01544-gb4b270f11a02 #5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014
[2]
#!/bin/bash
set +H
echo whitelist > /sys/kernel/debug/kcsan
echo !vxlan_xmit > /sys/kernel/debug/kcsan
ip link add name vx0 up type vxlan id 10010 dstport 4789 local 192.0.2.1
bridge fdb add 00:11:22:33:44:55 dev vx0 self static dst 198.51.100.1
taskset -c 0 mausezahn vx0 -a own -b 00:11:22:33:44:55 -c 0 -q &
taskset -c 2 mausezahn vx0 -a own -b 00:11:22:33:44:55 -c 0 -q &
Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20250204145549.1216254-2-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ipv4: ip_gre: Fix set but not used warning in ipgre_err() if IPv4-only
if CONFIG_NET_IPGRE is enabled, but CONFIG_IPV6 is disabled:
net/ipv4/ip_gre.c: In function ‘ipgre_err’:
net/ipv4/ip_gre.c:144:22: error: variable ‘data_len’ set but not used [-Werror=unused-but-set-variable]
144 | unsigned int data_len = 0;
| ^~~~~~~~
Fix this by moving all data_len processing inside the IPV6-only section
that uses its result.
Jakub Kicinski [Thu, 6 Feb 2025 02:47:50 +0000 (18:47 -0800)]
Merge branch 'rxrpc-call-state-fixes'
David Howells says:
====================
rxrpc: Call state fixes
Here some call state fixes for AF_RXRPC.
(1) Fix the state of a call to not treat the challenge-response cycle as
part of an incoming call's state set. The problem is that it makes
handling received of the final packet in the receive phase difficult
as that wants to change the call state - but security negotiations may
not yet be complete.
(2) Fix a race between the changing of the call state at the end of the
request reception phase of a service call, recvmsg() collecting the last
data and sendmsg() trying to send the reply before the I/O thread has
advanced the call state.
David Howells [Tue, 4 Feb 2025 23:05:54 +0000 (23:05 +0000)]
rxrpc: Fix race in call state changing vs recvmsg()
There's a race in between the rxrpc I/O thread recording the end of the
receive phase of a call and recvmsg() examining the state of the call to
determine whether it has completed.
The problem is that call->_state records the I/O thread's view of the call,
not the application's view (which may lag), so that alone is not
sufficient. To this end, the application also checks whether there is
anything left in call->recvmsg_queue for it to pick up. The call must be
in state RXRPC_CALL_COMPLETE and the recvmsg_queue empty for the call to be
considered fully complete.
In rxrpc_input_queue_data(), the latest skbuff is added to the queue and
then, if it was marked as LAST_PACKET, the state is advanced... But this
is two separate operations with no locking around them.
As a consequence, the lack of locking means that sendmsg() can jump into
the gap on a service call and attempt to send the reply - but then get
rejected because the I/O thread hasn't advanced the state yet.
Simply flipping the order in which things are done isn't an option as that
impacts the client side, causing the checks in rxrpc_kernel_check_life() as
to whether the call is still alive to race instead.
Fix this by moving the update of call->_state inside the skb queue
spinlocked section where the packet is queued on the I/O thread side.
rxrpc's recvmsg() will then automatically sync against this because it has
to take the call->recvmsg_queue spinlock in order to dequeue the last
packet.
rxrpc's sendmsg() doesn't need amending as the app shouldn't be calling it
to send a reply until recvmsg() indicates it has returned all of the
request.
Fixes: 93368b6bd58a ("rxrpc: Move call state changes from recvmsg to I/O thread") Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org Link: https://patch.msgid.link/20250204230558.712536-3-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Howells [Tue, 4 Feb 2025 23:05:53 +0000 (23:05 +0000)]
rxrpc: Fix call state set to not include the SERVER_SECURING state
The RXRPC_CALL_SERVER_SECURING state doesn't really belong with the other
states in the call's state set as the other states govern the call's Rx/Tx
phase transition and govern when packets can and can't be received or
transmitted. The "Securing" state doesn't actually govern the reception of
packets and would need to be split depending on whether or not we've
received the last packet yet (to mirror RECV_REQUEST/ACK_REQUEST).
The "Securing" state is more about whether or not we can start forwarding
packets to the application as recvmsg will need to decode them and the
decoding can't take place until the challenge/response exchange has
completed.
Fix this by removing the RXRPC_CALL_SERVER_SECURING state from the state
set and, instead, using a flag, RXRPC_CALL_CONN_CHALLENGING, to track
whether or not we can queue the call for reception by recvmsg() or notify
the kernel app that data is ready. In the event that we've already
received all the packets, the connection event handler will poke the app
layer in the appropriate manner.
Also there's a race whereby the app layer sees the last packet before rxrpc
has managed to end the rx phase and change the state to one amenable to
allowing a reply. Fix this by queuing the packet after calling
rxrpc_end_rx_phase().
Fixes: 17926a79320a ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both") Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org Link: https://patch.msgid.link/20250204230558.712536-2-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ido Schimmel [Tue, 4 Feb 2025 12:38:39 +0000 (14:38 +0200)]
net: sched: Fix truncation of offloaded action statistics
In case of tc offload, when user space queries the kernel for tc action
statistics, tc will query the offloaded statistics from device drivers.
Among other statistics, drivers are expected to pass the number of
packets that hit the action since the last query as a 64-bit number.
Unfortunately, tc treats the number of packets as a 32-bit number,
leading to truncation and incorrect statistics when the number of
packets since the last query exceeds 0xffffffff:
$ tc -s filter show dev swp2 ingress
filter protocol all pref 1 flower chain 0
filter protocol all pref 1 flower chain 0 handle 0x1
skip_sw
in_hw in_hw_count 1
action order 1: mirred (Egress Redirect to device swp1) stolen
index 1 ref 1 bind 1 installed 58 sec used 0 sec
Action statistics:
Sent 1133877034176 bytes 536959475 pkt (dropped 0, overlimits 0 requeues 0)
[...]
According to the above, 2111-byte packets were redirected which is
impossible as only 64-byte packets were transmitted and the MTU was
1500.
Fix by treating packets as a 64-bit number:
$ tc -s filter show dev swp2 ingress
filter protocol all pref 1 flower chain 0
filter protocol all pref 1 flower chain 0 handle 0x1
skip_sw
in_hw in_hw_count 1
action order 1: mirred (Egress Redirect to device swp1) stolen
index 1 ref 1 bind 1 installed 61 sec used 0 sec
Action statistics:
Sent 1370624380864 bytes 21416005951 pkt (dropped 0, overlimits 0 requeues 0)
[...]
Eric Dumazet [Tue, 4 Feb 2025 14:48:25 +0000 (14:48 +0000)]
net: flush_backlog() small changes
Add READ_ONCE() around reads of skb->dev->reg_state, because
this field can be changed from other threads/cpus.
Instead of calling dev_kfree_skb_irq() and kfree_skb()
while interrupts are masked and locks held,
use a temporary list and use __skb_queue_purge_reason()
Use SKB_DROP_REASON_DEV_READY drop reason to better
describe why these skbs are dropped.
Aswin Karuvally [Tue, 4 Feb 2025 10:31:35 +0000 (11:31 +0100)]
s390/net: Remove LCS driver
The original Open Systems Adapter (OSA) was introduced by IBM in the
mid-90s. These were then superseded by OSA-Express in 1999 which used
Queued Direct IO to greatly improve throughput. The newer cards
retained the older, slower non-QDIO (OSE) modes for compatibility with
older systems. In Linux, the lcs driver was responsible for cards
operating in the older OSE mode and the qeth driver was introduced to
allow the OSA-Express cards to operate in the newer QDIO (OSD) mode.
For an S390 machine from 1998 or later, there is no reason to use the
OSE mode and lcs driver as all OSA cards since 1999 provide the faster
OSD mode. As a result, it's been years since we have heard of a
customer configuration involving the lcs driver.
This patch removes the lcs driver. The technology it supports has been
obsolete for past 25+ years and is irrelevant for current use cases.
Reviewed-by: Alexandra Winter <wintera@linux.ibm.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> Acked-by: Peter Oberparleiter <oberpar@linux.ibm.com> Signed-off-by: Aswin Karuvally <aswin@linux.ibm.com> Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250204103135.1619097-1-wintera@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
cxgb4: Avoid a -Wflex-array-member-not-at-end warning
-Wflex-array-member-not-at-end was introduced in GCC-14, and we are
getting ready to enable it, globally.
Move the conflicting declaration to the end of the structure. Notice
that `struct ethtool_dump` is a flexible structure --a structure that
contains a flexible-array member.
Fix the following warning:
./drivers/net/ethernet/chelsio/cxgb4/cxgb4.h:1215:29: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Link: https://patch.msgid.link/Z6GBZ4brXYffLkt_@kspp Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cong Wang [Tue, 4 Feb 2025 00:58:41 +0000 (16:58 -0800)]
selftests/tc-testing: Add a test case for qdisc_tree_reduce_backlog()
Integrate the test case provided by Mingi Cho into TDC.
All test results:
1..4
ok 1 ca5e - Check class delete notification for ffff:
ok 2 e4b7 - Check class delete notification for root ffff:
ok 3 33a9 - Check ingress is not searchable on backlog update
ok 4 a4b9 - Test class qlen notification
Cong Wang [Tue, 4 Feb 2025 00:58:40 +0000 (16:58 -0800)]
netem: Update sch->q.qlen before qdisc_tree_reduce_backlog()
qdisc_tree_reduce_backlog() notifies parent qdisc only if child
qdisc becomes empty, therefore we need to reduce the backlog of the
child qdisc before calling it. Otherwise it would miss the opportunity
to call cops->qlen_notify(), in the case of DRR, it resulted in UAF
since DRR uses ->qlen_notify() to maintain its active list.
Fixes: f8d4bc455047 ("net/sched: netem: account for backlog updates from child qdisc") Cc: Martin Ottens <martin.ottens@fau.de> Reported-by: Mingi Cho <mincho@theori.io> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Link: https://patch.msgid.link/20250204005841.223511-4-xiyou.wangcong@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Quang Le [Tue, 4 Feb 2025 00:58:39 +0000 (16:58 -0800)]
selftests/tc-testing: Add a test case for pfifo_head_drop qdisc when limit==0
When limit == 0, pfifo_tail_enqueue() must drop new packet and
increase dropped packets count of the qdisc.
All test results:
1..16
ok 1 a519 - Add bfifo qdisc with system default parameters on egress
ok 2 585c - Add pfifo qdisc with system default parameters on egress
ok 3 a86e - Add bfifo qdisc with system default parameters on egress with handle of maximum value
ok 4 9ac8 - Add bfifo qdisc on egress with queue size of 3000 bytes
ok 5 f4e6 - Add pfifo qdisc on egress with queue size of 3000 packets
ok 6 b1b1 - Add bfifo qdisc with system default parameters on egress with invalid handle exceeding maximum value
ok 7 8d5e - Add bfifo qdisc on egress with unsupported argument
ok 8 7787 - Add pfifo qdisc on egress with unsupported argument
ok 9 c4b6 - Replace bfifo qdisc on egress with new queue size
ok 10 3df6 - Replace pfifo qdisc on egress with new queue size
ok 11 7a67 - Add bfifo qdisc on egress with queue size in invalid format
ok 12 1298 - Add duplicate bfifo qdisc on egress
ok 13 45a0 - Delete nonexistent bfifo qdisc
ok 14 972b - Add prio qdisc on egress with invalid format for handles
ok 15 4d39 - Delete bfifo qdisc twice
ok 16 d774 - Check pfifo_head_drop qdisc enqueue behaviour when limit == 0
Quang Le [Tue, 4 Feb 2025 00:58:38 +0000 (16:58 -0800)]
pfifo_tail_enqueue: Drop new packet when sch->limit == 0
Expected behaviour:
In case we reach scheduler's limit, pfifo_tail_enqueue() will drop a
packet in scheduler's queue and decrease scheduler's qlen by one.
Then, pfifo_tail_enqueue() enqueue new packet and increase
scheduler's qlen by one. Finally, pfifo_tail_enqueue() return
`NET_XMIT_CN` status code.
Weird behaviour:
In case we set `sch->limit == 0` and trigger pfifo_tail_enqueue() on a
scheduler that has no packet, the 'drop a packet' step will do nothing.
This means the scheduler's qlen still has value equal 0.
Then, we continue to enqueue new packet and increase scheduler's qlen by
one. In summary, we can leverage pfifo_tail_enqueue() to increase qlen by
one and return `NET_XMIT_CN` status code.
The problem is:
Let's say we have two qdiscs: Qdisc_A and Qdisc_B.
- Qdisc_A's type must have '->graft()' function to create parent/child relationship.
Let's say Qdisc_A's type is `hfsc`. Enqueue packet to this qdisc will trigger `hfsc_enqueue`.
- Qdisc_B's type is pfifo_head_drop. Enqueue packet to this qdisc will trigger `pfifo_tail_enqueue`.
- Qdisc_B is configured to have `sch->limit == 0`.
- Qdisc_A is configured to route the enqueued's packet to Qdisc_B.
Enqueue packet through Qdisc_A will lead to:
- hfsc_enqueue(Qdisc_A) -> pfifo_tail_enqueue(Qdisc_B)
- Qdisc_B->q.qlen += 1
- pfifo_tail_enqueue() return `NET_XMIT_CN`
- hfsc_enqueue() check for `NET_XMIT_SUCCESS` and see `NET_XMIT_CN` => hfsc_enqueue() don't increase qlen of Qdisc_A.
The whole process lead to a situation where Qdisc_A->q.qlen == 0 and Qdisc_B->q.qlen == 1.
Replace 'hfsc' with other type (for example: 'drr') still lead to the same problem.
This violate the design where parent's qlen should equal to the sum of its childrens'qlen.
Bug impact: This issue can be used for user->kernel privilege escalation when it is reachable.
Fixes: 57dbb2d83d10 ("sched: add head drop fifo queue") Reported-by: Quang Le <quanglex97@gmail.com> Signed-off-by: Quang Le <quanglex97@gmail.com> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Link: https://patch.msgid.link/20250204005841.223511-2-xiyou.wangcong@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The '-f' parameter is there to force the kernel to emit MPTCP FASTCLOSE
by closing the connection with unread bytes in the receive queue.
The xdisconnect() helper was used to stop the connection, but it does
more than that: it will shut it down, then wait before reconnecting to
the same address. This causes the mptcp_join's "fastclose test" to fail
all the time.
This failure is due to a recent change, with commit 218cc166321f
("selftests: mptcp: avoid spurious errors on disconnect"), but that went
unnoticed because the test is currently ignored. The recent modification
only shown an existing issue: xdisconnect() doesn't need to be used
here, only the shutdown() part is needed.
Fixes: 6bf41020b72b ("selftests: mptcp: update and extend fastclose test-cases") Cc: stable@vger.kernel.org Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250204-net-mptcp-sft-conn-f-v1-1-6b470c72fffa@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Petr Machata [Tue, 4 Feb 2025 17:37:15 +0000 (18:37 +0100)]
bridge: mdb: Allow replace of a host-joined group
Attempts to replace an MDB group membership of the host itself are
currently bounced:
# ip link add name br up type bridge vlan_filtering 1
# bridge mdb replace dev br port br grp 239.0.0.1 vid 2
# bridge mdb replace dev br port br grp 239.0.0.1 vid 2
Error: bridge: Group is already joined by host.
A similar operation done on a member port would succeed. Ignore the check
for replacement of host group memberships as well.
The bit of code that this enables is br_multicast_host_join(), which, for
already-joined groups only refreshes the MC group expiration timer, which
is desirable; and a userspace notification, also desirable.
Change a selftest that exercises this code path from expecting a rejection
to expecting a pass. The rest of MDB selftests pass without modification.
====================
net-sysfs: remove the rtnl_trylock/restart_syscall construction
The series initially aimed at improving spins (and thus delays) while
accessing net sysfs under rtnl lock contention[1]. The culprit was the
trylock/restart_syscall constructions. There wasn't much interest at the
time but it got traction recently for other reasons (lowering the rtnl
lock pressure).
Since v1[2]:
- Do not export rtnl_lock_interruptible [Stephen].
- Add netdev_warn_once messages in rx_queue_add_kobject [Jakub].
Since the RFC[1]:
- Limit the breaking of the sysfs protection to sysfs_rtnl_lock() only
as this is not needed in the whole rtnl locking section thanks to the
additional check on dev_isalive(). This simplifies error handling as
well as the unlocking path.
- Used an interruptible version of rtnl_lock, as done by Jakub in
his experiments.
- Removed a WARN_ONCE_ONCE [Greg].
- Removed explicit inline markers [Stephen].
Most of the reasoning is explained in comments added in patch 1. This
was tested by stress-testing net sysfs attributes (read/write ops) while
adding/removing queues and adding/removing veths, all in parallel. I
also used an OCP single node cluster, spawning lots of pods.
Antoine Tenart [Tue, 4 Feb 2025 17:03:12 +0000 (18:03 +0100)]
net-sysfs: prevent uncleared queues from being re-added
With the (upcoming) removal of the rtnl_trylock/restart_syscall logic
and because of how Tx/Rx queues are implemented (and their
requirements), it might happen that a queue is re-added before having
the chance to be cleared. In such rare case, do not complete the queue
addition operation.
Antoine Tenart [Tue, 4 Feb 2025 17:03:11 +0000 (18:03 +0100)]
net-sysfs: move queue attribute groups outside the default groups
Rx/tx queues embed their own kobject for registering their per-queue
sysfs files. The issue is they're using the kobject default groups for
this and entirely rely on the kobject refcounting for releasing their
sysfs paths.
In order to remove rtnl_trylock calls we need sysfs files not to rely on
their associated kobject refcounting for their release. Thus we here
move queues sysfs files from the kobject default groups to their own
groups which can be removed separately.
Antoine Tenart [Tue, 4 Feb 2025 17:03:10 +0000 (18:03 +0100)]
net-sysfs: remove rtnl_trylock from device attributes
There is an ABBA deadlock between net device unregistration and sysfs
files being accessed[1][2]. To prevent this from happening all paths
taking the rtnl lock after the sysfs one (actually kn->active refcount)
use rtnl_trylock and return early (using restart_syscall)[3], which can
make syscalls to spin for a long time when there is contention on the
rtnl lock[4].
There are not many possibilities to improve the above:
- Rework the entire net/ locking logic.
- Invert two locks in one of the paths — not possible.
But here it's actually possible to drop one of the locks safely: the
kernfs_node refcount. More details in the code itself, which comes with
lots of comments.
Note that we check the device is alive in the added sysfs_rtnl_lock
helper to disallow sysfs operations to run after device dismantle has
started. This also help keeping the same behavior as before. Because of
this calls to dev_isalive in sysfs ops were removed.
Linus Torvalds [Wed, 5 Feb 2025 16:13:07 +0000 (08:13 -0800)]
Merge tag 'for-6.14-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- add lockdep annotation for relocation root to fix a splat warning
while merging roots
- fix assertion failure when splitting ordered extent after transaction
abort
- don't print 'qgroup inconsistent' message when rescan process updates
qgroup data sooner than the subvolume deletion process
- fix use-after-free (accessing the error number) when attempting to
join an aborted transaction
- avoid starting new transaction if not necessary when cleaning qgroup
during subvolume drop
* tag 'for-6.14-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: avoid starting new transaction when cleaning qgroup during subvolume drop
btrfs: fix use-after-free when attempting to join an aborted transaction
btrfs: do not output error message if a qgroup has been already cleaned up
btrfs: fix assertion failure when splitting ordered extent after transaction abort
btrfs: fix lockdep splat while merging a relocation root
Heiner Kallweit [Mon, 3 Feb 2025 20:41:36 +0000 (21:41 +0100)]
net: phy: realtek: use string choices helpers
Use string choices helpers to simplify the code.
Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202501190707.qQS8PGHW-lkp@intel.com/ Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Breno Leitao [Mon, 3 Feb 2025 19:04:15 +0000 (11:04 -0800)]
netconsole: selftest: Add test for fragmented messages
Add a new selftest to verify netconsole's handling of messages that
exceed the packet size limit and require fragmentation. The test sends
messages with varying sizes and userdata, validating that:
1. Large messages are correctly fragmented and reassembled
2. Userdata fields are properly preserved across fragments
3. Messages work correctly with and without kernel release version
appending
The test creates a networking environment using netdevsim, sends
messages through /dev/kmsg, and verifies the received fragments maintain
message integrity.
-Wflex-array-member-not-at-end was introduced in GCC-14, and we are
getting ready to enable it, globally.
Remove unused flexible-array member `buf` and, with this, fix the following
warnings:
drivers/net/ethernet/aquantia/atlantic/aq_hw.h:197:36: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
drivers/net/ethernet/aquantia/atlantic/hw_atl/../aq_hw.h:197:36: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
Suggested-by: Igor Russkikh <irusskikh@marvell.com> Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Igor Russkikh <irusskikh@marvell.com> Link: https://patch.msgid.link/Z6F3KZVfnAZ2FoJm@kspp Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Mon, 3 Feb 2025 21:58:16 +0000 (13:58 -0800)]
net: warn if NAPI instance wasn't shut down
Drivers should always disable a NAPI instance before removing it.
If they don't the instance may be queued for polling.
Since commit 86e25f40aa1e ("net: napi: Add napi_config")
we also remove the NAPI from the busy polling hash table
in napi_disable(), so not disabling would leave a stale
entry there.
Use of busy polling is relatively uncommon so bugs may be lurking
in the drivers. Add an explicit warning.
mlxsw_sp_ipip_lb_ul_vr_id() has been unused since 2020's
commit acde33bf7319 ("mlxsw: spectrum_router: Reduce
mlxsw_sp_ipip_fib_entry_op_gre4()")
mlxsw_sp_rif_exists() has been unused since 2023's
commit 49c3a615d382 ("mlxsw: spectrum_router: Replay MACVLANs when RIF is
made")
mlxsw_sp_rif_vid() has been unused since 2023's
commit a5b52692e693 ("mlxsw: spectrum_switchdev: Manage RIFs on PVID
change")
Remove them.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Link: https://patch.msgid.link/20250203190141.204951-1-linux@treblig.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
mlx5dr_domain_sync() was added in 2019 by
commit 70605ea545e8 ("net/mlx5: DR, Expose APIs for direct rule managing")
but hasn't been used.
Remove it.
mlx5dr_domain_sync() was the only user of
mlx5dr_send_ring_force_drain().
Remove it.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Link: https://patch.msgid.link/20250203185958.204794-1-linux@treblig.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The last use of mlx4_find_cached_mac() was removed in 2014 by
commit 2f5bb473681b ("mlx4: Add ref counting to port MAC table for RoCE")
mlx4_zone_free_entries() was added in 2014 by
commit 7a89399ffad7 ("net/mlx4: Add mlx4_bitmap zone allocator")
but hasn't been used. (The _unique version is used)
Remove them.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Link: https://patch.msgid.link/20250203185229.204279-1-linux@treblig.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jacob Moroni [Mon, 3 Feb 2025 14:36:05 +0000 (09:36 -0500)]
net: atlantic: fix warning during hot unplug
Firmware deinitialization performs MMIO accesses which are not
necessary if the device has already been removed. In some cases,
these accesses happen via readx_poll_timeout_atomic which ends up
timing out, resulting in a warning at hw_atl2_utils_fw.c:112:
Fix this by skipping firmware deinitialization altogether if the
PCI device is no longer present.
Tested with an AQC113 attached via Thunderbolt by performing
repeated unplug cycles while traffic was running via iperf.
Fixes: 97bde5c4f909 ("net: ethernet: aquantia: Support for NIC-specific code") Signed-off-by: Jacob Moroni <mail@jakemoroni.com> Reviewed-by: Igor Russkikh <irusskikh@marvell.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250203143604.24930-3-mail@jakemoroni.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Linus Torvalds [Tue, 4 Feb 2025 19:01:48 +0000 (11:01 -0800)]
Merge tag 'kthreads-fixes-2025-02-04' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks
Pull kthreads fix from Frederic Weisbecker:
- Properly handle return value when allocation fails for the preferred
affinity
* tag 'kthreads-fixes-2025-02-04' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks:
kthread: Fix return value on kzalloc() failure in kthread_affine_preferred()
Linus Torvalds [Tue, 4 Feb 2025 17:52:01 +0000 (09:52 -0800)]
Merge tag 'livepatching-for-6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching
Pull livepatching fix from Petr Mladek:
- Fix livepatching selftests for util-linux-2.40.x
* tag 'livepatching-for-6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching:
selftests: livepatch: handle PRINTK_CALLER in check_result()
Linus Torvalds [Tue, 4 Feb 2025 17:50:02 +0000 (09:50 -0800)]
Merge tag 'platform-drivers-x86-v6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
Pull x86 platform driver fixes from Ilpo Järvinen:
- ideapad-laptop: Pass a correct pointer to the driver data
- intel/ifs: Provide a link to the IFS test images
- intel/pmc: Use large enough type when decoding LTR value
* tag 'platform-drivers-x86-v6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
platform/x86/intel/ifs: Update documentation with image download path
platform/x86/intel: pmc: fix ltr decode in pmc_core_ltr_show()
platform/x86: ideapad-laptop: pass a correct pointer to the driver data
David Howells [Mon, 3 Feb 2025 11:03:04 +0000 (11:03 +0000)]
rxrpc: Fix the rxrpc_connection attend queue handling
The rxrpc_connection attend queue is never used because conn::attend_link
is never initialised and so is always NULL'd out and thus always appears to
be busy. This requires the following fix:
(1) Fix this the attend queue problem by initialising conn::attend_link.
And, consequently, two further fixes for things masked by the above bug:
(2) Fix rxrpc_input_conn_event() to handle being invoked with a NULL
sk_buff pointer - something that can now happen with the above change.
(3) Fix the RXRPC_SKB_MARK_SERVICE_CONN_SECURED message to carry a pointer
to the connection and a ref on it.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: netdev@vger.kernel.org Fixes: f2cce89a074e ("rxrpc: Implement a mechanism to send an event notification to a connection") Link: https://patch.msgid.link/20250203110307.7265-3-dhowells@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jithu Joseph [Fri, 31 Jan 2025 20:53:15 +0000 (12:53 -0800)]
platform/x86/intel/ifs: Update documentation with image download path
The documentation previously listed the path to download In Field Scan
(IFS) test images as "TBD".
Update the documentation to include the correct image download
location. Also move the download link to the appropriate section within
the documentation.
Reported-by: Anisse Astier <anisse@astier.eu> Signed-off-by: Jithu Joseph <jithu.joseph@intel.com> Link: https://lore.kernel.org/r/20250131205315.1585663-1-jithu.joseph@intel.com Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Paolo Abeni [Sat, 1 Feb 2025 18:02:51 +0000 (19:02 +0100)]
net: harmonize tstats and dstats
After the blamed commits below, some UDP tunnel use dstats for
accounting. On the xmit path, all the UDP-base tunnels ends up
using iptunnel_xmit_stats() for stats accounting, and the latter
assumes the relevant (tunnel) network device uses tstats.
The end result is some 'funny' stat report for the mentioned UDP
tunnel, e.g. when no packet is actually dropped and a bunch of
packets are transmitted:
Address the issue ensuring the same binary layout for the overlapping
fields of dstats and tstats. While this solution is a bit hackish, is
smaller and with no performance pitfall compared to other alternatives
i.e. supporting both dstat and tstat in iptunnel_xmit_stats() or
reverting the blamed commit.
With time we should possibly move all the IP-based tunnel (and virtual
devices) to dstats.
Fixes: c77200c07491 ("bareudp: Handle stats using NETDEV_PCPU_STAT_DSTATS.") Fixes: 6fa6de302246 ("geneve: Handle stats using NETDEV_PCPU_STAT_DSTATS.") Fixes: be226352e8dc ("vxlan: Handle stats using NETDEV_PCPU_STAT_DSTATS.") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/2e1c444cf0f63ae472baff29862c4c869be17031.1738432804.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
ethtool: rss: minor fixes for recent RSS changes
Make sure RSS_GET messages are consistent in do and dump.
Fix up a recently added safety check for RSS + queue offset.
Adjust related tests so that they pass on devices which
don't support RSS + queue offset.
====================
Jakub Kicinski [Sat, 1 Feb 2025 01:30:38 +0000 (17:30 -0800)]
ethtool: ntuple: fix rss + ring_cookie check
The info.flow_type is for RXFH commands, ntuple flow_type is inside
the flow spec. The check currently does nothing, as info.flow_type
is 0 (or even uninitialized by user space) for ETHTOOL_SRXCLSRLINS.
Fixes: 9e43ad7a1ede ("net: ethtool: only allow set_rxnfc with rss + ring_cookie if driver opts in") Reviewed-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Joe Damato <jdamato@fastly.com> Link: https://patch.msgid.link/20250201013040.725123-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Sat, 1 Feb 2025 01:30:37 +0000 (17:30 -0800)]
ethtool: rss: fix hiding unsupported fields in dumps
Commit ec6e57beaf8b ("ethtool: rss: don't report key if device
doesn't support it") intended to stop reporting key fields for
additional rss contexts if device has a global hashing key.
Later we added dump support and the filtering wasn't properly
added there. So we end up reporting the key fields in dumps
but not in dos:
The drivers/net/hw/rss_ctx.py selftest catches this when run on
a device with single key, already:
# Check| At /root/./ksft-net-drv/drivers/net/hw/rss_ctx.py, line 381, in test_rss_context_dump:
# Check| ksft_ne(set(data.get('hkey', [1])), {0}, "key is all zero")
# Check failed {0} == {0} key is all zero
not ok 8 rss_ctx.test_rss_context_dump
Fixes: f6122900f4e2 ("ethtool: rss: support dumping RSS contexts") Reviewed-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Joe Damato <jdamato@fastly.com> Link: https://patch.msgid.link/20250201013040.725123-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>