net: renesas: rswitch: Convert to for_each_available_child_of_node()
Simplify rswitch_get_port_node() by using the
for_each_available_child_of_node() helper instead of manually ignoring
unavailable child nodes, and leaking a reference.
Jakub Kicinski [Fri, 7 Feb 2025 19:56:12 +0000 (11:56 -0800)]
Merge branch 'net-stmmac-yet-more-eee-updates'
Russell King says:
====================
net: stmmac: yet more EEE updates
Continuing on with the STMMAC EEE cleanups from last cycle, this series
further cleans up the EEE code, and fixes a problem with the existing
implementation - disabling EEE doesn't immediately disable LPI
signalling until the next packet is transmitted. It likely also fixes
a potential race condition when trying to disable LPI vs the software
timer.
====================
Add a new method to control LPI mode configuration. This is architected
to have three configuration states: LPI disabled, LPI forced (active),
or LPI under hardware timer control. This reflects the three modes
which the main body of the driver wishes to deal with.
We pass in whether transmit clock gating should be used, and the
hardware timer value in microseconds to be set when using hardware
timer control.
net: stmmac: use common LPI_CTRL_STATUS bit definitions
The bit definitions for the LPI control/status register are
identical across all MAC versions, with the exception that some
bits may not be implemented. Provide definitions for bits in this
register in common.h, convert to use them, and remove the core-
specific definitions.
net: stmmac: remove unnecessary LPI disable when enabling LPI
Remove the unnecessary LPI disable when enabling LPI - as noted in
previous commits, there will never be two consecutive calls to
stmmac_mac_enable_tx_lpi() without an intervening
stmmac_mac_disable_tx_lpi.
net: stmmac: clear priv->tx_path_in_lpi_mode when disabling LPI
As other code paths do, clear priv->tx_path_in_lpi_mode when disabling
LPI. This is done after the software timer has been deleted and
hardware LPI has been disabled.
Phylink will not call the mac_disable_tx_lpi() and mac_enable_tx_lpi()
methods randomly - the first method to be called will be the enable
method, and then after, the disable method will be called once between
subsequent enable calls. Thus there is a guaranteed ordering.
Therefore, we know the previous state of priv->eee_enabled, and can
remove it from both methods.
Since priv->eee_active is assigned with a constant value in each of
these methods, there is no need to test its value later. Remove these
unnecessary tests.
net: stmmac: remove priv->dma_cap.eee test in tx_lpi methods
The tests for priv->dma_cap.eee in stmmac_mac_{en,dis}able_tx_lpi()
is useless as these methods will only be called when using phylink
managed EEE, and that will only be enabled if the LPI capabilities
in phylink_config have been populated during initialisation. This
only occurs when priv->dma_cap.eee was true.
As priv->dma_cap.eee remains constant during the lifetime of the driver
instance, there is no need to re-check it in these methods.
LPIATE enables the hardware timer for entering LPI mode. To sure that
the correct mode is used, clear LPIATE when using manual/software-timed
mode to prevent the hardware using the timer.
stmmac_main.c avoids this being a problem at the moment by calling
stmmac_set_eee_lpi_timer(..., 0) before switching to software mode.
We no longer need to call stmmac_set_eee_lpi_timer(..., 0) when
disabling EEE as stmmac_reset_eee_mode() will now clear all LPI
settings.
net: stmmac: ensure LPI is disabled when disabling EEE
When EEE is disabled, we call stmmac_set_eee_lpi_timer(..., 0).
For dwmac4, this will result in LPIATE being cleared, but LPIEN and
LPITXA being set, causing LPI mode to be signalled (if it wasn't
before).
For others MACs, stmmac_set_eee_lpi_timer() does nothing, which means
that LPI mode will continue to be signalled despite the expectation
for it to be disabled.
In both cases, LPI mode will be terminated when the transmitter has
a packet to send, and LPIEN will be cleared by hardware.
Call stmmac_reset_eee_mode() to ensure that LPI mode is disabled when
EEE mode is requested to be disabled.
Ted Chen [Thu, 6 Feb 2025 14:00:02 +0000 (22:00 +0800)]
vxlan: Remove unnecessary comments for vxlan_rcv() and vxlan_err_lookup()
Remove the two unnecessary comments around vxlan_rcv() and
vxlan_err_lookup(), which indicate that the callers are from
net/ipv{4,6}/udp.c. These callers are trivial to find. Additionally, the
comment for vxlan_rcv() missed that the caller could also be from
net/ipv6/udp.c.
Suggested-by: Nikolay Aleksandrov <razor@blackwall.org> Suggested-by: Ido Schimmel <idosch@idosch.org> Signed-off-by: Ted Chen <znscnchen@gmail.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250206140002.116178-1-znscnchen@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There are lot of net drivers using of_get_child_by_name() followed by
of_device_is_available() to find the available child node by name for a
given parent. Provide a helper for these users to simplify the code.
v1->v2:
* Make it as a series as per [1] to cover the dependency.
* Added Rb tag from Rob for patch#1 and this patch can be merged through
net as it is the main user.
* Updated all the patches with patch suffix net-next
* Dropped _free() usage.
Biju Das [Wed, 5 Feb 2025 12:42:21 +0000 (12:42 +0000)]
of: base: Add of_get_available_child_by_name()
There are lot of drivers using of_get_child_by_name() followed by
of_device_is_available() to find the available child node by name for a
given parent. Provide a helper for these users to simplify the code.
Suggested-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Rob Herring <robh@kernel.org> Signed-off-by: Biju Das <biju.das.jz@bp.renesas.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Fri, 7 Feb 2025 01:34:36 +0000 (17:34 -0800)]
Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:
====================
ice: managing MSI-X in driver
Michal Swiatkowski says:
It is another try to allow user to manage amount of MSI-X used for each
feature in ice. First was via devlink resources API, it wasn't accepted
in upstream. Also static MSI-X allocation using devlink resources isn't
really user friendly.
This try is using more dynamic way. "Dynamic" across whole kernel when
platform supports it and "dynamic" across the driver when not.
To achieve that reuse global devlink parameter pf_msix_max and
pf_msix_min. It fits how ice hardware counts MSI-X. In case of ice amount
of MSI-X reported on PCI is a whole MSI-X for the card (with MSI-X for
VFs also). Having pf_msix_max allow user to statically set how many
MSI-X he wants on PF and how many should be reserved for VFs.
pf_msix_min is used to set minimum number of MSI-X with which ice driver
should probe correctly.
Meaning of this field in case of dynamic vs static allocation:
- on system with dynamic MSI-X allocation support
* alloc pf_msix_min as static, rest will be allocated dynamically
- on system without dynamic MSI-X allocation support
* try alloc pf_msix_max as static, minimum acceptable result is
pf_msix_min
As Jesse and Piotr suggested pf_msix_max and pf_msix_min can (an
probably should) be stored in NVM. This patchset isn't implementing
that.
Dynamic (kernel or driver) way means that splitting MSI-X across the
RDMA and eth in case there is a MSI-X shortage isn't correct. Can work
when dynamic is only on driver site, but can't when dynamic is on kernel
site.
Let's remove this code and move to MSI-X allocation feature by feature.
If there is no more MSI-X for a feature, a feature is working with less
MSI-X or it is turned off.
There is a regression here. With MSI-X splitting user can run RDMA and
eth even on system with not enough MSI-X. Now only eth will work. RDMA
can be turned on by changing number of PF queues (lowering) and reprobe
RDMA driver.
Example:
72 CPU number, eth, RDMA and flow director (1 MSI-X), 1 MSI-X for OICR
on PF, and 1 more for RDMA. Card is using 1 + 72 + 1 + 72 + 1 = 147.
After this changes we have a static base vector for SRIOV (SIOV probably
in the feature). Last patch from this series is simplifying managing VF
MSI-X code based on static vector.
Now changing queues using ethtool is also changing MSI-X. If there is
enough MSI-X it is always one to one. When there is not enough there
will be more queues than MSI-X.
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
ice: init flow director before RDMA
ice: simplify VF MSI-X managing
ice: enable_rdma devlink param
ice: treat dyn_allowed only as suggestion
ice, irdma: move interrupts code to irdma
ice: get rid of num_lan_msix field
ice: remove splitting MSI-X between features
ice: devlink PF MSI-X max and min parameter
ice: count combined queues using Rx/Tx count
====================
net: pcs: rzn1-miic: fill in PCS supported_interfaces
Populate the PCS supported_interfaces bitmap with the interfaces that
this PCS supports. This makes the manual checking in miic_validate()
redundant, so remove that.
====================
enic: Use Page Pool API for receiving packets
Use the Page Pool API for RX. The Page Pool API improves bandwidth and
CPU overhead by recycling pages instead of allocating new buffers in the
driver. Also, page pool fragment allocation for smaller MTUs is used
allow multiple packets to share pages.
RX code was moved to its own file and some refactoring was done
beforehand to make the page pool changes more trasparent and to simplify
the resulting code.
====================
John Daley [Wed, 5 Feb 2025 23:54:16 +0000 (15:54 -0800)]
enic: remove copybreak tunable
With the move to using the Page Pool API for RX, rx copybreak was not
showing any improvement in host CPU overhead, latency or bandwidth so
the driver no longer makes use of the rx_copybreak setting. This patch
removes the ethtool tuneable hooks to set and get the rx copybreak since
they and now no-ops. Rx copybreak was the only tunable supported, so
remove the set and get tunable callbacks all together.
Co-developed-by: Nelson Escobar <neescoba@cisco.com> Signed-off-by: Nelson Escobar <neescoba@cisco.com> Co-developed-by: Satish Kharat <satishkh@cisco.com> Signed-off-by: Satish Kharat <satishkh@cisco.com> Signed-off-by: John Daley <johndale@cisco.com> Link: https://patch.msgid.link/20250205235416.25410-5-johndale@cisco.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
John Daley [Wed, 5 Feb 2025 23:54:15 +0000 (15:54 -0800)]
enic: Use the Page Pool API for RX
The Page Pool API improves bandwidth and CPU overhead by recycling pages
instead of allocating new buffers in the driver. Make use of page pool
fragment allocation for smaller MTUs so that multiple packets can share
a page. For MTUs larger than PAGE_SIZE, adjust the 'order' page
parameter so that contiguous pages can be used to receive the larger
packets.
The RQ descriptor field 'os_buf' is repurposed to hold page pointers
allocated from page_pool instead of SKBs. When packets arrive, SKBs are
allocated and the page pointers are attached instead of preallocating SKBs.
'alloc_fail' netdev statistic is incremented when page_pool_dev_alloc()
fails.
Co-developed-by: Nelson Escobar <neescoba@cisco.com> Signed-off-by: Nelson Escobar <neescoba@cisco.com> Co-developed-by: Satish Kharat <satishkh@cisco.com> Signed-off-by: Satish Kharat <satishkh@cisco.com> Signed-off-by: John Daley <johndale@cisco.com> Link: https://patch.msgid.link/20250205235416.25410-4-johndale@cisco.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
John Daley [Wed, 5 Feb 2025 23:54:13 +0000 (15:54 -0800)]
enic: Move RX functions to their own file
Move RX handler code into its own file in preparation for further
changes. Some formatting changes were necessary in order to satisfy
checkpatch but there were no functional changes.
Co-developed-by: Nelson Escobar <neescoba@cisco.com> Signed-off-by: Nelson Escobar <neescoba@cisco.com> Co-developed-by: Satish Kharat <satishkh@cisco.com> Signed-off-by: Satish Kharat <satishkh@cisco.com> Signed-off-by: John Daley <johndale@cisco.com> Link: https://patch.msgid.link/20250205235416.25410-2-johndale@cisco.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Joe Damato [Wed, 5 Feb 2025 19:37:47 +0000 (19:37 +0000)]
netdev-genl: Elide napi_id when not present
There are at least two cases where napi_id may not present and the
napi_id should be elided:
1. Queues could be created, but napi_enable may not have been called
yet. In this case, there may be a NAPI but it may not have an ID and
output of a napi_id should be elided.
2. TX-only NAPIs currently do not have NAPI IDs. If a TX queue happens
to be linked with a TX-only NAPI, elide the NAPI ID from the netlink
output as a NAPI ID of 0 is not useful for users.
Signed-off-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250205193751.297211-1-jdamato@fastly.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Fri, 7 Feb 2025 00:27:33 +0000 (16:27 -0800)]
Merge branch 'io_uring-zero-copy-rx'
David Wei says:
====================
io_uring zero copy rx
This patchset contains net/ patches needed by a new io_uring request
implementing zero copy rx into userspace pages, eliminating a kernel
to user copy.
We configure a page pool that a driver uses to fill a hw rx queue to
hand out user pages instead of kernel pages. Any data that ends up
hitting this hw rx queue will thus be dma'd into userspace memory
directly, without needing to be bounced through kernel memory. 'Reading'
data out of a socket instead becomes a _notification_ mechanism, where
the kernel tells userspace where the data is. The overall approach is
similar to the devmem TCP proposal.
This relies on hw header/data split, flow steering and RSS to ensure
packet headers remain in kernel memory and only desired flows hit a hw
rx queue configured for zero copy. Configuring this is outside of the
scope of this patchset.
We share netdev core infra with devmem TCP. The main difference is that
io_uring is used for the uAPI and the lifetime of all objects are bound
to an io_uring instance. Data is 'read' using a new io_uring request
type. When done, data is returned via a new shared refill queue. A zero
copy page pool refills a hw rx queue from this refill queue directly. Of
course, the lifetime of these data buffers are managed by io_uring
rather than the networking stack, with different refcounting rules.
This patchset is the first step adding basic zero copy support. We will
extend this iteratively with new features e.g. dynamically allocated
zero copy areas, THP support, dmabuf support, improved copy fallback,
general optimisations and more.
In terms of netdev support, we're first targeting Broadcom bnxt. Patches
aren't included since Taehee Yoo has already sent a more comprehensive
patchset adding support in [1]. Google gve should already support this,
and Mellanox mlx5 support is WIP pending driver changes.
===========
Performance
===========
Note: Comparison with epoll + TCP_ZEROCOPY_RECEIVE isn't done yet.
Test setup:
* AMD EPYC 9454
* Broadcom BCM957508 200G
* Kernel v6.11 base [2]
* liburing fork [3]
* kperf fork [4]
* 4K MTU
* Single TCP flow
With application thread + net rx softirq pinned to _different_ cores:
David Wei [Tue, 4 Feb 2025 21:56:21 +0000 (13:56 -0800)]
net: add helpers for setting a memory provider on an rx queue
Add helpers that properly prep or remove a memory provider for an rx
queue then restart the queue.
Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Link: https://patch.msgid.link/20250204215622.695511-11-dw@davidwei.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pavel Begunkov [Tue, 4 Feb 2025 21:56:20 +0000 (13:56 -0800)]
net: page_pool: add memory provider helpers
Add helpers for memory providers to interact with page pools.
net_mp_niov_{set,clear}_page_pool() serve to [dis]associate a net_iov
with a page pool. If used, the memory provider is responsible to match
"set" calls with "clear" once a net_iov is not going to be used by a page
pool anymore, changing a page pool, etc.
Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Link: https://patch.msgid.link/20250204215622.695511-10-dw@davidwei.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pavel Begunkov [Tue, 4 Feb 2025 21:56:19 +0000 (13:56 -0800)]
net: prepare for non devmem TCP memory providers
There is a good bunch of places in generic paths assuming that the only
page pool memory provider is devmem TCP. As we want to reuse the net_iov
and provider infrastructure, we need to patch it up and explicitly check
the provider type when we branch into devmem TCP code.
Reviewed-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Link: https://patch.msgid.link/20250204215622.695511-9-dw@davidwei.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pavel Begunkov [Tue, 4 Feb 2025 21:56:18 +0000 (13:56 -0800)]
net: page_pool: add a mp hook to unregister_netdevice*
Devmem TCP needs a hook in unregister_netdevice_many_notify() to upkeep
the set tracking queues it's bound to, i.e. ->bound_rxqs. Instead of
devmem sticking directly out of the genetic path, add a mp function.
Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Link: https://patch.msgid.link/20250204215622.695511-8-dw@davidwei.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pavel Begunkov [Tue, 4 Feb 2025 21:56:17 +0000 (13:56 -0800)]
net: page_pool: add callback for mp info printing
Add a mandatory callback that prints information about the memory
provider to netlink.
Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Link: https://patch.msgid.link/20250204215622.695511-7-dw@davidwei.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Wei [Tue, 4 Feb 2025 21:56:16 +0000 (13:56 -0800)]
netdev: add io_uring memory provider info
Add a nested attribute for io_uring memory provider info. For now it is
empty and its presence indicates that a particular page pool or queue
has an io_uring memory provider attached.
Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Link: https://patch.msgid.link/20250204215622.695511-6-dw@davidwei.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pavel Begunkov [Tue, 4 Feb 2025 21:56:15 +0000 (13:56 -0800)]
net: page_pool: create hooks for custom memory providers
A spin off from the original page pool memory providers patch by Jakub,
which allows extending page pools with custom allocators. One of such
providers is devmem TCP, and the other is io_uring zerocopy added in
following patches.
Pavel Begunkov [Tue, 4 Feb 2025 21:56:14 +0000 (13:56 -0800)]
net: generalise net_iov chunk owners
Currently net_iov stores a pointer to struct dmabuf_genpool_chunk_owner,
which serves as a useful abstraction to share data and provide a
context. However, it's too devmem specific, and we want to reuse it for
other memory providers, and for that we need to decouple net_iov from
devmem. Make net_iov to point to a new base structure called
net_iov_area, which dmabuf_genpool_chunk_owner extends.
Reviewed-by: Mina Almasry <almasrymina@google.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Link: https://patch.msgid.link/20250204215622.695511-4-dw@davidwei.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pavel Begunkov [Tue, 4 Feb 2025 21:56:13 +0000 (13:56 -0800)]
net: prefix devmem specific helpers
Add prefixes to all helpers that are specific to devmem TCP, i.e.
net_iov_binding[_id].
Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Link: https://patch.msgid.link/20250204215622.695511-3-dw@davidwei.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pavel Begunkov [Tue, 4 Feb 2025 21:56:12 +0000 (13:56 -0800)]
net: page_pool: don't cast mp param to devmem
page_pool_check_memory_provider() is a generic path and shouldn't assume
anything about the actual type of the memory provider argument. It's
fine while devmem is the only provider, but cast away the devmem
specific binding types to avoid confusion.
Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Link: https://patch.msgid.link/20250204215622.695511-2-dw@davidwei.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Wed, 5 Feb 2025 17:33:52 +0000 (09:33 -0800)]
tools: ynl: add all headers to makefile deps
The Makefile.deps lists uAPI headers to make the build work when
system headers are older than in-tree headers. The problem doesn't
occur for new headers, because system headers are not there at all.
But out-of-tree YNL clone on GH also uses this header to identify
header dependencies, and one day the system headers will exist,
and will get out of date. So let's add the headers we missed.
I don't think this is a fix, but FWIW the commits which added
the missing headers are:
commit 04e65df94b31 ("netlink: spec: add shaper YAML spec")
commit 49922401c219 ("ethtool: separate definitions that are gonna be generated")
Linus Torvalds [Thu, 6 Feb 2025 17:14:54 +0000 (09:14 -0800)]
Merge tag 'net-6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Interestingly the recent kmemleak improvements allowed our CI to catch
a couple of percpu leaks addressed here.
We (mostly Jakub, to be accurate) are working to increase review
coverage over the net code-base tweaking the MAINTAINER entries.
Current release - regressions:
- core: harmonize tstats and dstats
- ipv6: fix dst refleaks in rpl, seg6 and ioam6 lwtunnels
- eth: tun: revert fix group permission check
- eth: stmmac: revert "specify hardware capability value when FIFO
size isn't specified"
Previous releases - regressions:
- udp: gso: do not drop small packets when PMTU reduces
- rxrpc: fix race in call state changing vs recvmsg()
- eth: ice: fix Rx data path for heavy 9k MTU traffic
- eth: vmxnet3: fix tx queue race condition with XDP
Previous releases - always broken:
- sched: pfifo_tail_enqueue: drop new packet when sch->limit == 0
- ethtool: ntuple: fix rss + ring_cookie check
- rxrpc: fix the rxrpc_connection attend queue handling
Misc:
- recognize Kuniyuki Iwashima as a maintainer"
* tag 'net-6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (34 commits)
Revert "net: stmmac: Specify hardware capability value when FIFO size isn't specified"
MAINTAINERS: add a sample ethtool section entry
MAINTAINERS: add entry for ethtool
rxrpc: Fix race in call state changing vs recvmsg()
rxrpc: Fix call state set to not include the SERVER_SECURING state
net: sched: Fix truncation of offloaded action statistics
tun: revert fix group permission check
selftests/tc-testing: Add a test case for qdisc_tree_reduce_backlog()
netem: Update sch->q.qlen before qdisc_tree_reduce_backlog()
selftests/tc-testing: Add a test case for pfifo_head_drop qdisc when limit==0
pfifo_tail_enqueue: Drop new packet when sch->limit == 0
selftests: mptcp: connect: -f: no reconnect
net: rose: lock the socket in rose_bind()
net: atlantic: fix warning during hot unplug
rxrpc: Fix the rxrpc_connection attend queue handling
net: harmonize tstats and dstats
selftests: drv-net: rss_ctx: don't fail reconfigure test if queue offset not supported
selftests: drv-net: rss_ctx: add missing cleanup in queue reconfigure
ethtool: ntuple: fix rss + ring_cookie check
ethtool: rss: fix hiding unsupported fields in dumps
...
Revert "net: stmmac: Specify hardware capability value when FIFO size isn't specified"
This reverts commit 8865d22656b4, which caused breakage for platforms
which are not using xgmac2 or gmac4. Only these two cores have the
capability of providing the FIFO sizes from hardware capability fields
(which are provided in priv->dma_cap.[tr]x_fifo_size.)
All other cores can not, which results in these two fields containing
zero. We also have platforms that do not provide a value in
priv->plat->[tr]x_fifo_size, resulting in these also being zero.
This causes the new tests introduced by the reverted commit to fail,
and produce e.g.:
An example of such a platform which fails is QEMU's npcm750-evb.
This uses dwmac1000 which, as noted above, does not have the capability
to provide the FIFO sizes from hardware.
Therefore, revert the commit to maintain compatibility with the way
the driver used to work.
Alexander Duyck [Tue, 4 Feb 2025 01:00:38 +0000 (17:00 -0800)]
eth: fbnic: set IFF_UNICAST_FLT to avoid enabling promiscuous mode when adding unicast addrs
I realized when we were adding unicast addresses we were enabling
promiscuous mode. I did a bit of digging and realized we had overlooked
setting the driver private flag to indicate we supported unicast filtering.
Example below shows the table with 00deadbeef01 as the main NIC address,
and 5 additional addresses in the 00deadbeefX0 format.
Before rule 31 would be active. With this change it correctly sticks
to just the unicast filters.
Signed-off-by: Alexander Duyck <alexanderduyck@meta.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250204010038.1404268-2-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Alexander Duyck [Tue, 4 Feb 2025 01:00:37 +0000 (17:00 -0800)]
eth: fbnic: add MAC address TCAM to debugfs
Add read only access to the 32-entry MAC address TCAM via debugfs.
BMC filtering shares the same table so this is quite useful
to access during debug. See next commit for an example output.
Signed-off-by: Alexander Duyck <alexanderduyck@meta.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250204010038.1404268-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jakub Kicinski [Mon, 3 Feb 2025 21:55:09 +0000 (13:55 -0800)]
tools: ynl-gen: don't output external constants
A definition with a "header" property is an "external" definition
for C code, as in it is defined already in another C header file.
Other languages will need the exact value but C codegen should
not recreate it. So don't output those definitions in the uAPI
header.
Jakub Kicinski [Tue, 4 Feb 2025 21:57:50 +0000 (13:57 -0800)]
MAINTAINERS: add a sample ethtool section entry
I feel like we don't do a good enough keeping authors of driver
APIs around. The ethtool code base was very nicely compartmentalized
by Michal. Establish a precedent of creating MAINTAINERS entries
for "sections" of the ethtool API. Use Andrew and cable test as
a sample entry. The entry should ideally cover 3 elements:
a core file, test(s), and keywords. The last one is important
because we intend the entries to cover core code *and* reviews
of drivers implementing given API!
Jakub Kicinski [Tue, 4 Feb 2025 21:57:29 +0000 (13:57 -0800)]
MAINTAINERS: add entry for ethtool
Michal did an amazing job converting ethtool to Netlink, but never
added an entry to MAINTAINERS for himself. Create a formal entry
so that we can delegate (portions) of this code to folks.
Over the last 3 years majority of the reviews have been done by
Andrew and I. I suppose Michal didn't want to be on the receiving
end of the flood of patches.
====================
Support one PTP device per hardware clock
This series contains two features from Jianbo, followed by simple
cleanups.
Patches 1-9 by Jianbo add support for one PTP device per hardware clock,
described below [1].
Patches 10-12 by Jianbo add support for 200Gbps per-lane link modes in
kernel and mlx5 driver.
Patches 13-15 are simple cleanups by Gal and Carolina.
[1]
PHC (PTP hardware clock) is normally shared by multiple functions
(PF/VF/SF). mlx5 driver currently creates a separate PTP device for each
network interface that shares one PHC.
PHC can be configured to work as free running mode or real time mode.
In this series, only one PTP device is created for the shared PHC when
it is running in real time mode.
To support this feature,
* Firmware needs to support clock identity. When functions share a
PHC, the clock identities they query are same.
* Driver dynamically allocates mlx5_clock to represent a PHC.
* New devcom component is added for hardware clock. Functions are
grouped by the identity, and one mlx5_clock is allocated and shared
by the functions with the same identity.
* When PTP device accesses PHC by its callbacks, the first function
in the clock devcom list is selected to send commands to firmware.
* PPS IN event is armed on one function. It should be re-armed on
the other one when current is unloaded.
====================
Carolina Jubran [Mon, 3 Feb 2025 21:35:16 +0000 (23:35 +0200)]
net/mlx5e: Avoid WARN_ON when configuring MQPRIO with HTB offload enabled
When attempting to enable MQPRIO while HTB offload is already
configured, the driver currently returns `-EINVAL` and triggers a
`WARN_ON`, leading to an unnecessary call trace.
Update the code to handle this case more gracefully by returning
`-EOPNOTSUPP` instead, while also providing a helpful user message.
Jianbo Liu [Mon, 3 Feb 2025 21:35:13 +0000 (23:35 +0200)]
net/mlx5e: Support FEC settings for 200G per lane link modes
Add support to show and config FEC by ethtool for 200G/lane link
modes. The RS encoding setting is mapped, and can be overridden to
FEC_RS_544_514_INTERLEAVED_QUAD for these modes.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Shahar Shitrit <shshitrit@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jianbo Liu [Mon, 3 Feb 2025 21:35:10 +0000 (23:35 +0200)]
net/mlx5: Generate PPS IN event on new function for shared clock
As a specific function (mdev) is chosen to send MTPPSE command to
firmware, the event is generated only on that function. When that
function is unloaded, the PPS event can't be forward to PTP device,
even when there are other functions in the group, and PTP device is
not destroyed. To resolve this problem, need to send MTPPSE again from
new function, and dis-arm the event on old function after that.
PPS events are handled by EQ notifier. The async EQs and notifiers are
destroyed in mlx5_eq_table_destroy() which is called before
mlx5_cleanup_clock(). During the period between
mlx5_eq_table_destroy() and mlx5_cleanup_clock(), the events can't be
handled. To avoid event loss, add mlx5_clock_unload() in mlx5_unload()
to arm the event on other available function, and mlx5_clock_load in
mlx5_load() for symmetry.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jianbo Liu [Mon, 3 Feb 2025 21:35:09 +0000 (23:35 +0200)]
net/mlx5: Support one PTP device per hardware clock
Currently, mlx5 driver exposes a PTP device for each network interface,
resulting in multiple device nodes representing the same underlying
PHC (PTP hardware clock). This causes problem if it is trying to
synchronize to itself. For instance, when ptp4l operates on multiple
interfaces following different masters, phc2sys attempts to
synchronize them in automatic mode.
PHC can be configured to work as free running mode or real time mode.
All functions can access it directly. In this patch, we create one PTP
device for each PHC when it's running in real time mode. All the
functions share the same PTP device if the clock identifies they query
are same, and they are already grouped by devcom in previous commit.
The first mdev in the peer list is chosen when sending
MTPPS/MTUTC/MTPPSE/MRTCQ to firmware. Since the function can be
unloaded at any time, we need to use a mutex lock to protect the mdev
pointer used in PTP and PPS callbacks. Besides, new one should be
picked from the peer list when the current is not available.
The clock info, which is used by IB, is shared by all the interfaces
using the same hardware clock.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jianbo Liu [Mon, 3 Feb 2025 21:35:08 +0000 (23:35 +0200)]
net/mlx5: Move PPS notifier and out_work to clock_state
The PPS notifier is currently in mlx5_clock, and mlx5_clock can be
shared in later patch, so the notifier should be registered for each
device to avoid any event miss. Besides, the out_work is scheduled by
PPS out event which is triggered only when the device is in free
running mode. So, both are moved to mlx5_core_dev's clock_state.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jianbo Liu [Mon, 3 Feb 2025 21:35:06 +0000 (23:35 +0200)]
net/mlx5: Change clock in mlx5_core_dev to mlx5_clock pointer
Change clock member in mlx5_core_dev to a pointer, so it can point to
a clock shared by multiple functions in later patch.
For now, each function has its own clock, so mdev in mlx5_clock_priv
is the back pointer to the function. Later it points to one (normally
the first one) of the multiple functions sharing the same clock.
Change mlx5_init_clock() to return error if mlx5_clock is not
allocated. Besides, a null clock is defined and used when hardware
clock is not supported. So, the clock pointer is always pointing to
something valid.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jianbo Liu [Mon, 3 Feb 2025 21:35:05 +0000 (23:35 +0200)]
net/mlx5: Add API to get mlx5_core_dev from mlx5_clock
The mdev is calculated directly from mlx5_clock, as it's one of the
fields in mlx5_core_dev. Move to a function so it can be easily
changed in next patch.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jianbo Liu [Mon, 3 Feb 2025 21:35:04 +0000 (23:35 +0200)]
net/mlx5: Add init and destruction functions for a single HW clock
Move hardware clock initialization and destruction to the functions,
which will be used for dynamically allocated clock. Such clock is
shared by all the devices if the queried clock identities are same.
The out_work is for PPS out event, which can't be triggered when clock
is shared, so INIT_WORK is not moved to the initialization function.
Besides, we still need to register notifier for each device.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jianbo Liu [Mon, 3 Feb 2025 21:35:03 +0000 (23:35 +0200)]
net/mlx5: Change parameters for PTP internal functions
In later patch, the mlx5_clock will be allocated dynamically, its
address can be obtained from mlx5_core_dev struct, but mdev can't be
obtained from mlx5_clock because it can be shared by multiple
interfaces. So change the parameter for such internal functions, only
mdev is passed down from the callers.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
====================
vxlan: Age FDB entries based on Rx traffic
tl;dr - This patchset prevents VXLAN FDB entries from lingering if
traffic is only forwarded to a silent host.
The VXLAN driver maintains two timestamps for each FDB entry: 'used' and
'updated'. The first is refreshed by both the Rx and Tx paths and the
second is refreshed upon migration.
The driver ages out entries according to their 'used' time which means
that an entry can linger when traffic is only forwarded to a silent host
that might have migrated to a different remote.
This patchset solves the problem by adjusting the above semantics and
aligning them to those of the bridge driver. That is, 'used' time is
refreshed by the Tx path, 'updated' time is refresh by Rx path or user
space updates and entries are aged out according to their 'updated'
time.
Patches #1-#2 perform small changes in how the 'used' and 'updated'
fields are accessed.
Patches #3-#5 refresh the 'updated' time where needed.
Patch #6 flips the driver to age out FDB entries according to their
'updated' time.
Patch #7 removes unnecessary updates to the 'used' time.
Patch #8 extends a test case to cover aging of FDB entries in the
presence of Tx traffic.
====================
Ido Schimmel [Tue, 4 Feb 2025 14:55:48 +0000 (16:55 +0200)]
vxlan: Avoid unnecessary updates to FDB 'used' time
Now that the VXLAN driver ages out FDB entries based on their 'updated'
time we can remove unnecessary updates of the 'used' time from the Rx
path and the control path, so that the 'used' time is only updated by
the Tx path.
Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20250204145549.1216254-8-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ido Schimmel [Tue, 4 Feb 2025 14:55:47 +0000 (16:55 +0200)]
vxlan: Age out FDB entries based on 'updated' time
Currently, the VXLAN driver ages out FDB entries based on their 'used'
time which is refreshed by both the Tx and Rx paths. This means that an
FDB entry will not age out if traffic is only forwarded to the target
host:
# ip link add name vx1 up type vxlan id 10010 local 192.0.2.1 dstport 4789 learning ageing 10
# bridge fdb add 00:11:22:33:44:55 dev vx1 self dynamic dst 198.51.100.1
# bridge fdb get 00:11:22:33:44:55 br vx1 self
00:11:22:33:44:55 dev vx1 dst 198.51.100.1 self
# mausezahn vx1 -a own -b 00:11:22:33:44:55 -c 0 -p 100 -q &
# sleep 20
# bridge fdb get 00:11:22:33:44:55 br vx1 self
00:11:22:33:44:55 dev vx1 dst 198.51.100.1 self
This is wrong as an FDB entry will remain present when we no longer have
an indication that the host is still behind the current remote. It is
also inconsistent with the bridge driver:
# ip link add name br1 up type bridge ageing_time $((10 * 100))
# ip link add name swp1 up master br1 type dummy
# bridge fdb add 00:11:22:33:44:55 dev swp1 master dynamic
# bridge fdb get 00:11:22:33:44:55 br br1
00:11:22:33:44:55 dev swp1 master br1
# mausezahn br1 -a own -b 00:11:22:33:44:55 -c 0 -p 100 -q &
# sleep 20
# bridge fdb get 00:11:22:33:44:55 br br1
Error: Fdb entry not found.
Solve this by aging out entries based on their 'updated' time, which is
not refreshed by the Tx path:
# ip link add name vx1 up type vxlan id 10010 local 192.0.2.1 dstport 4789 learning ageing 10
# bridge fdb add 00:11:22:33:44:55 dev vx1 self dynamic dst 198.51.100.1
# bridge fdb get 00:11:22:33:44:55 br vx1 self
00:11:22:33:44:55 dev vx1 dst 198.51.100.1 self
# mausezahn vx1 -a own -b 00:11:22:33:44:55 -c 0 -p 100 -q &
# sleep 20
# bridge fdb get 00:11:22:33:44:55 br vx1 self
Error: Fdb entry not found.
But is refreshed by the Rx path:
# ip address add 192.0.2.1/32 dev lo
# ip link add name vx1 up type vxlan id 10010 local 192.0.2.1 dstport 4789 localbypass
# ip link add name vx2 up type vxlan id 20010 local 192.0.2.1 dstport 4789 learning ageing 10
# bridge fdb add 00:11:22:33:44:55 dev vx1 self static dst 127.0.0.1 vni 20010
# mausezahn vx1 -a 00:aa:bb:cc:dd:ee -b 00:11:22:33:44:55 -c 0 -p 100 -q &
# sleep 20
# bridge fdb get 00:aa:bb:cc:dd:ee br vx2 self
00:aa:bb:cc:dd:ee dev vx2 dst 127.0.0.1 self
# pkill mausezahn
# sleep 20
# bridge fdb get 00:aa:bb:cc:dd:ee br vx2 self
Error: Fdb entry not found.
Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20250204145549.1216254-7-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ido Schimmel [Tue, 4 Feb 2025 14:55:46 +0000 (16:55 +0200)]
vxlan: Refresh FDB 'updated' time upon user space updates
When a host migrates to a different remote and a packet is received from
the new remote, the corresponding FDB entry is updated and its 'updated'
time is refreshed.
However, when user space replaces the remote of an FDB entry, its
'updated' time is not refreshed:
# ip link add name vx1 up type vxlan id 10010 dstport 4789
# bridge fdb add 00:11:22:33:44:55 dev vx1 self dynamic dst 198.51.100.1
# sleep 10
# bridge -s -j -p fdb get 00:11:22:33:44:55 br vx1 self | jq '.[]["updated"]'
10
# bridge fdb replace 00:11:22:33:44:55 dev vx1 self dynamic dst 198.51.100.2
# bridge -s -j -p fdb get 00:11:22:33:44:55 br vx1 self | jq '.[]["updated"]'
10
This can lead to the entry being aged out prematurely and it is also
inconsistent with the bridge driver:
# ip link add name br1 up type bridge
# ip link add name swp1 master br1 up type dummy
# ip link add name swp2 master br1 up type dummy
# bridge fdb add 00:11:22:33:44:55 dev swp1 master dynamic vlan 1
# sleep 10
# bridge -s -j fdb get 00:11:22:33:44:55 br br1 vlan 1 | jq '.[]["updated"]'
10
# bridge fdb replace 00:11:22:33:44:55 dev swp2 master dynamic vlan 1
# bridge -s -j fdb get 00:11:22:33:44:55 br br1 vlan 1 | jq '.[]["updated"]'
0
Adjust the VXLAN driver to refresh the 'updated' time of an FDB entry
whenever one of its attributes is changed by user space:
# ip link add name vx1 up type vxlan id 10010 dstport 4789
# bridge fdb add 00:11:22:33:44:55 dev vx1 self dynamic dst 198.51.100.1
# sleep 10
# bridge -s -j -p fdb get 00:11:22:33:44:55 br vx1 self | jq '.[]["updated"]'
10
# bridge fdb replace 00:11:22:33:44:55 dev vx1 self dynamic dst 198.51.100.2
# bridge -s -j -p fdb get 00:11:22:33:44:55 br vx1 self | jq '.[]["updated"]'
0
Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20250204145549.1216254-6-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ido Schimmel [Tue, 4 Feb 2025 14:55:45 +0000 (16:55 +0200)]
vxlan: Refresh FDB 'updated' time upon 'NTF_USE'
The 'NTF_USE' flag can be used by user space to refresh FDB entries so
that they will not age out. Currently, the VXLAN driver implements it by
refreshing the 'used' field in the FDB entry as this is the field
according to which FDB entries are aged out.
Subsequent patches will switch the VXLAN driver to age out entries based
on the 'updated' field. Prepare for this change by refreshing the
'updated' field upon 'NTF_USE'. This is consistent with the bridge
driver's FDB:
# ip link add name br1 up type bridge
# ip link add name swp1 master br1 up type dummy
# bridge fdb add 00:11:22:33:44:55 dev swp1 master dynamic vlan 1
# sleep 10
# bridge fdb replace 00:11:22:33:44:55 dev swp1 master dynamic vlan 1
# bridge -s -j fdb get 00:11:22:33:44:55 br br1 vlan 1 | jq '.[]["updated"]'
10
# sleep 10
# bridge fdb replace 00:11:22:33:44:55 dev swp1 master use dynamic vlan 1
# bridge -s -j fdb get 00:11:22:33:44:55 br br1 vlan 1 | jq '.[]["updated"]'
0
Before:
# ip link add name vx1 up type vxlan id 10010 dstport 4789
# bridge fdb add 00:11:22:33:44:55 dev vx1 self dynamic dst 198.51.100.1
# sleep 10
# bridge fdb replace 00:11:22:33:44:55 dev vx1 self dynamic dst 198.51.100.1
# bridge -s -j -p fdb get 00:11:22:33:44:55 br vx1 self | jq '.[]["updated"]'
10
# sleep 10
# bridge fdb replace 00:11:22:33:44:55 dev vx1 self use dynamic dst 198.51.100.1
# bridge -s -j -p fdb get 00:11:22:33:44:55 br vx1 self | jq '.[]["updated"]'
20
After:
# ip link add name vx1 up type vxlan id 10010 dstport 4789
# bridge fdb add 00:11:22:33:44:55 dev vx1 self dynamic dst 198.51.100.1
# sleep 10
# bridge fdb replace 00:11:22:33:44:55 dev vx1 self dynamic dst 198.51.100.1
# bridge -s -j -p fdb get 00:11:22:33:44:55 br vx1 self | jq '.[]["updated"]'
10
# sleep 10
# bridge fdb replace 00:11:22:33:44:55 dev vx1 self use dynamic dst 198.51.100.1
# bridge -s -j -p fdb get 00:11:22:33:44:55 br vx1 self | jq '.[]["updated"]'
0
Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20250204145549.1216254-5-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ido Schimmel [Tue, 4 Feb 2025 14:55:44 +0000 (16:55 +0200)]
vxlan: Always refresh FDB 'updated' time when learning is enabled
Currently, when learning is enabled and a packet is received from the
expected remote, the 'updated' field of the FDB entry is not refreshed.
This will become a problem when we switch the VXLAN driver to age out
entries based on the 'updated' field.
Solve this by always refreshing an FDB entry when we receive a packet
with a matching source MAC address, regardless if it was received via
the expected remote or not as it indicates the host is alive. This is
consistent with the bridge driver's FDB.
Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20250204145549.1216254-4-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ido Schimmel [Tue, 4 Feb 2025 14:55:42 +0000 (16:55 +0200)]
vxlan: Annotate FDB data races
The 'used' and 'updated' fields in the FDB entry structure can be
accessed concurrently by multiple threads, leading to reports such as
[1]. Can be reproduced using [2].
Suppress these reports by annotating these accesses using
READ_ONCE() / WRITE_ONCE().
[1]
BUG: KCSAN: data-race in vxlan_xmit / vxlan_xmit
write to 0xffff942604d263a8 of 8 bytes by task 286 on cpu 0:
vxlan_xmit+0xb29/0x2380
dev_hard_start_xmit+0x84/0x2f0
__dev_queue_xmit+0x45a/0x1650
packet_xmit+0x100/0x150
packet_sendmsg+0x2114/0x2ac0
__sys_sendto+0x318/0x330
__x64_sys_sendto+0x76/0x90
x64_sys_call+0x14e8/0x1c00
do_syscall_64+0x9e/0x1a0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
read to 0xffff942604d263a8 of 8 bytes by task 287 on cpu 2:
vxlan_xmit+0xadf/0x2380
dev_hard_start_xmit+0x84/0x2f0
__dev_queue_xmit+0x45a/0x1650
packet_xmit+0x100/0x150
packet_sendmsg+0x2114/0x2ac0
__sys_sendto+0x318/0x330
__x64_sys_sendto+0x76/0x90
x64_sys_call+0x14e8/0x1c00
do_syscall_64+0x9e/0x1a0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
value changed: 0x00000000fffbac6e -> 0x00000000fffbac6f
Reported by Kernel Concurrency Sanitizer on:
CPU: 2 UID: 0 PID: 287 Comm: mausezahn Not tainted 6.13.0-rc7-01544-gb4b270f11a02 #5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014
[2]
#!/bin/bash
set +H
echo whitelist > /sys/kernel/debug/kcsan
echo !vxlan_xmit > /sys/kernel/debug/kcsan
ip link add name vx0 up type vxlan id 10010 dstport 4789 local 192.0.2.1
bridge fdb add 00:11:22:33:44:55 dev vx0 self static dst 198.51.100.1
taskset -c 0 mausezahn vx0 -a own -b 00:11:22:33:44:55 -c 0 -q &
taskset -c 2 mausezahn vx0 -a own -b 00:11:22:33:44:55 -c 0 -q &
Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20250204145549.1216254-2-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ipv4: ip_gre: Fix set but not used warning in ipgre_err() if IPv4-only
if CONFIG_NET_IPGRE is enabled, but CONFIG_IPV6 is disabled:
net/ipv4/ip_gre.c: In function ‘ipgre_err’:
net/ipv4/ip_gre.c:144:22: error: variable ‘data_len’ set but not used [-Werror=unused-but-set-variable]
144 | unsigned int data_len = 0;
| ^~~~~~~~
Fix this by moving all data_len processing inside the IPV6-only section
that uses its result.
Jakub Kicinski [Thu, 6 Feb 2025 02:47:50 +0000 (18:47 -0800)]
Merge branch 'rxrpc-call-state-fixes'
David Howells says:
====================
rxrpc: Call state fixes
Here some call state fixes for AF_RXRPC.
(1) Fix the state of a call to not treat the challenge-response cycle as
part of an incoming call's state set. The problem is that it makes
handling received of the final packet in the receive phase difficult
as that wants to change the call state - but security negotiations may
not yet be complete.
(2) Fix a race between the changing of the call state at the end of the
request reception phase of a service call, recvmsg() collecting the last
data and sendmsg() trying to send the reply before the I/O thread has
advanced the call state.
David Howells [Tue, 4 Feb 2025 23:05:54 +0000 (23:05 +0000)]
rxrpc: Fix race in call state changing vs recvmsg()
There's a race in between the rxrpc I/O thread recording the end of the
receive phase of a call and recvmsg() examining the state of the call to
determine whether it has completed.
The problem is that call->_state records the I/O thread's view of the call,
not the application's view (which may lag), so that alone is not
sufficient. To this end, the application also checks whether there is
anything left in call->recvmsg_queue for it to pick up. The call must be
in state RXRPC_CALL_COMPLETE and the recvmsg_queue empty for the call to be
considered fully complete.
In rxrpc_input_queue_data(), the latest skbuff is added to the queue and
then, if it was marked as LAST_PACKET, the state is advanced... But this
is two separate operations with no locking around them.
As a consequence, the lack of locking means that sendmsg() can jump into
the gap on a service call and attempt to send the reply - but then get
rejected because the I/O thread hasn't advanced the state yet.
Simply flipping the order in which things are done isn't an option as that
impacts the client side, causing the checks in rxrpc_kernel_check_life() as
to whether the call is still alive to race instead.
Fix this by moving the update of call->_state inside the skb queue
spinlocked section where the packet is queued on the I/O thread side.
rxrpc's recvmsg() will then automatically sync against this because it has
to take the call->recvmsg_queue spinlock in order to dequeue the last
packet.
rxrpc's sendmsg() doesn't need amending as the app shouldn't be calling it
to send a reply until recvmsg() indicates it has returned all of the
request.
Fixes: 93368b6bd58a ("rxrpc: Move call state changes from recvmsg to I/O thread") Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org Link: https://patch.msgid.link/20250204230558.712536-3-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Howells [Tue, 4 Feb 2025 23:05:53 +0000 (23:05 +0000)]
rxrpc: Fix call state set to not include the SERVER_SECURING state
The RXRPC_CALL_SERVER_SECURING state doesn't really belong with the other
states in the call's state set as the other states govern the call's Rx/Tx
phase transition and govern when packets can and can't be received or
transmitted. The "Securing" state doesn't actually govern the reception of
packets and would need to be split depending on whether or not we've
received the last packet yet (to mirror RECV_REQUEST/ACK_REQUEST).
The "Securing" state is more about whether or not we can start forwarding
packets to the application as recvmsg will need to decode them and the
decoding can't take place until the challenge/response exchange has
completed.
Fix this by removing the RXRPC_CALL_SERVER_SECURING state from the state
set and, instead, using a flag, RXRPC_CALL_CONN_CHALLENGING, to track
whether or not we can queue the call for reception by recvmsg() or notify
the kernel app that data is ready. In the event that we've already
received all the packets, the connection event handler will poke the app
layer in the appropriate manner.
Also there's a race whereby the app layer sees the last packet before rxrpc
has managed to end the rx phase and change the state to one amenable to
allowing a reply. Fix this by queuing the packet after calling
rxrpc_end_rx_phase().
Fixes: 17926a79320a ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both") Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org Link: https://patch.msgid.link/20250204230558.712536-2-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ido Schimmel [Tue, 4 Feb 2025 12:38:39 +0000 (14:38 +0200)]
net: sched: Fix truncation of offloaded action statistics
In case of tc offload, when user space queries the kernel for tc action
statistics, tc will query the offloaded statistics from device drivers.
Among other statistics, drivers are expected to pass the number of
packets that hit the action since the last query as a 64-bit number.
Unfortunately, tc treats the number of packets as a 32-bit number,
leading to truncation and incorrect statistics when the number of
packets since the last query exceeds 0xffffffff:
$ tc -s filter show dev swp2 ingress
filter protocol all pref 1 flower chain 0
filter protocol all pref 1 flower chain 0 handle 0x1
skip_sw
in_hw in_hw_count 1
action order 1: mirred (Egress Redirect to device swp1) stolen
index 1 ref 1 bind 1 installed 58 sec used 0 sec
Action statistics:
Sent 1133877034176 bytes 536959475 pkt (dropped 0, overlimits 0 requeues 0)
[...]
According to the above, 2111-byte packets were redirected which is
impossible as only 64-byte packets were transmitted and the MTU was
1500.
Fix by treating packets as a 64-bit number:
$ tc -s filter show dev swp2 ingress
filter protocol all pref 1 flower chain 0
filter protocol all pref 1 flower chain 0 handle 0x1
skip_sw
in_hw in_hw_count 1
action order 1: mirred (Egress Redirect to device swp1) stolen
index 1 ref 1 bind 1 installed 61 sec used 0 sec
Action statistics:
Sent 1370624380864 bytes 21416005951 pkt (dropped 0, overlimits 0 requeues 0)
[...]
Eric Dumazet [Tue, 4 Feb 2025 14:48:25 +0000 (14:48 +0000)]
net: flush_backlog() small changes
Add READ_ONCE() around reads of skb->dev->reg_state, because
this field can be changed from other threads/cpus.
Instead of calling dev_kfree_skb_irq() and kfree_skb()
while interrupts are masked and locks held,
use a temporary list and use __skb_queue_purge_reason()
Use SKB_DROP_REASON_DEV_READY drop reason to better
describe why these skbs are dropped.
Aswin Karuvally [Tue, 4 Feb 2025 10:31:35 +0000 (11:31 +0100)]
s390/net: Remove LCS driver
The original Open Systems Adapter (OSA) was introduced by IBM in the
mid-90s. These were then superseded by OSA-Express in 1999 which used
Queued Direct IO to greatly improve throughput. The newer cards
retained the older, slower non-QDIO (OSE) modes for compatibility with
older systems. In Linux, the lcs driver was responsible for cards
operating in the older OSE mode and the qeth driver was introduced to
allow the OSA-Express cards to operate in the newer QDIO (OSD) mode.
For an S390 machine from 1998 or later, there is no reason to use the
OSE mode and lcs driver as all OSA cards since 1999 provide the faster
OSD mode. As a result, it's been years since we have heard of a
customer configuration involving the lcs driver.
This patch removes the lcs driver. The technology it supports has been
obsolete for past 25+ years and is irrelevant for current use cases.
Reviewed-by: Alexandra Winter <wintera@linux.ibm.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> Acked-by: Peter Oberparleiter <oberpar@linux.ibm.com> Signed-off-by: Aswin Karuvally <aswin@linux.ibm.com> Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250204103135.1619097-1-wintera@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
cxgb4: Avoid a -Wflex-array-member-not-at-end warning
-Wflex-array-member-not-at-end was introduced in GCC-14, and we are
getting ready to enable it, globally.
Move the conflicting declaration to the end of the structure. Notice
that `struct ethtool_dump` is a flexible structure --a structure that
contains a flexible-array member.
Fix the following warning:
./drivers/net/ethernet/chelsio/cxgb4/cxgb4.h:1215:29: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Link: https://patch.msgid.link/Z6GBZ4brXYffLkt_@kspp Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cong Wang [Tue, 4 Feb 2025 00:58:41 +0000 (16:58 -0800)]
selftests/tc-testing: Add a test case for qdisc_tree_reduce_backlog()
Integrate the test case provided by Mingi Cho into TDC.
All test results:
1..4
ok 1 ca5e - Check class delete notification for ffff:
ok 2 e4b7 - Check class delete notification for root ffff:
ok 3 33a9 - Check ingress is not searchable on backlog update
ok 4 a4b9 - Test class qlen notification
Cong Wang [Tue, 4 Feb 2025 00:58:40 +0000 (16:58 -0800)]
netem: Update sch->q.qlen before qdisc_tree_reduce_backlog()
qdisc_tree_reduce_backlog() notifies parent qdisc only if child
qdisc becomes empty, therefore we need to reduce the backlog of the
child qdisc before calling it. Otherwise it would miss the opportunity
to call cops->qlen_notify(), in the case of DRR, it resulted in UAF
since DRR uses ->qlen_notify() to maintain its active list.
Fixes: f8d4bc455047 ("net/sched: netem: account for backlog updates from child qdisc") Cc: Martin Ottens <martin.ottens@fau.de> Reported-by: Mingi Cho <mincho@theori.io> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Link: https://patch.msgid.link/20250204005841.223511-4-xiyou.wangcong@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Quang Le [Tue, 4 Feb 2025 00:58:39 +0000 (16:58 -0800)]
selftests/tc-testing: Add a test case for pfifo_head_drop qdisc when limit==0
When limit == 0, pfifo_tail_enqueue() must drop new packet and
increase dropped packets count of the qdisc.
All test results:
1..16
ok 1 a519 - Add bfifo qdisc with system default parameters on egress
ok 2 585c - Add pfifo qdisc with system default parameters on egress
ok 3 a86e - Add bfifo qdisc with system default parameters on egress with handle of maximum value
ok 4 9ac8 - Add bfifo qdisc on egress with queue size of 3000 bytes
ok 5 f4e6 - Add pfifo qdisc on egress with queue size of 3000 packets
ok 6 b1b1 - Add bfifo qdisc with system default parameters on egress with invalid handle exceeding maximum value
ok 7 8d5e - Add bfifo qdisc on egress with unsupported argument
ok 8 7787 - Add pfifo qdisc on egress with unsupported argument
ok 9 c4b6 - Replace bfifo qdisc on egress with new queue size
ok 10 3df6 - Replace pfifo qdisc on egress with new queue size
ok 11 7a67 - Add bfifo qdisc on egress with queue size in invalid format
ok 12 1298 - Add duplicate bfifo qdisc on egress
ok 13 45a0 - Delete nonexistent bfifo qdisc
ok 14 972b - Add prio qdisc on egress with invalid format for handles
ok 15 4d39 - Delete bfifo qdisc twice
ok 16 d774 - Check pfifo_head_drop qdisc enqueue behaviour when limit == 0
Quang Le [Tue, 4 Feb 2025 00:58:38 +0000 (16:58 -0800)]
pfifo_tail_enqueue: Drop new packet when sch->limit == 0
Expected behaviour:
In case we reach scheduler's limit, pfifo_tail_enqueue() will drop a
packet in scheduler's queue and decrease scheduler's qlen by one.
Then, pfifo_tail_enqueue() enqueue new packet and increase
scheduler's qlen by one. Finally, pfifo_tail_enqueue() return
`NET_XMIT_CN` status code.
Weird behaviour:
In case we set `sch->limit == 0` and trigger pfifo_tail_enqueue() on a
scheduler that has no packet, the 'drop a packet' step will do nothing.
This means the scheduler's qlen still has value equal 0.
Then, we continue to enqueue new packet and increase scheduler's qlen by
one. In summary, we can leverage pfifo_tail_enqueue() to increase qlen by
one and return `NET_XMIT_CN` status code.
The problem is:
Let's say we have two qdiscs: Qdisc_A and Qdisc_B.
- Qdisc_A's type must have '->graft()' function to create parent/child relationship.
Let's say Qdisc_A's type is `hfsc`. Enqueue packet to this qdisc will trigger `hfsc_enqueue`.
- Qdisc_B's type is pfifo_head_drop. Enqueue packet to this qdisc will trigger `pfifo_tail_enqueue`.
- Qdisc_B is configured to have `sch->limit == 0`.
- Qdisc_A is configured to route the enqueued's packet to Qdisc_B.
Enqueue packet through Qdisc_A will lead to:
- hfsc_enqueue(Qdisc_A) -> pfifo_tail_enqueue(Qdisc_B)
- Qdisc_B->q.qlen += 1
- pfifo_tail_enqueue() return `NET_XMIT_CN`
- hfsc_enqueue() check for `NET_XMIT_SUCCESS` and see `NET_XMIT_CN` => hfsc_enqueue() don't increase qlen of Qdisc_A.
The whole process lead to a situation where Qdisc_A->q.qlen == 0 and Qdisc_B->q.qlen == 1.
Replace 'hfsc' with other type (for example: 'drr') still lead to the same problem.
This violate the design where parent's qlen should equal to the sum of its childrens'qlen.
Bug impact: This issue can be used for user->kernel privilege escalation when it is reachable.
Fixes: 57dbb2d83d10 ("sched: add head drop fifo queue") Reported-by: Quang Le <quanglex97@gmail.com> Signed-off-by: Quang Le <quanglex97@gmail.com> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Link: https://patch.msgid.link/20250204005841.223511-2-xiyou.wangcong@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The '-f' parameter is there to force the kernel to emit MPTCP FASTCLOSE
by closing the connection with unread bytes in the receive queue.
The xdisconnect() helper was used to stop the connection, but it does
more than that: it will shut it down, then wait before reconnecting to
the same address. This causes the mptcp_join's "fastclose test" to fail
all the time.
This failure is due to a recent change, with commit 218cc166321f
("selftests: mptcp: avoid spurious errors on disconnect"), but that went
unnoticed because the test is currently ignored. The recent modification
only shown an existing issue: xdisconnect() doesn't need to be used
here, only the shutdown() part is needed.
Fixes: 6bf41020b72b ("selftests: mptcp: update and extend fastclose test-cases") Cc: stable@vger.kernel.org Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250204-net-mptcp-sft-conn-f-v1-1-6b470c72fffa@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Petr Machata [Tue, 4 Feb 2025 17:37:15 +0000 (18:37 +0100)]
bridge: mdb: Allow replace of a host-joined group
Attempts to replace an MDB group membership of the host itself are
currently bounced:
# ip link add name br up type bridge vlan_filtering 1
# bridge mdb replace dev br port br grp 239.0.0.1 vid 2
# bridge mdb replace dev br port br grp 239.0.0.1 vid 2
Error: bridge: Group is already joined by host.
A similar operation done on a member port would succeed. Ignore the check
for replacement of host group memberships as well.
The bit of code that this enables is br_multicast_host_join(), which, for
already-joined groups only refreshes the MC group expiration timer, which
is desirable; and a userspace notification, also desirable.
Change a selftest that exercises this code path from expecting a rejection
to expecting a pass. The rest of MDB selftests pass without modification.