]> git.ipfire.org Git - thirdparty/linux.git/log
thirdparty/linux.git
3 weeks agoice: add support for reading and unpacking Rx queue context
Jacob Keller [Wed, 18 Jun 2025 22:24:36 +0000 (15:24 -0700)] 
ice: add support for reading and unpacking Rx queue context

In order to support live migration, the ice driver will need to read
certain data from the Rx queue context. This is stored in the hardware in a
packed format.

Since we use <linux/packing.h> for the mapping between the packed hardware
format and the unpacked structure, it is trivial to enable unpacking
support via the unpack_fields() function.

Add the ice_unpack_rxq_ctx() function based on the unpack_fields() API.
Re-use the same field definitions from the packing implementation.

Add ice_copy_rxq_ctx_from_hw() to copy the Rx queue context data from the
hardware registers.

Use these to implement ice_read_rxq_ctx() which will return the Rx queue
context to the caller in its unpacked ice_rlan_ctx struct.

This will enable the migration logic access to the relevant data about the
Rx device queues. It can easily be copied to the target system as part of
the migration payload, where it will be used to configure the Rx queues.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Madhu Chittim <madhu.chittim@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
3 weeks agoMerge branch 'net-dsa-rzn1_a5psw-add-compile_test'
Paolo Abeni [Thu, 10 Jul 2025 13:41:03 +0000 (15:41 +0200)] 
Merge branch 'net-dsa-rzn1_a5psw-add-compile_test'

Rosen Penev says:

====================
net: dsa: rzn1_a5psw: add COMPILE_TEST

Allows the various bots to test compilation. Also threw in a small devm
conversion for enabling clocks.
====================

Link: https://patch.msgid.link/20250708014144.2514-1-rosenp@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agonet: dsa: rzn1_a5psw: use devm to enable clocks
Rosen Penev [Tue, 8 Jul 2025 01:41:44 +0000 (18:41 -0700)] 
net: dsa: rzn1_a5psw: use devm to enable clocks

The remove function has these in the wrong order. The switch should be
unregistered last. Simpler to use devm so that the right thing is done.

Signed-off-by: Rosen Penev <rosenp@gmail.com>
Link: https://patch.msgid.link/20250708014144.2514-3-rosenp@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agonet: dsa: rzn1_a5psw: add COMPILE_TEST
Rosen Penev [Tue, 8 Jul 2025 01:41:43 +0000 (18:41 -0700)] 
net: dsa: rzn1_a5psw: add COMPILE_TEST

There's no architecture specific requirement for it to compile. Allows
the bots to test compilation properly.

Signed-off-by: Rosen Penev <rosenp@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250708014144.2514-2-rosenp@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agonet: xsk: introduce XDP_MAX_TX_SKB_BUDGET setsockopt
Jason Xing [Fri, 4 Jul 2025 16:01:38 +0000 (00:01 +0800)] 
net: xsk: introduce XDP_MAX_TX_SKB_BUDGET setsockopt

This patch provides a setsockopt method to let applications leverage to
adjust how many descs to be handled at most in one send syscall. It
mitigates the situation where the default value (32) that is too small
leads to higher frequency of triggering send syscall.

Considering the prosperity/complexity the applications have, there is no
absolutely ideal suggestion fitting all cases. So keep 32 as its default
value like before.

The patch does the following things:
- Add XDP_MAX_TX_SKB_BUDGET socket option.
- Set max_tx_budget to 32 by default in the initialization phase as a
  per-socket granular control.
- Set the range of max_tx_budget as [32, xs->tx->nentries].

The idea behind this comes out of real workloads in production. We use a
user-level stack with xsk support to accelerate sending packets and
minimize triggering syscalls. When the packets are aggregated, it's not
hard to hit the upper bound (namely, 32). The moment user-space stack
fetches the -EAGAIN error number passed from sendto(), it will loop to try
again until all the expected descs from tx ring are sent out to the driver.
Enlarging the XDP_MAX_TX_SKB_BUDGET value contributes to less frequency of
sendto() and higher throughput/PPS.

Here is what I did in production, along with some numbers as follows:
For one application I saw lately, I suggested using 128 as max_tx_budget
because I saw two limitations without changing any default configuration:
1) XDP_MAX_TX_SKB_BUDGET, 2) socket sndbuf which is 212992 decided by
net.core.wmem_default. As to XDP_MAX_TX_SKB_BUDGET, the scenario behind
this was I counted how many descs are transmitted to the driver at one
time of sendto() based on [1] patch and then I calculated the
possibility of hitting the upper bound. Finally I chose 128 as a
suitable value because 1) it covers most of the cases, 2) a higher
number would not bring evident results. After twisting the parameters,
a stable improvement of around 4% for both PPS and throughput and less
resources consumption were found to be observed by strace -c -p xxx:
1) %time was decreased by 7.8%
2) error counter was decreased from 18367 to 572

[1]: https://lore.kernel.org/all/20250619093641.70700-1-kerneljasonxing@gmail.com/

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://patch.msgid.link/20250704160138.48677-1-kerneljasonxing@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 weeks agoMerge branch 'net-mlx5-misc-changes-2025-07-09'
Jakub Kicinski [Thu, 10 Jul 2025 02:47:47 +0000 (19:47 -0700)] 
Merge branch 'net-mlx5-misc-changes-2025-07-09'

Tariq Toukan says:

====================
net/mlx5: misc changes 2025-07-09

This series contains misc enhancements to the mlx5 driver.
====================

Link: https://patch.msgid.link/1752009387-13300-1-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet/mlx5e: RX, Remove unnecessary RQT redirects
Tariq Toukan [Tue, 8 Jul 2025 21:16:27 +0000 (00:16 +0300)] 
net/mlx5e: RX, Remove unnecessary RQT redirects

RQTs (Receive Queue Table) should redirect traffic to the channels' RQs
when they're active.  Otherwise, redirect to the designated "drop RQ".

RQTs are created in "inactive" state, pointing to the "drop RQ".
In activate and de-activate flows, do not "deactivate" the rest of RQTs
(beyond the num of channels), as they are already inactive.

This cuts down unnecessary execution of FW commands (MODIFY_RQT), and
improves the latency of open/close channels or configuration change.

Perf:
NIC: Connect-X7.
Configuration: 1 combined channel, max num channels 248.
Measure time for "interface up + interface down".

Before: 0.313 sec
After:  0.057 sec (5.5x faster)

247 MODIFY_RQT commands saved in interface up.
247 MODIFY_RQT commands saved in interface down.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/1752009387-13300-6-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet/mlx5: Warn when write combining is not supported
Maor Gottlieb [Tue, 8 Jul 2025 21:16:26 +0000 (00:16 +0300)] 
net/mlx5: Warn when write combining is not supported

Warn if write combining is not supported, as it can impact latency.
Add the warning message to be printed only when the driver actually
run the test and detect unsupported state, rather than when
inheriting parent's result for SFs.

Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/1752009387-13300-5-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet/mlx5e: Replace recursive VLAN push handling with an iterative loop
Gal Pressman [Tue, 8 Jul 2025 21:16:25 +0000 (00:16 +0300)] 
net/mlx5e: Replace recursive VLAN push handling with an iterative loop

mlx5e_tc_act_vlan_add_push_action() uses tail-recursion to walk through
a stack of VLAN devices.

There is no need for a complicated recursion with unnecessary stack
consumption and less obvious code flow, rewrite the function so that it
uses a do while loop instead.

Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/1752009387-13300-4-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet/mlx5e: CT: extract a memcmp from a spinlock section
Cosmin Ratiu [Tue, 8 Jul 2025 21:16:24 +0000 (00:16 +0300)] 
net/mlx5e: CT: extract a memcmp from a spinlock section

This reduces the time the lock is held and reduces contention.

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/1752009387-13300-3-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet/mlx5e: Remove unused VLAN insertion logic in TX path
Carolina Jubran [Tue, 8 Jul 2025 21:16:23 +0000 (00:16 +0300)] 
net/mlx5e: Remove unused VLAN insertion logic in TX path

The VLAN insertion capability (`wqe_vlan_insert`) was never enabled on
all mlx5 devices. When VLAN TX offload is advertised but this
capability is not supported, the driver uses inline headers to insert
the VLAN tag.

To support this, the driver used to set the
`MLX5E_SQ_STATE_VLAN_NEED_L2_INLINE` bit to enforce L2 inline mode
when `wqe_vlan_insert` was not supported. Since the capability is
disabled on all devices, this logic was always active, and the SQ flag
has become redundant. L2 inline is enforced unconditionally for
VLAN-tagged packets.

The `skb_vlan_tag_present()` check in the else-if block of
`mlx5e_sq_xmit_wqe()` is never true by this point in the TX flow,
as the VLAN tag has already been inserted by the driver using inline
headers. As a result, this code is never executed.

Remove the redundant SQ state, dead VLAN insertion code block, and
related logic.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/1752009387-13300-2-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agovsock/test: fix test for null ptr deref when transport changes
Stefano Garzarella [Tue, 8 Jul 2025 11:17:01 +0000 (13:17 +0200)] 
vsock/test: fix test for null ptr deref when transport changes

In test_stream_transport_change_client(), the client sends CONTROL_CONTINUE
on each iteration, even when connect() is unsuccessful. This causes a flood
of control messages in the server that hangs around for more than 10
seconds after the test finishes, triggering several timeouts and causing
subsequent tests to fail. This was discovered in testing a newly proposed
test that failed in this way on the client side:
    ...
    33 - SOCK_STREAM transport change null-ptr-deref...ok
    34 - SOCK_STREAM ioctl(SIOCINQ) functionality...recv timed out

The CONTROL_CONTINUE message is used only to tell to the server to call
accept() to consume successful connections, so that subsequent connect()
will not fail for finding the queue full.

Send CONTROL_CONTINUE message only when the connect() has succeeded, or
found the queue full. Note that the second connect() can also succeed if
the first one was interrupted after sending the request.

Fixes: 3a764d93385c ("vsock/test: Add test for null ptr deref when transport changes")
Cc: leonardi@redhat.com
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Luigi Leonardi <leonardi@redhat.com>
Link: https://patch.msgid.link/20250708111701.129585-1-sgarzare@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoMerge branch 'net-phy-bcm54811-phy-initialization'
Jakub Kicinski [Thu, 10 Jul 2025 02:32:32 +0000 (19:32 -0700)] 
Merge branch 'net-phy-bcm54811-phy-initialization'

 says:

====================
net: phy: bcm54811: PHY initialization

Proper bcm54811 PHY driver initialization for MII-Lite.

The bcm54811 PHY in MLP package must be setup for MII-Lite interface
mode by software. Normally, the PHY to MAC interface is selected in
hardware by setting the bootstrap pins of the PHY. However, MII and
MII-Lite share the same hardware setup and must be distinguished by
software, setting appropriate bit in a configuration register.
The MII-Lite interface mode is non-standard one, defined by Broadcom
for some of their PHYs. The MII-Lite lightness consist in omitting
RXER, TXER, CRS and COL signals of the standard MII interface.
Absence of COL them makes half-duplex links modes impossible but
does not interfere with Broadcom's BroadR-Reach link modes, because
they are full-duplex only.
To do it in a clean way, MII-Lite must be introduced first, including
its limitation to link modes (no half-duplex), because it is a
prerequisite for the patch #3 of this series. The patch #4 does not
depend on MII-Lite directly but both #3 and #4 are necessary for
bcm54811 to work properly without additional configuration steps to be
done - for example in the bootloader, before the kernel starts.

PATCH 1 - Add MII-Lite PHY interface mode as defined by Broadcom for
   their two-wire PHYs. It can be used with most Ethernet controllers
   under certain limitations (no half-duplex link modes etc.).

PATCH 2 - Add MII-Lite PHY interface type

PATCH 3 - Activation of MII-Lite interface mode on Broadcom bcm5481x
   PHYs

PATCH 4 - Initialize the BCM54811 PHY properly so that it conforms
   to the datasheet regarding a reserved bit in the LRE Control
   register, which must be written to zero after every device reset.
   Ignore the LDS capability bit in LRE Status register on bcm54811.
====================

Link: https://patch.msgid.link/20250708090140.61355-1-kamilh@axis.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet: phy: bcm54811: PHY initialization
Kamil Horák - 2N [Tue, 8 Jul 2025 09:01:40 +0000 (11:01 +0200)] 
net: phy: bcm54811: PHY initialization

Reset the bit 12 in PHY's LRE Control register upon initialization.
According to the datasheet, this bit must be written to zero after
every device reset.

Signed-off-by: Kamil Horák - 2N <kamilh@axis.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20250708090140.61355-5-kamilh@axis.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet: phy: bcm5481x: MII-Lite activation
Kamil Horák - 2N [Tue, 8 Jul 2025 09:01:39 +0000 (11:01 +0200)] 
net: phy: bcm5481x: MII-Lite activation

Broadcom PHYs featuring the BroadR-Reach two-wire link mode are usually
capable to operate in simplified MII mode, without TXER, RXER, CRS and
COL signals as defined for the MII. The absence of COL signal makes
half-duplex link modes impossible, however, the BroadR-Reach modes are
all full-duplex only.
Depending on the IC encapsulation, there exist MII-Lite-only PHYs such
as bcm54811 in MLP. The PHY itself is hardware-strapped to select among
multiple RGMII and MII-Lite modes, but the MII-Lite mode must be also
activated by software.

Add MII-Lite activation for bcm5481x PHYs.

Signed-off-by: Kamil Horák - 2N <kamilh@axis.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/20250708090140.61355-4-kamilh@axis.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agodt-bindings: ethernet-phy: add MII-Lite phy interface type
Kamil Horák - 2N [Tue, 8 Jul 2025 09:01:38 +0000 (11:01 +0200)] 
dt-bindings: ethernet-phy: add MII-Lite phy interface type

Some Broadcom PHYs are capable to operate in simplified MII mode,
without TXER, RXER, CRS and COL signals as defined for the MII.
The MII-Lite mode can be used on most Ethernet controllers with full
MII interface by just leaving the input signals (RXER, CRS, COL)
inactive. The absence of COL signal makes half-duplex link modes
impossible but does not interfere with BroadR-Reach link modes on
Broadcom PHYs, because they are all full-duplex only.

Add new interface type "mii-lite" to phy-connection-type enum.

Signed-off-by: Kamil Horák - 2N <kamilh@axis.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Acked-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20250708090140.61355-3-kamilh@axis.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet: phy: MII-Lite PHY interface mode
Kamil Horák - 2N [Tue, 8 Jul 2025 09:01:37 +0000 (11:01 +0200)] 
net: phy: MII-Lite PHY interface mode

Some Broadcom PHYs are capable to operate in simplified MII mode,
without TXER, RXER, CRS and COL signals as defined for the MII.
The MII-Lite mode can be used on most Ethernet controllers with full
MII interface by just leaving the input signals (RXER, CRS, COL)
inactive. The absence of COL signal makes half-duplex link modes
impossible but does not interfere with BroadR-Reach link modes on
Broadcom PHYs, because they are all full-duplex only.

Add MII-Lite interface mode, especially for Broadcom two-wire PHYs.

Signed-off-by: Kamil Horák - 2N <kamilh@axis.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/20250708090140.61355-2-kamilh@axis.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet: usb: enable the work after stop usbnet by ip down/up
Zqiang [Tue, 8 Jul 2025 08:16:53 +0000 (16:16 +0800)] 
net: usb: enable the work after stop usbnet by ip down/up

Oleksij reported that:
The smsc95xx driver fails after one down/up cycle, like this:
 $ nmcli device set enu1u1 managed no
 $ p a a 10.10.10.1/24 dev enu1u1
 $ ping -c 4 10.10.10.3
 $ ip l s dev enu1u1 down
 $ ip l s dev enu1u1 up
 $ ping -c 4 10.10.10.3
The second ping does not reach the host. Networking also fails on other interfaces.

Enable the work by replacing the disable_work_sync() with cancel_work_sync().

[Jun Miao: completely write the commit changelog]

Fixes: 2c04d279e857 ("net: usb: Convert tasklet API to new bottom half workqueue mechanism")
Reported-by: Oleksij Rempel <o.rempel@pengutronix.de>
Tested-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: Zqiang <qiang.zhang@linux.dev>
Signed-off-by: Jun Miao <jun.miao@intel.com>
Link: https://patch.msgid.link/20250708081653.307815-1-jun.miao@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoMerge branch 'vsock-introduce-siocinq-ioctl-support'
Jakub Kicinski [Thu, 10 Jul 2025 02:30:00 +0000 (19:30 -0700)] 
Merge branch 'vsock-introduce-siocinq-ioctl-support'

Xuewei Niu says:

====================
vsock: Introduce SIOCINQ ioctl support

Introduce SIOCINQ ioctl support for vsock, indicating the length of unread
bytes.

Similar with SIOCOUTQ ioctl, the information is transport-dependent.

The first patch adds SIOCINQ ioctl support in AF_VSOCK.

Thanks to @dexuan, the second patch is to fix the issue where hyper-v
`hvs_stream_has_data()` doesn't return the readable bytes.

The third patch wraps the ioctl into `ioctl_int()`, which implements a
retry mechanism to prevent immediate failure.

The last one adds two test cases to check the functionality. The changes
have been tested, and the results are as expected.
====================

Link: https://patch.msgid.link/20250708-siocinq-v6-0-3775f9a9e359@antgroup.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agotest/vsock: Add ioctl SIOCINQ tests
Xuewei Niu [Tue, 8 Jul 2025 06:36:14 +0000 (14:36 +0800)] 
test/vsock: Add ioctl SIOCINQ tests

Add SIOCINQ ioctl tests for both SOCK_STREAM and SOCK_SEQPACKET.

The client waits for the server to send data, and checks if the SIOCINQ
ioctl value matches the data size. After consuming the data, the client
checks if the SIOCINQ value is 0.

Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Tested-by: Luigi Leonardi <leonardi@redhat.com>
Reviewed-by: Luigi Leonardi <leonardi@redhat.com>
Link: https://patch.msgid.link/20250708-siocinq-v6-4-3775f9a9e359@antgroup.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agotest/vsock: Add retry mechanism to ioctl wrapper
Xuewei Niu [Tue, 8 Jul 2025 06:36:13 +0000 (14:36 +0800)] 
test/vsock: Add retry mechanism to ioctl wrapper

Wrap the ioctl in `ioctl_int()`, which takes a pointer to the actual
int value and an expected int value. The function will not return until
either the ioctl returns the expected value or a timeout occurs, thus
avoiding immediate failure.

Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Luigi Leonardi <leonardi@redhat.com>
Link: https://patch.msgid.link/20250708-siocinq-v6-3-3775f9a9e359@antgroup.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agovsock: Add support for SIOCINQ ioctl
Xuewei Niu [Tue, 8 Jul 2025 06:36:12 +0000 (14:36 +0800)] 
vsock: Add support for SIOCINQ ioctl

Add support for SIOCINQ ioctl, indicating the length of bytes unread in the
socket. The value is obtained from `vsock_stream_has_data()`.

Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Luigi Leonardi <leonardi@redhat.com>
Link: https://patch.msgid.link/20250708-siocinq-v6-2-3775f9a9e359@antgroup.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agohv_sock: Return the readable bytes in hvs_stream_has_data()
Dexuan Cui [Tue, 8 Jul 2025 06:36:11 +0000 (14:36 +0800)] 
hv_sock: Return the readable bytes in hvs_stream_has_data()

When hv_sock was originally added, __vsock_stream_recvmsg() and
vsock_stream_has_data() actually only needed to know whether there
is any readable data or not, so hvs_stream_has_data() was written to
return 1 or 0 for simplicity.

However, now hvs_stream_has_data() should return the readable bytes
because vsock_data_ready() -> vsock_stream_has_data() needs to know the
actual bytes rather than a boolean value of 1 or 0.

The SIOCINQ ioctl support also needs hvs_stream_has_data() to return
the readable bytes.

Let hvs_stream_has_data() return the readable bytes of the payload in
the next host-to-guest VMBus hv_sock packet.

Note: there may be multiple incoming hv_sock packets pending in the
VMBus channel's ringbuffer, but so far there is not a VMBus API that
allows us to know all the readable bytes in total without reading and
caching the payload of the multiple packets, so let's just return the
readable bytes of the next single packet. In the future, we'll either
add a VMBus API that allows us to know the total readable bytes without
touching the data in the ringbuffer, or the hv_sock driver needs to
understand the VMBus packet format and parse the packets directly.

Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
Link: https://patch.msgid.link/20250708-siocinq-v6-1-3775f9a9e359@antgroup.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoDocumentation: xsk: correct the obsolete references and examples
Jason Xing [Tue, 8 Jul 2025 06:29:07 +0000 (14:29 +0800)] 
Documentation: xsk: correct the obsolete references and examples

The modified lines are mainly related to the following commits[1][2]
which remove those tests and examples. Since samples/bpf has been
deprecated, we can refer to more examples that are easily searched
in the various xdp-projects, like the following link:
https://github.com/xdp-project/bpf-examples/tree/main/AF_XDP-example

[1]
commit f36600634282 ("libbpf: move xsk.{c,h} into selftests/bpf")
[2]
commit cfb5a2dbf141 ("bpf, samples: Remove AF_XDP samples")

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250708062907.11557-1-kerneljasonxing@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoskbuff: Add MSG_MORE flag to optimize tcp large packet transmission
Feng Yang [Tue, 8 Jul 2025 05:40:53 +0000 (13:40 +0800)] 
skbuff: Add MSG_MORE flag to optimize tcp large packet transmission

When using sockmap for forwarding, the average latency for different packet sizes
after sending 10,000 packets is as follows:
size    old(us)         new(us)
512     56              55
1472    58              58
1600    106             81
3000    145             105
5000    182             125

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Feng Yang <yangfeng@kylinos.cn>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250708054053.39551-1-yangfeng59949@163.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoMerge branch 'converge-on-using-secs_to_jiffies-part-two'
Jakub Kicinski [Thu, 10 Jul 2025 02:25:03 +0000 (19:25 -0700)] 
Merge branch 'converge-on-using-secs_to_jiffies-part-two'

Easwar Hariharan says:

====================
Converge on using secs_to_jiffies() part two

This is the second series (part 1*) that converts users of msecs_to_jiffies() that
either use the multiply pattern of either of:
- msecs_to_jiffies(N*1000) or
- msecs_to_jiffies(N*MSEC_PER_SEC)

where N is a constant or an expression, to avoid the multiplication.

The conversion is made with Coccinelle with the secs_to_jiffies() script
in scripts/coccinelle/misc. Attention is paid to what the best change
can be rather than restricting to what the tool provides.

v1: https://lore.kernel.org/20250219-netdev-secs-to-jiffies-part-2-v1-0-c484cc63611b@linux.microsoft.com
====================

Link: https://patch.msgid.link/20250707-netdev-secs-to-jiffies-part-2-v2-0-b7817036342f@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet: ipconfig: convert timeouts to secs_to_jiffies()
Easwar Hariharan [Mon, 7 Jul 2025 22:03:33 +0000 (15:03 -0700)] 
net: ipconfig: convert timeouts to secs_to_jiffies()

Commit b35108a51cf7 ("jiffies: Define secs_to_jiffies()") introduced
secs_to_jiffies().  As the value here is a multiple of 1000, use
secs_to_jiffies() instead of msecs_to_jiffies to avoid the multiplication.

This is converted using scripts/coccinelle/misc/secs_to_jiffies.cocci with
the following Coccinelle rules:

@depends on patch@
expression E;
@@

-msecs_to_jiffies(E * 1000)
+secs_to_jiffies(E)

-msecs_to_jiffies(E * MSEC_PER_SEC)
+secs_to_jiffies(E)

While here, manually convert a couple timeouts denominated in seconds

Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Link: https://patch.msgid.link/20250707-netdev-secs-to-jiffies-part-2-v2-2-b7817036342f@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet/smc: convert timeouts to secs_to_jiffies()
Easwar Hariharan [Mon, 7 Jul 2025 22:03:32 +0000 (15:03 -0700)] 
net/smc: convert timeouts to secs_to_jiffies()

Commit b35108a51cf7 ("jiffies: Define secs_to_jiffies()") introduced
secs_to_jiffies().  As the value here is a multiple of 1000, use
secs_to_jiffies() instead of msecs_to_jiffies to avoid the multiplication.

This is converted using scripts/coccinelle/misc/secs_to_jiffies.cocci with
the following Coccinelle rules:

@depends on patch@
expression E;
@@

-msecs_to_jiffies(E * 1000)
+secs_to_jiffies(E)

-msecs_to_jiffies(E * MSEC_PER_SEC)
+secs_to_jiffies(E)

Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Reviewed-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Link: https://patch.msgid.link/20250707-netdev-secs-to-jiffies-part-2-v2-1-b7817036342f@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agogve: make IRQ handlers and page allocation NUMA aware
Bailey Forrest [Mon, 7 Jul 2025 21:01:07 +0000 (14:01 -0700)] 
gve: make IRQ handlers and page allocation NUMA aware

All memory in GVE is currently allocated without regard for the NUMA
node of the device. Because access to NUMA-local memory access is
significantly cheaper than access to a remote node, this change attempts
to ensure that page frags used in the RX path, including page pool
frags, are allocated on the NUMA node local to the gVNIC device. Note
that this attempt is best-effort. If necessary, the driver will still
allocate non-local memory, as __GFP_THISNODE is not passed. Descriptor
ring allocations are not updated, as dma_alloc_coherent handles that.

This change also modifies the IRQ affinity setting to only select CPUs
from the node local to the device, preserving the behavior that TX and
RX queues of the same index share CPU affinity.

Signed-off-by: Bailey Forrest <bcf@google.com>
Signed-off-by: Joshua Washington <joshwash@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Harshitha Ramamurthy <hramamurthy@google.com>
Signed-off-by: Jeroen de Borst <jeroendb@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250707210107.2742029-1-jeroendb@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoMerge branch 'add-microchip-zl3073x-support-part-1'
Jakub Kicinski [Thu, 10 Jul 2025 02:08:56 +0000 (19:08 -0700)] 
Merge branch 'add-microchip-zl3073x-support-part-1'

Ivan Vecera says:

====================
Add Microchip ZL3073x support (part 1)

Add support for Microchip Azurite DPLL/PTP/SyncE chip family that
provides DPLL and PTP functionality. This series bring first part
that adds the core functionality and basic DPLL support.

The next part of the series will bring additional DPLL functionality
like eSync support, phase offset and frequency offset reporting and
phase adjustments.

Testing was done by myself and by Prathosh Satish on Microchip EDS2
development board with ZL30732 DPLL chip connected over I2C bus.
====================

Link: https://patch.msgid.link/20250704182202.1641943-1-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agodpll: zl3073x: Add support to get/set frequency on pins
Ivan Vecera [Fri, 4 Jul 2025 18:22:02 +0000 (20:22 +0200)] 
dpll: zl3073x: Add support to get/set frequency on pins

Add support to get/set frequency on pins. The frequency for input
pins (references) is computed in the device according this formula:

 freq = base_freq * multiplier * (nominator / denominator)

where the base_freq comes from the list of supported base frequencies
and other parameters are arbitrary numbers. All these parameters are
16-bit unsigned integers.

The frequency for output pin is determined by the frequency of
synthesizer the output pin is connected to and divisor of the output
to which is the given pin belongs. The resulting frequency of the
P-pin and the N-pin from this output pair depends on the signal
format of this output pair.

The device supports so-called N-divided signal formats where for the
N-pin there is an additional divisor. The frequencies for both pins
from such output pair are computed:

 P-pin-freq = synth_freq / output_div
 N-pin-freq = synth_freq / output_div / n_div

For other signal-format types both P and N pin have the same frequency
based only synth frequency and output divisor.

Implement output pin callbacks to get and set frequency. The frequency
setting for the output non-N-divided signal format is simple as we have
to compute just new output divisor. For N-divided formats it is more
complex because by changing of output divisor we change frequency for
both P and N pins. In this case if we are changing frequency for P-pin
we have to compute also new N-divisor for N-pin to keep its current
frequency. From this and the above it follows that the frequency of
the N-pin cannot be higher than the frequency of the P-pin and the
callback must take this limitation into account.

Co-developed-by: Prathosh Satish <Prathosh.Satish@microchip.com>
Signed-off-by: Prathosh Satish <Prathosh.Satish@microchip.com>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20250704182202.1641943-13-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agodpll: zl3073x: Implement input pin state setting in automatic mode
Ivan Vecera [Fri, 4 Jul 2025 18:22:01 +0000 (20:22 +0200)] 
dpll: zl3073x: Implement input pin state setting in automatic mode

Implement input pin state setting when the DPLL is running in automatic
mode. Unlike manual mode, the DPLL mode switching is not used here and
the implementation uses special priority value (15) to make the given
pin non-selectable.

When the user sets state of the pin as disconnected the driver
internally sets its priority in HW to 15 that prevents the DPLL to
choose this input pin. Conversely, if the pin status is set to
selectable, the driver sets the pin priority in HW to the original saved
value.

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20250704182202.1641943-12-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agodpll: zl3073x: Add support to get/set priority on input pins
Ivan Vecera [Fri, 4 Jul 2025 18:22:00 +0000 (20:22 +0200)] 
dpll: zl3073x: Add support to get/set priority on input pins

Add support for getting and setting input pin priority. Implement
required callbacks and set appropriate capability for input pins.
Although the pin priority make sense only if the DPLL is running in
automatic mode we have to expose this capability unconditionally because
input pins (references) are shared between all DPLLs where one of them
can run in automatic mode while the other one not.

Co-developed-by: Prathosh Satish <Prathosh.Satish@microchip.com>
Signed-off-by: Prathosh Satish <Prathosh.Satish@microchip.com>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20250704182202.1641943-11-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agodpll: zl3073x: Implement input pin selection in manual mode
Ivan Vecera [Fri, 4 Jul 2025 18:21:59 +0000 (20:21 +0200)] 
dpll: zl3073x: Implement input pin selection in manual mode

Implement input pin state setting if the DPLL is running in manual mode.
The driver indicates manual mode if the DPLL mode is one of ref-lock,
forced-holdover, freerun.

Use these modes to implement input pin state change between connected
and disconnected states. When the user set the particular pin as
connected the driver marks this input pin as forced reference and
switches the DPLL mode to ref-lock. When the use set the pin as
disconnected the driver switches the DPLL to freerun or forced holdover
mode. The switch to holdover mode is done if the DPLL has holdover
capability (e.g is currently locked with holdover acquired).

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20250704182202.1641943-10-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agodpll: zl3073x: Register DPLL devices and pins
Ivan Vecera [Fri, 4 Jul 2025 18:21:58 +0000 (20:21 +0200)] 
dpll: zl3073x: Register DPLL devices and pins

Enumerate all available DPLL channels and registers a DPLL device for
each of them. Check all input references and outputs and register
DPLL pins for them.

Number of registered DPLL pins depends on configuration of references
and outputs. If the reference or output is configured as differential
one then only one DPLL pin is registered. Both references and outputs
can be also disabled from firmware configuration and in this case
no DPLL pins are registered.

All registrable references are registered to all available DPLL devices
with exception of DPLLs that are configured in NCO (numerically
controlled oscillator) mode. In this mode DPLL channel acts as PHC and
cannot be locked to any reference.

Device outputs are connected to one of synthesizers and each synthesizer
is driven by some DPLL channel. So output pins belonging to given output
are registered to DPLL device that drives associated synthesizer.

Finally add kworker task to monitor async changes on all DPLL channels
and input pins and to notify about them DPLL core. Output pins are not
monitored as their parameters are not changed asynchronously by the
device.

Co-developed-by: Prathosh Satish <Prathosh.Satish@microchip.com>
Signed-off-by: Prathosh Satish <Prathosh.Satish@microchip.com>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20250704182202.1641943-9-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agodpll: zl3073x: Read DPLL types and pin properties from system firmware
Ivan Vecera [Fri, 4 Jul 2025 18:21:57 +0000 (20:21 +0200)] 
dpll: zl3073x: Read DPLL types and pin properties from system firmware

Add support for reading of DPLL types and optional pin properties from
the system firmware (DT, ACPI...).

The DPLL types are stored in property 'dpll-types' as string array and
possible values 'pps' and 'eec' are mapped to DPLL enums DPLL_TYPE_PPS
and DPLL_TYPE_EEC.

The pin properties are stored under 'input-pins' and 'output-pins'
sub-nodes and the following ones are supported:

* reg
    integer that specifies pin index
* label
    string that is used by driver as board label
* connection-type
    string that indicates pin connection type
* supported-frequencies-hz
    array of u64 values what frequencies are supported / allowed for
    given pin with respect to hardware wiring

Do not blindly trust system firmware and filter out frequencies that
cannot be configured/represented in device (input frequencies have to
be factorized by one of the base frequencies and output frequencies have
to divide configured synthesizer frequency).

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20250704182202.1641943-8-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agodpll: zl3073x: Fetch invariants during probe
Ivan Vecera [Fri, 4 Jul 2025 18:21:56 +0000 (20:21 +0200)] 
dpll: zl3073x: Fetch invariants during probe

Several configuration parameters will remain constant at runtime,
so we can load them during probe to avoid excessive reads from
the hardware.

Read the following parameters from the device during probe and store
them for later use:

* enablement status and frequencies of the synthesizers and their
  associated DPLL channels
* enablement status and type (single-ended or differential) of input pins
* associated synthesizers, signal format, and enablement status of
  outputs

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20250704182202.1641943-7-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agodpll: Add basic Microchip ZL3073x support
Ivan Vecera [Fri, 4 Jul 2025 18:21:55 +0000 (20:21 +0200)] 
dpll: Add basic Microchip ZL3073x support

Microchip Azurite ZL3073x represents chip family providing DPLL
and optionally PHC (PTP) functionality. The chips can be connected
be connected over I2C or SPI bus.

They have the following characteristics:
* up to 5 separate DPLL units (channels)
* 5 synthesizers
* 10 input pins (references)
* 10 outputs
* 20 output pins (output pin pair shares one output)
* Each reference and output can operate in either differential or
  single-ended mode (differential mode uses 2 pins)
* Each output is connected to one of the synthesizers
* Each synthesizer is driven by one of the DPLL unit

The device uses 7-bit addresses and 8-bits values. It exposes 8-, 16-,
32- and 48-bits registers in address range <0x000,0x77F>. Due to 7bit
addressing, the range is organized into pages of 128 bytes, with each
page containing a page selector register at address 0x7F.
For reading/writing multi-byte registers, the device supports bulk
transfers.

Add basic functionality to access device registers, probe functionality
both I2C and SPI cases and add devlink support to provide info and
to set clock ID parameter.

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20250704182202.1641943-6-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agodevlink: Add new "clock_id" generic device param
Ivan Vecera [Fri, 4 Jul 2025 18:21:54 +0000 (20:21 +0200)] 
devlink: Add new "clock_id" generic device param

Add a new device generic parameter to specify clock ID that should
be used by the device for registering DPLL devices and pins.

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20250704182202.1641943-5-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agodevlink: Add support for u64 parameters
Ivan Vecera [Fri, 4 Jul 2025 18:21:53 +0000 (20:21 +0200)] 
devlink: Add support for u64 parameters

Only 8, 16 and 32-bit integers are supported for numeric devlink
parameters. The subsequent patch adds support for DPLL clock ID
that is defined as 64-bit number. Add support for u64 parameter
type.

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20250704182202.1641943-4-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agodt-bindings: dpll: Add support for Microchip Azurite chip family
Ivan Vecera [Fri, 4 Jul 2025 18:21:52 +0000 (20:21 +0200)] 
dt-bindings: dpll: Add support for Microchip Azurite chip family

Add DT bindings for Microchip Azurite DPLL chip family. These chips
provide up to 5 independent DPLL channels, 10 differential or
single-ended inputs and 10 differential or 20 single-ended outputs.
They can be connected via I2C or SPI busses.

Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20250704182202.1641943-3-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agodt-bindings: dpll: Add DPLL device and pin
Ivan Vecera [Fri, 4 Jul 2025 18:21:51 +0000 (20:21 +0200)] 
dt-bindings: dpll: Add DPLL device and pin

Add a common DT schema for DPLL device and its associated pins.
The DPLL (device phase-locked loop) is a device used for precise clock
synchronization in networking and telecom hardware.

The device includes one or more DPLLs (channels) and one or more
physical input/output pins.

Each DPLL channel is used either to provide a pulse-per-clock signal or
to drive an Ethernet equipment clock.

The input and output pins have the following properties:
* label: specifies board label
* connection type: specifies its usage depending on wiring
* list of supported or allowed frequencies: depending on how the pin
  is connected and where)
* embedded sync capability: indicates whether the pin supports this

Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20250704182202.1641943-2-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agovirtio-net: xsk: rx: move the xdp->data adjustment to buf_to_xdp()
Bui Quang Minh [Sat, 5 Jul 2025 07:55:14 +0000 (14:55 +0700)] 
virtio-net: xsk: rx: move the xdp->data adjustment to buf_to_xdp()

This commit does not do any functional changes. It moves xdp->data
adjustment for buffer other than first buffer to buf_to_xdp() helper so
that the xdp_buff adjustment does not scatter over different functions.

Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Link: https://patch.msgid.link/20250705075515.34260-1-minhquangbui99@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoMerge branch 'add-vf-drivers-for-wangxun-virtual-functions'
Jakub Kicinski [Thu, 10 Jul 2025 01:39:15 +0000 (18:39 -0700)] 
Merge branch 'add-vf-drivers-for-wangxun-virtual-functions'

Mengyuan Lou says:

====================
Add vf drivers for wangxun virtual functions

Introduces basic support for Wangxun’s virtual function (VF) network
drivers, specifically txgbevf and ngbevf. These drivers provide SR-IOV
VF functionality for Wangxun 10/25/40G network devices.
The first three patches add common APIs for Wangxun VF drivers, including
mailbox communication and shared initialization logic.These abstractions
are placed in libwx to reduce duplication across VF drivers.
Patches 4–8 introduce the txgbevf driver, including:
PCI device initialization, Hardware reset, Interrupt setup, Rx/Tx datapath
implementation and link status changeing flow.
Patches 9–12 implement the ngbevf driver, mirroring the functionality
added in txgbevf.

v2: https://lore.kernel.org/20250625102058.19898-1-mengyuanlou@net-swift.com
v1: https://lore.kernel.org/20250611083559.14175-1-mengyuanlou@net-swift.com
====================

Link: https://patch.msgid.link/20250704094923.652-1-mengyuanlou@net-swift.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet: ngbevf: add link update flow
Mengyuan Lou [Fri, 4 Jul 2025 09:49:23 +0000 (17:49 +0800)] 
net: ngbevf: add link update flow

Add link update flow to wangxun 1G virtual functions.
Get link status from pf in mbox, and if it is failed then
check the vx_status, because vx_status switching is too slow.

Signed-off-by: Mengyuan Lou <mengyuanlou@net-swift.com>
Link: https://patch.msgid.link/20250704094923.652-13-mengyuanlou@net-swift.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet: ngbevf: init interrupts and request irqs
Mengyuan Lou [Fri, 4 Jul 2025 09:49:22 +0000 (17:49 +0800)] 
net: ngbevf: init interrupts and request irqs

Add specific parameters for irq alloc, then use
wx_init_interrupt_scheme to initialize interrupt
allocation in probe.
Add .ndo_start_xmit support and start all queues.

Signed-off-by: Mengyuan Lou <mengyuanlou@net-swift.com>
Link: https://patch.msgid.link/20250704094923.652-12-mengyuanlou@net-swift.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet: ngbevf: add sw init pci info and reset hardware
Mengyuan Lou [Fri, 4 Jul 2025 09:49:21 +0000 (17:49 +0800)] 
net: ngbevf: add sw init pci info and reset hardware

Do sw init and reset hw for ngbevf virtual functions, then
register netdev.

Signed-off-by: Mengyuan Lou <mengyuanlou@net-swift.com>
Link: https://patch.msgid.link/20250704094923.652-11-mengyuanlou@net-swift.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet: wangxun: add ngbevf build
Mengyuan Lou [Fri, 4 Jul 2025 09:49:20 +0000 (17:49 +0800)] 
net: wangxun: add ngbevf build

Add doc build infrastructure for ngbevf driver.
Implement the basic PCI driver loading and unloading interface.
Initialize the id_table which support 1G virtual
functions for Wangxun.

Signed-off-by: Mengyuan Lou <mengyuanlou@net-swift.com>
Link: https://patch.msgid.link/20250704094923.652-10-mengyuanlou@net-swift.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet: txgbevf: add link update flow
Mengyuan Lou [Fri, 4 Jul 2025 09:49:19 +0000 (17:49 +0800)] 
net: txgbevf: add link update flow

Add link update flow to wangxun 10/25/40G virtual functions.
Get link status from pf in mbox, and if it is failed then
check the vx_status, because vx_status switching is too slow.

Signed-off-by: Mengyuan Lou <mengyuanlou@net-swift.com>
Link: https://patch.msgid.link/20250704094923.652-9-mengyuanlou@net-swift.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet: txgbevf: Support Rx and Tx process path
Mengyuan Lou [Fri, 4 Jul 2025 09:49:18 +0000 (17:49 +0800)] 
net: txgbevf: Support Rx and Tx process path

Improve the configuration of Rx and Tx ring.
Setup and alloc resources.
Configure Rx and Tx unit on hardware.
Add .ndo_start_xmit support and start all queues.

Signed-off-by: Mengyuan Lou <mengyuanlou@net-swift.com>
Link: https://patch.msgid.link/20250704094923.652-8-mengyuanlou@net-swift.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet: txgbevf: init interrupts and request irqs
Mengyuan Lou [Fri, 4 Jul 2025 09:49:17 +0000 (17:49 +0800)] 
net: txgbevf: init interrupts and request irqs

Add irq alloc flow functions for vf.
Alloc pcie msix irqs for drivers and request_irq for tx/rx rings
and misc other events.
If the application is successful, config vertors for interrupts.
Enable interrupts mask in wxvf_irq_enable.

Signed-off-by: Mengyuan Lou <mengyuanlou@net-swift.com>
Link: https://patch.msgid.link/20250704094923.652-7-mengyuanlou@net-swift.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet: txgbevf: add sw init pci info and reset hardware
Mengyuan Lou [Fri, 4 Jul 2025 09:49:16 +0000 (17:49 +0800)] 
net: txgbevf: add sw init pci info and reset hardware

Add sw init and reset hw for txgbevf virtual functions
which initialize basic parameters, and then register netdev.

Signed-off-by: Mengyuan Lou <mengyuanlou@net-swift.com>
Link: https://patch.msgid.link/20250704094923.652-6-mengyuanlou@net-swift.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet: wangxun: add txgbevf build
Mengyuan Lou [Fri, 4 Jul 2025 09:49:15 +0000 (17:49 +0800)] 
net: wangxun: add txgbevf build

Add doc build infrastructure for txgbevf driver.
Implement the basic PCI driver loading and unloading interface.
Initialize the id_table which support 10/25/40G virtual
functions for Wangxun.
Ioremap the space of bar0 and bar4 which will be used.

Signed-off-by: Mengyuan Lou <mengyuanlou@net-swift.com>
Link: https://patch.msgid.link/20250704094923.652-5-mengyuanlou@net-swift.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet: libwx: add wangxun vf common api
Mengyuan Lou [Fri, 4 Jul 2025 09:49:14 +0000 (17:49 +0800)] 
net: libwx: add wangxun vf common api

Add common wx_configure_vf and wx_set_mac_vf for
ngbevf and txgbevf.

Signed-off-by: Mengyuan Lou <mengyuanlou@net-swift.com>
Link: https://patch.msgid.link/20250704094923.652-4-mengyuanlou@net-swift.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet: libwx: add base vf api for vf drivers
Mengyuan Lou [Fri, 4 Jul 2025 09:49:13 +0000 (17:49 +0800)] 
net: libwx: add base vf api for vf drivers

Implement mbox_write_and_read_ack functions which are
used to set basic functions like set_mac, get_link.etc
for vf.

Signed-off-by: Mengyuan Lou <mengyuanlou@net-swift.com>
Link: https://patch.msgid.link/20250704094923.652-3-mengyuanlou@net-swift.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet: libwx: add mailbox api for wangxun vf drivers
Mengyuan Lou [Fri, 4 Jul 2025 09:49:12 +0000 (17:49 +0800)] 
net: libwx: add mailbox api for wangxun vf drivers

Implements the mailbox interfaces for Wangxun vf drivers which
will be used in txgbevf and ngbevf.

Signed-off-by: Mengyuan Lou <mengyuanlou@net-swift.com>
Link: https://patch.msgid.link/20250704094923.652-2-mengyuanlou@net-swift.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agodt-bindings: net: Add support for Sophgo CV1800 dwmac
Inochi Amaoto [Thu, 3 Jul 2025 02:12:19 +0000 (10:12 +0800)] 
dt-bindings: net: Add support for Sophgo CV1800 dwmac

The GMAC IP on CV1800 series SoC is a standard Synopsys
DesignWare MAC (version 3.70a).

Add necessary compatible string for this device.

Signed-off-by: Inochi Amaoto <inochiama@gmail.com>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://patch.msgid.link/20250703021220.124195-1-inochiama@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoudp: remove udp_tunnel_gro_init()
Eric Dumazet [Mon, 7 Jul 2025 09:16:34 +0000 (09:16 +0000)] 
udp: remove udp_tunnel_gro_init()

Use DEFINE_MUTEX() to initialize udp_tunnel_gro_type_lock.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250707091634.311974-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agodt-bindings: net: altr,socfpga-stmmac.yaml: add minItems to iommus
Matthew Gerlach [Mon, 7 Jul 2025 15:44:09 +0000 (08:44 -0700)] 
dt-bindings: net: altr,socfpga-stmmac.yaml: add minItems to iommus

Add missing 'minItems: 1' to iommus property of the Altera SOCFPGA SoC
implementation of the Synopsys DWMAC.

Fixes: 6d359cf464f4 ("dt-bindings: net: Convert socfpga-dwmac bindings to yaml")
Signed-off-by: Matthew Gerlach <matthew.gerlach@altera.com>
Reviewed-by: Yanteng Si <siyanteng@cqsoftware.com.cn>
Acked-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20250707154409.15527-1-matthew.gerlach@altera.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet: dt-bindings: ixp4xx-ethernet: Support fixed links
Linus Walleij [Fri, 4 Jul 2025 12:54:26 +0000 (14:54 +0200)] 
net: dt-bindings: ixp4xx-ethernet: Support fixed links

This ethernet controller is using fixed links for DSA switches
in two already existing device trees, so make sure the checker
does not complain like this:

intel-ixp42x-linksys-wrv54g.dtb: ethernet@c8009000 (intel,ixp4xx-ethernet):
'fixed-link' does not match any of the regexes: '^pinctrl-[0-9]+$'
from schema $id: http://devicetree.org/schemas/net/intel,ixp4xx-ethernet.yaml#

intel-ixp42x-usrobotics-usr8200.dtb: ethernet@c800a000 (intel,ixp4xx-ethernet):
'fixed-link' does not match any of the regexes: '^pinctrl-[0-9]+$'
from schema $id: http://devicetree.org/schemas/net/intel,ixp4xx-ethernet.yaml#

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202507040609.K9KytWBA-lkp@intel.com/
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Acked-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20250704-ixp4xx-ethernet-binding-fix-v1-1-8ac360d5bc9b@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoMerge branch 'ipv6-drop-rtnl-from-mcast-c-and-anycast-c'
Jakub Kicinski [Wed, 9 Jul 2025 01:32:41 +0000 (18:32 -0700)] 
Merge branch 'ipv6-drop-rtnl-from-mcast-c-and-anycast-c'

Kuniyuki Iwashima says:

====================
ipv6: Drop RTNL from mcast.c and anycast.c

This is a prep series for RCU conversion of RTM_NEWNEIGH, which needs
RTNL during neigh_table.{pconstructor,pdestructor}() touching IPv6
multicast code.

Currently, IPv6 multicast code is protected by lock_sock() and
inet6_dev->mc_lock, and RTNL is not actually needed.

In addition, anycast code is also in the same situation and does not
need RTNL at all.

This series removes RTNL from net/ipv6/{mcast.c,anycast.c} and finally
removes setsockopt_needs_rtnl() from do_ipv6_setsockopt().

v2: https://lore.kernel.org/20250624202616.526600-1-kuni1840@gmail.com
v1: https://lore.kernel.org/20250616233417.1153427-1-kuni1840@gmail.com
====================

Link: https://patch.msgid.link/20250702230210.3115355-1-kuni1840@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoipv6: Remove setsockopt_needs_rtnl().
Kuniyuki Iwashima [Wed, 2 Jul 2025 23:01:32 +0000 (16:01 -0700)] 
ipv6: Remove setsockopt_needs_rtnl().

We no longer need to hold RTNL for IPv6 socket options.

Let's remove setsockopt_needs_rtnl().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250702230210.3115355-16-kuni1840@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoipv6: anycast: Don't hold RTNL for IPV6_JOIN_ANYCAST.
Kuniyuki Iwashima [Wed, 2 Jul 2025 23:01:31 +0000 (16:01 -0700)] 
ipv6: anycast: Don't hold RTNL for IPV6_JOIN_ANYCAST.

inet6_sk(sk)->ipv6_ac_list is protected by lock_sock().

In ipv6_sock_ac_join(), only __dev_get_by_index(), __dev_get_by_flags(),
and __in6_dev_get() require RTNL.

__dev_get_by_flags() is only used by ipv6_sock_ac_join() and can be
converted to RCU version.

Let's replace RCU version helper and drop RTNL from IPV6_JOIN_ANYCAST.

setsockopt_needs_rtnl() will be removed in the next patch.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250702230210.3115355-15-kuni1840@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoipv6: anycast: Unify two error paths in ipv6_sock_ac_join().
Kuniyuki Iwashima [Wed, 2 Jul 2025 23:01:30 +0000 (16:01 -0700)] 
ipv6: anycast: Unify two error paths in ipv6_sock_ac_join().

The next patch will replace __dev_get_by_index() and __dev_get_by_flags()
to RCU + refcount version.

Then, we will need to call dev_put() in some error paths.

Let's unify two error paths to make the next patch cleaner.

Note that we add READ_ONCE() for net->ipv6.devconf_all->forwarding
and idev->conf.forwarding as we will drop RTNL that protects them.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250702230210.3115355-14-kuni1840@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoipv6: anycast: Don't hold RTNL for IPV6_LEAVE_ANYCAST and IPV6_ADDRFORM.
Kuniyuki Iwashima [Wed, 2 Jul 2025 23:01:29 +0000 (16:01 -0700)] 
ipv6: anycast: Don't hold RTNL for IPV6_LEAVE_ANYCAST and IPV6_ADDRFORM.

inet6_sk(sk)->ipv6_ac_list is protected by lock_sock().

In ipv6_sock_ac_drop() and ipv6_sock_ac_close(),
only __dev_get_by_index() and __in6_dev_get() requrie RTNL.

Let's replace them with dev_get_by_index() and in6_dev_get()
and drop RTNL from IPV6_LEAVE_ANYCAST and IPV6_ADDRFORM.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250702230210.3115355-13-kuni1840@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoipv6: anycast: Don't use rtnl_dereference().
Kuniyuki Iwashima [Wed, 2 Jul 2025 23:01:28 +0000 (16:01 -0700)] 
ipv6: anycast: Don't use rtnl_dereference().

inet6_dev->ac_list is protected by inet6_dev->lock, so rtnl_dereference()
is a bit rough annotation.

As done in mcast.c, we can use ac_dereference() that checks if
inet6_dev->lock is held.

Let's replace rtnl_dereference() with a new helper ac_dereference().

Note that now addrconf_join_solict() / addrconf_leave_solict() in
__ipv6_dev_ac_inc() / __ipv6_dev_ac_dec() does not need RTNL, so we
can remove ASSERT_RTNL() there.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250702230210.3115355-12-kuni1840@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoipv6: mcast: Remove unnecessary ASSERT_RTNL and comment.
Kuniyuki Iwashima [Wed, 2 Jul 2025 23:01:27 +0000 (16:01 -0700)] 
ipv6: mcast: Remove unnecessary ASSERT_RTNL and comment.

Now, RTNL is not needed for mcast code, and what's commented in
ip6_mc_msfget() is apparent by for_each_pmc_socklock(), which has
lockdep annotation for lock_sock().

Let's remove the comment and ASSERT_RTNL() in ipv6_mc_rejoin_groups().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250702230210.3115355-11-kuni1840@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoipv6: mcast: Don't hold RTNL for MCAST_ socket options.
Kuniyuki Iwashima [Wed, 2 Jul 2025 23:01:26 +0000 (16:01 -0700)] 
ipv6: mcast: Don't hold RTNL for MCAST_ socket options.

In ip6_mc_source() and ip6_mc_msfilter(), per-socket mld data is
protected by lock_sock() and inet6_dev->mc_lock is also held for
some per-interface functions.

ip6_mc_find_dev_rtnl() only depends on RTNL.  If we want to remove
it, we need to check inet6_dev->dead under mc_lock to close the race
with addrconf_ifdown(), as mentioned earlier.

Let's do that and drop RTNL for the rest of MCAST_ socket options.

Note that ip6_mc_msfilter() has unnecessary lock dances and they
are integrated into one to avoid the last-minute error and simplify
the error handling.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250702230210.3115355-10-kuni1840@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoipv6: mcast: Don't hold RTNL in ipv6_sock_mc_close().
Kuniyuki Iwashima [Wed, 2 Jul 2025 23:01:25 +0000 (16:01 -0700)] 
ipv6: mcast: Don't hold RTNL in ipv6_sock_mc_close().

In __ipv6_sock_mc_close(), per-socket mld data is protected by lock_sock(),
and only __dev_get_by_index() and __in6_dev_get() require RTNL.

Let's call __ipv6_sock_mc_drop() and drop RTNL in ipv6_sock_mc_close().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250702230210.3115355-9-kuni1840@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoipv6: mcast: Don't hold RTNL for IPV6_DROP_MEMBERSHIP and MCAST_LEAVE_GROUP.
Kuniyuki Iwashima [Wed, 2 Jul 2025 23:01:24 +0000 (16:01 -0700)] 
ipv6: mcast: Don't hold RTNL for IPV6_DROP_MEMBERSHIP and MCAST_LEAVE_GROUP.

In __ipv6_sock_mc_drop(), per-socket mld data is protected by lock_sock(),
and only __dev_get_by_index() and __in6_dev_get() require RTNL.

Let's use dev_get_by_index() and in6_dev_get() and drop RTNL for
IPV6_ADD_MEMBERSHIP and MCAST_JOIN_GROUP.

Note that __ipv6_sock_mc_drop() is factorised to reuse in the next patch.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250702230210.3115355-8-kuni1840@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoipv6: mcast: Don't hold RTNL for IPV6_ADD_MEMBERSHIP and MCAST_JOIN_GROUP.
Kuniyuki Iwashima [Wed, 2 Jul 2025 23:01:23 +0000 (16:01 -0700)] 
ipv6: mcast: Don't hold RTNL for IPV6_ADD_MEMBERSHIP and MCAST_JOIN_GROUP.

In __ipv6_sock_mc_join(), per-socket mld data is protected by lock_sock(),
and only __dev_get_by_index() requires RTNL.

Let's use dev_get_by_index() and drop RTNL for IPV6_ADD_MEMBERSHIP and
MCAST_JOIN_GROUP.

Note that we must call rt6_lookup() and dev_hold() under RCU.

If rt6_lookup() returns an entry from the exception table, dst_dev_put()
could change rt->dev.dst to loopback concurrently, and the original device
could lose the refcount before dev_hold() and unblock device registration.

dst_dev_put() is called from NETDEV_UNREGISTER and synchronize_net() follows
it, so as long as rt6_lookup() and dev_hold() are called within the same
RCU critical section, the dev is alive.

Even if the race happens, they are synchronised by idev->dead and mcast
addresses are cleaned up.

For the racy access to rt->dst.dev, we use dst_dev().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250702230210.3115355-7-kuni1840@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoipv6: mcast: Use in6_dev_get() in ipv6_dev_mc_dec().
Kuniyuki Iwashima [Wed, 2 Jul 2025 23:01:22 +0000 (16:01 -0700)] 
ipv6: mcast: Use in6_dev_get() in ipv6_dev_mc_dec().

As well as __ipv6_dev_mc_inc(), all code in __ipv6_dev_mc_dec() are
protected by inet6_dev->mc_lock, and RTNL is not needed.

Let's use in6_dev_get() in ipv6_dev_mc_dec() and remove ASSERT_RTNL()
in __ipv6_dev_mc_dec().

Now, we can remove the RTNL comment above addrconf_leave_solict() too.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250702230210.3115355-6-kuni1840@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoipv6: mcast: Remove mca_get().
Kuniyuki Iwashima [Wed, 2 Jul 2025 23:01:21 +0000 (16:01 -0700)] 
ipv6: mcast: Remove mca_get().

Since commit 63ed8de4be81 ("mld: add mc_lock for protecting per-interface
mld data"), the newly allocated struct ifmcaddr6 cannot be removed until
inet6_dev->mc_lock is released, so mca_get() and mc_put() are unnecessary.

Let's remove the extra refcounting.

Note that mca_get() was only used in __ipv6_dev_mc_inc().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250702230210.3115355-5-kuni1840@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoipv6: mcast: Check inet6_dev->dead under idev->mc_lock in __ipv6_dev_mc_inc().
Kuniyuki Iwashima [Wed, 2 Jul 2025 23:01:20 +0000 (16:01 -0700)] 
ipv6: mcast: Check inet6_dev->dead under idev->mc_lock in __ipv6_dev_mc_inc().

Since commit 63ed8de4be81 ("mld: add mc_lock for protecting
per-interface mld data"), every multicast resource is protected
by inet6_dev->mc_lock.

RTNL is unnecessary in terms of protection but still needed for
synchronisation between addrconf_ifdown() and __ipv6_dev_mc_inc().

Once we removed RTNL, there would be a race below, where we could
add a multicast address to a dead inet6_dev.

  CPU1                            CPU2
  ====                            ====
  addrconf_ifdown()               __ipv6_dev_mc_inc()
                                    if (idev->dead) <-- false
    dead = true                       return -ENODEV;
    ipv6_mc_destroy_dev() / ipv6_mc_down()
      mutex_lock(&idev->mc_lock)
      ...
      mutex_unlock(&idev->mc_lock)
                                    mutex_lock(&idev->mc_lock)
                                    ...
                                    mutex_unlock(&idev->mc_lock)

The race window can be easily closed by checking inet6_dev->dead
under inet6_dev->mc_lock in __ipv6_dev_mc_inc() as addrconf_ifdown()
will acquire it after marking inet6_dev dead.

Let's check inet6_dev->dead under mc_lock in __ipv6_dev_mc_inc().

Note that now __ipv6_dev_mc_inc() no longer depends on RTNL and
we can remove ASSERT_RTNL() there and the RTNL comment above
addrconf_join_solict().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250702230210.3115355-4-kuni1840@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoipv6: mcast: Replace locking comments with lockdep annotations.
Kuniyuki Iwashima [Wed, 2 Jul 2025 23:01:19 +0000 (16:01 -0700)] 
ipv6: mcast: Replace locking comments with lockdep annotations.

Commit 63ed8de4be81 ("mld: add mc_lock for protecting per-interface
mld data") added the same comments regarding locking to many functions.

Let's replace the comments with lockdep annotation, which is more helpful.

Note that we just remove the comment for mld_clear_zeros() and
mld_send_cr(), where mc_dereference() is used in the entry of the
function.

While at it, a comment for __ipv6_sock_mc_join() is moved back to the
correct place.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250702230210.3115355-3-kuni1840@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoipv6: ndisc: Remove __in6_dev_get() in pndisc_{constructor,destructor}().
Kuniyuki Iwashima [Wed, 2 Jul 2025 23:01:18 +0000 (16:01 -0700)] 
ipv6: ndisc: Remove __in6_dev_get() in pndisc_{constructor,destructor}().

ipv6_dev_mc_{inc,dec}() has the same check.

Let's remove __in6_dev_get() from pndisc_constructor() and
pndisc_destructor().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250702230210.3115355-2-kuni1840@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoMerge branch 'net-xsk-update-tx-queue-consumer'
Jakub Kicinski [Wed, 9 Jul 2025 01:28:10 +0000 (18:28 -0700)] 
Merge branch 'net-xsk-update-tx-queue-consumer'

Jason Xing says:

====================
net: xsk: update tx queue consumer

Patch 1 makes sure the consumer is updated at the end of generic xmit.
Patch 2 adds corresponding test.
====================

Link: https://patch.msgid.link/20250703141712.33190-1-kerneljasonxing@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoselftests/bpf: add a new test to check the consumer update case
Jason Xing [Thu, 3 Jul 2025 14:17:12 +0000 (22:17 +0800)] 
selftests/bpf: add a new test to check the consumer update case

The subtest sends 33 packets at one time on purpose to see if xsk
exitting __xsk_generic_xmit() updates the global consumer of tx queue
when reaching the max loop (max_tx_budget, 32 by default). The number 33
can avoid xskq_cons_peek_desc() updates the consumer when it's about to
quit sending, to accurately check if the issue that the first patch
resolves remains. The new case will not check this issue in zero copy
mode.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250703141712.33190-3-kerneljasonxing@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agonet: xsk: update tx queue consumer immediately after transmission
Jason Xing [Thu, 3 Jul 2025 14:17:11 +0000 (22:17 +0800)] 
net: xsk: update tx queue consumer immediately after transmission

For afxdp, the return value of sendto() syscall doesn't reflect how many
descs handled in the kernel. One of use cases is that when user-space
application tries to know the number of transmitted skbs and then decides
if it continues to send, say, is it stopped due to max tx budget?

The following formular can be used after sending to learn how many
skbs/descs the kernel takes care of:

  tx_queue.consumers_before - tx_queue.consumers_after

Prior to the current patch, in non-zc mode, the consumer of tx queue is
not immediately updated at the end of each sendto syscall when error
occurs, which leads to the consumer value out-of-dated from the perspective
of user space. So this patch requires store operation to pass the cached
value to the shared value to handle the problem.

More than those explicit errors appearing in the while() loop in
__xsk_generic_xmit(), there are a few possible error cases that might
be neglected in the following call trace:
__xsk_generic_xmit()
    xskq_cons_peek_desc()
        xskq_cons_read_desc()
    xskq_cons_is_valid_desc()
It will also cause the premature exit in the while() loop even if not
all the descs are consumed.

Based on the above analysis, using @sent_frame could cover all the possible
cases where it might lead to out-of-dated global state of consumer after
finishing __xsk_generic_xmit().

The patch also adds a common helper __xsk_tx_release() to keep align
with the zc mode usage in xsk_tx_release().

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250703141712.33190-2-kerneljasonxing@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoMerge branch 'af_unix-introduce-so_inq-scm_inq'
Jakub Kicinski [Wed, 9 Jul 2025 01:05:27 +0000 (18:05 -0700)] 
Merge branch 'af_unix-introduce-so_inq-scm_inq'

Kuniyuki Iwashima says:

====================
af_unix: Introduce SO_INQ & SCM_INQ.

We have an application that uses almost the same code for TCP and
AF_UNIX (SOCK_STREAM).

The application uses TCP_INQ for TCP, but AF_UNIX doesn't have it
and requires an extra syscall, ioctl(SIOCINQ) or getsockopt(SO_MEMINFO)
as an alternative.

Also, ioctl(SIOCINQ) for AF_UNIX SOCK_STREAM is more expensive because
it needs to iterate all skb in the receive queue.

This series adds a cached field for SIOCINQ to speed it up and introduce
SO_INQ, the generic version of TCP_INQ to get the queue length as cmsg in
each recvmsg().
====================

Link: https://patch.msgid.link/20250702223606.1054680-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoselftest: af_unix: Add test for SO_INQ.
Kuniyuki Iwashima [Wed, 2 Jul 2025 22:35:19 +0000 (22:35 +0000)] 
selftest: af_unix: Add test for SO_INQ.

Let's add a simple test to check the basic functionality of SO_INQ.

The test does the following:

  1. Create socketpair in self->fd[]
  2. Enable SO_INQ
  3. Send data via self->fd[0]
  4. Receive data from self->fd[1]
  5. Compare the SCM_INQ cmsg with ioctl(SIOCINQ)

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250702223606.1054680-8-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoaf_unix: Introduce SO_INQ.
Kuniyuki Iwashima [Wed, 2 Jul 2025 22:35:18 +0000 (22:35 +0000)] 
af_unix: Introduce SO_INQ.

We have an application that uses almost the same code for TCP and
AF_UNIX (SOCK_STREAM).

TCP can use TCP_INQ, but AF_UNIX doesn't have it and requires an
extra syscall, ioctl(SIOCINQ) or getsockopt(SO_MEMINFO) as an
alternative.

Let's introduce the generic version of TCP_INQ.

If SO_INQ is enabled, recvmsg() will put a cmsg of SCM_INQ that
contains the exact value of ioctl(SIOCINQ).  The cmsg is also
included when msg->msg_get_inq is non-zero to make sockets
io_uring-friendly.

Note that SOCK_CUSTOM_SOCKOPT is flagged only for SOCK_STREAM to
override setsockopt() for SOL_SOCKET.

By having the flag in struct unix_sock, instead of struct sock, we
can later add SO_INQ support for TCP and reuse tcp_sk(sk)->recvmsg_inq.

Note also that supporting custom getsockopt() for SOL_SOCKET will need
preparation for other SOCK_CUSTOM_SOCKOPT users (UDP, vsock, MPTCP).

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250702223606.1054680-7-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoaf_unix: Cache state->msg in unix_stream_read_generic().
Kuniyuki Iwashima [Wed, 2 Jul 2025 22:35:17 +0000 (22:35 +0000)] 
af_unix: Cache state->msg in unix_stream_read_generic().

In unix_stream_read_generic(), state->msg is fetched multiple times.

Let's cache it in a local variable.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250702223606.1054680-6-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoaf_unix: Use cached value for SOCK_STREAM in unix_inq_len().
Kuniyuki Iwashima [Wed, 2 Jul 2025 22:35:16 +0000 (22:35 +0000)] 
af_unix: Use cached value for SOCK_STREAM in unix_inq_len().

Compared to TCP, ioctl(SIOCINQ) for AF_UNIX SOCK_STREAM socket is more
expensive, as unix_inq_len() requires iterating through the receive queue
and accumulating skb->len.

Let's cache the value for SOCK_STREAM to a new field during sendmsg()
and recvmsg().

The field is protected by the receive queue lock.

Note that ioctl(SIOCINQ) for SOCK_DGRAM returns the length of the first
skb in the queue.

SOCK_SEQPACKET still requires iterating through the queue because we do
not touch functions shared with unix_dgram_ops.  But, if really needed,
we can support it by switching __skb_try_recv_datagram() to a custom
version.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250702223606.1054680-5-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoaf_unix: Don't use skb_recv_datagram() in unix_stream_read_skb().
Kuniyuki Iwashima [Wed, 2 Jul 2025 22:35:15 +0000 (22:35 +0000)] 
af_unix: Don't use skb_recv_datagram() in unix_stream_read_skb().

unix_stream_read_skb() calls skb_recv_datagram() with MSG_DONTWAIT,
which is mostly equivalent to sock_error(sk) + skb_dequeue().

In the following patch, we will add a new field to cache the number
of bytes in the receive queue.  Then, we want to avoid introducing
atomic ops in the fast path, so we will reuse the receive queue lock.

As a preparation for the change, let's not use skb_recv_datagram()
in unix_stream_read_skb().

Note that sock_error() is now moved out of the u->iolock mutex as
the mutex does not synchronise the peer's close() at all.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250702223606.1054680-4-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoaf_unix: Don't check SOCK_DEAD in unix_stream_read_skb().
Kuniyuki Iwashima [Wed, 2 Jul 2025 22:35:14 +0000 (22:35 +0000)] 
af_unix: Don't check SOCK_DEAD in unix_stream_read_skb().

unix_stream_read_skb() checks SOCK_DEAD only when the dequeued skb is
OOB skb.

unix_stream_read_skb() is called for a SOCK_STREAM socket in SOCKMAP
when data is sent to it.

The function is invoked via sk_psock_verdict_data_ready(), which is
set to sk->sk_data_ready().

During sendmsg(), we check if the receiver has SOCK_DEAD, so there
is no point in checking it again later in ->read_skb().

Also, unix_read_skb() for SOCK_DGRAM does not have the test either.

Let's remove the SOCK_DEAD test in unix_stream_read_skb().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250702223606.1054680-3-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoaf_unix: Don't hold unix_state_lock() in __unix_dgram_recvmsg().
Kuniyuki Iwashima [Wed, 2 Jul 2025 22:35:13 +0000 (22:35 +0000)] 
af_unix: Don't hold unix_state_lock() in __unix_dgram_recvmsg().

When __skb_try_recv_datagram() returns NULL in __unix_dgram_recvmsg(),
we hold unix_state_lock() unconditionally.

This is because SOCK_SEQPACKET sk needs to return EOF in case its peer
has been close()d concurrently.

This behaviour totally depends on the timing of the peer's close() and
reading sk->sk_shutdown, and taking the lock does not play a role.

Let's drop the lock from __unix_dgram_recvmsg() and use READ_ONCE().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250702223606.1054680-2-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoMerge branch 'eth-fbnic-add-firmware-logging-support'
Jakub Kicinski [Wed, 9 Jul 2025 00:05:50 +0000 (17:05 -0700)] 
Merge branch 'eth-fbnic-add-firmware-logging-support'

Lee Trager says:

====================
eth: fbnic: Add firmware logging support

Firmware running on fbnic generates device logs. These logs contain useful
information about the device which may or may not be related to the host.
Logs are stored in a ring buffer and accessible through DebugFS.
====================

Link: https://patch.msgid.link/20250702192207.697368-1-lee@trager.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoeth: fbnic: Create fw_log file in DebugFS
Lee Trager [Wed, 2 Jul 2025 19:12:12 +0000 (12:12 -0700)] 
eth: fbnic: Create fw_log file in DebugFS

Allow reading the firmware log in DebugFS by accessing the fw_log file.
Buffer is read while a spinlock is acquired.

Signed-off-by: Lee Trager <lee@trager.us>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250702192207.697368-7-lee@trager.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoeth: fbnic: Enable firmware logging
Lee Trager [Wed, 2 Jul 2025 19:12:11 +0000 (12:12 -0700)] 
eth: fbnic: Enable firmware logging

The firmware log buffer is enabled during probe and freed during remove.
Early versions of firmware do not support sending logs. Once the mailbox is
up driver will enable logging when supported firmware versions are detected.
Logging is disabled before the mailbox is freed.

Signed-off-by: Lee Trager <lee@trager.us>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250702192207.697368-6-lee@trager.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoeth: fbnic: Add mailbox support for firmware logs
Lee Trager [Wed, 2 Jul 2025 19:12:10 +0000 (12:12 -0700)] 
eth: fbnic: Add mailbox support for firmware logs

By default firmware will not send logs to the host. This must be explicitly
enabled by the driver. The mailbox has the concept of a flag which is a u32
used as a boolean. Lack of flag defaults to a value of false. When enabling
logging historical logs may be optionally requested. These are log messages
generated by the NIC before the driver was loaded. The driver also sends a
log version to support changing the logging format in the future.

[SEND_LOGS_REQ] = {
    [SEND_LOGS]          /* flag to request log reporting */
    [SEND_LOGS_HISTORY]  /* flag to request historical logs */
    [SEND_LOGS_VERSION]  /* u32 indicating the log format version */
}

Logs may be sent to the user either one at a time, or when historical logs
are requested in bulk. Firmware may not send more than 14 messages in bulk
to prevent flooding the mailbox.

[LOG_MSG] = {
    [LOG_INDEX]     /* entry 0 - u64 index of log */
    [LOG_MSEC]      /* entry 0 - u32 timestamp of log */
    [LOG_MSG]       /* entry 0 - char log message up to 256 */
    [LOG_LENGTH]    /* u32 of remaining log items in arrays */
    [LOG_INDEX_ARRAY] = {
        [LOG_INDEX] /* entry 1 - u64 index of log */
[LOG_INDEX] /* entry 2 - u64 index of log */
...
    }
    [LOG_MSEC_ARRAY] = {
        [LOG_MSEC]  /* entry 1 - u32 timestamp of log */
[LOG_MSEC]  /* entry 2 - u32 timestamp of log */
...
    }
    [LOG_MSG_ARRAY] = {
        [LOG_MSG]   /* entry 1 - char log message up to 256 */
[LOG_MSG]   /* entry 2 - char log message up to 256 */
...
    }
}

Signed-off-by: Lee Trager <lee@trager.us>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250702192207.697368-5-lee@trager.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoeth: fbnic: Create ring buffer for firmware logs
Lee Trager [Wed, 2 Jul 2025 19:12:09 +0000 (12:12 -0700)] 
eth: fbnic: Create ring buffer for firmware logs

When enabled, firmware may send logs messages which are specific to the
device and not the host. Create a ring buffer to store these messages
which are read by a user through DebugFS. Buffer access is protected by
a spinlock.

Signed-off-by: Lee Trager <lee@trager.us>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250702192207.697368-4-lee@trager.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoeth: fbnic: Use FIELD_PREP to generate minimum firmware version
Lee Trager [Wed, 2 Jul 2025 19:12:08 +0000 (12:12 -0700)] 
eth: fbnic: Use FIELD_PREP to generate minimum firmware version

Create a new macro based on FIELD_PREP to generate easily readable minimum
firmware version ints. This macro will prevent the mistake from the
previous patch from happening again.

Signed-off-by: Lee Trager <lee@trager.us>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250702192207.697368-3-lee@trager.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoeth: fbnic: Fix incorrect minimum firmware version
Lee Trager [Wed, 2 Jul 2025 19:12:07 +0000 (12:12 -0700)] 
eth: fbnic: Fix incorrect minimum firmware version

The full minimum version is 0.10.6-0. The six is now correctly defined as
patch and shifted appropriately. 0.10.6-0 is a preproduction version of
firmware which was released over a year and a half ago. All production
devices meet this requirement.

Signed-off-by: Lee Trager <lee@trager.us>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250702192207.697368-2-lee@trager.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 weeks agoMerge branch 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox...
Jakub Kicinski [Tue, 8 Jul 2025 23:59:56 +0000 (16:59 -0700)] 
Merge branch 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux

Tariq Toukan says:

====================
mlx5-next updates 2025-07-08

The following pull-request contains common mlx5 updates
for your *net-next* tree.

v2: https://lore.kernel.org/1751574385-24672-1-git-send-email-tariqt@nvidia.com

* 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
  net/mlx5: Check device memory pointer before usage
  net/mlx5: fs, fix RDMA TRANSPORT init cleanup flow
  net/mlx5: Add IFC bits for PCIe Congestion Event object
  net/mlx5: Small refactor for general object capabilities
  net/mlx5: fs, add multiple prios to RDMA TRANSPORT steering domain
====================

Link: https://patch.msgid.link/1752002102-11316-1-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agoMerge branch 'net-migrate-remaining-drivers-to-dedicated-_rxfh_context-ops'
Jakub Kicinski [Tue, 8 Jul 2025 18:56:42 +0000 (11:56 -0700)] 
Merge branch 'net-migrate-remaining-drivers-to-dedicated-_rxfh_context-ops'

Jakub Kicinski says:

====================
net: migrate remaining drivers to dedicated _rxfh_context ops

Around a year ago Ed added dedicated ops for managing RSS contexts.
This significantly improved the clarity of the driver facing API.
Migrate the remaining 3 drivers and remove the old way of muxing
the RSS context operations via .set_rxfh().

v2: https://lore.kernel.org/20250702030606.1776293-1-kuba@kernel.org
v1: https://lore.kernel.org/20250630160953.1093267-1-kuba@kernel.org
====================

Link: https://patch.msgid.link/20250707184115.2285277-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agonet: ethtool: reduce indent for _rxfh_context ops
Jakub Kicinski [Mon, 7 Jul 2025 18:41:15 +0000 (11:41 -0700)] 
net: ethtool: reduce indent for _rxfh_context ops

Now that we don't have the compat code we can reduce the indent
a little. No functional changes.

Reviewed-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Edward Cree <ecree.xilinx@gmail.com>
Link: https://patch.msgid.link/20250707184115.2285277-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agonet: ethtool: remove the compat code for _rxfh_context ops
Jakub Kicinski [Mon, 7 Jul 2025 18:41:14 +0000 (11:41 -0700)] 
net: ethtool: remove the compat code for _rxfh_context ops

All drivers are now converted to dedicated _rxfh_context ops.
Remove the use of >set_rxfh() to manage additional contexts.

Reviewed-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Edward Cree <ecree.xilinx@gmail.com>
Link: https://patch.msgid.link/20250707184115.2285277-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agoeth: mlx5: migrate to the *_rxfh_context ops
Jakub Kicinski [Mon, 7 Jul 2025 18:41:13 +0000 (11:41 -0700)] 
eth: mlx5: migrate to the *_rxfh_context ops

Convert mlx5 to dedicated RXFH ops. This is a fairly shallow
conversion, TBH, most of the driver code stays as is, but we
let the core allocate the context ID for the driver.

mlx5e_rx_res_rss_get_rxfh() and friends are made void, since
core only calls the driver for context 0. The second call
is right after context creation so it must exist (tm).

Tested with drivers/net/hw/rss_ctx.py on MCX6.

Reviewed-by: Gal Pressman <gal@nvidia.com>
Link: https://patch.msgid.link/20250707184115.2285277-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agoeth: ice: drop the dead code related to rss_contexts
Jakub Kicinski [Mon, 7 Jul 2025 18:41:12 +0000 (11:41 -0700)] 
eth: ice: drop the dead code related to rss_contexts

ICE appears to have some odd form of rss_context use plumbed
in for .get_rxfh. The .set_rxfh side does not support creating
contexts, however, so this must be dead code. For at least a year
now (since commit 7964e7884643 ("net: ethtool: use the tracking
array for get_rxfh on custom RSS contexts")) we have not been
calling .get_rxfh with a non-zero rss_context. We just get
the info from the RSS XArray under dev->ethtool.

Remove what must be dead code in the driver, clear the support flags.

Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250707184115.2285277-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>