git.ipfire.org Git - thirdparty/kernel/linux.git/log

tls: remove dead sockmap (psock) handling from the SW path

TLS and sockmap are now mutually exclusive. Try to delete the code
from sendmsg and recvmsg path which is now obviously dead.

The main goal is to delete enough code for AI security scanners
to no longer bother us with sockmap related bugs. At the same
time retain the code in case someone has the cycles to fix
all of this and make the integration work, again.

If the integration does not get restored we can wipe the rest
of the skmsg code from TLS in two or three releases.

The changes on the Tx side are deeper since that's where most
of the bugs are, Rx side simply takes the data from sockmap
and gives it to the user. On Tx split record handling and
rolling back the iterator were the two problem areas.

Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20260614014102.461064-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tls: reject the combination of TLS and sockmap

TLS and sockmap (BPF psock) integration hides a lot of latent bugs.
Bugs which may be more or less relevant for real users but they
are definitely exploitable.

We could not find anyone actively using this integration so let's
reject this config. Adding a TLS socket to a sockmap was already
rejected by sk_psock_init() through the inet_csk_has_ulp() check.
We need to reject the attempts to configure the TLS keys (rather
than adding the ULP itself) because checking prior to the ULP
installation is tricky without risking a race with sockmap getting
added in parallel (sockmap does not hold the socket lock).

This patch is a minimal rejection of the feature. Subsequent patch
in the series will do a light dead code removal. Full cleanup would
require a major rewrite of the Tx path, we don't need skmsg any more.

Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20260614014102.461064-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'atm-remove-more-dead-code'

Jakub Kicinski says:

====================
atm: remove more dead code

Commit 6deb53595092 ("net: remove unused ATM protocols and legacy
ATM device drivers") removed a good chunk of old ATM drivers.
Our goal going forward is to limit the ATM support to PPPoATM
used in ADSL deployments.

A recent burst of AI generated fixes for net/atm/signaling.c and
net/atm/svc.c made me look closer at the remaining code. PPPoATM runs
over permanent virtual circuits (PF_ATMPVC) with a statically
configured VPI/VCI. We can drop switched virtual circuits (SVCs)
and user-space signaling (atmsigd) support. While digging around
I noticed a few more obviously dead pieces of code.

Annoyingly, I have applied one "fix" to QoS config which will
now make net conflict with this series :/
====================

Link: https://patch.msgid.link/20260615194416.752559-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

atm: remove orphaned uAPI for deleted drivers, protocols and SVCs

ATM removals have left a number of uAPI headers and ioctl
definitions with no in-kernel implementation behind them:

- device headers for adapters deleted with the legacy PCI/SBUS drivers:
   atm_eni.h, atm_he.h, atm_idt77105.h, atm_nicstar.h, atm_zatm.h and
   the atmtcp pair atm_tcp.h / <linux/atm_tcp.h>
- protocol headers for the removed CLIP, LANE and MPOA stacks:
   atmarp.h, atmclip.h, atmlec.h, atmmpc.h
- atmsvc.h and the SVC / p2mp / local-address ioctls in atmdev.h
   (ATM_{GET,RST,ADD,DEL}ADDR, ATM_{ADD,DEL,GET}LECSADDR,
   ATM_{ADD,DROP}PARTY) left behind by the SVC and address-registry
   removals

None of these are referenced by any remaining in-tree code.
Let's try to delete all this. Chances are nobody cares about
these headers any more. I'm keeping this separate from the
kernel side code changes for ease of revert, in case I am
proven wrong...

Link: https://patch.msgid.link/20260615194416.752559-10-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

atm: remove unused ATM PHY operations

The PHY operations are vestiges of the SAR/framer split used by the
removed PCI/SBUS ATM adapters:

- atmdev_ops::phy_put / ::phy_get (register accessors) are never called
by the core and solos-pci only listed them as NULL
- struct atmphy_ops and atm_dev::phy have no users at all - nothing
assigns or dereferences them

Remove all of them. atm_dev::phy_data is kept: solos-pci repurposes it
to stash its per-port channel index.

Link: https://patch.msgid.link/20260615194416.752559-9-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

atm: remove the unused pre_send and send_bh device operations

atmdev_ops::pre_send (a TX pre-processing hook) and ::send_bh (a
bottom-half capable send variant) have no implementation behind them:
no remaining ATM driver sets either, so vcc_sendmsg() always skipped
pre_send and the raw AAL0/AAL5 paths always fell back to ->send().
The drivers that used these hooks were removed with the legacy ATM
adapters.

Drop both operations and the dead branches that tested for them.

Link: https://patch.msgid.link/20260615194416.752559-8-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

atm: remove the unused change_qos device operation

atmdev_ops::change_qos() was the hook for renegotiating the traffic
parameters of an already-connected VCC, driven from SO_ATMQOS on a
connected socket (and previously from the SVC as_modify path, now gone).
None of the ATM drivers left in tree implement it - solos-pci only listed
change_qos = NULL - so atm_change_qos() always returned -EOPNOTSUPP.

Drop the operation and return -EOPNOTSUPP directly.

Link: https://patch.msgid.link/20260615194416.752559-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

atm: remove SVC socket support and the signaling daemon interface

ATM switched virtual circuits (SVCs) are set up and torn down by a
user-space signaling daemon (atmsigd) which the kernel talks to over
a dedicated "sigd" socket: the kernel marshals Q.2931-style requests
(as_connect, as_listen, as_accept, as_close, ...) to the daemon and
applies the results to PF_ATMSVC sockets. This is the machinery behind
classical SVC use and was the foundation for LANE / MPOA, all of which
have been removed.

DSL deployments do not use any of this. PPPoATM and BR2684 run over
permanent virtual circuits (PF_ATMPVC) with a statically configured
VPI/VCI; no atmsigd, no Q.2931. Neither remaining ATM driver
(solos-pci, the USB DSL modems) is reachable through the SVC path.

Remove the SVC socket family and the signaling interface:

- delete net/atm/svc.c, net/atm/signaling.c and signaling.h
- drop atmsvc_init()/atmsvc_exit() and the PF_ATMSVC registration and
module alias
- drop the ATMSIGD_CTRL ioctl (sigd_attach) and the /proc/net/atm/svc
file
- fold the SVC branch out of atm_change_qos(); all sockets are PVCs now

The obsolete ATM_SETSC ioctl stub is left in place (it already just
warns and returns 0), as is the struct atm_vcc SVC bookkeeping shared
with the queueing layer.

Link: https://patch.msgid.link/20260615194416.752559-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

atm: remove the local ATM (NSAP) address registry

net/atm/addr.c maintained the per-device lists of local NSAP addresses
(dev->local) and ILMI-learned LECS addresses (dev->lecs). These exist
solely to serve SVC signaling: the lists are populated through the
ATM_{ADD,DEL,RST}ADDR / ATM_{ADD,DEL,GET}LECSADDR ioctls used by the
atmsigd / ILMI daemons, and consumed when registering addresses with the
signaling daemon. The LECS list belonged to LAN Emulation, which has
been removed.

With no SVC users in a DSL-only configuration these lists are always
empty, so drop the registry entirely:

- remove the ADDR/LECSADDR/RSTADDR ioctls
- drop the now-always-empty "atmaddress" sysfs attribute
- remove the dev->local / dev->lecs lists, structs and enums
- delete net/atm/addr.c and net/atm/addr.h

The device ESI ("MAC" address) and its ATM_{G,S}ETESI ioctls and
"address" sysfs attribute are retained - the USB DSL modems populate
the ESI.

Link: https://patch.msgid.link/20260615194416.752559-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

atm: remove dead SONET PHY ioctls

The SONET_* ioctls are SONET/SDH PHY controls that atm_dev_ioctl() and
the compat path only ever forwarded to the driver's ->ioctl() handler.
The PHY drivers that implemented them (the S/UNI library and the framers
on the removed PCI/SBUS adapters) are gone, and neither surviving driver
services them: solos-pci has no ->ioctl, and usbatm handles only
ATM_QUERYLOOP. They now uniformly return an error regardless.

Drop the SONET compat passthrough and the SONET cases in atm_dev_ioctl(),
along with the now-unused linux/sonet.h includes. The SONET_* uAPI
definitions are untouched.

Link: https://patch.msgid.link/20260615194416.752559-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

atm: remove the unused send_oam / push_oam callbacks

The atmdev_ops::send_oam device operation and the atm_vcc::push_oam
callback were the kernel's interface for raw F4/F5 OAM cell exchange.
Nothing assigns them a non-NULL value and nothing ever invokes them:
the core only ever initialises push_oam to NULL (in vcc_create() and the
AAL init helpers) and the Solos driver only lists send_oam = NULL for
documentation. The drivers that actually drove OAM through these hooks
were removed along with the legacy ATM adapters.

Drop both callbacks and the NULL initialisers.

Link: https://patch.msgid.link/20260615194416.752559-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

atm: remove AAL3/4 transport support

AAL3/4 is an obsolete connection-oriented ATM adaptation layer that has
seen no real use since the SMDS-era hardware it was designed for (90s?).
We are only maintaining ATM support in-tree to keep PPPoATM running,
and PPPoATM runs over AAL5.

Drop the "raw" AAL3/4 transport (atm_init_aal34()) and the ATM_AAL34
cases in the connect and traffic-parameter paths. A vcc_connect() with
qos.aal == ATM_AAL34 now fails with -EPROTOTYPE.

uAPI cleanup is performed later, separately.

Link: https://patch.msgid.link/20260615194416.752559-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: sja1105: fix lastused timestamp in flower stats

flow_stats_update() takes an absolute timestamp for lastused, not delta.
Fix that.

Signed-off-by: David Yang <mmyangfl@gmail.com>
Link: https://patch.msgid.link/20260614141320.1133321-1-mmyangfl@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'extend-netkit-io_uring-zc-selftests'

Daniel Borkmann says:

====================
Extend netkit io_uring ZC selftests

Small follow-up to the HW net selftests, in particular to add a
selftest showing that also large rx_buf_len for io_uring ZC is
supported with netkit queue leasing.
====================

Link: https://patch.msgid.link/20260614102607.863838-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests/net: Add hugepage kernel config dependency for zcrx

test_iou_zcrx_large_buf in drivers/net/hw/nk_qlease.py runs iou-zcrx
with rx_buf_len > page size, backed by a hugepage-mapped area. Thus
add to the Kconfig.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com>
Link: https://patch.msgid.link/20260614102607.863838-5-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests/net: Add netkit io_uring ZC test for large rx_buf_len

Add test_iou_zcrx_large_buf, which runs iou-zcrx with rx_buf_len >
page size (-x 2) through a netkit-leased RX queue. The netkit ifindex
is opaque to io_uring, but rx_page_size is honoured by the leased
physical qops via netif_mp_open_rxq()'s lease redirect.

Originally, I also added a BIG TCP variant on top, but dropped it
here as fbnic (and the QEMU fbnic model) has no BIG TCP support
to exercise it as this point.

Tested against the QEMU fbnic emulation. The new test exercises
the > page rx_buf_len path only when the leased NIC advertises
QCFG_RX_PAGE_SIZE; otherwise it skips.

For fbnic, I used Bjorn's patches locally [0]:

  # ./nk_qlease.py
  TAP version 13
  1..5
  ok 1 nk_qlease.test_iou_zcrx
  ok 2 nk_qlease.test_iou_zcrx_large_buf
  ok 3 nk_qlease.test_attrs
  ok 4 nk_qlease.test_attach_xdp_with_mp
  ok 5 nk_qlease.test_destroy
  # Totals: pass:5 fail:0 xfail:0 xpass:0 skip:0 error:0

Without those patches (aka not advertising QCFG_RX_PAGE_SIZE):

  # ./nk_qlease.py
  TAP version 13
  1..5
  ok 1 nk_qlease.test_iou_zcrx
  ok 2 nk_qlease.test_iou_zcrx_large_buf # SKIP Large chunks are not supported -95
  ok 3 nk_qlease.test_attrs
  ok 4 nk_qlease.test_attach_xdp_with_mp
  ok 5 nk_qlease.test_destroy
  # Totals: pass:4 fail:0 xfail:0 xpass:0 skip:1 error:0

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com>
Link: https://lore.kernel.org/netdev/20260522113225.241337-1-bjorn@kernel.org/
Link: https://patch.msgid.link/20260614102607.863838-4-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests/net: Use public NetDrvContEnv API in nk_qlease fixtures

Expose the netkit host ifname as a public attribute nk_host_ifname
(symmetric with the already-public nk_guest_ifname), rename _attach_bpf
to a public attach_bpf, and add a public detach_bpf helper that
encapsulates the tc-filter teardown bookkeeping. Switch the fixtures
to this public API. No functional change and keeps pylint happy.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com>
Link: https://patch.msgid.link/20260614102607.863838-3-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests/net: Move netkit lease hw setup into per-test fixtures

The HW counterpart of nk_qlease.py was carrying its lease setup in main()
and stashing src_queue / nk_queue / nk_*_ifname on cfg, which had drawbacks
called out during the review at [0].

This is the deferred half of the cleanup that landed in commit e254ffb9502c
("selftests/net: Split netdevsim tests from HW tests in nk_qlease") which
was the SW counterpart of nk_qlease.py.

While at it, convert the open-coded "ip netns exec" prefixes in the test
bodies over to the ns= argument of cmd() / bkg().

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com>
Link: https://lore.kernel.org/netdev/20260408162238.16709090@kernel.org/
Link: https://patch.msgid.link/20260614102607.863838-2-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'ionic-expose-more-port-stats-to-ethtool'

Eric Joyner says:

====================
ionic: Expose more port stats to ethtool [part]

The primary aim of this patchset is to support the reporting of new port
statistics (and one old one) that firmware sends to the driver. A scheme
for these extra stats is introduced in order to prevent devices that
don't support these new statistics from unconditionally setting them or
reporting them in ethtool.
====================

Link: https://patch.msgid.link/20260614205303.48088-1-eric.joyner@amd.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ionic: Get "link_down_count" ext link stat from firmware

The number of times that link has gone down at the port level is tracked
by the firmware and sent to the driver via regular DMA writes to an
instance of struct ionic_port_status in the driver's memory.

This statistic was never reported in favor of a driver-derived stat, but
doing it in the driver was never necessary since firmware had been
reporting it the whole time. Since it would be more accurate and true to
the description of the statistic to get this count at the PHY level,
replace the driver-calculated statistic with one derived from the
firmware one and remove the driver-calculated one entirely.

The stat reported by the ethtool .get_link_ext_stats() handler is
normalized to 0 on driver load and any device resets that require the
driver to rebuild state while also handling overflows.

Signed-off-by: Eric Joyner <eric.joyner@amd.com>
Link: https://patch.msgid.link/20260614205303.48088-5-eric.joyner@amd.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ionic: Report "rx_bits_phy" stat to ethtool

This stat contains the number of total bits that the PHY has received;
it's useful for BER calculations. Add it to the ethtool stats output.

However, since this is one of the new "extra port stats", it's reported
in a different manner than the existing port stats and only
conditionally added to the ethtool stats output list: both the
DEV_CAP_EXTRA_STATS capability must be supported by the firmware, and
the firmware must set the value of the statistic to something other than
IONIC_STAT_INVALID.

To help support this scheme, the extra port stats region is initialized to
0xff's/IONIC_STAT_INVALID by the driver, to ensure the statistics that
the driver knows about but the firmware does not are still invalid
to the driver.

Signed-off-by: Eric Joyner <eric.joyner@amd.com>
Link: https://patch.msgid.link/20260614205303.48088-4-eric.joyner@amd.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ionic: Update ionic_if.h with new extra port stats

Add a new structure to report additional statistics from the firmware to
struct ionic_port_info. This new struct currently only contains FEC
related statistics, but any new port-level statistics collected by the
firmware would go into it.

The new structure is located in the same area as the unused
ionic_port_pb_stats structure, so this patch also removes that and its
supporting enumerations since they was never used in this driver.

Finally, to indicate firmware support for the new structure, introduce a
new device capability that the driver can use to see if the attached
device supports reporting these extra stats.

Signed-off-by: Eric Joyner <eric.joyner@amd.com>
Link: https://patch.msgid.link/20260614205303.48088-3-eric.joyner@amd.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ionic: Fix check in ionic_get_link_ext_stats

The current check will fail if SR-IOV is not initialized for the
physical function; this is because is_physfn is 0 if sriov_init() isn't
run or fails. Change the check that prevents getting the link down count
to use is_virtfn instead so that VFs don't get this functionality, which
was the original intent.

Fixes: 132b4ebfa090 ("ionic: add support for ethtool extended stat link_down_count")
Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Eric Joyner <eric.joyner@amd.com>
Link: https://patch.msgid.link/20260614205303.48088-2-eric.joyner@amd.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-dsa-mxl862xx-serdes-ports'

Daniel Golle says:

====================
net: dsa: mxl862xx: SerDes ports

Add support for the two SerDes PCS interfaces of the MxL862xx switch
ICs, which can both either be used to connect PHYs or SFP cages, or as
CPU port(s). 1000Base-X, 2500Base-X, 10GBase-R, 10GBase-KR, SGMII,
QSGMII and USXGMII (single 10G or quad 2.5G) are supported.

The firmware only added the API to directly control the PCS as of
version 1.0.84, so the PCS features are gated behind a version check.

As the driver is growing do some refactoring to break out the phylink
parts into mxl862xx-phylink.h.
====================

Link: https://patch.msgid.link/cover.1781319534.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: mxl862xx: add support for SerDes ports

The MxL862xx has two XPCS/SerDes interfaces (XPCS0 for ports 9-12,
XPCS1 for ports 13-16). Each can operate in various single-lane modes
(SGMII, 1000Base-X, 2500Base-X, 10GBase-R, 10GBase-KR, USXGMII) or as
QSGMII or 10G_QXGMII providing four sub-ports per interface.

Implement phylink PCS operations using the firmware's XPCS API:

  - pcs_enable/pcs_disable: refcount the sub-ports sharing an XPCS
    and power it down once the last sub-port is released.
  - pcs_config: configure negotiation mode and CL37/SGMII advertising.
  - pcs_get_state: read link state and the link-partner ability word
    from firmware and decode using phylink's standard CL37, SGMII, and
    USXGMII decoders.
  - pcs_an_restart: restart CL37 or CL73 auto-negotiation.
  - pcs_link_up: force speed/duplex for SGMII.
  - pcs_inband_caps: report per-mode in-band status capabilities.

Register a PCS instance for each SerDes interface and
QSGMII/10G_QXGMII sub-ports during setup. Advertise the supported
interface modes in phylink_get_caps based on port number.

Firmware older than 1.0.84 lacks the XPCS API and instead configures
the SerDes itself, using defaults stored in flash. mac_select_pcs()
returns NULL in that case while the single-lane interface modes stay
advertised, so a CPU port keeps working in the firmware-configured
mode.

Lacking support for expressing PHY-side role modes in Linux only the
MAC-side of SGMII, QSGMII, USXGMII and 10G_QXGMII are implemented for
now.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Link: https://patch.msgid.link/736e4df02e4cb8c530c1670cbe7efac20b5d696d.1781319534.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: mxl862xx: move API macros to mxl862xx-host.h

Move the MXL862XX_API_WRITE, MXL862XX_API_READ and
MXL862XX_API_READ_QUIET convenience macros from mxl862xx.c to
mxl862xx-host.h next to the mxl862xx_api_wrap() prototype they wrap.
This makes them available to other compilation units that include
mxl862xx-host.h, which is needed once the SerDes PCS code in
mxl862xx-phylink.c also calls firmware commands.

No functional change.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/914f57931e79cc3932a9f32813465c08d29cf4bf.1781319534.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: mxl862xx: move phylink stubs to mxl862xx-phylink.c

Move the phylink MAC operations and get_caps callback from mxl862xx.c
into a dedicated mxl862xx-phylink.c file. This prepares for the SerDes
PCS implementation which adds substantial phylink/PCS code -- keeping
it in a separate file avoids function-position churn in the main
driver file.

No functional change.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/fb9336de94bef47a0834287cbca87954e5e4c795.1781319534.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: mxl862xx: store firmware version for feature gating

Query the firmware version at init (already done in wait_ready),
cache it in priv->fw_version, and provide MXL862XX_FW_VER_MIN()
for version-gated code paths throughout the driver.

MXL862XX_FW_VER() packs major/minor/revision into a u32 with
bitwise shifts so that versions compare with natural ordering,
independent of host endianness.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/91a26a8ffeaa2ce1729f98347e93e779973976bb.1781319534.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: use int instead of atomic_t for qdma users counter

QDMA users counter is always accessed holding RTNL lock so we do not
require atomic_t for it.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'tcp-rehash-onto-different-local-ecmp-path-on-retransmit-timeout'

Neil Spring says:

====================
tcp: rehash onto different local ECMP path on retransmit timeout

Currently sk_rethink_txhash() re-rolls the socket's txhash on RTO,
PLB, and spurious-retransmission events, but the new hash is not
propagated into the IPv6 ECMP path selection.  The cached
route is reused and fib6_select_path() is never re-invoked, so
the connection uses the same local ECMP decision.

This series adds the two missing pieces:

1. __sk_dst_reset() alongside sk_rethink_txhash() so the cached dst
   is invalidated and the next transmit triggers a fresh route lookup.

2. fl6->mp_hash set from sk_txhash before each route lookup so
   fib6_select_path() picks a path from the (potentially re-rolled) hash.

The override applies only to fib_multipath_hash_policy 0 (the default L3
policy).  Its hash includes the flow label, but that is 0 by default
(np->flow_label is unset; auto_flowlabels computes the on-wire label
later, per packet), so flows to the same peer share one local path.
Keying it on sk_txhash makes that local path per-connection and lets a
rehash re-select it; even when a flow label is present (reflected REPFLOW
or explicitly set) only local path selection changes -- the on-wire flow
label is unaffected.  Policies 1-3 are left unchanged.

Patch 1 is the kernel change; patch 2 adds selftests covering rehash on
SYN, SYN/ACK, midstream RTO, midstream spurious-retransmission, and PLB
events, plus a policy 1 negative test, a flowlabel-leak regression test,
a dst-rebuild consistency test, and a syncookie path-consistency test.
====================

Link: https://patch.msgid.link/20260615042158.1600746-1-ntspring@meta.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: net: add local ECMP rehash test

Add ecmp_rehash.sh with nine scenarios verifying that TCP rehash
selects a different local ECMP path for IPv6:

  - SYN retransmission (forward path blocked during setup)
  - SYN/ACK retransmission (reverse path blocked during setup)
  - Midstream RTO (forward path blocked on established connection)
  - Midstream ACK rehash (reverse path blocked on established connection)
  - PLB rehash (ECN-driven congestion on established connection)
  - Hash policy 1 negative test (rehash attempted but path unchanged)
  - No flowlabel leak (client mp_hash does not alter on-wire flowlabel)
  - Dst rebuild consistency (dst invalidation does not change path)
  - Syncookie server path consistency (SYN-ACK and post-cookie ACKs
    use the same ECMP path)

The policy 1 test verifies that fib_multipath_hash_policy=1 computes
a deterministic 5-tuple hash, so txhash re-rolls do not change the
ECMP path while TcpTimeoutRehash still increments.

The flowlabel leak test sets auto_flowlabels=0 on the client and
installs tc filters on client egress that drop TCP packets with
nonzero flowlabel, confirming that the client's fl6->mp_hash does
not leak into the on-wire IPv6 flow label.

The PLB test needs DCTCP, a restricted congestion control.  Rather
than relax the host-global tcp_allowed_congestion_control (no
per-netns equivalent), it pins dctcp on the test routes via the
congctl route attribute, confined to the test namespaces.

The dst rebuild test streams data, invalidates the cached dst by
adding and removing a dummy route (bumping the fib6_node sernum),
and verifies that traffic stays on the same path.  The sernum change
causes ip6_dst_check() to fail on the next transmit, triggering a
fresh route lookup via inet6_csk_route_socket().
ECMP_REBUILD_ROUNDS=10 repeats the check to reduce the probability
of a buggy kernel passing by chance with 2-way ECMP.

The syncookie server path consistency test verifies that the
server's SYN-ACK and subsequent ACKs use the same ECMP path.
With syncookies, the request socket is freed after the SYN-ACK,
so cookie_tcp_reqsk_init() must derive the same txhash (from the
cookie) that was used for the SYN-ACK's route lookup.

The syncookie test forces tcp_syncookies=2; it skips when
CONFIG_SYN_COOKIES is not available.  selftests/net/config selects
it (and CONFIG_TCP_CONG_DCTCP for the PLB test).

Signed-off-by: Neil Spring <ntspring@meta.com>
Link: https://patch.msgid.link/20260615042158.1600746-3-ntspring@meta.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: rehash onto different local ECMP path on retransmit timeout

Currently sk_rethink_txhash() re-rolls the socket's txhash on RTO, PLB,
and spurious-retransmission events, but the cached route is reused and
the new hash is not propagated into the ECMP path selection logic.  Two
changes are needed to make rehash select a different local ECMP path:

1. Add __sk_dst_reset() alongside sk_rethink_txhash() in
   tcp_write_timeout(), tcp_rcv_spurious_retrans(), and
   tcp_plb_check_rehash() so the cached dst is invalidated and the
   next transmit triggers a fresh route lookup.

2. Set fl6->mp_hash from sk_txhash (or tcp_rsk(req)->txhash for
   SYN/ACK retransmits and syncookies) in tcp_v6_connect(),
   inet6_sk_rebuild_header(), inet6_csk_route_req(),
   inet6_csk_route_socket(), tcp_v6_send_response(), and
   cookie_v6_check() so fib6_select_path() picks a path based on the
   new hash.

The mp_hash override only applies to fib_multipath_hash_policy 0 (the
default L3 policy).  Its hash includes the flow label, but that is 0 by
default -- np->flow_label is unset, and auto_flowlabels only computes
the on-wire label later, per packet -- so flows to the same peer share
one local path.  Keying the hash on sk_txhash makes the local path
per-connection and lets a rehash re-select it.  Policies 1-3 are left
unchanged.

The mp_hash assignment is factored into a small helper,
ip6_ecmp_set_mp_hash(), shared by inet6_csk_route_req(),
inet6_csk_route_socket(), tcp_v6_connect(), inet6_sk_rebuild_header(),
tcp_v6_send_response(), and cookie_v6_check().  It applies
(txhash >> 1) ?: 1 for policy 0 (the >> 1 keeps mp_hash in the 31-bit
range; ?: 1 keeps it non-zero, since 0 would fall back to
rt6_multipath_hash()).  inet6_csk_route_socket() calls it only for
sk_protocol == IPPROTO_TCP so that non-TCP callers (e.g., L2TP via
inet6_csk_xmit) fall through to rt6_multipath_hash() and retain their
existing flow-key-based ECMP behavior.

tcp_v6_send_response() also sets mp_hash from the response txhash so
that a control packet (a RST from the full socket, or an ACK from a
time-wait socket) selects the same local ECMP nexthop as the
connection's txhash rather than falling back to the flow hash.  The
time-wait socket's tw_txhash is copied from sk_txhash when the
connection enters TIME_WAIT, so it reflects any rehash that occurred.

Setting mp_hash explicitly is necessary because the default ECMP hash
derives from fl6->flowlabel via np->flow_label, which is not updated
from sk_txhash (REPFLOW is off by default).  ip6_make_flowlabel()
cannot help either, as it runs after the route lookup.

As a consequence, for policy 0 the local ECMP path of an IPv6 TCP
flow follows sk_txhash even when fl6->flowlabel is non-zero, e.g. a
reflected (REPFLOW) or explicitly set (IPV6_FLOWLABEL_MGR) flow
label.  This is intentional: only local path selection changes, so
rehash can recover from a failed path; the on-wire flow label is
unchanged.

sk_set_txhash() is moved before ip6_dst_lookup_flow() in
tcp_v6_connect() so the initial ECMP path is selected by the same
txhash that subsequent route rebuilds will use.  This avoids
unintended path changes when the cached dst is naturally invalidated
(e.g., by PMTU discovery or route changes).

The rehash sites (tcp_write_timeout(), tcp_plb_check_rehash(), and
tcp_rcv_spurious_retrans()) call __sk_rethink_txhash_reset_dst(),
which re-rolls the txhash and, when it changed, drops the cached dst
so the next transmit re-runs route selection.  The dst reset is
guarded by sk->sk_family == AF_INET6 since IPv4 ECMP does not
currently use sk_txhash for path selection.  For IPv4-mapped IPv6
sockets this produces a redundant dst reset on a cold path
(RTO/PLB); the subsequent IPv4 route lookup returns the same result.
The helper is deliberately separate from sk_rethink_txhash() itself:
dst_negative_advice() calls sk_rethink_txhash() before its own dst op,
so resetting the dst inside sk_rethink_txhash() would skip that op
(e.g. rt6_remove_exception_rt()).

For syncookies, cookie_init_sequence() computes the cookie value
before route_req() and sets txhash so the SYN-ACK selects the same
ECMP path that cookie_v6_check() will use when the full socket is
created.  cookie_tcp_reqsk_init() derives txhash from the cookie so
the full socket's ECMP path matches the SYN-ACK.  Both the SYN-ACK
assignment in tcp_conn_request() and the full-socket assignment in
cookie_tcp_reqsk_init() set txhash from the cookie for IPv4 and IPv6
alike.  On IPv6 this drives ECMP path selection; on IPv4, which does
not use sk_txhash for ECMP, it only affects TX-queue selection.  That
selection scales the hash by its high bits (reciprocal_scale()), which
are uniform in the keyed secure_tcp_syn_cookie() output -- the MSS index
only perturbs the low bits -- so the queue distribution matches
net_tx_rndhash().

cookie_init_sequence() is split from the former version that also
called tcp_synq_overflow() and incremented SYNCOOKIESSENT; those
side effects are now in cookie_record_sent(), called after
route_req() succeeds so they are not bumped when route_req() fails.
cookie_record_sent() is guarded by CONFIG_SYN_COOKIES to
match the guard on tcp_synq_overflow().  route_req() receives 0 as
tw_isn for the syncookie path so that tcp_v6_init_req() still saves
ireq->pktopts for REPFLOW flowlabel reflection and IPv6 cmsg
options.  The ecn_ok clear for syncookies without timestamps stays
after tcp_ecn_create_request() so it takes precedence.

Signed-off-by: Neil Spring <ntspring@meta.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260615042158.1600746-2-ntspring@meta.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: Fix MODULE_LICENSE to match SPDX GPL-2.0-only identifier

Both airoha_eth.c and airoha_npu.c declare SPDX-License-Identifier:
GPL-2.0-only but use MODULE_LICENSE("GPL"), which the kernel module
loader interprets as GPL-2.0+ (any GPL version). This mismatch causes
license compliance tools (FOSSology, ScanCode, etc.) to misidentify
the effective license as more permissive than intended.

Replace MODULE_LICENSE("GPL") with MODULE_LICENSE("GPL v2") to
align with the GPL-2.0-only SPDX identifier. Per include/linux/module.h,
"GPL v2" maps to GPL-2.0-only, matching the source files' declared
license.

Signed-off-by: Wayen <win847@gmail.com>
Link: https://patch.msgid.link/6a2ded59.63d39acb.391892.7632@mx.google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

sctp: correct CONFIG_SCTP_DBG_OBJCNT macro name in comment

A comment in <net/sctp/sctp.h> incorrectly refers to
CONFIG_SCTP_DBG_OBJCOUNT instead of CONFIG_SCTP_DBG_OBJCNT. Correct it.

Discovered while searching for CONFIG_* symbols referenced in code but
not defined in any Kconfig file.

Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com>
Link: https://patch.msgid.link/20260613233725.162470-1-enelsonmoore@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: HWS: correct CONFIG_MLX5_HW_STEERING macro name in comment

A comment in
drivers/net/ethernet/mellanox/mlx5/core/steering/hws/fs_hws.h
incorrectly refers to CONFIG_MLX5_HWS_STEERING instead of
CONFIG_MLX5_HW_STEERING. Correct it.

Discovered while searching for CONFIG_* symbols referenced in code but
not defined in any Kconfig file.

Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Link: https://patch.msgid.link/20260613225904.140791-1-enelsonmoore@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: Fix typos in comments and Kconfig

Fix several typos found during code review:
- Kconfig: "Aiorha" -> "Airoha" in NET_AIROHA_FLOW_STATS help text
- Comment: "CMD1" -> "CDM1" (Central DMA, not Command)
- Comments: "GMD1/2/3/4" -> "GDM1/2/3/4" (Gigabit DMA, not GMD)

These are pure comment and documentation fixes with no functional impact.

Signed-off-by: Wayen.Yan <win847@gmail.com>
Acked-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/6a2ca74a.c5b1db4e.21a698.01e7@mx.google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: Fix non-standard return value in airoha_ppe_get_wdma_info()

airoha_ppe_get_wdma_info() returns -1 when the last path in the
forwarding path stack is not of type DEV_PATH_MTK_WDMA. This is not
a standard kernel error code. Replace it with -EINVAL since the
input path type is invalid from the caller's perspective.

Signed-off-by: Wayen.Yan <win847@gmail.com>
Acked-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/6a2ca3d9.ad59c0a6.147df9.2a62@mx.google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'docs-net-more-adjustments-to-docs'

Jakub Kicinski says:

====================
docs: net: more adjustments to docs

A few small updates to the docs.
This is trying to prepare docs for getting fed directly
into AI reviews.
====================

Link: https://patch.msgid.link/20260613165846.2913092-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

docs: net: fix minor issues with strparser docs

Not sure if anyone would read this doc, but the API has evolved
since it was written. Update to:
- show the int return type for strp_init()
- refer to strp_data_ready(), not the old strp_tcp_data_ready() name
- direct users to strp_msg(skb) for strparser metadata instead of
treating skb->cb as struct strp_msg directly

Link: https://patch.msgid.link/20260613165846.2913092-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

docs: net: fix minor issues with devlink docs

Update devlink documentation to match current code:

- describe health reporter defaults (it's currently under "callbacks"),
best-effort auto-dump, and port-scoped reporters
- fix generic parameter names and values
- fix nested devlink setup wording and registration ordering

Link: https://patch.msgid.link/20260613165846.2913092-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

docs: net: tls-offload: document tls_dev_del, tls_dev_resync, and rekey

Fill in some gaps in the TLS offload doc:

- describe the tls_dev_del and tls_dev_resync callbacks
- add a mention of rekeying being out of scope for now

Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20260613165846.2913092-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-atlantic-add-ptp-support-for-aqc113-antigua'

Sukhdeep Singh says:

====================
net: atlantic: add PTP support for AQC113 (Antigua)

This series adds IEEE 1588 PTP support for the AQC113 (Antigua) network
controller. AQC113 is the successor to the existing AQC107 (Atlantic)
chip already supported by the atlantic driver.

AQC113 uses a substantially different hardware architecture for PTP
compared to AQC107:

  - Dual on-chip TSG clocks with direct register access instead of
    PHY-based timestamping via firmware
  - TX timestamps via descriptor writeback instead of firmware mailbox
  - Hardware L3/L4 RX filters for PTP multicast steering with both
    IPv4 and IPv6 support
  - Reference-counted shared filter slots managed through an Action
    Resolver Table (ART), allowing multiple rules to share L3/L4
    hardware filters when their match criteria are identical

The series is structured in four parts:

Patches 1-3 prepare the existing L3/L4 filter path:

  Patch 1 corrects flow_type masking and IPv6 address handling in
  aq_set_data_fl3l4(). Patch 2 moves the active_ipv4/ipv6 bitmap
  updates to after the hardware write succeeds. Patch 3 decouples
  the function from driver-internal structures so it can be called
  directly by the AQC113 PTP filter setup code.

Patches 4-5 add the AQC113 hardware infrastructure:

  Patch 4 adds the low-level register definitions and accessor
  functions. Patch 5 adds filter data structures and firmware
  capability query.

Patches 6-7 implement the AQC113 L2/L3/L4 RX filter management:

  Patch 6 fixes the AQC113 HW init path: ART section selection,
  L2 filter slot assignment, and MAC address programming. Patch 7
  implements the complete L3/L4 RX filter ops including reference-
  counted ART sharing and IPv4/IPv6 steering.

Patches 8-12 add the AQC113 PTP feature:

  Patch 8 reserves the dedicated PTP traffic class buffer and
  configures the TX path. Patch 9 extends the hw_ops interface
  with PTP-specific function pointers and updates AQC107 to the
  new signatures. Patches 10-12 implement the full PTP subsystem:
  Patch 10 adds the hw_atl2 register-level PTP clock ops, Patch 11
  adds TX timestamp polling and PTP TX traffic classification, and
  Patch 12 integrates PTP into aq_ptp and the driver core.

The existing AQC107 PTP implementation is not functionally changed
by this series; AQC113-specific code paths are gated on chip
detection throughout.

Tested on AQC113 at 1G, 2.5G, 5G, and 10G link speeds using
ptp4l/phc2sys with hardware timestamping in both L2 and L4
(IPv4/IPv6) modes.
====================

Link: https://patch.msgid.link/20260610115448.272-1-sukhdeeps@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: atlantic: add AQC113 PTP support in aq_ptp and driver core

aq_ptp.c / aq_ptp.h:
- Add aq_ptp_state enum (AQ_PTP_FIRST_INIT, AQ_PTP_LINK_UP,
  AQ_PTP_NO_LINK) to distinguish first init from link-change events;
  on AQC113 only reset the TSG clock on first init to avoid disrupting
  ongoing synchronization.
- Add aq_ptp_dpath_enable() for comprehensive L3/L4 PTP filter
  setup/teardown, replacing the previous single-filter approach with
  an array of 4 slots for IPv4 and IPv6 PTP multicast addresses
  (224.0.1.129, 224.0.0.107, ff0e::181, ff02::6b).
- Add aq_ptp_parse_rx_filters() to map hwtstamp_rx_filters to L2/L4
  enable flags and call aq_ptp_dpath_enable().
- Re-apply RX filters on link change (hardware state lost after reset).
- Extend PTP ring alloc/init/start/stop to handle AQC113 PTP ring ops.
- Add per-instance PTP offset table for AQC113 with empirically measured
  values at 100M/1G/2.5G/5G/10G link speeds.
- Export aq_ptp_dpath_enable() and updated ring helpers in aq_ptp.h.

aq_hw.h:
- Include hw_atl2/hw_atl2.h for AQC113 PTP type definitions.

aq_nic.c:
- Account for PTP IRQ vector (AQ_HW_PTP_IRQS) in vector count math.
- Call hw_atl2 PTP re-enable hook after hardware reset in
  aq_nic_update_link_status().

aq_pci_func.c:
- Pass PTP IRQ index to aq_ptp_irq_alloc() in probe path.

Signed-off-by: Sukhdeep Singh <sukhdeeps@marvell.com>
Link: https://patch.msgid.link/20260610115448.272-13-sukhdeeps@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: atlantic: add AQC113 TX timestamp polling and PTP TX classification

aq_ring.h / aq_ring.c:
- Add ptp_ts_deadline field to aq_ring_s to track TX timestamp timeout.
- In aq_ring_tx_clean(): when hw_ring_tx_ptp_get_ts() returns 0 (HW not
  yet written back the timestamp), clear buff->is_mapped and buff->pa
  before breaking to prevent double dma_unmap on retry.  When
  ptp_ts_deadline expires, dequeue and drop the head of skb_ring to keep
  it in lockstep with buff_ring, then clear request_ts and free the skb
  via dev_kfree_skb_any() to unblock the ring.

aq_main.c:
- Add IPv6 PTP packet detection in aq_ndev_start_xmit() using
  ipv6_hdr()->nexthdr for ETH_P_IPV6 frames, steering them through
  aq_ptp_xmit() alongside the existing IPv4 path.
- Use PTP_EV_PORT/PTP_GEN_PORT constants instead of magic numbers 319/320.
- Remove duplicate aq_reapply_rxnfc_all_rules() and
  aq_filters_vlans_update()
  calls from aq_ndev_open() - now covered by aq_nic_start(), which also
  ensures filters are restored correctly after PM resume.

aq_nic.c:
- Move aq_reapply_rxnfc_all_rules() and aq_filters_vlans_update() into
  aq_nic_start() after hardware init, replacing the duplicate calls that
  were removed from aq_ndev_open().

Signed-off-by: Sukhdeep Singh <sukhdeeps@marvell.com>
Link: https://patch.msgid.link/20260610115448.272-12-sukhdeeps@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: atlantic: add AQC113 PTP hardware ops in hw_atl2

Add the hardware-layer PTP implementation for AQC113 (Antigua):

- hw_atl2.h/hw_atl2_utils.h/hw_atl2_internal.h: add PTP offset
  constants, RX timestamp size (HW_ATL2_RX_TS_SIZE=8), and reduced
  HW_ATL2_RXBUF_MAX=172 (AQC113 on-chip RX packet buffer hardware
  limit for data TCs).

- hw_atl2.c: implement hw_atl2_enable_ptp() to reset and enable TSG
  clocks and set PTP TC scheduling priority after hardware reset.

- hw_atl2.c: implement hw_atl2_adj_sys_clock(), hw_atl2_adj_clock_freq(),
  and aq_get_ptp_ts() for TSG clock read/adjust/increment operations.

- hw_atl2.c: implement hw_atl2_gpio_pulse() for PPS output generation
  via TSG pulse generator.

- hw_atl2.c: implement hw_atl2_hw_tx_ptp_ring_init() and
  hw_atl2_hw_rx_ptp_ring_init() for PTP ring setup.

- hw_atl2.c: implement hw_atl2_hw_ring_tx_ptp_get_ts() to read TX
  timestamp from descriptor writeback, and hw_atl2_hw_rx_extract_ts()
  to extract RX timestamp from the 8-byte packet trailer.

- hw_atl2.c: add hw_atl2_hw_get_clk_sel() helper.

- Wire all new ops into hw_atl2_ops.

Signed-off-by: Sukhdeep Singh <sukhdeeps@marvell.com>
Link: https://patch.msgid.link/20260610115448.272-11-sukhdeeps@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: atlantic: extend hw_ops and TX descriptor for AQC113 PTP

Extend the aq_hw_ops interface with new function pointers required for
PTP support on AQC113:
- enable_ptp: enable/disable PTP counter with clock selection
- hw_ring_tx_ptp_get_ts: read TX timestamp from descriptor writeback
- hw_tx_ptp_ring_init/hw_rx_ptp_ring_init: per-ring PTP initialization
- hw_get_clk_sel: query active TSG clock selection

Update existing hw_ops signatures to support AQC113 dual-clock
architecture:
- hw_gpio_pulse: add clk_sel and hightime parameters
- hw_extts_gpio_enable: add channel parameter

Add PTP-related hardware defines:
- AQ_HW_TXD_CTL_TS_EN/TS_TSG0 for TX descriptor timestamp control
- AQ2_HW_PTP_COUNTER_HZ for AQC113 TSG clock frequency
- AQ_HW_PTP_IRQS for PTP interrupt vector accounting
- PTP enable flags (L2/L4) and TSG clock selection constants

Add request_ts and clk_sel bitfields to aq_ring_buff_s for per-packet
TX timestamp request tracking.

Update hw_atl_b0.c (AQC107) implementations:
- Adapt gpio_pulse and extts_gpio_enable to new signatures
- Add TX descriptor timestamp bits for AQC113 when ANTIGUA chip
feature is detected

Signed-off-by: Sukhdeep Singh <sukhdeeps@marvell.com>
Link: https://patch.msgid.link/20260610115448.272-10-sukhdeeps@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: atlantic: add AQC113 PTP traffic class and TX path setup

Add PTP traffic class (TC) buffer reservation and TX path
improvements for AQC113:

- Reserve dedicated TX and RX buffer space for PTP TC when PTP is
  enabled, reducing user TC buffers accordingly (TX: 8KB, RX: 16KB).
- Configure PTP TC with no flow control and highest priority
  scheduling to ensure timely PTP packet transmission.

TX path improvements:
- Increase TX data and descriptor read-request limits when firmware
  has already enabled extended PCIe tag mode.

Also simplify RSS queue calculation in hw_atl2_hw_rss_set() by
extracting to a local variable and use unsigned types for loop
variables to match their usage.

Signed-off-by: Sukhdeep Singh <sukhdeeps@marvell.com>
Link: https://patch.msgid.link/20260610115448.272-9-sukhdeeps@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: atlantic: implement AQC113 L2/L3/L4 RX filter ops

Implement complete RX filter management for AQC113 hardware:

- Add tag-based ethertype filter policy (hw_atl2_filter_tag_get/put)
  that allocates and releases ART tags for L2 ethertype filters.

- Add L3/L4 filter sharing via serialized usage counters in
  hw_atl2_l3_filter/hw_atl2_l4_filter, managed through
  hw_atl2_rxf_l3_get/put and hw_atl2_rxf_l4_get/put.

- Implement L3 (IPv4/IPv6 source/destination address and protocol)
  filter find, get (program HW and increment refcount), and put
  (decrement refcount and clear HW when last user releases).

- Implement L4 (TCP/UDP/SCTP source/destination port) filter management
  with the same find/get/put pattern.

- Add combined L3L4 filter configuration (hw_atl2_new_fl3l4_configure)
  that translates legacy aq_rx_filter_l3l4 commands into AQC113 separate
  L3+L4 filter programming with Action Resolver Table (ART) entries.

- Add L2 ethertype filter set/clear (hw_atl2_hw_fl2_set/clear) with
  tag-based ART integration.

- Wire .hw_filter_l2_set, .hw_filter_l2_clear, .hw_filter_l3l4_set
  into hw_atl2_ops.

Signed-off-by: Sukhdeep Singh <sukhdeeps@marvell.com>
Link: https://patch.msgid.link/20260610115448.272-8-sukhdeeps@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: atlantic: fix AQC113 HW init: ART, L2 filter slot, MAC address

Fix initialization issues in hw_atl2 to correctly support AQC113:

- hw_atl2_hw_reset: replace unconditional priv memset with selective
  field clears so that l3l4_filters[].l3_index and l4_index can be
  initialized to -1 (not allocated) rather than 0; 0 is a valid filter
  index and would incorrectly appear as an occupied slot after a reset.

- hw_atl2_hw_init_new_rx_filters: use firmware-reported ART section
  base and count (clamped to 16) instead of hardcoded 0xFFFF mask;
  enable simultaneous IPv4/IPv6 L3 filter mode (rpf_l3_v6_v4_select);
  tag the UC MAC slot using firmware-supplied l2_filters_base_index
  instead of hardcoded HW_ATL2_MAC_UC.

- hw_atl2_hw_init_rx_path: enable only the firmware-assigned MAC slot
  (priv->l2_filters_base_index) instead of always slot 0.

- Add hw_atl2_hw_mac_addr_set() that programs the MAC address into
  the firmware-assigned L2 filter slot.  Wire into hw_atl2_ops
  replacing the A1 hw_atl_b0_hw_mac_addr_set; call it from hw_init.

- Wire .hw_get_regs into hw_atl2_ops.

Signed-off-by: Sukhdeep Singh <sukhdeeps@marvell.com>
Link: https://patch.msgid.link/20260610115448.272-7-sukhdeeps@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: atlantic: add AQC113 filter data structures, firmware query and register dump

Add filter infrastructure for AQC113 hardware:

- Define L3 (IPv4/IPv6), L4 (TCP/UDP/SCTP), and combined L3L4 filter
structures with serialized usage counter for filter sharing.
- Define tag policy structure for ethertype filter management.
- Add RPF L3/L4 command bit definitions for filter programming.
- Add filter count constants for L3L4, L3V4, L4, VLAN, and ethertype.
- Extend hw_atl2_priv with filter arrays, base indices, and counts
discovered from firmware.

Query filter capabilities from firmware shared memory at init time
to discover available L2/L3/L4/VLAN/ethertype filter resources and
ART (Action Resolver Table) configuration.

Add hardware register dump utility for AQC113 debug support.

Signed-off-by: Sukhdeep Singh <sukhdeeps@marvell.com>
Link: https://patch.msgid.link/20260610115448.272-6-sukhdeeps@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: atlantic: add AQC113 hardware register definitions and accessors

Add low-level hardware register definitions and accessor functions
for AQC113 (Antigua) chip features:

- L3/L4 filter command, tag, and address registers for IPv4/IPv6
- Ethertype filter tag registers
- TSG (Time Stamp Generator) clock control, modification, and
GPIO event generation/input timestamp registers
- TX descriptor timestamp writeback, timestamp enable, and AVB
enable registers
- TX data/descriptor read request limit registers
- TPB highest priority TC registers
- PCIe extended tag enable register
- RX descriptor timestamp request register
- Action resolver section enable getter
- GPIO special mode and TSG external GPIO TS input select

Signed-off-by: Sukhdeep Singh <sukhdeeps@marvell.com>
Link: https://patch.msgid.link/20260610115448.272-5-sukhdeeps@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: atlantic: decouple aq_set_data_fl3l4() from driver internals

Refactor aq_set_data_fl3l4() to take an ethtool_rx_flow_spec pointer and
an explicit HW register location instead of driver-internal structures
(aq_nic_s, aq_rx_filter). This makes the function reusable for PTP
filter setup which constructs flow specs independently.

Key changes:
- Add aq_is_ipv6_flow_type() helper to derive IPv6 status from the
  flow_type field, replacing the dependency on rx_fltrs->fl3l4.is_ipv6
  shared state.
- Change aq_set_data_fl3l4() signature to accept (fsp, data, location,
  add) and export it via aq_filters.h.
- Update aq_add_del_fl3l4() to compute the HW register location and
  pass it explicitly.

Signed-off-by: Sukhdeep Singh <sukhdeeps@marvell.com>
Link: https://patch.msgid.link/20260610115448.272-4-sukhdeeps@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: atlantic: move active_ipv4/ipv6 bitmap updates after HW write

Move active_ipv4/active_ipv6 bitmap updates from aq_set_data_fl3l4()
into aq_add_del_fl3l4() after the hardware write succeeds. The bitmaps
track which filter slots are actively programmed in hardware and must
only be updated once the HW write is confirmed.

The bitmap updates in aq_nic_reserve_filter() and aq_nic_release_filter()
are intentionally retained: they guard the aq_check_approve_fl3l4()
IPv4/IPv6 mixing validation for callers such as the AQC113 PTP path that
program filters directly via hw_atl2_new_fl3l4_configure() without going
through aq_add_del_fl3l4().

Signed-off-by: Sukhdeep Singh <sukhdeeps@marvell.com>
Link: https://patch.msgid.link/20260610115448.272-3-sukhdeeps@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: atlantic: correct L3L4 filter flow_type masking and IPv6 handling

Correct three issues in aq_set_data_fl3l4() required for the AQC113
PTP filter path introduced later in this series:

1. Mask FLOW_EXT from flow_type before the protocol switch statement.
   Flow types with FLOW_EXT set (e.g. TCP_V4_FLOW | FLOW_EXT) fall
   through to the default case and skip protocol comparison flags.

2. Extend the L3 address comparison check to cover all four IPv6
   words. The original code only checked ip_src[0]/ip_dst[0] and
   required !is_ipv6, so CMP_SRC_ADDR_L3/CMP_DEST_ADDR_L3 were never
   set for IPv6 filters.

3. Use explicit flow type checks for port extraction instead of
   negating IP_USER_FLOW/IPV6_USER_FLOW. The old check did not mask
   FLOW_EXT, so IP_USER_FLOW | FLOW_EXT would incorrectly attempt
   port extraction. Use the actual flow type to pick the correct
   union member directly.

Signed-off-by: Sukhdeep Singh <sukhdeeps@marvell.com>
Link: https://patch.msgid.link/20260610115448.272-2-sukhdeeps@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-dsa-netc-add-bridge-mode-support'

Wei Fang says:

====================
net: dsa: netc: add bridge mode support

This series adds bridge mode support to the NETC DSA switch driver,
covering both VLAN-aware and VLAN-unaware operation.

The NETC switch manages forwarding through a set of hardware tables
accessed via NTMP: the FDB table (FDBT), VLAN filter table (VFT), egress
treatment table (ETT), and egress count table (ECT). The series extends
the NTMP layer with the operations required for bridging, then builds the
DSA bridge callbacks on top.

Since all switch ports share the VFT, so only one VLAN-aware bridge is
supported.

FDB aging is managed in software. A periodic delayed work sweeps the
table using the hardware activity element mechanism, with a default aging
time of 300 seconds matching the IEEE 802.1Q standard. Per-port entries
are also flushed immediately on bridge leave and link-down events.
====================

Link: https://patch.msgid.link/20260611021458.2629145-1-wei.fang@oss.nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: netc: implement dynamic FDB entry ageing

The NETC switch does not age out dynamic FDB entries automatically.
Without software management, stale entries persist after topology
changes and cause incorrect forwarding.

Add a delayed work that periodically removes entries that have not been
refreshed within the specified cycles. The effective ageing time is:

ageing_time = fdbt_ageing_delay * 100

Default values are 3s interval and 100 cycles (300s total), matching
the IEEE 802.1Q default ageing time. The work starts when the first
port joins a bridge (tracked via br_cnt) and is cancelled when the
last port leaves. All FDB operations are serialized under fdbt_lock.

Implement .set_ageing_time() to allow the bridge layer to reconfigure
ageing parameters on demand.

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Link: https://patch.msgid.link/20260611021458.2629145-10-wei.fang@oss.nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: netc: add bridge mode support

Wire up the port_bridge_join, port_bridge_leave and port_vlan_filtering
DSA callbacks to support both VLAN-unaware and VLAN-aware bridge modes.

For VLAN-unaware bridges, each bridge instance is assigned a dedicated
internal PVID via NETC_VLAN_UNAWARE_PVID(bridge.num), counting down
from VID 4095. A VFT entry is created for this PVID with hardware MAC
learning and flood-on-miss forwarding enabled. The CPU port is included
as a VFT member so frames can reach the host. The reserved VID range is
blocked in port_vlan_add to prevent user-space conflicts.

Only one VLAN-aware bridge is supported at a time; this constraint is
enforced in port_bridge_join and port_vlan_filtering. The per-port PVID
is tracked in software and written to the BPDVR register whenever VLAN
filtering is active.

When a port leaves the bridge, its dynamic FDB entries are flushed right
away in port_bridge_leave(), without waiting for the ageing cycle. When
a link down event occurs on a port, netc_mac_link_down() will also clear
the port's dynamic FDB entries via netc_port_remove_dynamic_entries().
Non-bridge ports have no dynamic FDB entries, so this call is always
safe. Additionally, .port_fast_age() callback is added to flush the
dynamic FDB entries associated to a port.

Host flood rules are removed from the ingress port filter table when a
port joins a bridge to avoid bypassing FDB lookup and MAC learning.

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Link: https://patch.msgid.link/20260611021458.2629145-9-wei.fang@oss.nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: netc: add VLAN filter table and egress treatment management

Implement the DSA .port_vlan_add and .port_vlan_del operations to enable
VLAN-aware bridge offloading on the NETC switch.

VLAN membership is maintained in the VLAN Filter Table (VFT). Adding the
first port to a VLAN creates a new VFT entry with hardware MAC learning
and flood-on-miss forwarding; subsequent ports update the existing
entry's membership bitmap. Removing the last port deletes the entry.

Egress tagging is handled through the Egress Treatment Table (ETT). Each
VLAN is allocated a group of ETT entries, one per available port. Ports
are assigned a sequential ett_offset during initialisation, used to
address each port's entry within the group. Untagged ports configure the
ETT to strip the outer VLAN tag; tagged ports pass frames through
unmodified. Each ETT group is optionally paired with an Egress Counter
Table (ECT) group for per-port frame counting, allocated on a best-effort
basis. When the egress rule of an ETT entry changes, the counter of the
corresponding ECT entry will be recounted to track the number of frames
that match the new egress rule.

A software shadow list serialised by vft_lock tracks active VLAN state
across both port membership and egress tagging. VID 0 is used for single
port mode and is ignored by both callbacks.

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Link: https://patch.msgid.link/20260611021458.2629145-8-wei.fang@oss.nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: enetc: add helpers to set/clear table bitmap

NTMP index tables require software to allocate and manage entry IDs.
Add two bitmap helper functions to facilitate this management:

ntmp_lookup_free_eid(): finds the first zero bit in the given bitmap,
sets it to mark the entry as in-use, and returns the corresponding entry
ID. Returns NTMP_NULL_ENTRY_ID if no free entry is available.

ntmp_clear_eid_bitmap(): clears the bit associated with the given entry
ID in the bitmap to mark the entry as free. It is a no-op if the entry
ID is NTMP_NULL_ENTRY_ID.

Both functions are exported for use by other modules, such as the NETC
switch driver which needs to manage group index bitmaps for the Egress
Treatment Table (ETT) and Egress Count Table (ECT).

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Link: https://patch.msgid.link/20260611021458.2629145-7-wei.fang@oss.nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: netc: initialize the group bitmap of ETT and ECT

The Egress Treatment Table (ETT) and Egress Count Table (ECT) are both
index tables whose entry IDs are allocated by software. Every num_ports
entries form a group, where each entry in the group corresponds to one
port. To facilitate group allocation and management, initialize the group
index bitmaps for both tables based on hardware capabilities reported by
ETTCAPR and ECTCAPR registers.

The bitmap size per table is calculated as the total number of hardware
entries divided by the number of available ports, which gives the number
of groups available for software allocation. A set bit in the bitmap
represents a group index that has been allocated.

These bitmaps will be used by subsequent patches that add VLAN support.

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Link: https://patch.msgid.link/20260611021458.2629145-6-wei.fang@oss.nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: enetc: add "Update" operation to the egress count table

The egress count table is a static bounded index table, egress related
statistics are maintained in this table. The table is implemented as a
linear array of entries accessed using an index (0, 1, 2, ..., n) that
uniquely identifies an entry within the array. Egress Counter Entry ID
(EC_EID) is used as an index to an entry in this table. The EC_EID is
specified in the egress treatment table.

Egress count table entries are always present and enabled. The table
only supports access via entry ID, which is assigned by the software.
And it supports Update, Query and Query followed by Update operations.
Currently, only Update operation is supported.

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Link: https://patch.msgid.link/20260611021458.2629145-5-wei.fang@oss.nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: enetc: add interfaces to manage egress treatment table

Each entry in the egress treatment table contains the egress packet
processing actions to be applied to a grouping or scope of packets
exiting on a particular egress port of the switch. A scope of packets,
for example, could be the packets exiting a particular VLAN, matching
a particular 802.1Q bridge forwarding entry or belonging to a stream
identified at ingress. The egress treatment table is implemented as a
linear array of entries accessed using an index (0,1, 2, ..., n) that
uniquely identifies an entry within the array.

The egress treatment table only supports access vid entry ID, which is
assigned by the software. It supports Add, Update, Delete and Query
operations. Note that only Query operation is not supported yet.

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Link: https://patch.msgid.link/20260611021458.2629145-4-wei.fang@oss.nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: enetc: add "Update" and "Delete" operations to VLAN filter table

Add two interfaces to manage entries in the VLAN filter table:

ntmp_vft_update_entry(): Update the configuration element data of the
specified VLAN filter entry based on the given VLAN ID. It uses the
exact key access method to locate the entry.

ntmp_vft_delete_entry(): Delete the VLAN filter entry corresponding to
the specified VLAN ID. It also uses the exact key access method to
identify the target entry.

In addition, introduce struct vft_req_qd to describe the request data
buffer format for Query and Delete actions of the VLAN filter table,
which contains a common request data header and a VLAN access key.

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Link: https://patch.msgid.link/20260611021458.2629145-3-wei.fang@oss.nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: enetc: add interfaces to manage dynamic FDB entries

Add three interfaces to manage dynamic entries in the FDB table:

ntmp_fdbt_update_activity_element(): Update the activity element of all
dynamic FDB entries. For each entry, if its activity flag is not set,
which means no packet has matched this entry since the last update, the
activity counter is incremented. Otherwise, both the activity flag and
activity counter are reset. The activity counter is used to track how
long an FDB entry has been inactive, which is useful for implementing
an ageing mechanism.

ntmp_fdbt_delete_ageing_entries(): Delete all dynamic FDB entries whose
activity flag is not set and whose activity counter is greater than or
equal to the specified threshold. This is used to remove stale entries
that have been inactive for too long.

ntmp_fdbt_delete_port_dynamic_entries(): Delete all dynamic FDB entries
associated with the specified switch port. This is typically called when
a port goes down or is removed from a bridge.

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Link: https://patch.msgid.link/20260611021458.2629145-2-wei.fang@oss.nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests/net/openvswitch: add SET action test

Add test_action_set exercising OVS_ACTION_ATTR_SET with an ipv4 dst
rewrite. The test verifies the SET action in three steps: first
confirm normal forwarding, then apply set(ipv4(dst=10.0.0.99)) to
rewrite the destination to an address nobody owns and verify ping
fails, then restore normal forwarding and verify connectivity
recovers.

Signed-off-by: Minxi Hou <houminxi@gmail.com>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Link: https://patch.msgid.link/20260612130503.311240-1-houminxi@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-sfp-extend-smbus-support'

Jonas Jelonek says:

====================
net: sfp: extend SMBus support

Today, the SFP driver only drives I2C adapters that advertise full
I2C_FUNC_I2C, or SMBus-only adapters via single-byte transfers (with
hwmon disabled). Several SoCs ship I2C/SMBus-only controllers that
support more than just byte access -- e.g. word and I2C block -- and
have SFP cages wired to them. Today, those adapters either work
poorly or not at all.

This series teaches the SFP driver to use the larger SMBus access
modes when the adapter advertises them, and along the way starts
honoring i2c_adapter quirks on read/write length so adapters that
cap below the SFP block size are handled correctly. Patch 1 is a
small prep doing only the quirks handling; patch 2 extends the
SMBus path itself.

Capability matrix supported by patch 2:
  - BYTE only:                   single-byte access (unchanged).
  - BYTE + WORD:                 word for >=2-byte chunks, byte tail.
  - I2C_BLOCK present:           block as the universal transport.
  - WORD only (no BYTE/BLOCK):   accepted with WARN_ONCE; works for
                                 even-length transfers, odd-length
                                 transfers will error at xfer time.

Adapters with asymmetric R/W capabilities (e.g. only READ_I2C_BLOCK
without WRITE_I2C_BLOCK) remain functionally correct but use the
worse-supported direction's max for both directions, since
i2c_max_block_size is a single field. No mainline I2C driver was
seen advertising such asymmetry; per-direction sizes can be added
later if needed.
====================

Link: https://patch.msgid.link/20260614133418.2068201-1-jelonek.jonas@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sfp: extend SMBus support

Commit 7662abf4db94 ("net: phy: sfp: Add support for SMBus module access")
added SMBus access for SFP modules, but limited it to single-byte
transfers. As a side effect, hwmon is disabled (16-bit reads cannot be
guaranteed atomic) and a warning is printed.

Many SMBus-only I2C controllers in the wild support more than just
byte access, and SFP cages are often wired to such controllers
rather than to a full-featured I2C controller -- e.g. the SMBus
controllers in the Realtek longan and mango SoCs, which advertise
word access and I2C block reads. Today, they cannot drive an SFP at
all without falling back to the byte-only path.

Extend sfp_smbus_read()/sfp_smbus_write() so that, in addition to
the existing byte access, they also use SMBus word access and SMBus
I2C block access whenever the adapter advertises them. Both
directions are handled in a single read and a single write helper
that pick the largest supported transfer per chunk and fall back as
needed.

I2C-block is preferred unconditionally when available: the protocol
carries any length 1..32, so it can serve every chunk -- including
the 1- and 2-byte tails -- without help from word or byte access.
Note that this requires I2C_FUNC_SMBUS_I2C_BLOCK, which reads a
caller-specified number of bytes. This deviates from the official
SMBus Block Read (length is supplied by the slave) but is widely
supported by Linux I2C controllers/drivers.

Capability matrix this implementation supports:

  - BYTE only:                  works (unchanged behaviour); 1-byte
                                xfers, hwmon disabled.
  - BYTE + WORD:                word for >=2-byte chunks, byte for
                                trailing odd byte.
  - I2C_BLOCK present (with or
    without BYTE/WORD):         block as the universal transport for
                                every chunk.
  - WORD only (no BYTE/BLOCK):  accepted with WARN_ONCE. Even-length
                                transfers work; odd-length transfers
                                (e.g. the 3-byte cotsworks fixup
                                write) hit the BYTE branch which the
                                adapter does not implement, so the
                                xfer returns an error and the
                                operation is aborted. No mainline
                                I2C driver was found to advertise
                                WORD without BYTE; the warning lets
                                us learn about it if it ever shows
                                up.

Adapters with asymmetric R/W capabilities (e.g. only READ_I2C_BLOCK
but not WRITE_I2C_BLOCK) remain functionally correct -- the
per-iteration fallback uses the direction-specific bits -- but the
shared i2c_max_block_size is sized by the all-bits-set check, so a
transfer in the better-supported direction is not upgraded. None of
the mainline I2C bus drivers surveyed during review advertise such
asymmetry; promoting i2c_max_block_size to per-direction sizes can
be revisited if needed.

Signed-off-by: Jonas Jelonek <jelonek.jonas@gmail.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20260614133418.2068201-3-jelonek.jonas@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sfp: apply I2C adapter quirks to limit block size

The SFP driver assumes all I2C adapters support reading and writing the
pre-defined block size SFP_EEPROM_BLOCK_SIZE of 16 bytes. This constant
was probably chosen based on good guesses and known limitations of a
range of I2C adapters and SFP modules.

However, I2C adapters may even support less and usually need to specify
this via I2C quirks. Theoretically, such an adapter may provide full
functionality but only support a read and write length of e.g. 8 bytes.
Currently, the SFP driver doesn't account for that.

Add handling for I2C quirks in SFP I2C configuration taking the fields
max_read_len and max_write_len in struct i2c_adapter_quirks into account
to further limit the maximum block size if needed.

Signed-off-by: Jonas Jelonek <jelonek.jonas@gmail.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20260614133418.2068201-2-jelonek.jonas@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'nf-next-26-06-14' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next

Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter/IPVS updates for net-next.
More specifically, this contains conncount rework to address AI related
reports, assorted Netfiter updates and two small incremental updates on
IPVS:

1) Replace old obsolete workqueues (system_wq, system_unbound_wq)
   in IPVS, from Marco Crivellari.

2) Replace WARN_ON{_ONCE} by DEBUG_NET_WARN_ON_ONCE in nf_tables.
   In the recent years, reporters say that the use of WARN_ON{_ONCE}
   in conjunction with panic_on_warn=1 results in DoS. Let's replace
   it by DEBUG_NET_WARN_ON_ONCE so this is only exercised by test
   infrastructure and fuzzers, while also providing context to AI
   agents. From Fernando F. Mancera.

Five patches from Florian Westphal to address AI reports in the conncount
infrastructures:

3) Fix missing rcu read lock section when calling
   __ovs_ct_limit_get_zone_limit().

4) Add a dedicate lock per rbtree tree, this increases memory
   usage but it should improve scalability.

5) Add a helper function to find the rbtree node, no functional
   changes are intented.

6) Add sequence counter to detect concurrent tree modifications
   and retry lookups.

7) Add locks to GC conncount walk and address other nitpicks.

Then, several assorted updates:

8) Defensive Tree-wide addition of NULL checks for ct extensions.

9) Bail out if flowtable bypass cannot be fully set up from the
   flow offload expression, instead of lazy building a likely
   incomplete one.

10) Fix documentation for the new conn_max sysctl toggle in IPVS.

11) Add nf_dev_xmit_recursion*() helpers and use them, to address
    recent AI reports.

* tag 'nf-next-26-06-14' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  netfilter: nf_dup_netdev: add nf_dev_xmit_recursion*() helpers and use them
  ipvs: fix doc syntax for conn_max sysctl
  netfilter: flowtable: bail out if forward path cannot be discovered
  netfilter: conntrack: check NULL when retrieving ct extension
  netfilter: nf_conncount: gc and rcu fixes
  netfilter: nf_conncount: add sequence counter to detect tree modifications
  netfilter: nf_conncount: split count_tree_node rbtree walk into helper
  netfilter: nf_conncount: use per nf_conncount_data spinlocks
  netfilter: nf_conncount: callers must hold rcu read lock
  netfilter: nf_tables: use DEBUG_NET_WARN_ON_ONCE in packet and control paths
  ipvs: Replace use of system_unbound_wq with system_dfl_long_wq
====================

Link: https://patch.msgid.link/20260614114605.474783-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'netdev-expose-page-pool-order-via-netlink'

Dragos Tatulea says:

====================
netdev: expose page pool order via netlink

This small series exposes io_uring's high order page configuration
via the page_pool netlink interface and updates the appropriate
selftest to check this value.
====================

Link: https://patch.msgid.link/20260612211709.1456966-2-dtatulea@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

io_uring/zcrx: selftests: verify rx_buf_len for large chunks

Check the newly added rx_buf_len page_pool field for io_uring
in the existing large-chunks test after the receiver is up.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Link: https://patch.msgid.link/20260612211709.1456966-4-dtatulea@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netdev: expose io_uring rx_page_order order via netlink

This adds observability for the io_uring zcrx rx-buf-len configuration.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Yael Chemla <ychemla@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Acked-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/20260612211709.1456966-3-dtatulea@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'selftests-vsock-improve-vng-version-and-quirk-handling'

Bobby Eshleman says:

====================
selftests/vsock: improve vng version and quirk handling

As vng has continued updating, there have been two things in our
selftests that have been affected. One is that newer versions always
emit the vng version warning, and two is that we have a workaround that
is not needed in newer versions.

This series just updates the version handling to allow all newer
versions without warning and version-gates the workaround to only those
versions that don't have the commit that fixed the root cause.

Additionally, we add function for comparing major.minor versions which
is used in both patches.
-===================

Link: https://patch.msgid.link/20260612-vsock-test-update-v1-0-7d7eeed3ac8f@meta.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests/vsock: skip vng setsid workaround on >= 1.41

virtme-ng 1.41 ships the upstream fix for the SIGTTOU hang
(https://github.com/arighi/virtme-ng/pull/453), so the setsid wrapper in
vng_dry_run() is no longer needed there. Gate the workaround on the vng
version: setsid is used for vng < 1.41, and vng is invoked directly on
>= 1.41.

Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
Link: https://patch.msgid.link/20260612-vsock-test-update-v1-2-7d7eeed3ac8f@meta.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests/vsock: accept vng 1.33 or >= 1.36

The current vng version check uses a discrete allowlist of "1.33",
"1.36", and "1.37", which forces a script update on every new release
even though all post-1.36 releases work.

Replace the discrete list with: "1.33", or any version >= 1.36. 1.34
and 1.35 are skipped because they were not tested. Add a version_lt()
helper that compares MAJOR.MINOR numerically, so the check reads as a
straightforward version comparison.

Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
Link: https://patch.msgid.link/20260612-vsock-test-update-v1-1-7d7eeed3ac8f@meta.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: sfp: detect presence via I2C when no MOD_DEF0 GPIO

An SFP cage (compatible "sff,sfp") whose MOD_DEF0 signal is not wired to a
GPIO currently falls back to sff_gpio_get_state(), which unconditionally
reports the module as present. An empty cage therefore fails its probe and
is parked in SFP_MOD_ERROR forever; because SFP_F_PRESENT never deasserts
there is no REMOVE event to recover the state machine, so a module inserted
after boot is never detected, and empty cages spam -EIO at boot.

This affects boards that route none of the cage presence signal to a
software-readable input. On the NicGiga S100-0800S-M (RTL9303, 8x SFP+) the
cage I2C bus is the switch's SMBus master; TX_DISABLE is driven via a
PCA9534 I/O expander, but no MOD_ABS/MOD_DEF0 line reaches a readable GPIO
(the RTL9303 gpio0 lines read stuck-low, the single PCA9534 is fully
consumed by TX_DISABLE, and there is no RTL8231). The Horaco ZX-SW82TS-L2P
(RTL9302D, 2x SFP+) is independently affected in the same way.

For such an SFP cage, derive presence from a throttled single-byte I2C read
of the module EEPROM instead: a successful read asserts SFP_F_PRESENT,
R_PROBE_ABSENT consecutive failures clear it (to ride out a transient error
on a live module). The existing poll then emits SFP_E_INSERT / SFP_E_REMOVE
normally, giving working hot-plug and silencing the boot-time -EIO spam on
empty cages. Presence is re-probed every T_PROBE_PRESENT, so insertion is
detected within that interval and removal within
T_PROBE_PRESENT * R_PROBE_ABSENT.

A soldered-down module (compatible "sff,sff") has no presence signal and is
genuinely always present, so it continues to use sff_gpio_get_state(); the
new path is gated on the cage type advertising SFP_F_PRESENT.

Signed-off-by: Greg Patrick <gregspatrick@hotmail.com>
Tested-by: Manuel Stocker <mensi@mensi.ch>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20260611175341.2223184-1-gregspatrick@hotmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'devlink-warn-on-resource-id-collision-with-parent_top'

David Yang says:

====================
devlink: Warn on resource ID collision with PARENT_TOP

Filter out the ambiguous case of

enum {
    MY_RESOURCE_ID_A,  /* == DEVLINK_RESOURCE_ID_PARENT_TOP ! */
    MY_RESOURCE_ID_B,
    ...
};

register(..., MY_RESOURCE_ID_A, DEVLINK_RESOURCE_ID_PARENT_TOP, ...);
register(..., MY_RESOURCE_ID_B, MY_RESOURCE_ID_A, ...);
====================

Link: https://patch.msgid.link/20260611070856.889700-1-mmyangfl@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

devlink: Warn on resource ID collision with PARENT_TOP

ID 0 serves as the sentinel DEVLINK_RESOURCE_ID_PARENT_TOP to mark
top-level resources. While it is technically possible to use 0 as a real
resource ID, a user might be tempted to write:

enum {
    MY_RESOURCE_ID_A,  /* == DEVLINK_RESOURCE_ID_PARENT_TOP ! */
    MY_RESOURCE_ID_B,
    MY_RESOURCE_ID_C,
    MY_RESOURCE_ID_D,
    ...
};

register(..., MY_RESOURCE_ID_C, DEVLINK_RESOURCE_ID_PARENT_TOP, ...);
register(..., MY_RESOURCE_ID_D, MY_RESOURCE_ID_C, ...);
/* D is a child of C */

register(..., MY_RESOURCE_ID_A, DEVLINK_RESOURCE_ID_PARENT_TOP, ...);
register(..., MY_RESOURCE_ID_B, MY_RESOURCE_ID_A, ...);
/* Is B intentionally top-level, or is it actually a child of A? */

Add a WARN_ON() to catch this and prevent confusion.

Signed-off-by: David Yang <mmyangfl@gmail.com>
Link: https://patch.msgid.link/20260611070856.889700-6-mmyangfl@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: mv88e6xxx: Avoid devlink resource IDs collision with PARENT_TOP

The devlink resource ID for ATU collides with the sentinel
DEVLINK_RESOURCE_ID_PARENT_TOP (0). As a result, ATU_bin_* are
registered as in fact registered as top-level siblings, not as children
of ATU.

Whether intentional or unintentional, clarify it by keeping the real
resource IDs starting at 1. Unfortunately ATU_bin_* are already
registered at top-level, so keep their parent to PARENT_TOP.

Signed-off-by: David Yang <mmyangfl@gmail.com>
Link: https://patch.msgid.link/20260611070856.889700-5-mmyangfl@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: hellcreek: avoid devlink resource IDs collision with PARENT_TOP

This might not cause real problems, but the hellcreek devlink resource
ID collides with the sentinel DEVLINK_RESOURCE_ID_PARENT_TOP (0). Avoid
it by keeping the real resource IDs starting at 1.

Signed-off-by: David Yang <mmyangfl@gmail.com>
Acked-by: Kurt Kanzenbach <kurt@linutronix.de>
Link: https://patch.msgid.link/20260611070856.889700-4-mmyangfl@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: b53: avoid devlink resource IDs collision with PARENT_TOP

This might not cause real problems, but the b53 devlink resource ID
collides with the sentinel DEVLINK_RESOURCE_ID_PARENT_TOP (0). Avoid it
by keeping the real resource IDs starting at 1.

Signed-off-by: David Yang <mmyangfl@gmail.com>
Link: https://patch.msgid.link/20260611070856.889700-3-mmyangfl@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: dsa_loop: avoid devlink resource IDs collision with PARENT_TOP

This might not cause real problems, but the dsa_loop devlink resource ID
collides with the sentinel DEVLINK_RESOURCE_ID_PARENT_TOP (0). Avoid it
by keeping the real resource IDs starting at 1.

Signed-off-by: David Yang <mmyangfl@gmail.com>
Link: https://patch.msgid.link/20260611070856.889700-2-mmyangfl@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'ipv4-fib-remove-rtnl-in-fib_net_exit_batch'

Kuniyuki Iwashima says:

====================
ipv4: fib: Remove RTNL in fib_net_exit_batch().

Currently, we flush all IPv4 routes at ->exit_batch() during
netns dismantle, which requires an extra RTNL.

IPv4 routes are not added from the fast path unlike IPv6, so
we can flush routes before default_device_exit_batch().

However, there is implicit ordering between ip_fib_net_exit()
and default_device_exit_batch().

This series detangles it and moves ip_fib_net_exit() to
->exit_rtnl() to save the RTNL dance.

The same change for IPv6 will need more work.
====================

Link: https://patch.msgid.link/20260612063225.455191-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv4: fib: Convert fib_net_exit_batch() to ->exit_rtnl().

Currently, IPv4 routes are flushed in ->exit_batch() after
all devices are unregistered.

Unlike IPv6, IPv4 routes are not added from the fast path,
so we can flush routes before default_device_exit_batch().

Let's call ip_fib_net_exit() from ->exit_rtnl() to save
one RTNL locking dance.

ip_fib_net_exit() must use list_del_rcu() for fib_table
for the fast path on dying dev.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260612063225.455191-6-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv4: fib: Avoid calling fib_trie_table() in fib_new_table() for dying net.

We will call ip_fib_net_exit() from ->exit_rtnl().

All fib_table will be destroyed before devices are unregistered.

During device unregistration, inetdev_destroy() could call
fib_del_ifaddr(), which calls fib_magic(RTM_DELROUTE).

fib_magic() calls fib_new_table(), but we do not want to create
a new table after ip_fib_net_exit() destroys all tables.

As a prep, let's add check_net() before fib_trie_table() in
fib_new_table().

fib_trie_table() is also called from fib_trie_unmerge(), but
fib_get_table() fails first in fib_unmerge(), so the same
problem does not occur there.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260612063225.455191-5-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv4: fib: Free net->ipv4.{fib_table_hash,notifier_ops} without RTNL.

We will call ip_fib_net_exit() from ->exit_rtnl().

However, some paths will still access net->ipv4.fib_table_hash
after ->exit_rtnl().

For example, fib_flush() is called from fib_disable_ip() for
NETDEV_UNREGISTER.

Let's move kfree(net->ipv4.fib_table_hash) and fib4_notifier_exit()
from ip_fib_net_exit() to its caller.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260612063225.455191-4-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv4: fib: Call fib_proc_exit() and nl_fib_lookup_exit() at ->pre_exit().

We will call ip_fib_net_exit() from ->exit_rtnl().

Since the exit callbacks are called in the following order,

  1. ->pre_exit()
  ~~~ synchronize_rcu() ~~~
  2. ->exit_rtnl()   : ip_fib_net_exit()
  3. ->exit()        : fib_proc_exit() / nl_fib_lookup_exit()
  4. ->exit_batch()  : fib4_semantics_exit()

the reverse order of fib_net_init() would get messed up.

Let's move fib_proc_exit() and nl_fib_lookup_exit() to ->pre_exit().

This is fine because procfs/netlink access from userspace cannot
occur at this point and synchronize_rcu() is not needed.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260612063225.455191-3-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv4: fib: Flush all fib_info in fib_table_flush() during netns dismantle.

Even when fib_table_flush() is called with flush_all true, it does
not flush all fib_info due to this condition:

  !(fi->fib_flags & RTNH_F_DEAD) && !fib_props[fa->fa_type].error)

This creates an implicit ordering between default_device_exit_batch()
and fib_net_exit_batch().

fib_table_flush(flush_all=true) must be called after all devices
are NETDEV_UNREGISTERed, which is after nexthop_flush_dev() marks
RTNH_F_DEAD.

This would cause memory leak if the order were reversed.

fib_table_flush() does not skip non-dead error routes when flush_all
is true:

  !flush_all &&
  !(fi->fib_flags & RTNH_F_DEAD) && fib_props[fa->fa_type].error

Let's merge the two conditions not to skip all non-dead fib_info
during netns dismantle.

Note that we could further apply !flush_all to the basic table
id check and the rtmsg_fib() call in the loop.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260612063225.455191-2-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: hellcreek: replace kcalloc with struct_size

One fewer allocation for the priv struct.

Signed-off-by: Rosen Penev <rosenp@gmail.com>
Acked-by: Kurt Kanzenbach <kurt@linutronix.de>
Link: https://patch.msgid.link/20260608045640.5172-1-rosenp@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-mlx5-add-switchdev-mode-support-for-socket-direct-single-netdev-part-2-2'

Tariq Toukan says:

====================
net/mlx5: Add switchdev mode support for Socket Direct single netdev, part 2/2

This is part 2. Find part 1 here:
https://lore.kernel.org/all/20260531113954.395443-1-tariqt@nvidia.com/

This series enables Socket Direct single netdev to operate in switchdev
mode with shared FDB. SD single netdev combines multiple PCI functions
behind a single netdev interface. To support switchdev offloads, these
functions must participate in virtual LAG (shared FDB).

Design

Rather than introducing a separate LAG instance for SD, this series
integrates SD secondary devices into the existing LAG structure
(priv.lag) created at probe time. Each lag_func entry carries a
group_id field that identifies its SD group membership (0 means not
part of any SD group). An xarray mark (XA_MARK_PORT) distinguishes
physical port entries from SD secondaries, enabling a single unified
iterator that filters by group:

  - MLX5_LAG_FILTER_PORTS: iterate port-level entries only (existing
    behavior, used by bonding, FW LAG commands, v2p_map)
  - MLX5_LAG_FILTER_ALL: iterate all devices including SD secondaries
    (used by MPESW shared FDB across all devices)
  - specific group_id: iterate only devices in that SD group (used by
    per-group SD shared FDB operations)

Existing callers use mlx5_ldev_for_each() which maps to
MLX5_LAG_FILTER_PORTS, preserving current behavior for non-SD
configurations.

Lifecycle and ownership

The SD LAG lifecycle is tied to the SD group, not to bonding events:

1. At PCI probe, mlx5_lag_add_mdev() creates the LAG structure
   (priv.lag) for each LAG-capable PF. e.g.: SD primary devices

2. During mlx5_sd_init(), after the SD group is fully formed (primary
   and secondaries paired), sd_lag_init() registers the secondary
   devices into the primary's existing priv.lag by calling
   mlx5_ldev_add_mdev() with the SD group_id. The primary's lag_func
   also gets its group_id set. No separate LAG instance is created.

3. After all the devices in SD group transition to switchdev,
   mlx5_lag_shared_fdb_create() is invoked with the group_id to create
   a software-only shared FDB scoped to that SD group. This sets
   sd_fdb_active on all lag_func entries in the group. No FW LAG
   commands are issued since SD devices share the same physical port.

4. If MPESW (multi-port eswitch) is enabled on top of SD groups, the
   per-group SD shared FDB is torn down first, then MPESW shared FDB is
   created spanning all devices (ports + SD secondaries) using
   MLX5_LAG_FILTER_ALL. On MPESW disable, per-group SD shared FDB is
   restored.

5. On SD teardown (mlx5_sd_cleanup or device unbind), sd_lag_cleanup()
   removes secondaries from priv.lag and clears the primary's group_id.
   The LAG structure itself is not destroyed.

The sd_fdb_active flag is set on all lag_func entries in a group (not
just the primary), so any device can detect the SD shared FDB state
during lag_disable_change teardown without needing to look up peer
entries.

SD shared FDB is a pure software construct -- unlike regular LAG modes
(ROCE, SRIOV, MPESW), it does not issue FW create_lag/destroy_lag
commands. The software vport LAG for SD is implemented via eswitch
egress ACL bounce rules, managed by the IB layer through
mlx5_eth_lag_init(). And the software LAG demux is implemented via
steering rules that utilize new destination, VHCA_RX.

Patches

E-Switch preparation (patch 1):
  - Skip uplink IB rep load for SD secondary devices

Devcom support (patches 2-3):
  - Expose locked variant of send_event
  - Add DEVCOM_CANT_FAIL for non-rollback events

SD core hardening (patches 4-6):
  - Make primary/secondary role determination more robust
  - Add L2 table silent mode query support
  - Expand vport metadata for SD secondary devices

SD switchdev transition (patches 7-8):
  - Support switchdev mode transition with shared FDB
  - Notify SD on eswitch disable

LAG integration (patches 9-12):
  - Store demux resources per master lag_func
  - Disable both regular and SD LAG on lag_disable_change
  - Introduce software vport LAG implementation
  - Add MPESW over SD LAG support

Deferred init (patches 13-14):
  - Tie rep load/unload to SD LAG state
  - Defer vport metadata init until SD is ready

Enablement (patch 15):
  - Enable SD over ECPF and allow switchdev transition

v2: https://lore.kernel.org/20260608135547.482825-1-tariqt@nvidia.com
v1: https://lore.kernel.org/20260604114455.434711-1-tariqt@nvidia.com
====================

Link: https://patch.msgid.link/20260612113904.537595-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: SD, enable SD over ECPF and allow switchdev transition

Remove the restriction blocking SD on embedded CPU PFs (ECPF), enabling
SD functionality on BlueField DPUs. Remove the blocker preventing SD
devices from transitioning to switchdev mode.

The infrastructure added in earlier patches properly handles this case.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260612113904.537595-16-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: SD, defer vport metadata init until SD is ready

Allow SD devices to transition to switchdev before the SD group is
fully up. Metadata allocation requires the SD group to be ready, so
defer it from esw_offloads_enable() until SD shared-FDB activation.

Add mlx5_esw_offloads_init_deferred_metadata() which allocates per-vport
metadata and refreshes the ingress ACLs that were previously programmed
with metadata=0. The helper is idempotent and can be called multiple
times.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260612113904.537595-15-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: E-Switch, Tie rep load/unload to SD LAG state

On an SD device, vport representors are not functional until the SD
group is combined and shared FDB is active. Skip the initial load and
the reload paths in that window; reps are loaded as part of the SD LAG
activation flow once it becomes active.

In addition, explicitly unload representors when SD LAG is destroyed.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260612113904.537595-14-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: LAG, add MPESW over SD LAG support

Enable MPESW LAG creation over SD LAG members, forming a composite LAG
hierarchy. This allows bonding multiple SD groups together under a
single MPESW configuration with shared FDB.

When enabling composite MPESW, the individual SD LAG shared FDB
configurations are temporarily torn down and recreated when the
composite LAG is disabled.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260612113904.537595-13-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: LAG, introduce software vport LAG implementation

SD LAG is a virtual LAG without hardware LAG support, so it cannot use
the firmware vport LAG commands. Implement a software-based vport LAG
using egress ACL bounce rules.

Add esw_set_slave_egress_rule() to create an egress ACL rule on the
slave's manager vport that bounces traffic to the master's manager
vport. This achieves the same traffic steering as hardware vport LAG.

Redirect mlx5_cmd_create_vport_lag() and mlx5_cmd_destroy_vport_lag()
to the software implementation when operating in SD LAG mode.
In addition, adjust lag_demux creation to check SD LAG mode as well.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260612113904.537595-12-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: LAG, disable both regular and SD LAG on lag_disable_change

Extend mlx5_lag_disable_change() to properly disable both regular LAG
and SD LAG when requested. Each LAG type uses its own devcom component
for locking.

Use mlx5_sd_get_devcom() helper to retrieve the SD devcom component,
needed for proper locking when disabling SD LAG.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260612113904.537595-11-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: LAG, store demux resources per master lag_func

The lag demux resources (flow table, flow group, and rules xarray)
are stored on the shared ldev. With Socket Direct, multiple SD groups
each create their own demux FT/FG during their master's IB device
initialization. Since they all write to the same ldev fields, the
second group's init overwrites the first group's pointers, leaking
the first group's FT/FG.

During teardown, the cleanup uses the overwritten pointers, destroying
the wrong group's resources and leaving leaked flow tables in the LAG
namespace. These leaked tables can interfere with subsequently created
demux tables.

Move the demux resources from the shared ldev to per-master lag_func
instances. Each master device now owns its own independent demux
state. The rule_add and rule_del helpers look up the appropriate
master's lag_func via the existing filter/group infrastructure.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260612113904.537595-10-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: E-Switch, notify SD on eswitch disable

When eswitch is disabled, notify the SD layer so it can clean up
SD-specific resources such as the TX flow table root configuration
on secondary devices.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260612113904.537595-9-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: SD, support switchdev mode transition with shared FDB

When the eswitch transitions, propagate the change to SD: secondaries
get their TX flow table root reconfigured for the new mode, and when
all group devices move to switchdev, the per-group shared FDB is
activated.

Shared FDB activation is best-effort - failure does not block the
eswitch transition; the next transition retries.

Note: the existing mlx5_get_sd() guard that blocks switchdev for SD
devices is intentionally retained. It will be removed once all
supporting patches are in place.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260612113904.537595-8-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: SD, expend vport metadata for SD secondary devices

In Socket Direct configurations the primary and secondary PFs share the
same native_port_num. The eswitch vport metadata encodes pf_num in its
upper bits to distinguish vports across PFs. Without SD-awareness, both
PFs generate identical metadata, causing FDB rules to steer traffic to
the wrong representor.

Add mlx5_sd_pf_num_get() which remaps the pf_num for SD devices.
Use it so each PF in an SD group produces unique vport metadata.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260612113904.537595-7-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>