From: Jakub Kicinski Date: Thu, 4 Jun 2026 00:42:33 +0000 (-0700) Subject: Merge branch 'net-mlx5-add-switchdev-mode-support-for-socket-direct-single-netdev... X-Git-Url: http://git.ipfire.org/gitweb/?a=commitdiff_plain;h=c0192c7ec1fc5fa4ec9793bb460204715e2d9cd3;p=thirdparty%2Fkernel%2Flinux.git Merge branch 'net-mlx5-add-switchdev-mode-support-for-socket-direct-single-netdev-part-1-2' Tariq Toukan says: ==================== net/mlx5: Add switchdev mode support for Socket Direct single netdev, part 1/2 This series enables Socket Direct single netdev to operate in switchdev mode with shared FDB. SD single netdev combines multiple PCI functions behind a single netdev interface. To support switchdev offloads, these functions must participate in virtual LAG (shared FDB). Design Rather than introducing a separate LAG instance for SD, this series integrates SD secondary devices into the existing LAG structure (priv.lag) created at probe time. Each lag_func entry carries a group_id field that identifies its SD group membership (0 means not part of any SD group). An xarray mark (XA_MARK_PORT) distinguishes physical port entries from SD secondaries, enabling a single unified iterator that filters by group: - MLX5_LAG_FILTER_PORTS: iterate port-level entries only (existing behavior, used by bonding, FW LAG commands, v2p_map) - MLX5_LAG_FILTER_ALL: iterate all devices including SD secondaries (used by MPESW shared FDB across all devices) - specific group_id: iterate only devices in that SD group (used by per-group SD shared FDB operations) Existing callers use mlx5_ldev_for_each() which maps to MLX5_LAG_FILTER_PORTS, preserving current behavior for non-SD configurations. Lifecycle and ownership The SD LAG lifecycle is tied to the SD group, not to bonding events: 1. At PCI probe, mlx5_lag_add_mdev() creates the LAG structure (priv.lag) for each LAG-capable PF. e.g.: SD primary devices 2. During mlx5_sd_init(), after the SD group is fully formed (primary and secondaries paired), sd_lag_init() registers the secondary devices into the primary's existing priv.lag by calling mlx5_ldev_add_mdev() with the SD group_id. The primary's lag_func also gets its group_id set. No separate LAG instance is created. 3. After all the devices in SD group transition to switchdev, mlx5_lag_shared_fdb_create() is invoked with the group_id to create a software-only shared FDB scoped to that SD group. This sets sd_fdb_active on all lag_func entries in the group. No FW LAG commands are issued since SD devices share the same physical port. 4. If MPESW (multi-port eswitch) is enabled on top of SD groups, the per-group SD shared FDB is torn down first, then MPESW shared FDB is created spanning all devices (ports + SD secondaries) using MLX5_LAG_FILTER_ALL. On MPESW disable, per-group SD shared FDB is restored. 5. On SD teardown (mlx5_sd_cleanup or device unbind), sd_lag_cleanup() removes secondaries from priv.lag and clears the primary's group_id. The LAG structure itself is not destroyed. The sd_fdb_active flag is set on all lag_func entries in a group (not just the primary), so any device can detect the SD shared FDB state during lag_disable_change teardown without needing to look up peer entries. SD shared FDB is a pure software construct -- unlike regular LAG modes (ROCE, SRIOV, MPESW), it does not issue FW create_lag/destroy_lag commands. The software vport LAG for SD is implemented via eswitch egress ACL bounce rules, managed by the IB layer through mlx5_eth_lag_init(). And the software LAG demux is implemented via steering rules that utilize new destination, VHCA_RX. Patches Infrastructure (patches 1, 5-6): - Factor out shared FDB code into a dedicated file - Extend lag_func with group_id and sd_fdb_active fields; add XA_MARK_PORT and unified iterator with group_id filter - Extend shared FDB API with group_id parameter E-Switch preparation (patches 2-3): - Align eswitch disable sequence ordering - Move devcom init from TC to eswitch layer SD group management (patches 4, 7-9): - Replace peer count check with direct peer lookup - Register SD secondaries in the existing LAG at SD init time - Block RoCE and VF LAG for SD devices - Block multipath LAG for SD devices Switchdev integration (patch 10): - Keep netdev resources local in switchdev mode Steering (patches 11-12): - Track peer flow slots with bitmap for selective peer flow deletion - Enable TC flow steering for SD LAG Enablement (patch 13): - Verify unique vhca_id count for cross-VHCA RQT ==================== Link: https://patch.msgid.link/20260531113954.395443-1-tariqt@nvidia.com Signed-off-by: Jakub Kicinski --- c0192c7ec1fc5fa4ec9793bb460204715e2d9cd3