====================
Intel Wired LAN Updates 2024-04-30 (ixgbe, i40e, ice)
This series includes updates to support Energy-Efficient Ethernet (EEE) on
E610 devices in the ixgbe driver, support for an unmanaged DPLL output on
E830, as well as some other minor cleanups and improvements across ixgbe,
i40e, and ice.
Jedrzej begins with the first six patches preparing the ixgbe driver to
support EEE, adding a EEE capability flag, updating the supported EEE
speeds, updating the ACI command structures with the fields related to
EEE, moving the EEE config validation out for re-use, and finally
implementing the EEE support for E610 hardware.
Aleksandr fixes the ixgbe_update_flash_X550() logic to prevent unaligned
access in ixgbe_host_interface_command(). Note: this has no functional
change on x86, and is being sent through net-next as it is considered a
minor cleanup.
Jacob (hi!) modifies the i40e driver to only timestamp PTP event packets,
instead of timestamping every V2 event frame. This avoids wasting the
limited number of timestamp slots for frames which the PTP protocol does
not care about.
Jacob also extends the devlink flash notification message reporting that
users can activate the new firmware via devlink reload to explicitly
indicate the required "fw_activate" action.
Byungchul Park fixes the ice_lbtest_receive_frames() function to use
netmem_desc instead of the page structure.
Przemyslaw Korba fixes a truncation warning in ice_dpll_init_fwnode_pins()
by increasing the allowed length of the pin_name string on the stack to 16.
Ivan Vecera adds some bounds checking to ice_dpll_rclk_state_on_pin_get/set()
and moves the CGU register macros to be under the header guard ifdef in
ice_dpll.h
====================
Introduced by commit ad1df4f2d591 ("ice: dpll: Support E825-C SyncE and
dynamic pin discovery"):
ice_dpll.c: In function ‘ice_dpll_init’:
ice_dpll.c:3588:59: error: ‘%u’ directive output may be truncated
writing between 1 and 10 bytes into a region of size 4
[-Werror=format-truncation=] snprintf(pin_name, sizeof(pin_name),
"rclk%u", i);
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Przemyslaw Korba <przemyslaw.korba@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260430-jk-iwl-net-next-2026-04-30-v1-13-6f27ae1cd073@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Byungchul Park [Fri, 1 May 2026 06:37:23 +0000 (23:37 -0700)]
ice: access @pp through netmem_desc instead of page
To eliminate the use of struct page in page pool, the page pool users
should use netmem descriptor and APIs instead.
Make ice driver access @pp through netmem_desc instead of page.
Signed-off-by: Byungchul Park <byungchul@sk.com> Tested-by: Alexander Nowlin <alexander.nowlin@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> Reviewed-by: David Hildenbrand (Arm) <david@kernel.org> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260430-jk-iwl-net-next-2026-04-30-v1-12-6f27ae1cd073@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jacob Keller [Fri, 1 May 2026 06:37:22 +0000 (23:37 -0700)]
ice: mention fw_activate action along with devlink reload
The ice driver reports a helpful status message when updating firmware
indicating what action is necessary to enable the new firmware. This is
done because some updates require power cycling or rebooting the machine
but some can be activated via devlink.
The ice driver only supports activating firmware with the specific action
of "fw_activate" a bare "devlink dev reload" will *not* update the
firmware, and will only perform driver reinitialization.
Update the status message to explicitly reflect that the reload must use
the fw_activate action.
I considered modifying the text to spell out the full command, but felt
that was both overkill and something that would belong better as part of
the user space program and not hard coded into the kernel driver output.
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260430-jk-iwl-net-next-2026-04-30-v1-11-6f27ae1cd073@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jacob Keller [Fri, 1 May 2026 06:37:21 +0000 (23:37 -0700)]
i40e: only timestamp PTP event packets
The i40e_ptp_set_timestamp_mode() function is responsible for configuring
hardware timestamping. When programming receive timestamping, the logic
must determine how to configure the PRTTSYN_CTL1 register for receive
timestamping.
The i40e hardware does not support timestamping all frames. Instead,
timestamps are captured into one of the four PRTTSYN_RXTIME registers.
Currently, the driver configures hardware to timestamp all V2 packets on
ports 319 and 320, including all message types. This timestamps
significantly more packets than is actually requested by the
HWTSTAMP_FILTER_PTP_V2_EVENT filter type.
The documentation for HWTSTAMP_FILTER_PTP_V2_EVENT indicates that it should
timestamp PTP v2 messages on any layer, including any kind of event
packets.
Timestamping other packets is acceptable, but not required by the filter.
Doing so wastes valuable slots in the Rx timestamp registers. For most
applications this doesn't cause a problem. However, for extremely high
rates of messages, it becomes possible that one of the critical event
packets is not timestamped.
The PTP protocol only requires timestamps for event messages on port 319,
but hardware is timestamping on both 319 and 320, and timestamping message
types which do not need a timestamp value.
The i40e hardware actually has a more strict filtering option. First, only
timestamp layer 4 messages on port 319 instead of both 319 and 320. Second,
note that hardware has a specific mode to timestamp only event packets
(those with message type < 8).
Update the configuration to use the strict mode that only timestamps event
messages, switching the TSYNTYPE field from 10b to 11b which limits the
timestamping only to eventpackets with a Message Type of < 8. Note that the
X700 series datasheet seems to indicate that the V2MSESTYPE field is no
longer relevant. However, we only tested and validated with leaving the
V2MESSTYPE field set to 0xF for the "wildcard" behavior it documents. This
might not be required but it in that case setting it appears harmless, so
leave it as is.
This avoids wasting the valuable Rx timestamp register slots on non-event
frames, and may reduce faults when operating under high event rates.
ixgbe: fix unaligned u32 access in ixgbe_update_flash_X550()
ixgbe_host_interface_command() treats its buffer as a u32 array. The
local buffer we pass in was a union of byte-sized fields, which gives
it 1-byte alignment on the stack. On strict-align architectures this
can cause unaligned 32-bit accesses.
Add a u32 member to union ixgbe_hic_hdr2 so the object is 4-byte
aligned, and pass the u32 member when calling
ixgbe_host_interface_command().
No functional change on x86; prevents unaligned accesses on
architectures that enforce natural alignment.
Fixes: 49425dfc7451 ("ixgbe: Add support for x550em_a 10G MAC type") Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Fixes: 6a14ee0cfb19 ("ixgbe: Add X550 support function pointers") Tested-by: Rinitha S <sx.rinitha@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260430-jk-iwl-net-next-2026-04-30-v1-9-6f27ae1cd073@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add E610 specific implementation of .get_eee() and .set_eee() ethtool
callbacks.
Introduce ixgbe_setup_eee_e610() which is used to set EEE config
on E610 device via ixgbe_aci_set_phy_cfg() (0x0601 ACI command).
Assign it to dedicated mac operation.
E610 devices support EEE feature specifically for 2.5, 5 and 10G link
speeds. When user try to set EEE for unsupported speeds log it.
Setting timer and setting EEE advertised speeds are not yet supported.
EEE shall be enabled by default for E610 devices.
Add EEE statuis logging during link watchdog run.
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260430-jk-iwl-net-next-2026-04-30-v1-6-6f27ae1cd073@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ixgbe: E610: update ACI command structs with EEE fields
There were recent changes in some of the ACI commands,
which have been extended with EEE related fields.
Set PHY Config, Get PHY Caps and Get Link Info have been
affected.
Align SW structs to the recent FW changes.
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260430-jk-iwl-net-next-2026-04-30-v1-4-6f27ae1cd073@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ixgbe: E610: use new version of 0x601 ACI command buffer
Since FW version 1.40, buffer size of the 0x601 cmd has been increased
by 2B - from 24 to 26B. Buffer has been extended with new field
which can be used to configure EEE entry delay.
Pre-1.40 FW versions still expect 24B buffer and throws error when
receipts 26B buffer. To keep compatibility, check whether EEE
device capability flag is set and basing on it use appropriate
size of the command buffer.
Additionally place Set PHY Config capabilities defines out of
structs definitions.
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260430-jk-iwl-net-next-2026-04-30-v1-3-6f27ae1cd073@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Despite there was no EEE (Energy Efficient Ethernet) feature
support for E610 adapters, eee_speeds_supported variable was
defined and even initialized with some EEE speeds.
As E610 adapter supports EEE only for 10G, 5G and 2.5G speeds,
update hw.phy.eee_speeds_supported. Remove unsupported speeds -
10M, 100M and 1G.
Add also entry for 5G speed in EEE speeds mapping array used
by ethtool callbacks.
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260430-jk-iwl-net-next-2026-04-30-v1-2-6f27ae1cd073@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Recently EEE functionality support has been introduced to E610 FW.
Currently ixgbe driver has no possibility to detect whether NVM
loaded on given adapter supports EEE.
There's dedicated device capability element reflecting FW support
for given EEE link speed.
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260430-jk-iwl-net-next-2026-04-30-v1-1-6f27ae1cd073@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Yang [Thu, 30 Apr 2026 11:45:25 +0000 (19:45 +0800)]
net: dsa: yt921x: Refactor long register helpers
Dealing long registers with u64 is good, until you realize there are
longer 96-bit registers.
Refactor reg64 helpers to use u32 arrays instead of u64 values, in
preparation for 96-bit registers. We do not keep the separate u64
version for reg64 to avoid duplicated wrappers, although it looks better
when dealing with reg64 *only*.
Helpers for reg96 should be added when they are actually used to avoid
function unused warnings.
David Yang [Thu, 30 Apr 2026 11:45:24 +0000 (19:45 +0800)]
net: dsa: pass extack to dsa_switch_ops :: port_policer_add()
Drivers might have error messages to propagate to user space. Propagate
the netlink extack so that they can inform user space in a verbal way of
their limitations.
Make the according transformations to the two users (sja1105 and felix).
net/mlx5: Add vhca_id_type support to IPsec alias creation
When creating an alias FT for MPV IPsec, if alias creation with
sw_vhca_id is supported use it instead of using the hw_vhca_id.
This in turn allows IPsec to work properly after live migration,
in case a VF was live migrated and his hw_vhca_id changed due to
migration which can happen if you migrate to a VF with a different index
than yours, IPsec would fail to start post migration, this patch
resolves the issue by using sw_vhca_id instead which doesn't change post
migration.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260430061958.225245-1-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Biggers [Wed, 29 Apr 2026 21:08:56 +0000 (21:08 +0000)]
Documentation/tcp_ao: Document the supported MAC algorithms and lengths
Update the TCP-AO documentation to fix some incorrect terminology and
claims regarding the MAC algorithms, and document which MAC algorithms
and lengths the Linux implementation supports.
====================
net/mlx5: enable sub-page allocations for mlx5_frag_buf
This series aims to improve memory utilization for DMA-coherent
fragmented-buffer allocations on systems with large PAGE_SIZE.
Before this change, such allocations were page-granular, as they were
backed by full pages. On large-page systems this caused significant
internal waste for small objects. For example, a single 4K request
consumed an entire 64K page.
The common kernel solution for sub-page coherent DMA allocations is the
DMA pool API. However, those pools do not return pages to the system
until teardown. That behavior is not a good fit for mlx5_frag_buf
allocations, since they back interface resources (WQs and CQs).
Interfaces may be removed dynamically, so their memory footprint should
reflect live usage to avoid situations where large amounts of memory
remain tied up in pools.
This series introduces a lightweight mlx5-local pool implementation for
sub-page coherent DMA allocations, which immediately returns free
backing pages. It wires mlx5_frag_buf allocations to use these internal
pools, while keeping the mechanism reusable for other mlx5-internal
coherent DMA allocation users in follow-up work.
====================
net/mlx5: use internal dma pools for frag buf alloc
Add mlx5_dma_pool alloc/free paths, and wire mlx5_frag_buf allocation
and free paths to use them.
mlx5_frag_buf_alloc_node() now selects an mlx5_dma_pool to allocate
fragments from, instead of directly allocating full coherent pages.
mlx5_frag_buf_free() frees from the respective pool.
mlx5_dma_pool_alloc() keeps allocation fast by maintaining pages with
available indexes at the head of the list, so the common allocation path
can take a free index immediately. New backing pages are allocated only
when no free index is available.
mlx5_dma_pool_free() returns released indexes to the pool and frees a
backing page once all of its indexes become free. This avoids keeping
fully free pages for the lifetime of the pool and reduces coherent DMA
memory footprint.
Introduce mlx5 DMA pool and pool-page data structures, and add the
creation and teardown paths.
Each NUMA node owns a set of mlx5_dma_pool instances, each one with a
different block size. The sizes are defined as all powers of two
starting from MLX5_ADAPTER_PAGE_SHIFT and up to PAGE_SHIFT. Since
mlx5_frag_bufs are used to back objects whose sizes are encoded relative
to MLX5_ADAPTER_PAGE_SHIFT, a smaller block_shift value cannot be used.
Requests larger than PAGE_SIZE continue to be handled as page-sized
fragments, as in the existing frag-buf allocation model.
Wire mlx5_frag_buf pools init/cleanup hooks into
mlx5_mdev_init()/uninit() and the init unwind path.
Keep temporary no-op stubs in alloc.c so lifecycle ordering is in place
before the coherent DMA sub-page allocator implementation is added in
follow-up patches.
Qingfang Deng [Wed, 29 Apr 2026 02:38:46 +0000 (10:38 +0800)]
pppoe: optimize hash with word access
Currently, hash_item() processes the 6-byte Ethernet address and the
2-byte session ID byte-wise to compute a hash.
Optimize this by using 16-bit word operations: XOR three 16-bit words
from the Ethernet address and the 16-bit session ID, then fold the
result. This reduces the total number of loads and XORs. The Ethernet
addresses in a skb and struct pppoe_addr are both 2-byte aligned, so the
u16 pointer cast is safe.
Lorenzo Bianconi [Thu, 30 Apr 2026 08:47:38 +0000 (10:47 +0200)]
net: airoha: configure QoS channel for HW accelerated flowtable traffic
As done for the SW path, configure the QoS channel for HW accelerated
traffic according to the user port index when forwarding to a DSA port,
or rely on the GDM port identifier otherwise. This allows HTB shaping
to be applied to HW accelerated traffic.
selftests: drv-net: Enable ntuple-filters if supported
Certain devices which support ntuple-filters do not enable the feature
by default. The existing tests will skip (if they check for the feature),
or fail if they blindly attempt to install rules. Therefore, attempt to turn
on ntuple-filters if the device supports them.
Eric Dumazet [Thu, 30 Apr 2026 07:40:04 +0000 (07:40 +0000)]
ip6mr: plug drop_reason to ip6mr_cache_report()
- Check mrt->mroute_sk earlier in the function.
- Use sock_queue_rcv_skb_reason() instead of sock_queue_rcv_skb().
- Use sk_skb_reason_drop() instead of kfree_skb().
Note that we return -ENOMEM if sock_queue_rcv_skb_reason() failed,
as the precise error is not really needed for callers.
drivers/net/Space.c is the last remnant of the linux-2.4.x driver model
that required each subsystem and device driver init function to be called
from init/main.c explicitly, before the introduction of initcall levels.
In linux-7.0, this was only used for a handful of ISA network drivers,
with the ne2000 driver being the last one.
Fold the code into ne.c directly, with minimal changes to preserve
the existing command line parsing.
Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # m68k Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260429145624.2948432-2-arnd@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The cs89x0 driver is really two in one, and they are mutually exclusive:
- the ISA driver was used on 486-era PCs. It likely has no remaining
users, like the other ethernet drivers that got removed in
linux-7.1. The DMA support in here is the last device driver use of
the deprecated isa_bus_to_virt() interface, all other users are either
x86 specific or or got converted to the normal dma-mapping interface.
The driver was maintained by Andrew Morton at the time, based on
the linux-2.2 vendor driver from Cirrus Logic.
- the platform_driver instance was used on some embedded Arm boards
around the same time, such as the EP7211 Development Kit. This
is the same chip, but uses modern devicetree based probing and no DMA.
This was added by Alexander Shiyan.
Remove the ISA driver as a cleanup, including all of the outdated
documentation referring to its configuration.
Jakub Kicinski [Wed, 29 Apr 2026 21:30:01 +0000 (14:30 -0700)]
net: tls: reshuffle the device ops check
We try to validate during registration that the netdev
has ops if it has features. This is currently somewhat sillily
written because we have a dereference before a NULL check
on the ops struct. Straighten this out.
No functional change intended other than saving ourselves
the very theoretical crash with a bad driver.
Note that we check earlier in the function that either ops
or TLS features are set for the device in question.
Jakub Kicinski [Fri, 1 May 2026 01:53:20 +0000 (18:53 -0700)]
Merge branch 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
Tariq Toukan says:
====================
mlx5-next updates 2026-04-29
* 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
net/mlx5: Extend query_esw_functions output for multi-function support
net/mlx5: Remove unused host_sf_enable field
net/mlx5: Add function_id_type for enable/disable_hca cmds
mlx5: Rename the vport number enums for host PF and VF
====================
====================
bridge: Do not suppress ARP probes and DAD NS unconditionally
When using bridge neighbor suppression in EVPN deployments, Duplicate
Address Detection (DAD) is currently broken for both IPv4 (ARP probes)
and IPv6 (DAD Neighbor Solicitations). This prevents proper address
conflict detection across the VXLAN fabric.
The neighbor suppression feature allows the bridge to reply to ARP/NS
messages on behalf of remote hosts when FDB and neighbor entries exist,
suppressing unnecessary flooding over the VXLAN overlay. However, the
current implementation unconditionally suppresses ARP probes and DAD NS,
which breaks DAD.
For DAD to work correctly:
- When the bridge doesn't know the answer:
flood the probe/DAD packet to allow remote VTEPs to respond.
- When the bridge knows the answer:
reply to indicate the address is in use.
This series fixes the issue by adjusting the early suppression checks to
exclude ARP probes and DAD NS from unconditional suppression, allowing
them to reach the normal FDB lookup path. Gratuitous ARP and IPv6
unsolicited-NA messages are still suppressed unconditionally as before.
selftests: net: Add tests for ARP probe and DAD NS handling
Add test cases to verify that ARP probes and DAD Neighbor Solicitations
are handled correctly by the bridge neighbor suppression feature.
When neighbor suppression is enabled on a bridge VXLAN port, the bridge
should reply to ARP/NS messages on behalf of remote hosts when both FDB
and neighbor entries exist, and the answer is known. However, when
either the FDB or the neighbor exists, ARP probes / DAD NS should be
treated like regular ARP requests / NS and flood to VXLAN.
Add two new test functions:
neigh_suppress_arp_probe(): Tests ARP probe handling by triggering
duplicate address detection using arping -D. Verifies that probes are
flooded when the bridge doesn't know the answer, and suppressed when FDB
and neighbor entries exist.
neigh_suppress_dad_ns(): Tests DAD NS handling by constructing DAD NS
packets using mausezahn and verifies correct flooding/suppression
behavior.
Per-port ARP probe suppression
------------------------------
TEST: ARP probe suppression [ OK ]
TEST: "neigh_suppress" is on [ OK ]
TEST: ARP probe suppression [FAIL]
TEST: FDB and neighbor entry installation [ OK ]
TEST: arping [FAIL]
TEST: ARP probe suppression [FAIL]
TEST: neighbor removal [ OK ]
TEST: ARP probe suppression [FAIL]
TEST: "neigh_suppress" is off [ OK ]
TEST: ARP probe suppression [FAIL]
Per-port DAD NS suppression
---------------------------
TEST: DAD NS suppression [ OK ]
TEST: "neigh_suppress" is on [ OK ]
TEST: DAD NS suppression [FAIL]
TEST: FDB and neighbor entry installation [ OK ]
TEST: DAD NS suppression [FAIL]
TEST: neighbor removal [ OK ]
TEST: DAD NS suppression [FAIL]
TEST: DAD NS proxy NA reply [FAIL]
TEST: "neigh_suppress" is off [ OK ]
TEST: DAD NS suppression [FAIL]
Per-port ARP probe suppression
------------------------------
TEST: ARP probe suppression [ OK ]
TEST: "neigh_suppress" is on [ OK ]
TEST: ARP probe suppression [ OK ]
TEST: FDB and neighbor entry installation [ OK ]
TEST: arping [ OK ]
TEST: ARP probe suppression [ OK ]
TEST: neighbor removal [ OK ]
TEST: ARP probe suppression [ OK ]
TEST: "neigh_suppress" is off [ OK ]
TEST: ARP probe suppression [ OK ]
Per-port DAD NS suppression
---------------------------
TEST: DAD NS suppression [ OK ]
TEST: "neigh_suppress" is on [ OK ]
TEST: DAD NS suppression [ OK ]
TEST: FDB and neighbor entry installation [ OK ]
TEST: DAD NS suppression [ OK ]
TEST: neighbor removal [ OK ]
TEST: DAD NS suppression [ OK ]
TEST: DAD NS proxy NA reply [ OK ]
TEST: "neigh_suppress" is off [ OK ]
TEST: DAD NS suppression [ OK ]
bridge: Do not suppress ARP probes and DAD NS unconditionally
When neighbor suppression is enabled on a VXLAN port, the bridge is
expected to reply to ARP/NS messages on behalf of remote hosts when both
FDB and neighbor entries exist. This allows the bridge to suppress
flooding of these messages to the VXLAN overlay.
According to RFC 9161 ("Operational Aspects of Proxy ARP/ND in Ethernet
Virtual Private Networks"):
"A PE SHOULD reply to broadcast/multicast address resolution messages,
i.e., ARP Requests, ARP probes, NS messages, as well as DAD NS messages.
An ARP probe is an ARP Request constructed with an all-zero sender IP
address that may be used by hosts for IPv4 Address Conflict Detection as
specified in [RFC5227]".
However, the current implementation unconditionally suppresses ARP probes
and DAD Neighbor Solicitations, which breaks Duplicate Address Detection
(DAD) over EVPN.
For DAD to work correctly over the VXLAN fabric:
- When the bridge does not know the answer:
flood the probe/DAD packet to allow remote VTEPs to respond.
- When the bridge knows the answer:
reply to indicate the address is in use.
Fix by adjusting the early suppression checks to exclude ARP probes and
DAD NS from unconditional suppression.
When replying to a DAD NS, br_nd_send() is adjusted to set the NA
destination to the all-nodes multicast address (ff02::1) and clear the
Solicited flag, in accordance with RFC 4861 section 7.2.4.
D. Wythe [Wed, 29 Apr 2026 02:16:37 +0000 (10:16 +0800)]
net/smc: cap allocation order for SMC-R physically contiguous buffers
The alloc_pages() cannot satisfy requests exceeding MAX_PAGE_ORDER,
and attempting such allocations will lead to guaranteed failures
and potential kernel warnings.
For SMCR_PHYS_CONT_BUFS, cap the allocation order to MAX_PAGE_ORDER.
This ensures the attempts to allocate the largest possible physically
contiguous chunk succeed, instead of failing with an invalid order.
This also avoids redundant "try-fail-degrade" cycles in
__smc_buf_create().
For SMCR_MIXED_BUFS, no cap is needed: if the order exceeds
MAX_PAGE_ORDER, alloc_pages() will silently fail (__GFP_NOWARN)
and automatically fall back to virtual memory.
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Sidraya Jayagond <sidraya@linux.ibm.com> Link: https://patch.msgid.link/20260429021637.21815-1-alibuda@linux.alibaba.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Fri, 1 May 2026 00:10:20 +0000 (17:10 -0700)]
Merge tag 'wireless-next-2026-04-30' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Johannes Berg says:
====================
Some new content already, notably:
- mac80211: major rework of station bandwidth handling,
fixing issues with lower capability than AP
- general: cleanups for EMLSR spec issues (drafts differed)
- ath9k: GPIO interface improvements
- ath12k: replace dynamic memory allocation in WMI RX path
* tag 'wireless-next-2026-04-30' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (39 commits)
wifi: brcmsmac: phy_lcn: Remove dead code in wlc_lcnphy_radio_2064_channel_tune_4313()
wifi: mac80211: always allow transmitting null-data on TXQs
wifi: mac80211: use kstrtobool_from_user() in debugfs callbacks
wifi: cfg80211: validate cipher suite for NAN Data keys
wifi: nl80211: check link is beaconing for color change
wifi: mac80211: clarify an 802.11 VHT spec reference
wifi: mac80211: fix per-station PHY capability bandwidth
wifi: mac80211: clarify per-STA bandwidth handling
wifi: nl80211: always validate AP operation/PHY regulatory
wifi: cfg80211: provide HT/VHT operation for AP beacon
wifi: nl80211: reject too short HT/VHT/HE/EHT capability/operation
wifi: cfg80211: move AP HT/VHT/... operation to beacon info
wifi: nl80211: reject beacons with bad HE operation
wifi: cfg80211: remove HE/SAE H2E required fields
wifi: mac80211: remove ieee80211_sta_cur_vht_bw()
wifi: mac80211: clean up ieee80211_sta_cap_rx_bw()
wifi: mac80211: clean up initial STA NSS/bandwidth handling
wifi: mac80211: clean up STA NSS handling
wifi: mac80211: simplify ieee80211_sta_rx_bw_to_chan_width()
wifi: nl80211: document channel opmode change channel width
...
====================
- eth:
- stmmac: prevent NULL deref when RX memory exhausted
- airoha: do not read uninitialized fragment address
- rtl8150: fix use-after-free in rtl8150_start_xmit()
Misc:
- add Ido Schimmel as IPv4/IPv6 maintainer
- add David Heidelberg as NFC subsystem maintainer"
* tag 'net-7.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (79 commits)
net/sched: cls_flower: revert unintended changes
sfc: fix error code in efx_devlink_info_running_versions()
net: tls: fix strparser anchor skb leak on offload RX setup failure
ice: add dpll peer notification for paired SMA and U.FL pins
ice: fix missing dpll notifications for SW pins
dpll: export __dpll_pin_change_ntf() for use under dpll_lock
ice: fix SMA and U.FL pin state changes affecting paired pin
ice: fix missing SMA pin initialization in DPLL subsystem
ice: fix infinite recursion in ice_cfg_tx_topo via ice_init_dev_hw
ice: fix NULL pointer dereference in ice_reset_all_vfs()
iavf: add VIRTCHNL_OP_ADD_VLAN to success completion handler
iavf: wait for PF confirmation before removing VLAN filters
iavf: stop removing VLAN filters from PF on interface down
iavf: rename IAVF_VLAN_IS_NEW to IAVF_VLAN_ADDING
page_pool: fix memory-provider leak in page_pool_create_percpu() error path
bonding: 3ad: implement proper RCU rules for port->aggregator
net: airoha: Do not return err in ndo_stop() callback
hv_sock: fix ARM64 support
MAINTAINERS: update the IPv4/IPv6 entry and add Ido Schimmel
selftests: drv-net: clarify linters and frameworks in README
...
Merge tag 'sound-7.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"A bunch of small fixes. One minor fix is found in the core side for
data race in PCM OSS layer, while remaining changes are various
device-specific fixes and quirks.
- Core: PCM OSS data race fix
- HD-audio: Fixes for TAS2781, CS35L56, and Realtek/Conexant quirks;
avoidance of a WARN_ON for HDMI channel mapping
- USB-audio: Improvements in UAC3 parsing robustness (leaks, size
checks) and fixes for potential endless loops
- ASoC: Driver-specific fixes for CS35L56, Intel bytcr_wm5102,
Spacemit, AW88395, and others, plus a new quirk for Steam Deck
OLED
- Misc: A UAF fix in aloop driver, division by zero fix in ua101
driver and leak fixes in caiaq driver"
* tag 'sound-7.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (32 commits)
ALSA: hda/tas2781: Fix incorrect bit update for non-book-zero or book 0 pages >1
ALSA: hda: cs35l56: Fix uninitialized value in cs35l56_hda_read_acpi()
ALSA: hda/conexant: Fix missing error check for jack detection
ALSA: hda: Avoid WARN_ON() for HDMI chmap slot checks
ALSA: usb-audio: Fix quirk entry placement for PreSonus AudioBox USB
ASoC: spacemit: adjust FIFO trigger threshold to half FIFO size
ASoC: spacemit: move hw constraints from hw_params to startup
ASoC: codecs: ab8500: Fix casting of private data
ASoC: cs35l56: Fix illegal writes to OTP_MEM registers
ASoC: Intel: bytcr_wm5102: Fix MCLK leak on platform_clock_control error
ALSA: usb-audio: Avoid potential endless loop in convert_chmap_v3()
ALSA: usb-audio: Fix potential leak of pd at parsing UAC3 streams
ALSA: caiaq: Don't abort when no input device is available
ALSA: caiaq: Fix potentially leftover ep1_in_urb at error path
ASoC: aw88395: Fix kernel panic caused by invalid GPIO error pointer
ALSA: caiaq: fix usb_dev refcount leak on probe failure
sound: ua101: fix division by zero at probe
ALSA: usb-audio: apply quirk for Playstation PDP Riffmaster
ALSA: hda: Remove duplicate cmedia entries in codecs Makefile
ALSA: hda/realtek: Add micmute LED quirk for Acer Aspire A315-44P
...
Paolo Abeni [Thu, 30 Apr 2026 14:22:06 +0000 (16:22 +0200)]
Merge branch 'dpll-add-pin-operational-state'
Ivan Vecera says:
====================
dpll: add pin operational state
Add pin operational state (operstate) to the DPLL subsystem to
separate administrative intent from actual hardware status.
Currently pin-state mixes what the user requested (connected,
selectable, disconnected) with what the hardware is actually doing.
This makes it difficult to diagnose situations where a user sets
a pin as selectable or connected but the hardware cannot use it
due to signal issues.
The new operstate attribute is reported inside the pin-parent-device
nest alongside the existing state and is read-only. Defined values:
- active: pin is qualified and actively used by the DPLL
- standby: pin is qualified but not actively used by the DPLL
- no-signal: pin does not have a valid signal
- qual-failed: pin signal failed qualification checks
Patch 1 adds the operstate enum, netlink attribute and the
operstate_on_dpll_get callback to the DPLL subsystem. It also
updates Documentation/driver-api/dpll.rst to describe the
separation between admin state and operational state.
Patch 2 implements the callback for ZL3073x input pins using the
reference monitor status register. It also refactors the existing
state_on_dpll_get to return purely administrative state and switches
periodic monitoring to track operstate changes.
====================
Ivan Vecera [Tue, 28 Apr 2026 15:49:07 +0000 (17:49 +0200)]
dpll: zl3073x: implement pin operational state reporting
Implement operstate_on_dpll_get callback for input pins to report
the actual hardware status:
- active: pin is the currently locked reference
- standby: signal is valid but pin is not actively used
- no-signal: reference monitor reports Loss of Signal (LOS)
- qual-failed: reference monitor reports a qualification failure
(SCM, CFM, GST, PFM, eSync or Split-XO)
Separate administrative state (state_on_dpll_get) from operational
state: admin state now reports purely the user-requested intent
(connected in reflock mode, selectable in auto mode).
Switch periodic monitoring to track operstate changes instead of
the mixed admin/oper state that was previously reported.
Ivan Vecera [Tue, 28 Apr 2026 15:49:06 +0000 (17:49 +0200)]
dpll: add pin operational state
Add pin-operstate enum and operstate_on_dpll_get callback to report
the actual hardware status of a pin with respect to its parent DPLL
device. Unlike pin-state (which reflects administrative intent set
by the user), operstate reflects what the hardware is actually doing.
Defined operational states:
- active: pin is qualified and actively used by the DPLL
- standby: pin is qualified but not actively used by the DPLL
- no-signal: pin does not have a valid signal
- qual-failed: pin signal failed qualification
The operstate is reported inside the pin-parent-device nested
attribute alongside the existing state and phase-offset attributes.
Signed-off-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Reviewed-by: Petr Oros <poros@redhat.com> Link: https://patch.msgid.link/20260428154907.2820654-2-ivecera@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Paolo Abeni [Wed, 29 Apr 2026 07:39:11 +0000 (09:39 +0200)]
net/sched: cls_flower: revert unintended changes
While applying the blamed commit 4ca07b9239bd ("net: mctp i2c: check
length before marking flow active"), I unintentionally included
unrelated and unacceptable changes.
Revert them.
Fixes: 4ca07b9239bd ("net: mctp i2c: check length before marking flow active") Reported-by: Jeremy Kerr <jk@codeconstruct.com.au> Closes: https://lore.kernel.org/netdev/bd8704fe0bd53e278add5cde4873256656623e2e.camel@codeconstruct.com.au/ Signed-off-by: Paolo Abeni <pabeni@redhat.com> Link: https://patch.msgid.link/043026a53ff84da88b17648c4b0d17f0331749cb.1777447863.git.pabeni@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Dan Carpenter [Wed, 29 Apr 2026 06:48:17 +0000 (09:48 +0300)]
sfc: fix error code in efx_devlink_info_running_versions()
Return -EIO if efx_mcdi_rpc() doesn't return enough space.
Fixes: 14743ddd2495 ("sfc: add devlink info support for ef100") Signed-off-by: Dan Carpenter <error27@gmail.com> Reviewed-by: Edward Cree <ecree.xilinx@gmail.com> Link: https://patch.msgid.link/afGpsbLRHL4_H0KS@stanley.mountain Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When tls_set_device_offload_rx() fails at tls_dev_add(), the error path
calls tls_sw_free_resources_rx() to clean up the SW context that was
initialized by tls_set_sw_offload(). This function calls
tls_sw_release_resources_rx() (which stops the strparser via
tls_strp_stop()) and tls_sw_free_ctx_rx() (which kfrees the context),
but never frees the anchor skb that was allocated by alloc_skb(0) in
tls_strp_init().
Note that tls_sw_free_resources_rx() is exclusively used for this
"failed to start offload" code path, there's no other caller.
The leak did not exist before commit 84c61fe1a75b ("tls: rx: do not use
the standard strparser"), because the standard strparser doesn't try
to pre-allocate an skb.
The normal close path in tls_sk_proto_close() handles cleanup by calling
tls_sw_strparser_done() (which calls tls_strp_done()) after dropping
the socket lock, because tls_strp_done() does cancel_work_sync() and
the strparser work handler takes the socket lock.
Fixes: 84c61fe1a75b ("tls: rx: do not use the standard strparser") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Link: https://patch.msgid.link/20260428231559.1358502-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
====================
Intel Wired LAN Update 2026-04-27 (ice, iavf)
Petr Oros from RedHat has accumulated a number of fixes for the Intel ice
and iavf drivers, bundled together in this series.
First, a series of 4 fixes to resolve issues with the iavf driver logic for
handling VLAN filters. This includes keeping VLAN filters while the
interface is brought down, waiting for confirmation on filter deletion
before deleting filters from the driver tracking structures, and handling
the VIRTCHNL_OP_ADD_VLAN for the old v1 VLAN_ADD command.
A fix for a crash in ice_reset_all_vfs(), properly checking for errors when
ice_vf_rebuild_vsi() fails.
A fix for a possible infinite recursion in ice_cfg_tx_topo() that occurs
when trying to apply invalid Tx topology configuration.
A fix to initialize the SMA pins in the DPLL subsystem properly.
A fix to change the SMA and U.FL pin state for paired pins, ensuring that
all flows changing one pin will also update its shared pin appropriately.
A preparatory patch to export __dpll_pin_change_ntf() so that drivers can
notify pin changes while already holding the dpll_lock.
A fix to ensure DPLL notifications are sent for the software-controlled
pins which wrap the physical CGU input/output pins.
A fix to add DPLL notifications for peer pins when changing the SMA or U.FL
pins, ensuring DPLL subsystem is notified about the paired connected pins.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
====================
Petr Oros [Tue, 28 Apr 2026 05:22:23 +0000 (22:22 -0700)]
ice: add dpll peer notification for paired SMA and U.FL pins
SMA and U.FL pins share physical signal paths in pairs (SMA1/U.FL1 and
SMA2/U.FL2). When one pin's state changes via a PCA9575 GPIO write,
the paired pin's state also changes, but no notification is sent for
the peer pin. Userspace consumers monitoring the peer via dpll netlink
subscribe never learn about the update.
Add ice_dpll_sw_pin_notify_peer() which sends a change notification for
the paired SW pin. Call it from ice_dpll_pin_sma_direction_set(),
ice_dpll_sma_pin_state_set(), and ice_dpll_ufl_pin_state_set() after
pf->dplls.lock is released. Use __dpll_pin_change_ntf() because
dpll_lock is still held by the dpll netlink layer (dpll_pin_pre_doit).
Fixes: 2dd5d03c77e2 ("ice: redesign dpll sma/u.fl pins control") Signed-off-by: Petr Oros <poros@redhat.com> Tested-by: Alexander Nowlin <alexander.nowlin@intel.com> Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260427-jk-iwl-net-petr-oros-fixes-v1-11-cdcb48303fd8@intel.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Petr Oros [Tue, 28 Apr 2026 05:22:22 +0000 (22:22 -0700)]
ice: fix missing dpll notifications for SW pins
The SMA/U.FL pin redesign (commit 2dd5d03c77e2 ("ice: redesign dpll
sma/u.fl pins control")) introduced software-controlled pins that wrap
backing CGU input/output pins, but never updated the notification and
data paths to propagate pin events to these SW wrappers.
The periodic work sends dpll_pin_change_ntf() only for direct CGU input
pins. SW pins that wrap these inputs never receive change or phase
offset notifications, so userspace consumers such as synce4l monitoring
SMA pins via dpll netlink never learn about state transitions or phase
offset updates. Similarly, ice_dpll_phase_offset_get() reads the SW
pin's own phase_offset field which is never updated; the PPS monitor
writes to the backing CGU input's field instead.
Fix by introducing ice_dpll_pin_ntf(), a wrapper around
dpll_pin_change_ntf() that also notifies any registered SMA/U.FL pin
whose backing CGU input matches. Replace all direct
dpll_pin_change_ntf() calls in the periodic notification paths with
this wrapper. Fix ice_dpll_phase_offset_get() to return the backing
CGU input's phase_offset for input-direction SW pins.
Fixes: 2dd5d03c77e2 ("ice: redesign dpll sma/u.fl pins control") Signed-off-by: Petr Oros <poros@redhat.com> Tested-by: Alexander Nowlin <alexander.nowlin@intel.com> Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Ivan Vecera <ivecera@redhat.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260427-jk-iwl-net-petr-oros-fixes-v1-10-cdcb48303fd8@intel.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Ivan Vecera [Tue, 28 Apr 2026 05:22:21 +0000 (22:22 -0700)]
dpll: export __dpll_pin_change_ntf() for use under dpll_lock
Export __dpll_pin_change_ntf() so that drivers can send pin change
notifications from within pin callbacks, which are already called
under dpll_lock. Using dpll_pin_change_ntf() in that context would
deadlock.
Add lockdep_assert_held() to catch misuse without the lock held.
Acked-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Signed-off-by: Ivan Vecera <ivecera@redhat.com> Signed-off-by: Petr Oros <poros@redhat.com> Tested-by: Alexander Nowlin <alexander.nowlin@intel.com> Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260427-jk-iwl-net-petr-oros-fixes-v1-9-cdcb48303fd8@intel.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Petr Oros [Tue, 28 Apr 2026 05:22:20 +0000 (22:22 -0700)]
ice: fix SMA and U.FL pin state changes affecting paired pin
SMA and U.FL pins share physical signal paths in pairs (SMA1/U.FL1 and
SMA2/U.FL2) controlled by the PCA9575 GPIO expander. Each pair can
only have one active pin at a time: SMA1 output and U.FL1 output share
the same CGU output, SMA2 input and U.FL2 input share the same CGU
input. The PCA9575 register bits determine which connector in each
pair owns the signal path.
The driver does not account for this pairing in two places:
ice_dpll_ufl_pin_state_set() modifies PCA9575 bits and disables the
backing CGU pin without checking whether the U.FL pin is currently
active. Disconnecting an already inactive U.FL pin flips bits that
the paired SMA pin relies on, breaking its connection.
ice_dpll_sma_direction_set() does not propagate direction changes to
the paired U.FL pin. For SMA2/U.FL2 the ICE_SMA2_UFL2_RX_DIS bit is
never managed, so U.FL2 stays disconnected after SMA2 switches to
output. For both pairs the backing CGU pin of the U.FL side is never
enabled when a direction change activates it, so userspace sees the
pin as disconnected even though the routing is correct.
Fix by guarding the U.FL disconnect path against inactive pins and by
updating the paired U.FL pin fully on SMA direction changes: manage
ICE_SMA2_UFL2_RX_DIS for the SMA2/U.FL2 pair and enable the backing
CGU pin whenever the peer becomes active.
Fixes: 2dd5d03c77e2 ("ice: redesign dpll sma/u.fl pins control") Signed-off-by: Petr Oros <poros@redhat.com> Tested-by: Alexander Nowlin <alexander.nowlin@intel.com> Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260427-jk-iwl-net-petr-oros-fixes-v1-8-cdcb48303fd8@intel.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Petr Oros [Tue, 28 Apr 2026 05:22:19 +0000 (22:22 -0700)]
ice: fix missing SMA pin initialization in DPLL subsystem
The DPLL SMA/U.FL pin redesign introduced ice_dpll_sw_pin_frequency_get()
which gates frequency reporting on the pin's active flag. This flag is
determined by ice_dpll_sw_pins_update() from the PCA9575 GPIO expander
state. Before the redesign, SMA pins were exposed as direct HW
input/output pins and ice_dpll_frequency_get() returned the CGU
frequency unconditionally — the PCA9575 state was never consulted.
The PCA9575 powers on with all outputs high, setting ICE_SMA1_DIR_EN,
ICE_SMA1_TX_EN, ICE_SMA2_DIR_EN and ICE_SMA2_TX_EN. Nothing in the
driver writes the register during initialization, so
ice_dpll_sw_pins_update() sees all pins as inactive and
ice_dpll_sw_pin_frequency_get() permanently returns 0 Hz for every
SW pin.
Fix this by writing a default SMA configuration in
ice_dpll_init_info_sw_pins(): clear all SMA bits, then set SMA1 and
SMA2 as active inputs (DIR_EN=0) with U.FL1 output and U.FL2 input
disabled. Each SMA/U.FL pair shares a physical signal path so only
one pin per pair can be active at a time. U.FL pins still report
frequency 0 after this fix: U.FL1 (output-only) is disabled by
ICE_SMA1_TX_EN which keeps the TX output buffer off, and U.FL2
(input-only) is disabled by ICE_SMA2_UFL2_RX_DIS. They can be
activated by changing the corresponding SMA pin direction via dpll
netlink.
Fixes: 2dd5d03c77e2 ("ice: redesign dpll sma/u.fl pins control") Signed-off-by: Petr Oros <poros@redhat.com> Reviewed-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Tested-by: Alexander Nowlin <alexander.nowlin@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260427-jk-iwl-net-petr-oros-fixes-v1-7-cdcb48303fd8@intel.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Petr Oros [Tue, 28 Apr 2026 05:22:18 +0000 (22:22 -0700)]
ice: fix infinite recursion in ice_cfg_tx_topo via ice_init_dev_hw
On certain E810 configurations where firmware supports Tx scheduler
topology switching (tx_sched_topo_comp_mode_en), ice_cfg_tx_topo()
may need to apply a new 5-layer or 9-layer topology from the DDP
package. If the AQ command to set the topology fails (e.g. due to
invalid DDP data or firmware limitations), the global configuration
lock must still be cleared via a CORER reset.
Commit 86aae43f21cf ("ice: don't leave device non-functional if Tx
scheduler config fails") correctly fixed this by refactoring
ice_cfg_tx_topo() to always trigger CORER after acquiring the global
lock and re-initialize hardware via ice_init_hw() afterwards.
However, commit 8a37f9e2ff40 ("ice: move ice_deinit_dev() to the end
of deinit paths") later moved ice_init_dev_hw() into ice_init_hw(),
breaking the reinit path introduced by 86aae43f21cf. This creates an
infinite recursive call chain:
Fix by moving ice_init_dev_hw() back out of ice_init_hw() and calling
it explicitly from ice_probe() and ice_devlink_reinit_up(). The third
caller, ice_cfg_tx_topo(), intentionally does not need ice_init_dev_hw()
during its reinit, it only needs the core HW reinitialization. This
breaks the recursion cleanly without adding flags or guards.
The deinit ordering changes from commit 8a37f9e2ff40 ("ice: move
ice_deinit_dev() to the end of deinit paths") which fixed slow rmmod
are preserved, only the init-side placement of ice_init_dev_hw() is
reverted.
Fixes: 8a37f9e2ff40 ("ice: move ice_deinit_dev() to the end of deinit paths") Signed-off-by: Petr Oros <poros@redhat.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Tested-by: Alexander Nowlin <alexander.nowlin@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260427-jk-iwl-net-petr-oros-fixes-v1-6-cdcb48303fd8@intel.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Petr Oros [Tue, 28 Apr 2026 05:22:17 +0000 (22:22 -0700)]
ice: fix NULL pointer dereference in ice_reset_all_vfs()
ice_reset_all_vfs() ignores the return value of ice_vf_rebuild_vsi().
When the VSI rebuild fails (e.g. during NVM firmware update via
nvmupdate64e), ice_vsi_rebuild() tears down the VSI on its error path,
leaving txq_map and rxq_map as NULL. The subsequent unconditional call
to ice_vf_post_vsi_rebuild() leads to a NULL pointer dereference in
ice_ena_vf_q_mappings() when it accesses vsi->txq_map[0].
The single-VF reset path in ice_reset_vf() already handles this
correctly by checking the return value of ice_vf_reconfig_vsi() and
skipping ice_vf_post_vsi_rebuild() on failure.
Apply the same pattern to ice_reset_all_vfs(): check the return value
of ice_vf_rebuild_vsi() and skip ice_vf_post_vsi_rebuild() and
ice_eswitch_attach_vf() on failure. The VF is left safely disabled
(ICE_VF_STATE_INIT not set, VFGEN_RSTAT not set to VFACTIVE) and can
be recovered via a VFLR triggered by a PCI reset of the VF
(sysfs reset or driver rebind).
Note that this patch does not prevent the VF VSI rebuild from failing
during NVM update — the underlying cause is firmware being in a
transitional state while the EMP reset is processed, which can cause
Admin Queue commands (ice_add_vsi, ice_cfg_vsi_lan) to fail. This
patch only prevents the subsequent NULL pointer dereference that
crashes the kernel when the rebuild does fail.
dmesg:
ice 0000:13:00.0: firmware recommends not updating fw.mgmt, as it
may result in a downgrade. continuing anyways
ice 0000:13:00.1: ice_init_nvm failed -5
ice 0000:13:00.1: Rebuild failed, unload and reload driver
Petr Oros [Tue, 28 Apr 2026 05:22:16 +0000 (22:22 -0700)]
iavf: add VIRTCHNL_OP_ADD_VLAN to success completion handler
The V1 ADD_VLAN opcode had no success handler; filters sent via V1
stayed in ADDING state permanently. Add a fallthrough case so V1
filters also transition ADDING -> ACTIVE on PF confirmation.
Critically, add an `if (v_retval) break` guard: the error switch in
iavf_virtchnl_completion() does NOT return after handling errors,
it falls through to the success switch. Without this guard, a
PF-rejected ADD would incorrectly mark ADDING filters as ACTIVE,
creating a driver/HW mismatch where the driver believes the filter
is installed but the PF never accepted it.
For V2, this is harmless: iavf_vlan_add_reject() in the error
block already kfree'd all ADDING filters, so the success handler
finds nothing to transition.
Fixes: 968996c070ef ("iavf: Fix VLAN_V2 addition/rejection") Signed-off-by: Petr Oros <poros@redhat.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260427-jk-iwl-net-petr-oros-fixes-v1-4-cdcb48303fd8@intel.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Petr Oros [Tue, 28 Apr 2026 05:22:15 +0000 (22:22 -0700)]
iavf: wait for PF confirmation before removing VLAN filters
The VLAN filter DELETE path was asymmetric with the ADD path: ADD
waits for PF confirmation (ADD -> ADDING -> ACTIVE), but DELETE
immediately frees the filter struct after sending the DEL message
without waiting for the PF response.
This is problematic because:
- If the PF rejects the DEL, the filter remains in HW but the driver
has already freed the tracking structure, losing sync.
- Race conditions between DEL pending and other operations
(add, reset) cannot be properly resolved if the filter struct
is already gone.
Add IAVF_VLAN_REMOVING state to make the DELETE path symmetric:
In iavf_del_vlans(), transition filters from REMOVE to REMOVING
instead of immediately freeing them. The new DEL completion handler
in iavf_virtchnl_completion() frees filters on success or reverts
them to ACTIVE on error.
Update iavf_add_vlan() to handle the REMOVING state: if a DEL is
pending and the user re-adds the same VLAN, queue it for ADD so
it gets re-programmed after the PF processes the DEL.
The !VLAN_FILTERING_ALLOWED early-exit path still frees filters
directly since no PF message is sent in that case.
Also update iavf_del_vlan() to skip filters already in REMOVING
state: DEL has been sent to PF and the completion handler will
free the filter when PF confirms. Without this guard, the sequence
DEL(pending) -> user-del -> second DEL could cause the PF to return
an error for the second DEL (filter already gone), causing the
completion handler to incorrectly revert a deleted filter back to
ACTIVE.
Fixes: 968996c070ef ("iavf: Fix VLAN_V2 addition/rejection") Signed-off-by: Petr Oros <poros@redhat.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260427-jk-iwl-net-petr-oros-fixes-v1-3-cdcb48303fd8@intel.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Petr Oros [Tue, 28 Apr 2026 05:22:14 +0000 (22:22 -0700)]
iavf: stop removing VLAN filters from PF on interface down
When a VF goes down, the driver currently sends DEL_VLAN to the PF for
every VLAN filter (ACTIVE -> DISABLE -> send DEL -> INACTIVE), then
re-adds them all on UP (INACTIVE -> ADD -> send ADD -> ADDING ->
ACTIVE). This round-trip is unnecessary because:
1. The PF disables the VF's queues via VIRTCHNL_OP_DISABLE_QUEUES,
which already prevents all RX/TX traffic regardless of VLAN filter
state.
2. The VLAN filters remaining in PF HW while the VF is down is
harmless - packets matching those filters have nowhere to go with
queues disabled.
3. The DEL+ADD cycle during down/up creates race windows where the
VLAN filter list is incomplete. With spoofcheck enabled, the PF
enables TX VLAN filtering on the first non-zero VLAN add, blocking
traffic for any VLANs not yet re-added.
Remove the entire DISABLE/INACTIVE state machinery:
- Remove IAVF_VLAN_DISABLE and IAVF_VLAN_INACTIVE enum values
- Remove iavf_restore_filters() and its call from iavf_open()
- Remove VLAN filter handling from iavf_clear_mac_vlan_filters(),
rename it to iavf_clear_mac_filters()
- Remove DEL_VLAN_FILTER scheduling from iavf_down()
- Remove all DISABLE/INACTIVE handling from iavf_del_vlans()
VLAN filters now stay ACTIVE across down/up cycles. Only explicit
user removal (ndo_vlan_rx_kill_vid) or PF/VF reset triggers VLAN
filter deletion/re-addition.
Fixes: ed1f5b58ea01 ("i40evf: remove VLAN filters on close") Signed-off-by: Petr Oros <poros@redhat.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260427-jk-iwl-net-petr-oros-fixes-v1-2-cdcb48303fd8@intel.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Petr Oros [Tue, 28 Apr 2026 05:22:13 +0000 (22:22 -0700)]
iavf: rename IAVF_VLAN_IS_NEW to IAVF_VLAN_ADDING
Rename the IAVF_VLAN_IS_NEW state to IAVF_VLAN_ADDING to better
describe what the state represents: an ADD request has been sent to
the PF and is waiting for a response.
This is a pure rename with no behavioral change, preparing for a
cleanup of the VLAN filter state machine.
Signed-off-by: Petr Oros <poros@redhat.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260427-jk-iwl-net-petr-oros-fixes-v1-1-cdcb48303fd8@intel.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This series is targeting net-next for 7.2. To make this series
self-contained in the networking code, I dropped the patches that remove
support for transformation cloning from the crypto API, which is a
further negative 275-line cleanup and optimization this series enables.
That will be done as a follow-up, either through the crypto tree for
7.3, or still through net-next for 7.2 at maintainer preference.
This series refactors the TCP-AO (TCP Authentication Option) code to do
MAC and KDF computations using lib/crypto/ instead of crypto_ahash.
This greatly simplifies the code and makes it much more efficient. The
entire tcp_sigpool mechanism becomes unnecessary and is removed, as the
problems it was designed to solve don't exist with the library APIs.
The crypto API's support for crypto transformation cloning also becomes
unnecessary and will be removed in follow-up patches. Note that as part
of that, we'll be able to roll back the addition of the reference count
to crypto_tfm, which had regressed performance for all crypto API users.
To make this simplification and optimization possible, this series also
updates the TCP-AO code to support a specific set of algorithms, rather
than arbitrary algorithms that don't make sense and are very likely not
being used, e.g. CRC-32 and HMAC-MD5.
Specifically, this series retains the support for AES-128-CMAC,
HMAC-SHA1, and HMAC-SHA256. AES-128-CMAC and HMAC-SHA1 are the only
algorithms that are actually standardized for use in TCP-AO, while
HMAC-SHA256 makes sense to continue supporting as a Linux extension. Of
course, other algorithms can still be (re-)added later if ever needed.
It's worth noting that TCP-AO MACs are limited to 20 bytes by the TCP
options space, which limits the benefit of further algorithm upgrades.
This series passes the tcp_ao selftests
(sudo make -C tools/testing/selftests/net/tcp_ao/ run_tests).
To get a sense for how much more efficient this makes the TCP-AO code,
here's a microbenchmark for tcp_ao_hash_skb() with skb->len == 128:
Eric Biggers [Mon, 27 Apr 2026 17:27:27 +0000 (10:27 -0700)]
net/tcp: Remove tcp_sigpool
tcp_sigpool is no longer used. It existed only as a workaround for
issues in the design of the crypto_ahash API, which have been avoided by
switching to the much easier-to-use library APIs instead. Remove it.
Eric Biggers [Mon, 27 Apr 2026 17:27:26 +0000 (10:27 -0700)]
net/tcp-ao: Return void from functions that can no longer fail
Since tcp-ao now uses the crypto library API instead of crypto_ahash,
and MACs and keys now have a statically-known maximum size, many tcp-ao
functions can no longer fail. Propagate this change up into the return
types of various functions.
Eric Biggers [Mon, 27 Apr 2026 17:27:25 +0000 (10:27 -0700)]
net/tcp-ao: Use stack-allocated MAC and traffic_key buffers
Now that the maximum MAC and traffic key lengths are statically-known
small values, allocate MACs and traffic keys on the stack instead of
with kmalloc. This eliminates multiple failure-prone GFP_ATOMIC
allocations.
Note that some cases such as tcp_ao_prepare_reset() are left unchanged
for now since they would require slightly wider changes.
Eric Biggers [Mon, 27 Apr 2026 17:27:24 +0000 (10:27 -0700)]
net/tcp-ao: Use crypto library API instead of crypto_ahash
Currently the kernel's TCP-AO implementation does the MAC and KDF
computations using the crypto_ahash API. This API is inefficient and
difficult to use, and it has required extensive workarounds in the form
of per-CPU preallocated objects (tcp_sigpool) to work at all.
Let's use lib/crypto/ instead. This means switching to straightforward
stack-allocated structures, virtually addressed buffers, and direct
function calls. It also means removing quite a bit of error handling.
This makes TCP-AO quite a bit faster.
This also enables many additional cleanups, which later commits will
handle: removing tcp-sigpool, removing support for crypto_tfm cloning,
removing more error handling, and replacing more dynamically-allocated
buffers with stack buffers based on the now-statically-known limits.
Eric Biggers [Mon, 27 Apr 2026 17:27:23 +0000 (10:27 -0700)]
net/tcp-ao: Drop support for most non-RFC-specified algorithms
RFC 5926 (https://datatracker.ietf.org/doc/html/rfc5926) specifies the
use of AES-128-CMAC and HMAC-SHA1 with TCP-AO. This includes a
specification for how traffic keys shall be derived for each algorithm.
Support for any other algorithms with TCP-AO isn't standardized, though
an expired Internet Draft (a work-in-progress document, not a standard)
from 2019 does propose adding HMAC-SHA256 support:
https://datatracker.ietf.org/doc/html/draft-nayak-tcp-sha2-03
Since both documents specify the KDF for each algorithm individually, it
isn't necessarily clear how any other algorithm should be integrated.
Nevertheless, the Linux implementation of TCP-AO allows userspace to
specify the MAC algorithm as a string tcp_ao_add::alg_name naming either
"cmac(aes128)" or an arbitrary algorithm in the crypto_ahash API. The
set of valid strings is undocumented. The implementation assumes that
"cmac(aes128)" is the only algorithm that requires an entropy extraction
step and that all algorithms accept keys with length equal to the
untruncated MAC; thus, arbitrary HMAC algorithms probably do work, but
some other MAC algorithms like AES-256-CMAC have never actually worked.
Unfortunately, this undocumented string allows many obsolete, insecure,
or redundant algorithms. For example, "hmac(md5)" and the
non-cryptographic "crc32" are accepted. It also ties the implementation
to crypto_ahash and requires that most memory be dynamically allocated,
making the implementation unnecessarily complex and inefficient. Still
furthermore, this implementation requires the crypto API to support
"transformation cloning", whose only user is this feature.
Fortunately, it's very likely that only a few algorithms are actually
used in practice. Let's restrict the set of allowed algorithms to
"cmac(aes128)" (or "cmac(aes)" with keylen=16), "hmac(sha1)", and
"hmac(sha256)". The first two are the actually standard ones, while
HMAC-SHA256 seems like a reasonable algorithm to continue supporting as
a Linux extension, considering the Internet Draft for it and the fact
that SHA-256 is the usual choice of upgrade from the outdated SHA-1.
If any other algorithm ever turns out to be needed, e.g. HMAC-SHA512, it
can of course be (re-)added in library form. However, note that the TCP
options space limits TCP-AO MACs to 20 bytes (160 bits) anyway, which
limits the potential benefit of any further upgrade to the algorithm.
Merge tag 'trace-v7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix inverted check of registering the stats for branch tracing
When calling register_stat_tracer() which returns zero on success and
negative on error, the callers were checking the return of zero as an
error and printing a warning message. Because this was just a normal
printk() message and not a WARN(), it wasn't caught in any testing.
Fix the check to print the warning message when an error actually
happens.
- Fix a typo in a comment in tracepoint.h
- Limit the size of event probes to 3K in size
It is possible to create a dynamic event probe via the tracefs system
that is greater than the max size of an event that the ring buffer
can hold. This basically causes the event to become useless.
Limit the size of an event probe to be 3K as that should be large
enough to handle any dynamic events being created, and fits within
the PAGE_SIZE sub-buffers of the ring buffer.
* tag 'trace-v7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing/probes: Limit size of event probe to 3K
tracepoint: Fix typo in tracepoint.h comment
tracing: branch: Fix inverted check on stat tracer registration
Hasan Basbunar [Tue, 28 Apr 2026 17:07:39 +0000 (19:07 +0200)]
page_pool: fix memory-provider leak in page_pool_create_percpu() error path
When page_pool_create_percpu() fails on page_pool_list(), it falls
through to its err_uninit: label, which calls page_pool_uninit().
At that point page_pool_init() has already taken two references
when the user requested PP_FLAG_ALLOW_UNREADABLE_NETMEM:
Neither is undone by page_pool_uninit(); both are only undone by
__page_pool_destroy() (success-side teardown). The error path
therefore leaks the per-provider reference taken by mp_ops->init
(io_zcrx_ifq->refs in the io_uring zcrx provider, the dmabuf
binding refcount in the devmem provider) plus one increment of
the page_pool_mem_providers static branch on every failure of
xa_alloc_cyclic() inside page_pool_list().
The leaked io_zcrx_ifq->refs in turn pins everything
io_zcrx_ifq_free() would release on cleanup: ifq->user (uid),
ifq->mm_account (mmdrop), ifq->dev (device refcount),
ifq->netdev_tracker (netdev refcount), and the rbuf region.
The leaked static branch increment forces all subsequent
page_pool_alloc_netmems() and page_pool_return_page() callers to
take the slow mp_ops branch for the lifetime of the kernel.
The same shape applies to the devmem dmabuf provider via
mp_dmabuf_devmem_init()/mp_dmabuf_devmem_destroy().
Restore the cleanup symmetry by moving the mp_ops->destroy() and
static_branch_dec() calls out of __page_pool_destroy() and into
page_pool_uninit(), so page_pool_uninit() is again the strict
inverse of page_pool_init(). page_pool_uninit() has only two
callers (the err_uninit: path and __page_pool_destroy()), so this
preserves the single-call invariant on the success path while
fixing the err path. The error path of page_pool_init() itself
still skips the mp_ops cleanup correctly: mp_ops->init is the
last action that takes a reference before page_pool_init() returns
0, so when it returns an error neither the refcount nor the static
branch has been touched.
Triggering the bug requires xa_alloc_cyclic() to fail with -ENOMEM,
which under normal GFP_KERNEL retry behaviour is rare. It is
deterministic under CONFIG_FAULT_INJECTION with fail_page_alloc /
xa fault injection, or under sustained memory pressure. The leak
is silent: there is no warning, and the released kernel build
continues running with a permanently-incremented static branch.
value changed: 0x0000000000000000 -> 0xffff88813cf5c400
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 UID: 0 PID: 22063 Comm: syz.0.31122 Tainted: G W syzkaller #0 PREEMPT(full)
Tainted: [W]=WARN
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/18/2026
Fixes: 47e91f56008b ("bonding: use RCU protection for 3ad xmit path") Reported-by: syzbot+9bb2ff2a4ab9e17307e1@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/69f0a82f.050a0220.3aadc4.0000.GAE@google.com/ Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jay Vosburgh <jv@jvosburgh.net> Cc: Andrew Lunn <andrew+netdev@lunn.ch> Link: https://patch.msgid.link/20260428123207.3809211-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Lorenzo Bianconi [Tue, 28 Apr 2026 06:53:16 +0000 (08:53 +0200)]
net: airoha: Do not return err in ndo_stop() callback
Always complete the airoha_dev_stop() routine regardless of the
airoha_set_vip_for_gdm_port() return value, since errors from
ndo_stop() are ignored by the networking stack and the interface is
always considered down after the call.
Lorenzo Bianconi [Tue, 28 Apr 2026 05:23:38 +0000 (07:23 +0200)]
net: airoha: Rename get_src_port_id callback in get_sport
For code consistency, rename get_src_port_id callback in get_sport.
Please note this patch does not introduce any logical change and it is
just a cosmetic patch.
r8152: Use ocp/mdio test and clear functions in r8157_hw_phy_cfg()
Replace explicit testing of bits and clearing these bits by existing
functions ocp_word_test_and_clr_bits() and r8152_mdio_test_and_clr_bit()
to re-use this code.
This allows to remove the "ocp_data" variable. Also remove the "ret" variable
which was incorrectly used for the r8153_phy_status() return value which
is a u16, so that the remaining "data" variable is sufficient.
====================
net/mlx5: Fix E-Switch work queue deadlock with devlink lock
mlx5_eswitch_cleanup() calls destroy_workqueue() while holding the
devlink lock through mlx5_uninit_one(). E-Switch workqueue workers also
need the devlink lock, but previously took it before checking whether
their work item was stale. Cleanup can therefore wait for a worker that
is blocked on the same devlink lock.
Mode changes have the same ordering hazard: the mode-change path holds
devlink lock while tearing down the current mode, and old work may still
be pending on the E-Switch workqueue.
Fix this by making esw_wq_handler() check the generation counter before
attempting to take devlink lock. The worker uses devl_trylock(); if the
lock is busy and the work is still current, it sleeps on an E-Switch wait
queue with a short timeout. Invalidation increments the generation
counter and wakes the wait queue, so stale workers exit without spinning
or blocking cleanup.
The generation counter already existed but was buried in
mlx5_esw_functions and only covered function-change events. The three
patches get from there to the fix in small steps.
Patch 1 moves the counter up to mlx5_eswitch. Pure refactor,
no behavior change.
Patch 2 cleans up the work queue plumbing: factors out the repeated
lock/check/dispatch boilerplate into a single esw_wq_handler() and
adds mlx5_esw_add_work() as the one place to enqueue work.
Patch 3 is the actual fix: check the generation before the lock, use
devl_trylock() instead of devl_lock(), add a wait queue so lock retries
do not spin, and invalidate pending work at the earliest safe operation
boundary. Cleanup invalidates before destroy_workqueue(), and mode
teardown unregisters the work-producing notifiers before invalidating so
new notifier work cannot capture the new generation.
====================
Mark Bloch [Tue, 28 Apr 2026 05:10:17 +0000 (08:10 +0300)]
net/mlx5: E-Switch, fix deadlock between devlink lock and esw->wq
mlx5_eswitch_cleanup() calls destroy_workqueue() while holding the
devlink lock through mlx5_uninit_one(). E-Switch workqueue workers also
need the devlink lock, but previously took it before checking whether
their work item was stale. This can deadlock when cleanup waits for a
worker that is blocked on the same devlink lock.
Mode changes have the same ordering hazard: the mode-change path holds
devlink lock while tearing down the current mode, and old work may still
be pending on the E-Switch workqueue.
Fix this by making esw_wq_handler() check the generation counter before
attempting to take devlink lock. The worker uses devl_trylock(); if the
lock is busy and the work is still current, it sleeps on an E-Switch wait
queue with a short timeout. Invalidation increments the generation
counter and wakes the wait queue, so stale workers exit without spinning
or blocking cleanup.
Invalidate work at the earliest safe operation boundary. Cleanup
invalidates before destroy_workqueue(), and QoS cleanup runs after the
workqueue is destroyed. Mode teardown unregisters the work-producing
notifiers first, then invalidates the queue before tearing down
FDB/QoS/rate-node state. This prevents new notifier work from capturing
the new generation while still making old work stale before expensive
teardown starts.
mlx5_devlink_eswitch_mode_set() now relies on
mlx5_eswitch_disable_locked() for the mode-change invalidation instead
of incrementing the generation after disable. mlx5_eswitch_disable()
gets the same coverage. SR-IOV enable/disable paths invalidate before VF
state changes so work against the old VF count or mode is discarded.
Remove the conditional generation increment in
mlx5_eswitch_event_handler_unregister(); mlx5_eswitch_disable_locked()
now handles it unconditionally after the relevant notifiers are
unregistered.
Mark Bloch [Tue, 28 Apr 2026 05:10:16 +0000 (08:10 +0300)]
net/mlx5: E-Switch, introduce generic work queue dispatch helper
Each E-Switch work item requires the same boilerplate: acquire the
devlink lock, check whether the work is stale, dispatch to the
appropriate handler, and release the lock. Factor this out.
Add a func callback to mlx5_host_work so the generic handler
esw_wq_handler() can dispatch to the right function without
duplicating locking logic. Introduce mlx5_esw_add_work() as the
single enqueue point: it stamps the work item with the current
generation counter and queues it onto the E-Switch work queue.
Refactor esw_vfs_changed_event_handler() to match the new contract:
it no longer receives work_gen or out as parameters. It queries
mlx5_esw_query_functions() itself and owns the kvfree() of the
result. The devlink lock is acquired and released by esw_wq_handler()
before dispatching, so the handler runs with the lock already held.
Update mlx5_esw_funcs_changed_handler() to use mlx5_esw_add_work().
Mark Bloch [Tue, 28 Apr 2026 05:10:15 +0000 (08:10 +0300)]
net/mlx5: E-Switch, move work queue generation counter
The generation counter in mlx5_esw_functions is used to detect stale
work items on the E-Switch work queue. Move it from mlx5_esw_functions
to the top-level mlx5_eswitch struct so it can guard all work types,
not just function-change events.
This is a mechanical refactor: no behavioral change.
Hamza Mahfooz [Tue, 28 Apr 2026 12:53:39 +0000 (08:53 -0400)]
hv_sock: fix ARM64 support
VMBUS ring buffers must be page aligned. Therefore, the current value of
24K presents a challenge on ARM64 kernels (with 64K pages). So, use
VMBUS_RING_SIZE() to ensure they are always aligned and large enough to
hold all of the relevant data.
Cc: stable@vger.kernel.org Fixes: 77ffe33363c0 ("hv_sock: use HV_HYP_PAGE_SIZE for Hyper-V communication") Tested-by: Dexuan Cui <decui@microsoft.com> Reviewed-by: Dexuan Cui <decui@microsoft.com> Signed-off-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com> Acked-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20260428125339.13963-1-hamzamahfooz@linux.microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: phy: aquantia: use ADVERTISE_XNP for extended next page advertising
When configuring the link parameters in forced mode for the AQR-105, the
Extended Next Page bit gets advertised for Multi-Gigabit modes.
This is done through bit 12 of MDIO_AN_ADVERTISE in MDIO_MMD_AN. This
contains a copy of the MII_ADVERTISE, for which 802.3 defines bit 12 as
the Extended Next Page advertising. This bit used to be marked as
reserved, but a proper define for it was added in :
commit e7a62edd34b1 ("net: phy: qcom: at803x: Use the correct bit to disable extended next page")
Let's use it instead of the ADVERTISE_RESV definition, making the code
more self-documenting.
Jakub Kicinski [Wed, 29 Apr 2026 23:55:57 +0000 (16:55 -0700)]
Merge branch 'net-psp-add-more-validation'
Jakub Kicinski says:
====================
net: psp: add more validation
Address some AI code-scan issues with the PSP code.
I don't think any of these are real bugs, but they may
become bugs in the future. The two real bugs discovered
were posted separately for net. AI reports 3 more which
seem plain wrong (rx SPI "leak" on error etc.).
====================
Jakub Kicinski [Tue, 28 Apr 2026 20:53:52 +0000 (13:53 -0700)]
psp: validate IPv4 header fields in psp_dev_rcv()
psp_dev_rcv() is called from the NIC driver's RX completion path
before the frame reaches ip_rcv_core(), so the IP header has not
been validated in SW, yet. We expect that the device has done
all this validation, but let's also add the SW checks, to avoid
surprises.
Jakub Kicinski [Tue, 28 Apr 2026 20:53:51 +0000 (13:53 -0700)]
psp: add a comment about a psp_dev add netlink notification
In psp_dev_create(), the DEV_ADD_NTF netlink notification is sent
before the device is published to the netdev via rcu_assign_pointer().
IIRC this is intentional because a single PSP device is expected
to be shared with multiple netdevs. So we are trying to default to
not having the netdev info. We can change it if someone complains
but for now just add a comment that it's intentional.
Jakub Kicinski [Tue, 28 Apr 2026 20:53:50 +0000 (13:53 -0700)]
psp: validate protocol before mutating skb in psp_dev_encapsulate()
Code checkers / AI scans will complain that we have already modified
the packet by the time we realize that protocol is not IP.
Move the skb->protocol check to before skb_push()/memmove() so that
the skb is not left in a corrupted state when the function returns
false for an unsupported protocol. psp_dev_rcv() follows similar
pattern.
Today this path is unreachable because both in-tree callers (mlx5 and
netdevsim) only reach psp_dev_encapsulate() from TCP socket TX paths
where skb->protocol is always ETH_P_IP or ETH_P_IPV6, and both drop
the skb on a false return, anyway.
Jakub Kicinski [Tue, 28 Apr 2026 20:39:24 +0000 (13:39 -0700)]
MAINTAINERS: update the IPv4/IPv6 entry and add Ido Schimmel
The IPv4/IPv6 and routing code is not very well separated from
the TCP/UDP code. Scope it down properly by providing a more
accurate file list, instead of net/ipv4/ and net/ipv6/
Now that the entry is more accurately representing layer 3
and routing merge in the nexthop entry into it.
Add Ido Schimmel as a co-maintainer, Ido's git history speaks
for itself.
Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260428203924.1229169-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>