]> git.ipfire.org Git - thirdparty/kernel/linux.git/log
thirdparty/kernel/linux.git
7 weeks agonet: sched: introduce qdisc-specific drop reason tracing
Jesper Dangaard Brouer [Thu, 26 Feb 2026 13:44:12 +0000 (14:44 +0100)] 
net: sched: introduce qdisc-specific drop reason tracing

Create new enum qdisc_drop_reason and trace_qdisc_drop tracepoint
for qdisc layer drop diagnostics with direct qdisc context visibility.

The new tracepoint includes qdisc handle, parent, kind (name), and
device information. Existing SKB_DROP_REASON_QDISC_DROP is retained
for backwards compatibility via kfree_skb_reason().

Convert qdiscs with drop reasons to use the new infrastructure.

Change CAKE's cobalt_should_drop() return type from enum skb_drop_reason
to enum qdisc_drop_reason to fix implicit enum conversion warnings.
Use QDISC_DROP_UNSPEC as the 'not dropped' sentinel instead of
SKB_NOT_DROPPED_YET. Both have the same compiled value (0), so the
comparison logic remains semantically equivalent.

Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/177211345275.3011628.1974310302645218067.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoMerge branch 'icmp-fix-icmp-error-source-address-over-xfrm-tunnel'
Jakub Kicinski [Sat, 28 Feb 2026 23:08:19 +0000 (15:08 -0800)] 
Merge branch 'icmp-fix-icmp-error-source-address-over-xfrm-tunnel'

Antony Antony says:

====================
icmp: Fix icmp error source address over xfrm tunnel

icmp: Fix icmp error source address over xfrm tunnel

This fix, originally sent to XFRM/IPsec, has been recommended by
Steffen Klassert to submit to the net tree, since it changes ICMP
behavior.

The patch addresses a minor issue related to the IPv4 source address
of ICMP error messages. The bug only occurs when xfrm policies are
configured. It originated from an old 2011 commit:

commit 415b3334a21a ("icmp: Fix regression in nexthop resolution during
replies.")
====================

Link: https://patch.msgid.link/cover.1772101380.git.antony.antony@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoselftests: net: add ICMP error source address test over xfrm tunnel
Antony Antony [Thu, 26 Feb 2026 10:28:21 +0000 (11:28 +0100)] 
selftests: net: add ICMP error source address test over xfrm tunnel

Test that ICMP error messages generated by an IPsec gateway use
the correct source address (the gateway's address, not the
unreachable destination).

Signed-off-by: Antony Antony <antony.antony@secunet.com>
Link: https://patch.msgid.link/79d526f96cf2252d71550d38772876bc72c7e3c7.1772101380.git.antony.antony@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoicmp: fix ICMP error source address when xfrm policy matches
Antony Antony [Thu, 26 Feb 2026 10:27:51 +0000 (11:27 +0100)] 
icmp: fix ICMP error source address when xfrm policy matches

When an IPsec gateway generates an ICMP error (e.g., Destination Host
Unreachable), the source address incorrectly shows the unreachable
destination instead of the gateway's address. IPv6 behaves correctly.

Before fix:
  ping 10.1.6.3
  From 10.1.6.3 icmp_seq=1 Destination Host Unreachable
  (wrong - 10.1.6.3 is the unreachable host)

After fix:
  ping 10.1.6.3
  From 10.1.5.2 icmp_seq=1 Destination Host Unreachable
  (correct - 10.1.5.2 is the gateway)

The fix removes the memcpy that overwrote fl4 with fl4_dec after
xfrm_lookup(). A follow-up commit adds a selftest.

Fixes: 415b3334a21a ("icmp: Fix regression in nexthop resolution during replies.")
Cc: stable+noautosel@kernel.org # Avoid false positives in tests
Signed-off-by: Antony Antony <antony.antony@secunet.com>
Acked-by: Tobias Brunner <tobias@strongswan.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/19a0156ff6e76baa323a81d710510d399a6ff63a.1772101380.git.antony.antony@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoMerge branch 'npc-hw-block-support-for-cn20k'
Jakub Kicinski [Sat, 28 Feb 2026 18:30:26 +0000 (10:30 -0800)] 
Merge branch 'npc-hw-block-support-for-cn20k'

Ratheesh Kannoth says:

====================
NPC HW block support for cn20k

This patchset adds comprehensive support for the CN20K NPC
architecture. CN20K introduces significant changes in MCAM layout,
parser design, KPM/KPU mapping, index management, virtual index handling,
and dynamic rule installation. The patches update the AF, PF/VF, and
common layers to correctly support these new capabilities while
preserving compatibility with previous silicon variants.

MCAM on CN20K differs from older designs: the hardware now contains
two vertical banks of depth 8192, and thirty-two horizontal subbanks of
depth 256. Each subbank can be configured as x2 or x4, enabling
256-bit or 512-bit key storage. Several allocation models are added to
support this layout, including contiguous and non-contiguous allocation
with or without reference ranges and priorities.

Parser and extraction logic are also enhanced. CN20K introduces a new
profile model where up to twenty-four extractors may be configured for
each parsing profile. A new KPM profile scheme is added, grouping
sixteen KPUs into eight KPM profiles, each formed by two KPUs.

Support is added for default index allocation for CN20K-specific
MCAM entry structures, virtual index allocation, improved defragmentation,
and TC rule installation by allowing the AF driver to determine
required x2/x4 rule width during flow install.
====================

Link: https://patch.msgid.link/20260224080009.4147301-1-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoocteontx2-af: npc: Use common structures
Ratheesh Kannoth [Tue, 24 Feb 2026 08:00:09 +0000 (13:30 +0530)] 
octeontx2-af: npc: Use common structures

CN20K and legacy silicon differ in the size of key words used
in NPC MCAM. However, SoC-specific structures are not required
for low-level functions. Remove the SoC-specific structures
and rename the macros to improve readability.

Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/20260224080009.4147301-14-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoocteontx2-af: npc: cn20k: add debugfs support
Ratheesh Kannoth [Tue, 24 Feb 2026 08:00:08 +0000 (13:30 +0530)] 
octeontx2-af: npc: cn20k: add debugfs support

CN20K silicon divides the NPC MCAM into banks and subbanks, with each
subbank configurable for x2 or x4 key widths. This patch adds debugfs
entries to expose subbank usage details and their configured key type.

A debugfs entry is also added to display the default MCAM indexes
allocated for each pcifunc.

Additionally, debugfs support is introduced to show the mapping between
virtual indexes and real MCAM indexes, and vice versa.

Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/20260224080009.4147301-13-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoocteontx2-pf: cn20k: Add TC rules support
Subbaraya Sundeep [Tue, 24 Feb 2026 08:00:07 +0000 (13:30 +0530)] 
octeontx2-pf: cn20k: Add TC rules support

Unlike previous silicons, MCAM entries required for TC rules in CN20K
are allocated dynamically. The key size can also be dynamic, i.e., X2 or
X4. Based on the size of the TC rule match criteria, the AF driver
allocates an X2 or X4 rule. This patch implements the required changes
for CN20K TC by requesting an MCAM entry from the AF driver on the fly
when the user installs a rule. Based on the TC rule priority added or
deleted by the user, the PF driver shifts MCAM entries accordingly. If
there is a mix of X2 and X4 rules and the user tries to install a rule
in the middle of existing rules, the PF driver detects this and rejects
the rule since X2 and X4 rules cannot be shifted in hardware.

Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/20260224080009.4147301-12-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoocteontx2-af: npc: cn20k: Allocate MCAM entry for flow installation
Ratheesh Kannoth [Tue, 24 Feb 2026 08:00:06 +0000 (13:30 +0530)] 
octeontx2-af: npc: cn20k: Allocate MCAM entry for flow installation

In CN20K, the PF/VF driver is unaware of the NPC MCAM entry type (x2/x4)
required for a particular TC rule when the user installs rules through the
TC command. This forces the PF/VF driver to first query the AF driver for
the rule size, then allocate an entry, and finally install the flow. This
sequence requires three mailbox request/response exchanges from the PF. To
speed up the installation, the `install_flow` mailbox request message is
extended with additional fields that allow the AF driver to determine the
required NPC MCAM entry type, allocate the MCAM entry, and complete the
flow installation in a single step.

Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/20260224080009.4147301-11-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoocteontx2-af: npc: cn20k: virtual index support
Ratheesh Kannoth [Tue, 24 Feb 2026 08:00:05 +0000 (13:30 +0530)] 
octeontx2-af: npc: cn20k: virtual index support

This patch adds support for virtual MCAM index allocation and
improves CN20K MCAM defragmentation handling. A new field is
introduced in the non-ref, non-contiguous MCAM allocation mailbox
request to indicate that virtual indexes should be returned instead
of physical ones. Virtual indexes allow the hardware to move mapped
MCAM entries internally, enabling defragmentation and preventing
scattered allocations across subbanks. The patch also enhances
defragmentation by treating non-ref, non-contiguous allocations as
ideal candidates for packing sparsely used regions, which can free
up subbanks for potential x2 or x4 configuration. All such
allocations are tracked and always returned as virtual indexes so
they remain stable even when entries are moved during defrag.
During defragmentation, MCAM entries may shift between subbanks,
but their virtual indexes remain unchanged. Additionally, this
update fixes an issue where entry statistics were not being
restored correctly after defragmentation.

Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/20260224080009.4147301-10-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoocteontx2-af: npc: cn20k: Add new mailboxes for CN20K silicon
Suman Ghosh [Tue, 24 Feb 2026 08:00:04 +0000 (13:30 +0530)] 
octeontx2-af: npc: cn20k: Add new mailboxes for CN20K silicon

To enable enhanced MCAM capabilities for CN20K, the struct mcam_entry
has been extended to support expanded keyword requirements.
Specifically, the kw and kw_mask arrays have been increased from
a size of 7 to 8 to accommodate the additional keyword field
introduced for CN20K.

To ensure seamless integration while preserving compatibility
with existing platforms, dedicated CN20K-specific mailboxes
have been introduced that leverage the updated struct mcam_entry.
This approach allows CN20K to utilize the extended structure
without impacting current implementations.

This patch identifies the relevant mailboxes and introduces the
following CN20K-specific additions:

New mailboxes added:
1. `NPC_CN20K_MCAM_WRITE_ENTRY`
2. `NPC_CN20K_MCAM_ALLOC_AND_WRITE_ENTRY`
3. `NPC_CN20K_MCAM_READ_ENTRY`
4. `NPC_CN20K_MCAM_READ_BASE_RULE`

Signed-off-by: Suman Ghosh <sumang@marvell.com>
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/20260224080009.4147301-9-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoocteontx2-af: npc: cn20k: Prepare for new SoC
Ratheesh Kannoth [Tue, 24 Feb 2026 08:00:03 +0000 (13:30 +0530)] 
octeontx2-af: npc: cn20k: Prepare for new SoC

Current code pass mcam_entry structure to all low
level functions. This is not proper:

1) We need to modify all functions to support a new SoC
2) It does not look good to pass soc specific structure to
all common functions.

This patch adds a mcam meta data structure, which is populated
and passed to low level functions.

Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/20260224080009.4147301-8-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoocteontx2-af: npc: cn20k: Use common APIs
Suman Ghosh [Tue, 24 Feb 2026 08:00:02 +0000 (13:30 +0530)] 
octeontx2-af: npc: cn20k: Use common APIs

In cn20k silicon, the register definitions and the algorithms used to
read, write, copy, and enable MCAM entries have changed. This patch
updates the common APIs to support both cn20k and previous silicon
variants.

Additionally, cn20k introduces a new algorithm for MCAM index management.
The common APIs are updated to invoke the cn20k-specific index management
routines for allocating, freeing, and retrieving default MCAM entries.

Signed-off-by: Suman Ghosh <sumang@marvell.com>
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/20260224080009.4147301-7-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoocteontx2-af: npc: cn20k: Allocate default MCAM indexes
Ratheesh Kannoth [Tue, 24 Feb 2026 08:00:01 +0000 (13:30 +0530)] 
octeontx2-af: npc: cn20k: Allocate default MCAM indexes

Reserving MCAM entries in the AF driver for installing default MCAM
entries is not an efficient allocation method, as it results in
significant wastage of entries. This patch allocates MCAM indexes for
promiscuous, multicast, broadcast, and unicast traffic in descending
order of indexes (from lower to higher priority) when the NIX LF is
attached to the PF/VF.

Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/20260224080009.4147301-6-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoocteontx2-af: npc: cn20k: MKEX profile support
Suman Ghosh [Tue, 24 Feb 2026 08:00:00 +0000 (13:30 +0530)] 
octeontx2-af: npc: cn20k: MKEX profile support

In new silicon variant cn20k, a new parser profile is introduced. Instead
of having two layer-data information per key field type, a new key
extractor concept is introduced. As part of this change now a maximum of
24 extractor can be configured per packet parsing profile. For example,
LA type(ether) can have 24 unique parsing key, LC type(ip), LD
type(tcp/udp) also can have unique 24 parsing key associated.

Signed-off-by: Suman Ghosh <sumang@marvell.com>
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/20260224080009.4147301-5-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoocteontx2-af: npc: cn20k: Add default profile
Suman Ghosh [Tue, 24 Feb 2026 07:59:59 +0000 (13:29 +0530)] 
octeontx2-af: npc: cn20k: Add default profile

Default mkex profile for cn20k silicon. This commit
changes attribute of objects to may_be_unused to
avoid compiler warning

Signed-off-by: Suman Ghosh <sumang@marvell.com>
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/20260224080009.4147301-4-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoocteontx2-af: npc: cn20k: KPM profile changes
Suman Ghosh [Tue, 24 Feb 2026 07:59:58 +0000 (13:29 +0530)] 
octeontx2-af: npc: cn20k: KPM profile changes

KPU (Kangaroo Processing Unit) profiles are primarily used to set the
required packet pointers that will be used in later stages for key
generation. In the new CN20K silicon variant, a new KPM profile is
introduced alongside the existing KPU profiles.

In CN20K, a total of 16 KPUs are grouped into 8 KPM profiles. As per
the current hardware design, each KPM configuration contains a
combination of 2 KPUs:
    KPM0 = KPU0 + KPU8
    KPM1 = KPU1 + KPU9
    ...
    KPM7 = KPU7 + KPU15

This configuration enables more efficient use of KPU resources. This
patch adds support for the new KPM profile configuration.

Signed-off-by: Suman Ghosh <sumang@marvell.com>
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/20260224080009.4147301-3-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoocteontx2-af: npc: cn20k: Index management
Ratheesh Kannoth [Tue, 24 Feb 2026 07:59:57 +0000 (13:29 +0530)] 
octeontx2-af: npc: cn20k: Index management

In CN20K silicon, the MCAM is divided vertically into two banks.
Each bank has a depth of 8192.

The MCAM is divided horizontally into 32 subbanks, with each subbank
having a depth of 256.

Each subbank can accommodate either x2 keys or x4 keys. x2 keys are
256 bits in size, and x4 keys are 512 bits in size.

    Bank1                   Bank0
    |-----------------------------|
    |               |             | subbank 31 { depth 256 }
    |               |             |
    |-----------------------------|
    |               |             | subbank 30
    |               |             |
    ------------------------------
    ...............................

    |-----------------------------|
    |               |             | subbank 0
    |               |             |
    ------------------------------|

This patch implements the following allocation schemes in NPC.
The allocation API accepts reference (ref), limit, contig, priority,
and count values. For example, specifying ref=100, limit=200,
contig=1, priority=LOW, and count=20 will allocate 20 contiguous
MCAM entries between entries 100 and 200.

1. Contiguous allocation with ref, limit, and priority.
2. Non-contiguous allocation with ref, limit, and priority.
3. Non-contiguous allocation without ref.
4. Contiguous allocation without ref.

Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/20260224080009.4147301-2-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoipv6: sit: Replace deprecated strcpy with strscpy
Thorsten Blum [Fri, 27 Feb 2026 00:45:42 +0000 (01:45 +0100)] 
ipv6: sit: Replace deprecated strcpy with strscpy

strcpy() has been deprecated [1] because it performs no bounds checking
on the destination buffer, which can lead to buffer overflows. Replace
it with the safer strscpy().  Use the two-argument version of strscpy()
to copy 'parms->name' in ipip6_tunnel_locate().

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strcpy
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Link: https://patch.msgid.link/20260227004541.798966-3-thorsten.blum@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoMerge branch 'gve-support-larger-ring-sizes-in-dqo-qpl-mode'
Jakub Kicinski [Sat, 28 Feb 2026 16:58:40 +0000 (08:58 -0800)] 
Merge branch 'gve-support-larger-ring-sizes-in-dqo-qpl-mode'

Max Yuan says:

====================
gve: Support larger ring sizes in DQO-QPL mode

This patch series updates the gve driver to improve Queue Page List
(QPL) management and enable support for larger ring sizes when using the
DQO-QPL queue format.

Previously, the driver used hardcoded multipliers to determine the
number of pages to register for QPLs (e.g., 2x ring size for RX). This
rigid approach made it difficult to support larger ring sizes without
potentially exceeding the "max_registered_pages" limit reported by the
device.

The first patch introduces a unified and flexible logic for calculating
QPL page requirements. It balances TX and RX page allocations based on
the configured ring sizes and scales the total count down proportionally
if it would otherwise exceed the device's global registration limit.

The second patch leverages this new flexibility to stop ignoring the
maximum ring size supported by the device in DQO-QPL mode. Users can now
configure ring sizes up to the device-reported maximum, as the driver
will automatically adjust the QPL size to stay within allowed memory
bounds.
====================

Link: https://patch.msgid.link/20260225182342.1049816-1-joshwash@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agogve: Enable reading max ring size from the device in DQO-QPL mode
Matt Olson [Wed, 25 Feb 2026 18:23:42 +0000 (10:23 -0800)] 
gve: Enable reading max ring size from the device in DQO-QPL mode

The gVNIC device indicates a device option (MODIFY_RING) to the driver,
which presents a range of ring sizes from which the user is allowed to
select. But in DQO-QPL queue format, the driver ignores the "max" of
this range and instead allows the user to configure the ring size in the
range [min, default]. This was done because increasing the ring size
could result in the number of registered pages being higher than the max
allowed by the device.

In order to support large ring sizes, stop ignoring the "max" of the
range presented in the MODIFY_RING option.

Signed-off-by: Matt Olson <maolson@google.com>
Signed-off-by: Max Yuan <maxyuan@google.com>
Reviewed-by: Jordan Rhee <jordanrhee@google.com>
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
Reviewed-by: Praveen Kaligineedi <pkaligineedi@google.com>
Signed-off-by: Joshua Washington <joshwash@google.com>
Link: https://patch.msgid.link/20260225182342.1049816-3-joshwash@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agogve: Update QPL page registration logic
Matt Olson [Wed, 25 Feb 2026 18:23:41 +0000 (10:23 -0800)] 
gve: Update QPL page registration logic

For DQO, change QPL page registration logic to be more flexible to honor
the "max_registered_pages" parameter from the gVNIC device.

Previously the number of RX pages per QPL was hardcoded to twice the
ring size, and the number of TX pages per QPL was dictated by the device
in the DQO-QPL device option. Now [in DQO-QPL mode], the driver will
ignore the "tx_pages_per_qpl" parameter indicated in the DQO-QPL device
option and instead allocate up to (tx_queue_length / 2) pages per TX QPL
and up to (rx_queue_length * 2) pages per RX QPL while keeping the total
number of pages under the "max_registered_pages".

Merge DQO and GQI QPL page calculation logic into a unified
gve_update_num_qpl_pages function. Add rx_pages_per_qpl to the priv
struct for consumption by both DQO and GQI.

Signed-off-by: Matt Olson <maolson@google.com>
Signed-off-by: Max Yuan <maxyuan@google.com>
Reviewed-by: Jordan Rhee <jordanrhee@google.com>
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Praveen Kaligineedi <pkaligineedi@google.com>
Signed-off-by: Joshua Washington <joshwash@google.com>
Link: https://patch.msgid.link/20260225182342.1049816-2-joshwash@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agokeys, dns: Use kmalloc_flex to improve dns_resolver_preparse
Thorsten Blum [Thu, 26 Feb 2026 21:49:29 +0000 (22:49 +0100)] 
keys, dns: Use kmalloc_flex to improve dns_resolver_preparse

Use kmalloc_flex() when allocating a new 'struct user_key_payload' in
dns_resolver_preparse() to replace the open-coded size arithmetic.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Link: https://patch.msgid.link/20260226214930.785423-3-thorsten.blum@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet: fix sock compilation error under CONFIG_PREEMPT_RT
Jiayuan Chen [Sat, 28 Feb 2026 11:13:18 +0000 (19:13 +0800)] 
net: fix sock compilation error under CONFIG_PREEMPT_RT

When CONFIG_PREEMPT_RT is enabled, __SPIN_LOCK_UNLOCKED() expands to a
brace-enclosed initializer rather than a compound literal, which cannot
be used in assignment expressions. This causes a build failure:

  net/core/sock.c:3787:29: error: expected expression before '{' token
   3787 |                 tmp.slock = __SPIN_LOCK_UNLOCKED(tmp.slock);

Use declaration-with-initializer instead of assignment, consistent with
how __SPIN_LOCK_UNLOCKED() is used elsewhere in the kernel (e.g.
DEFINE_SPINLOCK).

Fixes: 5151ec54f586 ("net: use try_cmpxchg() in lock_sock_nested()")
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260228111319.79506-1-jiayuan.chen@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoMerge branch 'net-ethernet-litex-minor-improvment-for-the-codebase'
Jakub Kicinski [Sat, 28 Feb 2026 03:25:19 +0000 (19:25 -0800)] 
Merge branch 'net-ethernet-litex-minor-improvment-for-the-codebase'

Inochi Amaoto says:

====================
net: ethernet: litex: minor improvment for the codebase

Improve the litex code for using the device managed function to register
netdev and replace all the "pdev->dev" with dev pointer instead.
====================

Link: https://patch.msgid.link/20260227003351.752934-1-inochiama@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet: ethernet: litex: use device pointer to simplify code.
Inochi Amaoto [Fri, 27 Feb 2026 00:33:48 +0000 (08:33 +0800)] 
net: ethernet: litex: use device pointer to simplify code.

As there is already a device pointer in the probe function, replace
all "&pdev->dev" pattern with this predefined device pointer.

Signed-off-by: Inochi Amaoto <inochiama@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20260227003351.752934-3-inochiama@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet: ethernet: litex: use devm_register_netdev() to register netdev
Inochi Amaoto [Fri, 27 Feb 2026 00:33:47 +0000 (08:33 +0800)] 
net: ethernet: litex: use devm_register_netdev() to register netdev

Use devm_register_netdev to avoid unnecessary remove() callback in
platform_driver structure.

Signed-off-by: Inochi Amaoto <inochiama@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20260227003351.752934-2-inochiama@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet/handshake: Fixed grammar mistake
Leon Kral [Fri, 27 Feb 2026 00:07:47 +0000 (00:07 +0000)] 
net/handshake: Fixed grammar mistake

The word "a" was used instead of "an" which is grammatically incorrect.
Fixed by changing from "a" to "an". This improves readability of the
documentation.

Signed-off-by: Leon Kral <leon.j.kral@protonmail.com>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Link: https://patch.msgid.link/20260227001151.41610-1-leon.j.kral@protonmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoNFC: fix header file kernel-doc warnings
Randy Dunlap [Thu, 26 Feb 2026 22:10:04 +0000 (14:10 -0800)] 
NFC: fix header file kernel-doc warnings

Repair some of the comments:
- use the correct enum names
- don't use "/**" for a non-kernel-doc comment

to fix these warnings:

Warning: include/uapi/linux/nfc.h:127 Excess enum value
 '@NFC_EVENT_DEVICE_DEACTIVATED' description in 'nfc_commands'
Warning: include/uapi/linux/nfc.h:204 Excess enum value
 '@NFC_ATTR_APDU' description in 'nfc_attrs'
Warning: include/uapi/linux/nfc.h:302 expecting prototype for Pseudo().
 Prototype was for NFC_RAW_HEADER_SIZE() instead

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://patch.msgid.link/20260226221004.1037909-1-rdunlap@infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet: inline skb_add_rx_frag_netmem()
Eric Dumazet [Thu, 26 Feb 2026 04:12:13 +0000 (04:12 +0000)] 
net: inline skb_add_rx_frag_netmem()

This critical helper (via skb_add_rx_frag()) is mostly used
from drivers rx fast path.

It is time to inline it, this actually saves space in vmlinux:

size vmlinux.old vmlinux
   text    data     bss     dec     hex filename
37350766        23092977        4846992 65290735        3e441ef vmlinux.old
37350600        23092977        4846992 65290569        3e44149 vmlinux

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260226041213.1892561-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoipv6: discard fragment queue earlier if there is malformed datagram
Fernando Fernandez Mancera [Wed, 25 Feb 2026 13:37:58 +0000 (14:37 +0100)] 
ipv6: discard fragment queue earlier if there is malformed datagram

Currently the kernel IPv6 implementation is not dicarding the fragment
queue upon receiving a IPv6 fragment that is not 8 bytes aligned. It
relies on queue expiration to free the queue.

While RFC 8200 section 4.5 does not explicitly mention that the rest of
fragments must be discarded, it does not make sense to keep them. The
parameter problem message is sent regardless that. In addition, if the
sender is able to re-compose the datagram so it is 8 bytes aligned it
would qualify as a new whole datagram not fitting into the same fragment
queue.

The same situation happens if segment end is exceeding the IPv6 maximum
packet length. The sooner we can free resources the better during
reassembly, the better.

Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Link: https://patch.msgid.link/20260225133758.4553-1-fmancera@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agor8152: Add 2500baseT EEE status/configuration support
Birger Koblitz [Tue, 24 Feb 2026 17:40:14 +0000 (18:40 +0100)] 
r8152: Add 2500baseT EEE status/configuration support

The r8152 driver supports the RTL8156, which is a 2.5Gbit Ethernet controller for
USB 3.0, for which support is added for configuring and displaying the EEE
advertisement status for 2.5GBit connections.

The patch also corrects the determination of whether EEE is active to include
the 2.5GBit connection status and make the determination dependent not on the
desired speed configuration (tp->speed), but on the actual speed used by
the controller. For consistency, this is corrected also for the RTL8152/3.

This was tested on an Edimax EU-4307 V1.0 USB-Ethernet adapter with RTL8156,
and a SECOMP Value 12.99.1115 USB-C 3.1 Ethernet converter with RTL8153.

Signed-off-by: Birger Koblitz <mail@birger-koblitz.de>
Link: https://patch.msgid.link/20260224-b4-eee2g5-v2-1-cf5c83df036e@birger-koblitz.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agovmxnet3: Suppress page allocation warning for massive Rx Data ring
Aaron Tomlin [Thu, 26 Feb 2026 16:31:21 +0000 (11:31 -0500)] 
vmxnet3: Suppress page allocation warning for massive Rx Data ring

The vmxnet3 driver supports an Rx Data ring (rx-mini) to optimise the
processing of small packets. The size of this ring's DMA-coherent memory
allocation is determined by the product of the primary Rx ring size and
the data ring descriptor size:

    sz = rq->rx_ring[0].size * rq->data_ring.desc_size;

When a user configures the maximum supported parameters via ethtool
(rx_ring[0].size = 4096, data_ring.desc_size = 2048), the required
contiguous memory allocation reaches 8 MB (8,388,608 bytes).

In environments lacking Contiguous Memory Allocator (CMA),
dma_alloc_coherent() falls back to the standard zone buddy allocator. An
8 MB allocation translates to a page order of 11, which strictly exceeds
the default MAX_PAGE_ORDER (10) on most architectures.

Consequently, __alloc_pages_noprof() catches the oversize request and
triggers a loud kernel warning stack trace:

    WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)

This warning is unnecessary and alarming to system administrators because
the vmxnet3 driver already handles this allocation failure gracefully.
If dma_alloc_coherent() returns NULL, the driver safely disables the
Rx Data ring (adapter->rxdataring_enabled = false) and falls back to
standard, streaming DMA packet processing.

To resolve this, append the __GFP_NOWARN flag to the dma_alloc_coherent()
gfp_mask. This instructs the page allocator to silently fail the
allocation if it exceeds order limits or memory is too fragmented,
preventing the spurious warning stack trace.

Furthermore, enhance the subsequent netdev_err() fallback message to
include the requested allocation size. This provides critical debugging
context to the administrator (e.g., revealing that an 8 MB allocation
was attempted and failed) without making hardcoded assumptions about
the state of the system's configurations.

Reviewed-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
Link: https://patch.msgid.link/20260226163121.4045808-1-atomlin@atomlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet: ethernet: mtk_eth_soc: avoid writing to ESW registers on MT7628
Joris Vaisvila [Thu, 26 Feb 2026 15:45:18 +0000 (15:45 +0000)] 
net: ethernet: mtk_eth_soc: avoid writing to ESW registers on MT7628

The MT7628 has a fixed-link PHY and does not expose MAC control
registers. Writes to these registers only corrupt the ESW VLAN
configuration.

This patch explicitly registers no-op phylink_mac_ops for MT7628, as
after removing the invalid register accesses, the existing
phylink_mac_ops effectively become no-ops.

This code was introduced by commit 296c9120752b
("net: ethernet: mediatek: Add MT7628/88 SoC support")

Signed-off-by: Joris Vaisvila <joey@tinyisr.com>
Reviewed-by: Daniel Golle <daniel@makrotpia.org>
Reviewed-by: Stefan Roese <stefan.roese@mailbox.org>
Link: https://patch.msgid.link/20260226154547.68553-1-joey@tinyisr.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agotls: don't select STREAM_PARSER
Sabrina Dubroca [Thu, 26 Feb 2026 14:26:27 +0000 (15:26 +0100)] 
tls: don't select STREAM_PARSER

ktls was converted to its own stream parser in commit
84c61fe1a75b ("tls: rx: do not use the standard strparser"), but the
Kconfig dependency was left. The only part of the original strparser
that's shared with ktls are a few structs (strp_msg, sk_skb_cb) and
the strp_msg helper, those don't require building the net/strparser
code.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/cb41e513a30eeaac0b419284cc87433f049b2ee0.1771871995.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet: use try_cmpxchg() in lock_sock_nested()
Eric Dumazet [Thu, 26 Feb 2026 02:12:15 +0000 (02:12 +0000)] 
net: use try_cmpxchg() in lock_sock_nested()

Add a fast path in lock_sock_nested(), to avoid acquiring
the socket spinlock only to set @owned to one:

        spin_lock_bh(&sk->sk_lock.slock);
        if (unlikely(sock_owned_by_user_nocheck(sk)))
                __lock_sock(sk);
        sk->sk_lock.owned = 1;
        spin_unlock_bh(&sk->sk_lock.slock);

On x86_64 compiler generates something quite efficient:

00000000000077c0 <lock_sock_nested>:
    77c0:       f3 0f 1e fa                 endbr64
    77c4:       e8 00 00 00 00              call   __fentry__
    77c9:       b9 01 00 00 00              mov    $0x1,%ecx
    77ce:       31 c0                       xor    %eax,%eax
    77d0:       f0 48 0f b1 8f 48 01 00 00  lock cmpxchg %rcx,0x148(%rdi)
    77d9:       75 06                       jne    slow_path
    77db:       2e e9 00 00 00 00           cs jmp __x86_return_thunk-0x4
slow_path:      ...

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20260226021215.1764237-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet/hsr: update outdated comments
Kexin Sun [Wed, 25 Feb 2026 14:51:59 +0000 (22:51 +0800)] 
net/hsr: update outdated comments

The function hsr_rcv() was renamed hsr_handle_frame() and moved to
net/hsr/hsr_slave.c by commit 81ba6afd6e64 ("net/hsr: Switch from
dev_add_pack() to netdev_rx_handler_register()").

Update all remaining references in the comments accordingly.

Signed-off-by: Kexin Sun <kexinsun@smail.nju.edu.cn>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260225145159.2953-1-kexinsun@smail.nju.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet: phy: micrel: Add support for lan9645x internal phy
Jens Emil Schulz Østergaard [Thu, 26 Feb 2026 08:24:19 +0000 (09:24 +0100)] 
net: phy: micrel: Add support for lan9645x internal phy

LAN9645X is a family of switch chips with 5 internal copper phys. The
internal PHY is based on parts of LAN8832. This is a low-power, single
port triple-speed (10BASE-T/100BASE-TX/1000BASE-T) ethernet physical
layer transceiver (PHY) that supports transmission and reception of data
on standard CAT-5, as well as CAT-5e and CAT-6 Unshielded Twisted
Pair (UTP) cables.

Add support for the internal PHY of the lan9645x chip family.

Reviewed-by: Steen Hegelund <Steen.Hegelund@microchip.com>
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Signed-off-by: Jens Emil Schulz Østergaard <jensemil.schulzostergaard@microchip.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20260226-phy_micrel_add_support_for_lan9645x_internal_phy-v3-1-1fe82379962b@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonetmem: remove the pp fields from net_iov
Byungchul Park [Tue, 24 Feb 2026 06:14:24 +0000 (15:14 +0900)] 
netmem: remove the pp fields from net_iov

Now that the pp fields in net_iov have no users, remove them from
net_iov and clean up.

Signed-off-by: Byungchul Park <byungchul@sk.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20260224061424.11219-1-byungchul@sk.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet: airoha: fix typo in function name
Zhengping Zhang [Thu, 26 Feb 2026 02:37:08 +0000 (10:37 +0800)] 
net: airoha: fix typo in function name

Corrected the typo in the function name from
 `airhoa_is_lan_gdm_port` to `airoha_is_lan_gdm_port`. This change ensures
 consistency in the API naming convention.

Signed-off-by: Zhengping Zhang <aquapinn@qq.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Acked-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/tencent_E4FD5D6BC0131E617D848896F5F9FCED6E0A@qq.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet: atlantic: fix reading SFP module info on some AQC100 cards
Tiernan Hubble [Wed, 25 Feb 2026 00:20:25 +0000 (18:20 -0600)] 
net: atlantic: fix reading SFP module info on some AQC100 cards

Commit 853a2944aaf3 ("net: atlantic: support reading SFP module info")
added support for reading SFP module info on AQC100-based cards. However,
it only supports reading directly from the controller's hardware
registers, and this does not seem to be supported on certain cards,
including my TRENDnet TEG-10GECSFP V3. "ethtool -m" times out when reading
certain registers, even when I increase the read poll timeout values.

The DPDK "atlantic" driver reads module info via firmware calls instead of
directly reading the hardware registers, provided that the NIC's firmware
version supports it.

This change adapts the DPDK firmware call code to the kernel driver. It
preserves the old hardware-based module read code as a fallback when the
firmware does not support it, to avoid breaking cards that are currently
working.

Tested on 2 different TRENDnet TEG-10GECSFP V3 cards, both with firmware
version 3.1.121 (current at the time of this patch). Both cards correctly
reported module info for a passive DAC cable and 2 different 10G optical
transceivers.

Signed-off-by: Tiernan Hubble <thubble@thubble.ca>
Link: https://patch.msgid.link/20260225002026.1754045-1-thubble@thubble.ca
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoMerge branch 'support-phys-that-have-inband-autoneg-disabled-with-gem'
Jakub Kicinski [Fri, 27 Feb 2026 03:19:27 +0000 (19:19 -0800)] 
Merge branch 'support-phys-that-have-inband-autoneg-disabled-with-gem'

Charles Perry says:

====================
Support PHYs that have inband autoneg disabled with GEM

I'm testing SGMII with a VSC8574 PHY [1] and microchip HPSC SoC [2].

The link can work with or without autoneg, as long as the MAC and the PHY
are configured the same way. This doesn't work with the current MAC driver
because the MAC inband autoneg is always enabled (in the ->mac_config()
phylink_mac_ops). More precisely, the PHY driver (mscc_main.c) has
phylink's ->config_inband() implemented while the MAC ->pcs_config() ops
has an empty body.

This is based on code written by Sean Anderson [3]. Let me know if I
should add a From: or Co-developed-by: tag.

Logs with inband autoneg (managed = "in-band-status"):

  root@p64h:~# ifconfig eth1 up 10.180.59.33
  macb 40004184000.ethernet eth1: PHY 4000c21e000.mdio-mdio:02 doesn't supply possible interfaces
  macb 40004184000.ethernet eth1: PHY [4000c21e000.mdio-mdio:02] driver [Microsemi GE VSC8574 SyncE] (irq=POLL)
  macb 40004184000.ethernet eth1: phy: sgmii setting supported 00000000,00000000,00000000,000042ff advertising 00000000,00000000,00000000,000042ff
  macb 40004184000.ethernet eth1: configuring for inband/sgmii link mode
  macb 40004184000.ethernet eth1: major config, requested inband/sgmii
  macb 40004184000.ethernet eth1: interface sgmii inband modes: pcs=03 phy=03
  macb 40004184000.ethernet eth1: major config, active inband/inband,an-enabled/sgmii
  macb 40004184000.ethernet eth1: phylink_mac_config: mode=inband/sgmii/none adv=00000000,00000000,00000000,000042ff pause=00
  macb_pcs_config: PCSANADV=0x1 PCSCNTRL=0x1040
  macb_pcs_get_state: PCSSTS=0x109 PCSANLPBASE=0x1
  macb_pcs_get_state: PCSSTS=0x12d PCSANLPBASE=0x1801
  macb 40004184000.ethernet eth1: phy link down sgmii/Unknown/Unknown/none/off/nolpi
  macb_pcs_get_state: PCSSTS=0x12d PCSANLPBASE=0x1801
  macb_pcs_get_state: PCSSTS=0x12d PCSANLPBASE=0x1801
  macb 40004184000.ethernet eth1: phy link up sgmii/1Gbps/Full/none/tx/nolpi
  macb_pcs_get_state: PCSSTS=0x129 PCSANLPBASE=0x9801
  macb_pcs_get_state: PCSSTS=0x12d PCSANLPBASE=0x9801
  macb 40004184000.ethernet eth1: Link is Up - 1Gbps/Full - flow control tx

Logs without inband autoneg:

  root@p64h:~# ifconfig eth1 up 10.180.59.33
  macb 40004184000.ethernet eth1: PHY 4000c21e000.mdio-mdio:02 doesn't supply possible interfaces
  macb 40004184000.ethernet eth1: PHY [4000c21e000.mdio-mdio:02] driver [Microsemi GE VSC8574 SyncE] (irq=POLL)
  macb 40004184000.ethernet eth1: phy: sgmii setting supported 00000000,00000000,00000000,000042ff advertising 00000000,00000000,00000000,000042ff
  macb 40004184000.ethernet eth1: configuring for phy/sgmii link mode
  macb 40004184000.ethernet eth1: major config, requested phy/sgmii
  macb 40004184000.ethernet eth1: interface sgmii inband modes: pcs=03 phy=03
  macb 40004184000.ethernet eth1: major config, active phy/outband/sgmii
  macb 40004184000.ethernet eth1: phylink_mac_config: mode=phy/sgmii/none adv=00000000,00000000,00000000,00000000 pause=00
  macb_pcs_config: PCSANADV=0x1 PCSCNTRL=0x40
  macb 40004184000.ethernet eth1: phy link down sgmii/Unknown/Unknown/none/off/nolpi
  macb 40004184000.ethernet eth1: phy link up sgmii/1Gbps/Full/none/tx/nolpi
  macb 40004184000.ethernet eth1: Link is Up - 1Gbps/Full - flow control tx

The above logs are generated with an additional printk() in macb_psc_config()
and macb_pcs_get_state() and "#define DEBUG" in phylink.c.

[1]: https://www.microchip.com/en-us/product/vsc8574
[2]: https://www.microchip.com/en-us/products/microprocessors/64-bit-mpus/pic64-hpsc
[3]: https://lore.kernel.org/all/20250610233547.3588356-1-sean.anderson@linux.dev/
====================

Link: https://patch.msgid.link/20260224202854.112813-1-charles.perry@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet: macb: add the .pcs_inband_caps() callback for SGMII
Charles Perry [Tue, 24 Feb 2026 20:28:54 +0000 (12:28 -0800)] 
net: macb: add the .pcs_inband_caps() callback for SGMII

In SGMII mode, GEM can work with or without inband
autonegotiation.

Signed-off-by: Charles Perry <charles.perry@microchip.com>
Link: https://patch.msgid.link/20260224202854.112813-4-charles.perry@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet: macb: add support for reporting SGMII inband link status
Charles Perry [Tue, 24 Feb 2026 20:28:53 +0000 (12:28 -0800)] 
net: macb: add support for reporting SGMII inband link status

This makes it possible to use in-band autonegotiation with
SGMII.

If using a device tree, this can be done by adding the managed =
"in-band-status" property to the gem node.

Signed-off-by: Charles Perry <charles.perry@microchip.com>
Link: https://patch.msgid.link/20260224202854.112813-3-charles.perry@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet: macb: fix SGMII with inband aneg disabled
Charles Perry [Tue, 24 Feb 2026 20:28:52 +0000 (12:28 -0800)] 
net: macb: fix SGMII with inband aneg disabled

Make it possible to connect a PHY which does not use inband
autoneg to a gem MAC using phylink's information.

The previous implementation relied on whether or not the link
was a fixed-link to disable SGMII autoneg. This commit extend
this to all link which are not configured for inband
autonegotiation.

Signed-off-by: Charles Perry <charles.perry@microchip.com>
Link: https://patch.msgid.link/20260224202854.112813-2-charles.perry@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoinet: remove three EXPORT_SYMBOL()
Eric Dumazet [Wed, 25 Feb 2026 13:40:23 +0000 (13:40 +0000)] 
inet: remove three EXPORT_SYMBOL()

inet_rcv_saddr_equal() and inet_csk_listen_stop() are not used
from any modules.

inet_csk_accept() can use EXPORT_IPV6_MOD()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260225134023.1176738-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agodocs: ethtool: clarify the bit-by-bit bitset format description
Yohei Kojima [Wed, 25 Feb 2026 15:12:09 +0000 (00:12 +0900)] 
docs: ethtool: clarify the bit-by-bit bitset format description

Clarify the bit-by-bit bitset format's behavior around mandatory
attributes and bit identification. More specifically, the following
changes are made:

* Rephrase a misleading sentence which implies name and index are
  mutually exclusive
* Describe that ETHTOOL_A_BITSET_BITS nest is mandatory
* Describe that a request fails if inconsistent identifiers are given

Signed-off-by: Yohei Kojima <yk@y-koj.net>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/ef90a56965ca66e57aa177929ce3e10c5ca815fa.1772031974.git.yk@y-koj.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoocteontx2-af: CGX: replace kfree() with rvu_free_bitmap()
Bo Sun [Wed, 25 Feb 2026 08:23:48 +0000 (16:23 +0800)] 
octeontx2-af: CGX: replace kfree() with rvu_free_bitmap()

mac_to_index_bmap is allocated with rvu_alloc_bitmap(), so free it
with rvu_free_bitmap() instead of open-coding kfree(.bmap) to keep
the alloc/free API pairing consistent.

Signed-off-by: Bo Sun <bo@mboxify.com>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Reviewed-by: Jijie Shao <shaojijie@huawei.com>
Link: https://patch.msgid.link/20260225082348.2519131-1-bo@mboxify.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agodocs: net: document neigh gc_interval sysctl
Gabriel Goller [Wed, 25 Feb 2026 09:58:10 +0000 (10:58 +0100)] 
docs: net: document neigh gc_interval sysctl

Add entry for the neigh/default/gc_interval sysctl. This sysctl is
unused since kernel v2.6.8.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Gabriel Goller <g.goller@proxmox.com>
Link: https://patch.msgid.link/20260225095822.44050-1-g.goller@proxmox.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet: stmmac: ptp: limit n_per_out
Russell King (Oracle) [Wed, 25 Feb 2026 10:01:19 +0000 (10:01 +0000)] 
net: stmmac: ptp: limit n_per_out

ptp_clock_ops.n_per_out sets the number of PPS outputs, which the PTP
subsystem uses to validate userspace input, such as the index number
used in a PTP_CLK_REQ_PEROUT request.

stmmac_enable() uses this to index the priv->pps array, which is an
array of size STMMAC_PPS_MAX. ptp_clock_ops.n_per_out is initialised
using priv->dma_cap.pps_out_num, which is a three bit field read from
hardware.

Documentation that I've checked suggests that values >= 5 are reserved,
but that doesn't mean such values won't appear, and if they do, we
can overrun the priv->pps array in stmmac_enable().

stmmac_ptp_register() has protection against this in its loop, but it
doesn't act to limit ptp_clock_ops.n_per_out.

Fix this by introducing a local variable, pps_out_num which is limited
to STMMAC_PPS_MAX, and use that when initialising the array and setting
priv->ptp_clock_ops.n_per_out. Print a warning when we limit the number
of outputs.

Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vvBhn-0000000ArCg-4C4u@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Jakub Kicinski [Thu, 26 Feb 2026 18:20:47 +0000 (10:20 -0800)] 
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR (net-7.0-rc2).

Conflicts:

tools/testing/selftests/drivers/net/hw/rss_ctx.py
  19c3a2a81d2b ("selftests: drv-net: rss: Generate unique ports for RSS context tests")
  ce5a0f4612db ("selftests: drv-net: rss_ctx: test RSS contexts persist after ifdown/up")

include/net/inet_connection_sock.h
  858d2a4f67ff6 ("tcp: fix potential race in tcp_v6_syn_recv_sock()")
  fcd3d039fab69 ("tcp: make tcp_v{4,6}_send_check() static")
https://lore.kernel.org/aZ8PSFLzBrEU3I89@sirena.org.uk

drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.c
drivers/net/ethernet/mellanox/mlx5/core/en/xsk/pool.c
  69050f8d6d075 ("treewide: Replace kmalloc with kmalloc_obj for non-scalar types")
  bf4afc53b77ae ("Convert 'alloc_obj' family to use the new default GFP_KERNEL argument")
  8a96b9144f18a ("net/mlx5e: Alloc xsk channel param out of mlx5e_open_xsk()")

Adjacent changes:

net/netfilter/ipvs/ip_vs_ctl.c
  c59bd9e62e06 ("ipvs: use more counters to avoid service lookups")
  bf4afc53b77a ("Convert 'alloc_obj' family to use the new default GFP_KERNEL argument")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoMerge tag 'net-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Linus Torvalds [Thu, 26 Feb 2026 16:00:13 +0000 (08:00 -0800)] 
Merge tag 'net-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Including fixes from IPsec, Bluetooth and netfilter

  Current release - regressions:

   - wifi: fix dev_alloc_name() return value check

   - rds: fix recursive lock in rds_tcp_conn_slots_available

  Current release - new code bugs:

   - vsock: lock down child_ns_mode as write-once

  Previous releases - regressions:

   - core:
      - do not pass flow_id to set_rps_cpu()
      - consume xmit errors of GSO frames

   - netconsole: avoid OOB reads, msg is not nul-terminated

   - netfilter: h323: fix OOB read in decode_choice()

   - tcp: re-enable acceptance of FIN packets when RWIN is 0

   - udplite: fix null-ptr-deref in __udp_enqueue_schedule_skb().

   - wifi: brcmfmac: fix potential kernel oops when probe fails

   - phy: register phy led_triggers during probe to avoid AB-BA deadlock

   - eth:
      - bnxt_en: fix deleting of Ntuple filters
      - wan: farsync: fix use-after-free bugs caused by unfinished tasklets
      - xscale: check for PTP support properly

  Previous releases - always broken:

   - tcp: fix potential race in tcp_v6_syn_recv_sock()

   - kcm: fix zero-frag skb in frag_list on partial sendmsg error

   - xfrm:
      - fix race condition in espintcp_close()
      - always flush state and policy upon NETDEV_UNREGISTER event

   - bluetooth:
      - purge error queues in socket destructors
      - fix response to L2CAP_ECRED_CONN_REQ

   - eth:
      - mlx5:
         - fix circular locking dependency in dump
         - fix "scheduling while atomic" in IPsec MAC address query
      - gve: fix incorrect buffer cleanup for QPL
      - team: avoid NETDEV_CHANGEMTU event when unregistering slave
      - usb: validate USB endpoints"

* tag 'net-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (72 commits)
  netfilter: nf_conntrack_h323: fix OOB read in decode_choice()
  dpaa2-switch: validate num_ifs to prevent out-of-bounds write
  net: consume xmit errors of GSO frames
  vsock: document write-once behavior of the child_ns_mode sysctl
  vsock: lock down child_ns_mode as write-once
  selftests/vsock: change tests to respect write-once child ns mode
  net/mlx5e: Fix "scheduling while atomic" in IPsec MAC address query
  net/mlx5: Fix missing devlink lock in SRIOV enable error path
  net/mlx5: E-switch, Clear legacy flag when moving to switchdev
  net/mlx5: LAG, disable MPESW in lag_disable_change()
  net/mlx5: DR, Fix circular locking dependency in dump
  selftests: team: Add a reference count leak test
  team: avoid NETDEV_CHANGEMTU event when unregistering slave
  net: mana: Fix double destroy_workqueue on service rescan PCI path
  MAINTAINERS: Update maintainer entry for QUALCOMM ETHQOS ETHERNET DRIVER
  dpll: zl3073x: Remove redundant cleanup in devm_dpll_init()
  selftests/net: packetdrill: Verify acceptance of FIN packets when RWIN is 0
  tcp: re-enable acceptance of FIN packets when RWIN is 0
  vsock: Use container_of() to get net namespace in sysctl handlers
  net: usb: kaweth: validate USB endpoints
  ...

8 weeks agonetfilter: nf_conntrack_h323: fix OOB read in decode_choice()
Vahagn Vardanian [Wed, 25 Feb 2026 13:06:18 +0000 (14:06 +0100)] 
netfilter: nf_conntrack_h323: fix OOB read in decode_choice()

In decode_choice(), the boundary check before get_len() uses the
variable `len`, which is still 0 from its initialization at the top of
the function:

    unsigned int type, ext, len = 0;
    ...
    if (ext || (son->attr & OPEN)) {
        BYTE_ALIGN(bs);
        if (nf_h323_error_boundary(bs, len, 0))  /* len is 0 here */
            return H323_ERROR_BOUND;
        len = get_len(bs);                        /* OOB read */

When the bitstream is exactly consumed (bs->cur == bs->end), the check
nf_h323_error_boundary(bs, 0, 0) evaluates to (bs->cur + 0 > bs->end),
which is false.  The subsequent get_len() call then dereferences
*bs->cur++, reading 1 byte past the end of the buffer.  If that byte
has bit 7 set, get_len() reads a second byte as well.

This can be triggered remotely by sending a crafted Q.931 SETUP message
with a User-User Information Element containing exactly 2 bytes of
PER-encoded data ({0x08, 0x00}) to port 1720 through a firewall with
the nf_conntrack_h323 helper active.  The decoder fully consumes the
PER buffer before reaching this code path, resulting in a 1-2 byte
heap-buffer-overflow read confirmed by AddressSanitizer.

Fix this by checking for 2 bytes (the maximum that get_len() may read)
instead of the uninitialized `len`.  This matches the pattern used at
every other get_len() call site in the same file, where the caller
checks for 2 bytes of available data before calling get_len().

Fixes: ec8a8f3c31dd ("netfilter: nf_ct_h323: Extend nf_h323_error_boundary to work on bits as well")
Signed-off-by: Vahagn Vardanian <vahagn@redrays.io>
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20260225130619.1248-2-fw@strlen.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agodpaa2-switch: validate num_ifs to prevent out-of-bounds write
Junrui Luo [Tue, 24 Feb 2026 11:05:56 +0000 (19:05 +0800)] 
dpaa2-switch: validate num_ifs to prevent out-of-bounds write

The driver obtains sw_attr.num_ifs from firmware via dpsw_get_attributes()
but never validates it against DPSW_MAX_IF (64). This value controls
iteration in dpaa2_switch_fdb_get_flood_cfg(), which writes port indices
into the fixed-size cfg->if_id[DPSW_MAX_IF] array. When firmware reports
num_ifs >= 64, the loop can write past the array bounds.

Add a bound check for num_ifs in dpaa2_switch_init().

dpaa2_switch_fdb_get_flood_cfg() appends the control interface (port
num_ifs) after all matched ports. When num_ifs == DPSW_MAX_IF and all
ports match the flood filter, the loop fills all 64 slots and the control
interface write overflows by one entry.

The check uses >= because num_ifs == DPSW_MAX_IF is also functionally
broken.

build_if_id_bitmap() silently drops any ID >= 64:
      if (id[i] < DPSW_MAX_IF)
          bmap[id[i] / 64] |= ...

Fixes: 539dda3c5d19 ("staging: dpaa2-switch: properly setup switching domains")
Signed-off-by: Junrui Luo <moonafterrain@outlook.com>
Reviewed-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Link: https://patch.msgid.link/SYBPR01MB78812B47B7F0470B617C408AAF74A@SYBPR01MB7881.ausprd01.prod.outlook.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agobonding: print churn state via netlink
Hangbin Liu [Tue, 24 Feb 2026 02:02:14 +0000 (02:02 +0000)] 
bonding: print churn state via netlink

Currently, the churn state is printed only in sysfs. Add netlink support
so users could get the state via netlink.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20260224020215.6012-1-liuhangbin@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agopppoe: remove kernel-mode relay support
Qingfang Deng [Tue, 24 Feb 2026 01:50:52 +0000 (09:50 +0800)] 
pppoe: remove kernel-mode relay support

The kernel-mode PPPoE relay feature and its two associated ioctls
(PPPOEIOCSFWD and PPPOEIOCDFWD) are not used by any existing userspace
PPPoE implementations. The most commonly-used package, RP-PPPoE [1],
handles the relaying entirely in userspace.

This legacy code has remained in the driver since its introduction in
kernel 2.3.99-pre7 for over two decades, but has served no practical
purpose.

Remove the unused relay code.

[1] https://dianne.skoll.ca/projects/rp-pppoe/

Signed-off-by: Qingfang Deng <dqfext@gmail.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/20260224015053.42472-1-dqfext@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agonet: consume xmit errors of GSO frames
Jakub Kicinski [Mon, 23 Feb 2026 23:51:00 +0000 (15:51 -0800)] 
net: consume xmit errors of GSO frames

udpgro_frglist.sh and udpgro_bench.sh are the flakiest tests
currently in NIPA. They fail in the same exact way, TCP GRO
test stalls occasionally and the test gets killed after 10min.

These tests use veth to simulate GRO. They attach a trivial
("return XDP_PASS;") XDP program to the veth to force TSO off
and NAPI on.

Digging into the failure mode we can see that the connection
is completely stuck after a burst of drops. The sender's snd_nxt
is at sequence number N [1], but the receiver claims to have
received (rcv_nxt) up to N + 3 * MSS [2]. Last piece of the puzzle
is that senders rtx queue is not empty (let's say the block in
the rtx queue is at sequence number N - 4 * MSS [3]).

In this state, sender sends a retransmission from the rtx queue
with a single segment, and sequence numbers N-4*MSS:N-3*MSS [3].
Receiver sees it and responds with an ACK all the way up to
N + 3 * MSS [2]. But sender will reject this ack as TCP_ACK_UNSENT_DATA
because it has no recollection of ever sending data that far out [1].
And we are stuck.

The root cause is the mess of the xmit return codes. veth returns
an error when it can't xmit a frame. We end up with a loss event
like this:

  -------------------------------------------------
  |   GSO super frame 1   |   GSO super frame 2   |
  |-----------------------------------------------|
  | seg | seg | seg | seg | seg | seg | seg | seg |
  |  1  |  2  |  3  |  4  |  5  |  6  |  7  |  8  |
  -------------------------------------------------
     x    ok    ok    <ok>|  ok    ok    ok   <x>
                          \\
   snd_nxt

"x" means packet lost by veth, and "ok" means it went thru.
Since veth has TSO disabled in this test it sees individual segments.
Segment 1 is on the retransmit queue and will be resent.

So why did the sender not advance snd_nxt even tho it clearly did
send up to seg 8? tcp_write_xmit() interprets the return code
from the core to mean that data has not been sent at all. Since
TCP deals with GSO super frames, not individual segment the crux
of the problem is that loss of a single segment can be interpreted
as loss of all. TCP only sees the last return code for the last
segment of the GSO frame (in <> brackets in the diagram above).

Of course for the problem to occur we need a setup or a device
without a Qdisc. Otherwise Qdisc layer disconnects the protocol
layer from the device errors completely.

We have multiple ways to fix this.

 1) make veth not return an error when it lost a packet.
    While this is what I think we did in the past, the issue keeps
    reappearing and it's annoying to debug. The game of whack
    a mole is not great.

 2) fix the damn return codes
    We only talk about NETDEV_TX_OK and NETDEV_TX_BUSY in the
    documentation, so maybe we should make the return code from
    ndo_start_xmit() a boolean. I like that the most, but perhaps
    some ancient, not-really-networking protocol would suffer.

 3) make TCP ignore the errors
    It is not entirely clear to me what benefit TCP gets from
    interpreting the result of ip_queue_xmit()? Specifically once
    the connection is established and we're pushing data - packet
    loss is just packet loss?

 4) this fix
    Ignore the rc in the Qdisc-less+GSO case, since it's unreliable.
    We already always return OK in the TCQ_F_CAN_BYPASS case.
    In the Qdisc-less case let's be a bit more conservative and only
    mask the GSO errors. This path is taken by non-IP-"networks"
    like CAN, MCTP etc, so we could regress some ancient thing.
    This is the simplest, but also maybe the hackiest fix?

Similar fix has been proposed by Eric in the past but never committed
because original reporter was working with an OOT driver and wasn't
providing feedback (see Link).

Link: https://lore.kernel.org/CANn89iJcLepEin7EtBETrZ36bjoD9LrR=k4cfwWh046GB+4f9A@mail.gmail.com
Fixes: 1f59533f9ca5 ("qdisc: validate frames going through the direct_xmit path")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260223235100.108939-1-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agoMerge branch 'vsock-add-write-once-semantics-to-child_ns_mode'
Paolo Abeni [Thu, 26 Feb 2026 10:10:05 +0000 (11:10 +0100)] 
Merge branch 'vsock-add-write-once-semantics-to-child_ns_mode'

Bobby Eshleman says:

====================
vsock: add write-once semantics to child_ns_mode

Two administrator processes may race when setting child_ns_mode: one
sets it to "local" and creates a namespace, but another changes it to
"global" in between. The first process ends up with a namespace in the
wrong mode. Make child_ns_mode write-once so that a namespace manager
can set it once, check the value, and be guaranteed it won't change
before creating its namespaces. Writing a different value after the
first write returns -EBUSY.

One patch for the implementation, one for docs, and one for tests.

v2: https://lore.kernel.org/r/20260218-vsock-ns-write-once-v2-0-19e4c50d509a@meta.com
v1: https://lore.kernel.org/r/20260217-vsock-ns-write-once-v1-1-a1fb30f289a9@meta.com
====================

Link: https://patch.msgid.link/20260223-vsock-ns-write-once-v3-0-c0cde6959923@meta.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agovsock: document write-once behavior of the child_ns_mode sysctl
Bobby Eshleman [Mon, 23 Feb 2026 22:38:34 +0000 (14:38 -0800)] 
vsock: document write-once behavior of the child_ns_mode sysctl

Update the vsock child_ns_mode documentation to include the new
write-once semantics of setting child_ns_mode. The semantics are
implemented in a preceding patch in this series.

Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/20260223-vsock-ns-write-once-v3-3-c0cde6959923@meta.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agovsock: lock down child_ns_mode as write-once
Bobby Eshleman [Mon, 23 Feb 2026 22:38:33 +0000 (14:38 -0800)] 
vsock: lock down child_ns_mode as write-once

Two administrator processes may race when setting child_ns_mode as one
process sets child_ns_mode to "local" and then creates a namespace, but
another process changes child_ns_mode to "global" between the write and
the namespace creation. The first process ends up with a namespace in
"global" mode instead of "local". While this can be detected after the
fact by reading ns_mode and retrying, it is fragile and error-prone.

Make child_ns_mode write-once so that a namespace manager can set it
once and be sure it won't change. Writing a different value after the
first write returns -EBUSY. This applies to all namespaces, including
init_net, where an init process can write "local" to lock all future
namespaces into local mode.

Fixes: eafb64f40ca4 ("vsock: add netns to vsock core")
Suggested-by: Daan De Meyer <daan.j.demeyer@gmail.com>
Suggested-by: Stefano Garzarella <sgarzare@redhat.com>
Co-developed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/20260223-vsock-ns-write-once-v3-2-c0cde6959923@meta.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agoselftests/vsock: change tests to respect write-once child ns mode
Bobby Eshleman [Mon, 23 Feb 2026 22:38:32 +0000 (14:38 -0800)] 
selftests/vsock: change tests to respect write-once child ns mode

The child_ns_mode sysctl parameter becomes write-once in a future patch
in this series, which breaks existing tests. This patch updates the
tests to respect this new policy. No additional tests are added.

Add "global-parent" and "local-parent" namespaces as intermediaries to
spawn namespaces in the given modes. This avoids the need to change
"child_ns_mode" in the init_ns. nsenter must be used because ip netns
unshares the mount namespace so nested "ip netns add" breaks exec calls
from the init ns. Adds nsenter to the deps check.

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
Link: https://patch.msgid.link/20260223-vsock-ns-write-once-v3-1-c0cde6959923@meta.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agoMerge branch 'net-mlx5e-shampo-allow-high-order-pages-in-zerocopy-mode'
Paolo Abeni [Thu, 26 Feb 2026 09:54:41 +0000 (10:54 +0100)] 
Merge branch 'net-mlx5e-shampo-allow-high-order-pages-in-zerocopy-mode'

Tariq Toukan says:

====================
net/mlx5e: SHAMPO, Allow high order pages in zerocopy mode

This series adds support for high order pages when io_uring/devmem
zero copy is used.

See detailed description by Dragos below.

The first patches are moving code around to allow using queue specific
parameters that are not just for XSK. They are a bit large as they touch
a lot of functions.

The middle part of the series is updating various formulas to remove
remaining hardcoded use of PAGE_SIZE/PAGE_SHIFT.

The last part adds support for high order pages by implementing the
queue configuration functions and allowing larger rx_page_size
configurations when in zero-copy mode.

Results show an increase in BW and a decrease in CPU usage.
The benchmark was done with the zcrx samples from liburing [0].

rx_buf_len=4K, oncpu [1]:
packets=3358832 (MB=820027), rps=55794 (MB/s=13621)
Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:       9    1.56    0.00   18.09   13.42    0.00   66.80    0.00    0.00    0.00    0.12

rx_buf_len=128K, oncpu [2]:
packets=3781376 (MB=923187), rps=62813 (MB/s=15335)
Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:       9    0.33    0.00    7.61   18.86    0.00   73.08    0.00    0.00    0.00    0.12

rx_buf_len=4K, offcpu [3]:
packets=3460368 (MB=844816), rps=57481 (MB/s=14033)
Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:       9    0.00    0.00    0.26    0.00    0.00   92.63    0.00    0.00    0.00    7.11
Average:      11    3.04    0.00   68.09   28.87    0.00    0.00    0.00    0.00    0.00    0.00

rx_buf_len=128K, offcpu [4]:
packets=4119840 (MB=1005820), rps=68435 (MB/s=16707)
Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:       9    0.00    0.00    0.87    0.00    0.00   63.77    0.00    0.00    0.00   35.36
Average:      11    1.96    0.00   43.68   54.37    0.00    0.00    0.00    0.00    0.00    0.00

[0] https://github.com/isilence/liburing/tree/zcrx/rx-buf-len

[1] commands:
  $> taskset -c 9 ./zcrx 6 -i eth2 -q 9 -A 1 -B 4096 -S 33554432
  $> ./send-zerocopy tcp -6 -D 2001:db8::1 -t 60 -C 0 -l 1 -b 1 -n 1 -z 1 -d -s 256000

[2] commands:
  $> taskset -c 9 ./zcrx 6 -i eth2 -q 9 -A 1 -B 131072 -S 33554432
  $> ./send-zerocopy tcp -6 -D 2001:db8::1 -t 60 -C 0 -l 1 -b 1 -n 1 -z 1 -d -s 256000

[3] commands:
  $> taskset -c 11 ./zcrx 6 -i eth2 -q 9 -A 1 -B 4096 -S 33554432
  $> ./send-zerocopy tcp -6 -D 2001:db8::1 -t 60 -C 0 -l 1 -b 1 -n 1 -z 1 -d -s 256000

[4] commands:
  $> taskset -c 11 ./zcrx 6 -i eth2 -q 9 -A 1 -B 131072 -S 33554432
  $> ./send-zerocopy tcp -6 -D 2001:db8::1 -t 60 -C 0 -l 1 -b 1 -n 1 -z 1 -d -s 256000
====================

Link: https://patch.msgid.link/20260223204155.1783580-1-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agonet/mlx5e: SHAMPO, Allow high order pages in zerocopy mode
Dragos Tatulea [Mon, 23 Feb 2026 20:41:55 +0000 (22:41 +0200)] 
net/mlx5e: SHAMPO, Allow high order pages in zerocopy mode

Allow high order pages only when SHAMPO mode is enabled (hw-gro) and the
queue is used for zerocopy (has memory provider ops set). The limit is
128K and it was chosen for the following reasons:
- 256K size requires a special case during MTT calculation to split the
  page in two. That's because two MTTs are needed to form an octword.
- Higher sizes require increasing WQE size and/or reducing the number
  of WQEs.
- Having the RQ lined with too few large pages can lead to refill
  issues.

Results show an increase in BW and a decrease in CPU usage.
The benchmark was done with the zcrx samples from liburing [0].

rx_buf_len=4K, oncpu [1]:
packets=3358832 (MB=820027), rps=55794 (MB/s=13621)
Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:       9    1.56    0.00   18.09   13.42    0.00   66.80    0.00    0.00    0.00    0.12

rx_buf_len=128K, oncpu [2]:
packets=3781376 (MB=923187), rps=62813 (MB/s=15335)
Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:       9    0.33    0.00    7.61   18.86    0.00   73.08    0.00    0.00    0.00    0.12

rx_buf_len=4K, offcpu [3]:
packets=3460368 (MB=844816), rps=57481 (MB/s=14033)
Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:       9    0.00    0.00    0.26    0.00    0.00   92.63    0.00    0.00    0.00    7.11
Average:      11    3.04    0.00   68.09   28.87    0.00    0.00    0.00    0.00    0.00    0.00

rx_buf_len=128K, offcpu [4]:
packets=4119840 (MB=1005820), rps=68435 (MB/s=16707)
Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:       9    0.00    0.00    0.87    0.00    0.00   63.77    0.00    0.00    0.00   35.36
Average:      11    1.96    0.00   43.68   54.37    0.00    0.00    0.00    0.00    0.00    0.00

[0] https://github.com/isilence/liburing/tree/zcrx/rx-buf-len

[1] commands:
  $> taskset -c 9 ./zcrx 6 -i eth2 -q 9 -A 1 -B 4096 -S 33554432
  $> ./send-zerocopy tcp -6 -D 2001:db8::1 -t 60 -C 0 -l 1 -b 1 -n 1 -z 1 -d -s 256000

[2] commands:
  $> taskset -c 9 ./zcrx 6 -i eth2 -q 9 -A 1 -B 131072 -S 33554432
  $> ./send-zerocopy tcp -6 -D 2001:db8::1 -t 60 -C 0 -l 1 -b 1 -n 1 -z 1 -d -s 256000

[3] commands:
  $> taskset -c 11 ./zcrx 6 -i eth2 -q 9 -A 1 -B 4096 -S 33554432
  $> ./send-zerocopy tcp -6 -D 2001:db8::1 -t 60 -C 0 -l 1 -b 1 -n 1 -z 1 -d -s 256000

[4] commands:
  $> taskset -c 11 ./zcrx 6 -i eth2 -q 9 -A 1 -B 131072 -S 33554432
  $> ./send-zerocopy tcp -6 -D 2001:db8::1 -t 60 -C 0 -l 1 -b 1 -n 1 -z 1 -d -s 256000

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260223204155.1783580-16-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agonet/mlx5e: Add param helper to calculate max page size
Dragos Tatulea [Mon, 23 Feb 2026 20:41:54 +0000 (22:41 +0200)] 
net/mlx5e: Add param helper to calculate max page size

This function will be necessary to determine the upper limit of
rx-page-size.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260223204155.1783580-15-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agonet/mlx5e: Pass netdev queue config to param calculations
Dragos Tatulea [Mon, 23 Feb 2026 20:41:53 +0000 (22:41 +0200)] 
net/mlx5e: Pass netdev queue config to param calculations

If set, take rx_page_size into consideration when calculating
the page shift in Multi Packet WQE mode.

The queue config is saved in the mlx5e_rq_opt_param struct which is
added to the mlx5e_channel_param struct. Now the configuration can be
read from the struct instead of adding it as an argument to all call
sites. For consistency, the queue config is assigned in
mlx5e_build_channel_param().

The queue configuration is read only from queue management ops
as that's the only place where it is currently useful. Furthermore,
netdev_queue_config() expects netdev->queue_mgmt_ops to be
set which is not always the case (representor netdevs).

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260223204155.1783580-14-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agonet/mlx5e: Add queue config ops for page size
Dragos Tatulea [Mon, 23 Feb 2026 20:41:52 +0000 (22:41 +0200)] 
net/mlx5e: Add queue config ops for page size

For now allow only PAGE_SIZE. A subsequent patch will add support for
high order pages in zero-copy mode.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260223204155.1783580-13-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agonet/mlx5e: RX, Make page frag bias more robust
Dragos Tatulea [Mon, 23 Feb 2026 20:41:51 +0000 (22:41 +0200)] 
net/mlx5e: RX, Make page frag bias more robust

The formula uses the system page size but does not account
for high order pages.

One way to fix this would be to adapt the formula to take
into account the pool order. This would require calculating it
for every allocation or adding an additional rq struct member to
hold the bias max.

However, the above is not really needed as the driver doesn't
check the bias value. It has other means to calculate the expected
number of fragments based on context.

This patch simply sets the value to the max possible value. A sanity
check is added during queue init phase to avoid having really big pages
from using more fragments than the type can fit.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260223204155.1783580-12-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agonet/mlx5e: Alloc rq drop page based on calculated page_shift
Dragos Tatulea [Mon, 23 Feb 2026 20:41:50 +0000 (22:41 +0200)] 
net/mlx5e: Alloc rq drop page based on calculated page_shift

An upcoming patch will allow setting the page order for RX
pages to be greater than 0. Make sure that the drop page will
also be allocated with the right size when that happens.

Take extra care when calculating the drop page size to
account for page_shift < PAGE_SHIFT which can happen for xsk.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260223204155.1783580-11-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agonet/mlx5e: Set page_pool order based on calculated page_shift
Dragos Tatulea [Mon, 23 Feb 2026 20:41:49 +0000 (22:41 +0200)] 
net/mlx5e: Set page_pool order based on calculated page_shift

Instead of unconditionally setting the page_pool to 0, calculate it from
page_shift for MPWQE case.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260223204155.1783580-10-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agonet/mlx5e: SHAMPO, Always calculate page size
Dragos Tatulea [Mon, 23 Feb 2026 20:41:48 +0000 (22:41 +0200)] 
net/mlx5e: SHAMPO, Always calculate page size

Adapt the rx path in SHAMPO mode to calculate page size based on
configured page_shift when dealing with payload data.

This is necessary as an upcoming patch will add support for using
different page sizes.

This change has no functional changes.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260223204155.1783580-9-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agonet/mlx5e: Drop unused channel parameters
Dragos Tatulea [Mon, 23 Feb 2026 20:41:47 +0000 (22:41 +0200)] 
net/mlx5e: Drop unused channel parameters

The channel parameters from struct mlx5_qmgmt_data are
built in mlx5e_queue_mem_alloc() but are not used.

mlx5e_open_channel() builds the channel parameters internally and those
parameters will be the ones that are used when opening the queue.

This patch drops the unused parameters.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260223204155.1783580-8-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agonet/mlx5e: Move xsk param into new option container struct
Dragos Tatulea [Mon, 23 Feb 2026 20:41:46 +0000 (22:41 +0200)] 
net/mlx5e: Move xsk param into new option container struct

The xsk parameter configuration (struct mlx5e_xsk_param) is passed
around many places during parameter calculation. It is used to contain
channel specific information (as opposed to the global info from
struct mlx5e_params).

Upcoming changes will need to push similar channel specific rq
configuration. Instead of adding one more parameter to all these
functions, create a new container structure that has optional rq
specific parameters. The xsk parameter will be the first of such kind.

The new container struct is itself optional. That means that before
checking its members, it has to be checked itself for validity.

This patch has no functional changes.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260223204155.1783580-7-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agonet/mlx5e: Alloc xsk channel param out of mlx5e_open_xsk()
Dragos Tatulea [Mon, 23 Feb 2026 20:41:45 +0000 (22:41 +0200)] 
net/mlx5e: Alloc xsk channel param out of mlx5e_open_xsk()

Currently the allocation and filling of the xsk channel
parameters was done in mlx5e_open_xsk().

Move this responsibility out of mlx5e_open_xsk() and have
the function take an already filled mlx5e_channel_param.

mlx5e_open_channel() already allocates channel parameters.
The only precaution that is needed is to call
mlx5e_build_xsk_channel_param() before mlx5e_open_xsk().

mlx5e_xsk_enable_locked() now allocates and fills the xsk parameters.

For simplicity, link the xsk parameters in struct mlx5e_channel_params
so that channel params can be passed around.

This patch has no functional changes.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260223204155.1783580-6-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agonet/mlx5e: Expose and rename xsk channel parameter function
Dragos Tatulea [Mon, 23 Feb 2026 20:41:44 +0000 (22:41 +0200)] 
net/mlx5e: Expose and rename xsk channel parameter function

mlx5e_build_xsk_cparam() is meant to be the alternative
to mlx5e_build_channel_param(). It calculates only the parameters
that it requires using the previously configured mlx5e_xsk_param.

Move this function to params.c to be alongside
mlx5e_build_channel_param() and give it a similar name.

Expose the function as it will be needed by upcoming changes.

This patch has no functional changes.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260223204155.1783580-5-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agonet/mlx5e: Extract max_xsk_wqebbs into its own function
Dragos Tatulea [Mon, 23 Feb 2026 20:41:43 +0000 (22:41 +0200)] 
net/mlx5e: Extract max_xsk_wqebbs into its own function

Calculating max_xsk_wqebbs seems large enough to deserve its own
function. It will make upcoming changes easier.

This patch has no functional changes.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260223204155.1783580-4-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agonet/mlx5e: Extract striding rq param calculation in function
Dragos Tatulea [Mon, 23 Feb 2026 20:41:42 +0000 (22:41 +0200)] 
net/mlx5e: Extract striding rq param calculation in function

Calculating parameters for striding rq is large enough
to deserve its own function. As the names are also very long
it is very easy to hit on the 80 char limitation every time
a change is made. This is an additional sign that it should
be extracted into its own function.

This patch has no functional change.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260223204155.1783580-3-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agonet/mlx5e: Make mlx5e_rq_param naming consistent
Dragos Tatulea [Mon, 23 Feb 2026 20:41:41 +0000 (22:41 +0200)] 
net/mlx5e: Make mlx5e_rq_param naming consistent

This structure is used under different names: rq_param, rq_params,
param, rqp. Refactor the code to use a single name: rq_param.

This patch has no functional change.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260223204155.1783580-2-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 weeks agoMerge branch 'mlx5-misc-fixes-2026-02-24'
Jakub Kicinski [Thu, 26 Feb 2026 04:01:52 +0000 (20:01 -0800)] 
Merge branch 'mlx5-misc-fixes-2026-02-24'

Tariq Toukan says:

====================
mlx5 misc fixes 2026-02-24

This patchset provides misc bug fixes from the team to the mlx5
core and Eth drivers.
====================

Link: https://patch.msgid.link/20260224114652.1787431-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet/mlx5e: Fix "scheduling while atomic" in IPsec MAC address query
Jianbo Liu [Tue, 24 Feb 2026 11:46:52 +0000 (13:46 +0200)] 
net/mlx5e: Fix "scheduling while atomic" in IPsec MAC address query

Fix a "scheduling while atomic" bug in mlx5e_ipsec_init_macs() by
replacing mlx5_query_mac_address() with ether_addr_copy() to get the
local MAC address directly from netdev->dev_addr.

The issue occurs because mlx5_query_mac_address() queries the hardware
which involves mlx5_cmd_exec() that can sleep, but it is called from
the mlx5e_ipsec_handle_event workqueue which runs in atomic context.

The MAC address is already available in netdev->dev_addr, so no need
to query hardware. This avoids the sleeping call and resolves the bug.

Call trace:
  BUG: scheduling while atomic: kworker/u112:2/69344/0x00000200
  __schedule+0x7ab/0xa20
  schedule+0x1c/0xb0
  schedule_timeout+0x6e/0xf0
  __wait_for_common+0x91/0x1b0
  cmd_exec+0xa85/0xff0 [mlx5_core]
  mlx5_cmd_exec+0x1f/0x50 [mlx5_core]
  mlx5_query_nic_vport_mac_address+0x7b/0xd0 [mlx5_core]
  mlx5_query_mac_address+0x19/0x30 [mlx5_core]
  mlx5e_ipsec_init_macs+0xc1/0x720 [mlx5_core]
  mlx5e_ipsec_build_accel_xfrm_attrs+0x422/0x670 [mlx5_core]
  mlx5e_ipsec_handle_event+0x2b9/0x460 [mlx5_core]
  process_one_work+0x178/0x2e0
  worker_thread+0x2ea/0x430

Fixes: cee137a63431 ("net/mlx5e: Handle ESN update events")
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260224114652.1787431-6-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet/mlx5: Fix missing devlink lock in SRIOV enable error path
Shay Drory [Tue, 24 Feb 2026 11:46:51 +0000 (13:46 +0200)] 
net/mlx5: Fix missing devlink lock in SRIOV enable error path

The cited commit miss to add locking in the error path of
mlx5_sriov_enable(). When pci_enable_sriov() fails,
mlx5_device_disable_sriov() is called to clean up. This cleanup function
now expects to be called with the devlink instance lock held.

Add the missing devl_lock(devlink) and devl_unlock(devlink)

Fixes: 84a433a40d0e ("net/mlx5: Lock mlx5 devlink reload callbacks")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260224114652.1787431-5-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet/mlx5: E-switch, Clear legacy flag when moving to switchdev
Shay Drory [Tue, 24 Feb 2026 11:46:50 +0000 (13:46 +0200)] 
net/mlx5: E-switch, Clear legacy flag when moving to switchdev

The cited commit introduced MLX5_PRIV_FLAGS_SWITCH_LEGACY to identify
when a transition to legacy mode is requested via devlink.  However, the
logic failed to clear this flag if the mode was subsequently changed
back to MLX5_ESWITCH_OFFLOADS (switchdev).  Consequently, if a user
toggled from legacy to switchdev, the flag remained set, leaving the
driver with wrong state indicating

Fix this by explicitly clearing the MLX5_PRIV_FLAGS_SWITCH_LEGACY bit
when the requested mode is MLX5_ESWITCH_OFFLOADS.

Fixes: 2a4f56fbcc47 ("net/mlx5e: Keep netdev when leave switchdev for devlink set legacy only")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260224114652.1787431-4-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet/mlx5: LAG, disable MPESW in lag_disable_change()
Shay Drory [Tue, 24 Feb 2026 11:46:49 +0000 (13:46 +0200)] 
net/mlx5: LAG, disable MPESW in lag_disable_change()

mlx5_lag_disable_change() unconditionally called mlx5_disable_lag() when
LAG was active, which is incorrect for MLX5_LAG_MODE_MPESW.
Hnece, call mlx5_disable_mpesw() when running in MPESW mode.

Fixes: a32327a3a02c ("net/mlx5: Lag, Control MultiPort E-Switch single FDB mode")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260224114652.1787431-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet/mlx5: DR, Fix circular locking dependency in dump
Shay Drory [Tue, 24 Feb 2026 11:46:48 +0000 (13:46 +0200)] 
net/mlx5: DR, Fix circular locking dependency in dump

Fix a circular locking dependency between dbg_mutex and the domain
rx/tx mutexes that could lead to a deadlock.

The dump path in dr_dump_domain_all() was acquiring locks in the order:
  dbg_mutex -> rx.mutex -> tx.mutex

While the table/matcher creation paths acquire locks in the order:
  rx.mutex -> tx.mutex -> dbg_mutex

This inverted lock ordering creates a circular dependency. Fix this by
changing dr_dump_domain_all() to acquire the domain lock before
dbg_mutex, matching the order used in mlx5dr_table_create() and
mlx5dr_matcher_create().

Lockdep splat:
 ======================================================
 WARNING: possible circular locking dependency detected
 6.19.0-rc6net_next_e817c4e #1 Not tainted
 ------------------------------------------------------
 sos/30721 is trying to acquire lock:
 ffff888102df5900 (&dmn->info.rx.mutex){+.+.}-{4:4}, at:
dr_dump_start+0x131/0x450 [mlx5_core]

 but task is already holding lock:
 ffff888102df5bc0 (&dmn->dump_info.dbg_mutex){+.+.}-{4:4}, at:
dr_dump_start+0x10b/0x450 [mlx5_core]

 which lock already depends on the new lock.

 the existing dependency chain (in reverse order) is:

 -> #2 (&dmn->dump_info.dbg_mutex){+.+.}-{4:4}:
        __mutex_lock+0x91/0x1060
        mlx5dr_matcher_create+0x377/0x5e0 [mlx5_core]
        mlx5_cmd_dr_create_flow_group+0x62/0xd0 [mlx5_core]
        mlx5_create_flow_group+0x113/0x1c0 [mlx5_core]
        mlx5_chains_create_prio+0x453/0x2290 [mlx5_core]
        mlx5_chains_get_table+0x2e2/0x980 [mlx5_core]
        esw_chains_create+0x1e6/0x3b0 [mlx5_core]
        esw_create_offloads_fdb_tables.cold+0x62/0x63f [mlx5_core]
        esw_offloads_enable+0x76f/0xd20 [mlx5_core]
        mlx5_eswitch_enable_locked+0x35a/0x500 [mlx5_core]
        mlx5_devlink_eswitch_mode_set+0x561/0x950 [mlx5_core]
        devlink_nl_eswitch_set_doit+0x67/0xe0
        genl_family_rcv_msg_doit+0xe0/0x130
        genl_rcv_msg+0x188/0x290
        netlink_rcv_skb+0x4b/0xf0
        genl_rcv+0x24/0x40
        netlink_unicast+0x1ed/0x2c0
        netlink_sendmsg+0x210/0x450
        __sock_sendmsg+0x38/0x60
        __sys_sendto+0x119/0x180
        __x64_sys_sendto+0x20/0x30
        do_syscall_64+0x70/0xd00
        entry_SYSCALL_64_after_hwframe+0x4b/0x53

 -> #1 (&dmn->info.tx.mutex){+.+.}-{4:4}:
        __mutex_lock+0x91/0x1060
        mlx5dr_table_create+0x11d/0x530 [mlx5_core]
        mlx5_cmd_dr_create_flow_table+0x62/0x140 [mlx5_core]
        __mlx5_create_flow_table+0x46f/0x960 [mlx5_core]
        mlx5_create_flow_table+0x16/0x20 [mlx5_core]
        esw_create_offloads_fdb_tables+0x136/0x240 [mlx5_core]
        esw_offloads_enable+0x76f/0xd20 [mlx5_core]
        mlx5_eswitch_enable_locked+0x35a/0x500 [mlx5_core]
        mlx5_devlink_eswitch_mode_set+0x561/0x950 [mlx5_core]
        devlink_nl_eswitch_set_doit+0x67/0xe0
        genl_family_rcv_msg_doit+0xe0/0x130
        genl_rcv_msg+0x188/0x290
        netlink_rcv_skb+0x4b/0xf0
        genl_rcv+0x24/0x40
        netlink_unicast+0x1ed/0x2c0
        netlink_sendmsg+0x210/0x450
        __sock_sendmsg+0x38/0x60
        __sys_sendto+0x119/0x180
        __x64_sys_sendto+0x20/0x30
        do_syscall_64+0x70/0xd00
        entry_SYSCALL_64_after_hwframe+0x4b/0x53

 -> #0 (&dmn->info.rx.mutex){+.+.}-{4:4}:
        __lock_acquire+0x18b6/0x2eb0
        lock_acquire+0xd3/0x2c0
        __mutex_lock+0x91/0x1060
        dr_dump_start+0x131/0x450 [mlx5_core]
        seq_read_iter+0xe3/0x410
        seq_read+0xfb/0x130
        full_proxy_read+0x53/0x80
        vfs_read+0xba/0x330
        ksys_read+0x65/0xe0
        do_syscall_64+0x70/0xd00
        entry_SYSCALL_64_after_hwframe+0x4b/0x53

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(&dmn->dump_info.dbg_mutex);
                                lock(&dmn->info.tx.mutex);
                                lock(&dmn->dump_info.dbg_mutex);
   lock(&dmn->info.rx.mutex);

                   *** DEADLOCK ***

Fixes: 9222f0b27da2 ("net/mlx5: DR, Add support for dumping steering info")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260224114652.1787431-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoMerge tag 'wireless-2026-02-25' of https://git.kernel.org/pub/scm/linux/kernel/git...
Jakub Kicinski [Thu, 26 Feb 2026 03:54:28 +0000 (19:54 -0800)] 
Merge tag 'wireless-2026-02-25' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless

Johannes Berg says:

====================
A good number of fixes:
 - cfg80211:
   - cancel rfkill work appropriately
   - fix radiotap parsing to correctly reject field 18
   - fix wext (yes...) off-by-one for IGTK key ID
 - mac80211:
   - fix for mesh NULL pointer dereference
   - fix for stack out-of-bounds (2 bytes) write on
     specific multi-link action frames
   - set default WMM parameters for all links
 - mwifiex: check dev_alloc_name() return value correctly
 - libertas: fix potential timer use-after-free
 - brcmfmac: fix crash on probe failure

* tag 'wireless-2026-02-25' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
  wifi: mac80211: fix NULL pointer dereference in mesh_rx_csa_frame()
  wifi: mac80211: bounds-check link_id in ieee80211_ml_reconfiguration
  wifi: mac80211: set default WMM parameters on all links
  wifi: libertas: fix use-after-free in lbs_free_adapter()
  wifi: mwifiex: Fix dev_alloc_name() return value check
  wifi: brcmfmac: Fix potential kernel oops when probe fails
  wifi: radiotap: reject radiotap with unknown bits
  wifi: cfg80211: cancel rfkill_block work in wiphy_unregister()
  wifi: cfg80211: wext: fix IGTK key ID off-by-one
====================

Link: https://patch.msgid.link/20260225113159.360574-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoMerge branch 'add-selftests-helper-to-get-n-unique-ports'
Jakub Kicinski [Thu, 26 Feb 2026 03:42:05 +0000 (19:42 -0800)] 
Merge branch 'add-selftests-helper-to-get-n-unique-ports'

Dimitri Daskalakis says:

====================
Add selftests helper to get N unique ports

The rss_ctx.py tests would occasionally flake. I found that the successive
calls to rand_port would occasionally return duplicate ports, breaking the
tests invariants.

Add a new helper that guarantees generated ports are unique.
====================

Link: https://patch.msgid.link/20260224224659.1507082-1-dimitri.daskalakis1@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoselftests: drv-net: rss: Generate unique ports for RSS context tests
Dimitri Daskalakis [Tue, 24 Feb 2026 22:46:59 +0000 (14:46 -0800)] 
selftests: drv-net: rss: Generate unique ports for RSS context tests

The RSS ctx tests rely on NFC rules with unique ports to steer packets
to the correct ctx. This updates the test to use the new rand_ports()
helper to guarantee the ports are unique.

Manual testing shows that generating 32 ports with the existing method
would result in at least one duplicate 4% of the time.

Signed-off-by: Dimitri Daskalakis <dimitri.daskalakis1@gmail.com>
Link: https://patch.msgid.link/20260224224659.1507082-3-dimitri.daskalakis1@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoselftests: net: py: Add rand_ports helper method
Dimitri Daskalakis [Tue, 24 Feb 2026 22:46:58 +0000 (14:46 -0800)] 
selftests: net: py: Add rand_ports helper method

Certain tests need a unique set of ports. Successive calls to the
existing rand_port method may return a duplicate port, resulting in test
flakiness. The new helper keeps sockets open while building a list of
ephemeral ports, thus the kernel enforces their uniqueness.

Signed-off-by: Dimitri Daskalakis <dimitri.daskalakis1@gmail.com>
Link: https://patch.msgid.link/20260224224659.1507082-2-dimitri.daskalakis1@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoMerge branch 'netfilter-updates-for-net-next'
Jakub Kicinski [Thu, 26 Feb 2026 03:36:28 +0000 (19:36 -0800)] 
Merge branch 'netfilter-updates-for-net-next'

Florian Westphal says:

====================
netfilter: updates for net-next

including IPVS updates from and via Julian Anastasov.

First updates for IPVS. From Julians cover-letter:

* Convert the global __ip_vs_mutex to per-net service_mutex and
  switch the service tables to be per-net, cowork by Jiejian Wu and
  Dust Li

* Convert some code that walks the service lists to use RCU instead of
  the service_mutex

* We used two tables for services (non-fwmark and fwmark), merge them
  into single svc_table

* The list for unavailable destinations (dest_trash) holds dsts and
  thus dev references causing extra work for the ip_vs_dst_event() dev
  notifier handler. Change this by dropping the reference when dest
  is removed and saved into dest_trash. The dest_trash will need more
  changes to make it light for lookups. TODO.

* On new connection we can do multiple lookups for services by trying
  different fallback options. Add more counters for service types, so
  that we can avoid unneeded lookups for services.

* The no_cport and dropentry counters can be per-net and also we can
  avoid extra conn lookups

Then, a few cleanups for nf_tables:

* keep BH enabled during nft_set_rbtree inserts, this is possible because
  the root lock is now only taken from control plane.
* toss a few EXPORT_SYMBOLs from nf_tables; these were historic
  leftovers from back in the day when e.g. set backends were still
  residing in their own modules.
* remove the register tracking infra from nftables.  It was disabled
  years ago in 5.18 and there are no plans to salvage this work; the
  idea was good (remove redundant register stores), but there is just
  one too many pitfalls, and better rule structuring (verdict maps)
  largely avoids the scenarios where this would have helped.
====================

Link: https://patch.msgid.link/20260224205048.4718-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonetfilter: nf_tables: remove register tracking infrastructure
Florian Westphal [Tue, 24 Feb 2026 20:50:48 +0000 (21:50 +0100)] 
netfilter: nf_tables: remove register tracking infrastructure

This facility was disabled in commit
9e539c5b6d9c ("netfilter: nf_tables: disable expression reduction infra"),
because not all nft_exprs guarantee they will update the destination
register: some may set NFT_BREAK instead to cancel evaluation of the
rule.

This has been dead code ever since.
There are no plans to salvage this at this time, so remove this.

Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20260224205048.4718-10-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonetfilter: nf_tables: drop obsolete EXPORT_SYMBOLs
Florian Westphal [Tue, 24 Feb 2026 20:50:47 +0000 (21:50 +0100)] 
netfilter: nf_tables: drop obsolete EXPORT_SYMBOLs

These are no longer required, calling objects are nowadays
baked into nf_tables.ko itself.

Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20260224205048.4718-9-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonetfilter: nft_set_rbtree: don't disable bh when acquiring tree lock
Florian Westphal [Tue, 24 Feb 2026 20:50:46 +0000 (21:50 +0100)] 
netfilter: nft_set_rbtree: don't disable bh when acquiring tree lock

As of commit 7e43e0a1141d
("netfilter: nft_set_rbtree: translate rbtree to array for binary search")
the lock is only taken from control plane, no need to disable BH anymore.

Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20260224205048.4718-8-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoipvs: no_cport and dropentry counters can be per-net
Julian Anastasov [Tue, 24 Feb 2026 20:50:45 +0000 (21:50 +0100)] 
ipvs: no_cport and dropentry counters can be per-net

Change the no_cport counters to be per-net and address family.
This should reduce the extra conn lookups done during present
NO_CPORT connections.

By changing from global to per-net dropentry counters, one net
will not affect the drop rate of another net.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20260224205048.4718-7-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoipvs: use more counters to avoid service lookups
Julian Anastasov [Tue, 24 Feb 2026 20:50:44 +0000 (21:50 +0100)] 
ipvs: use more counters to avoid service lookups

When new connection is created we can lookup for services multiple
times to support fallback options. We already have some counters
to skip specific lookups because it costs CPU cycles for hash
calculation, etc.

Add more counters for fwmark/non-fwmark services (fwm_services and
nonfwm_services) and make all counters per address family.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20260224205048.4718-6-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoipvs: do not keep dest_dst after dest is removed
Julian Anastasov [Tue, 24 Feb 2026 20:50:43 +0000 (21:50 +0100)] 
ipvs: do not keep dest_dst after dest is removed

Before now dest->dest_dst is not released when server is moved into
dest_trash list after removal. As result, we can keep dst/dev
references for long time without actively using them.

It is better to avoid walking the dest_trash list when
ip_vs_dst_event() receives dev events. So, make sure we do not
hold dev references in dest_trash list. As packets can be flying
while server is being removed, check the IP_VS_DEST_F_AVAILABLE
flag in slow path to ensure we do not save new dev references to
removed servers.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20260224205048.4718-5-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoipvs: use single svc table
Julian Anastasov [Tue, 24 Feb 2026 20:50:42 +0000 (21:50 +0100)] 
ipvs: use single svc table

fwmark based services and non-fwmark based services can be hashed
in same service table. This reduces the burden of working with two
tables.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20260224205048.4718-4-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoipvs: some service readers can use RCU
Julian Anastasov [Tue, 24 Feb 2026 20:50:41 +0000 (21:50 +0100)] 
ipvs: some service readers can use RCU

Some places walk the services under mutex but they can just use RCU:

* ip_vs_dst_event() uses ip_vs_forget_dev() which uses its own lock
  to modify dest
* ip_vs_genl_dump_services(): ip_vs_genl_fill_service() just fills skb
* ip_vs_genl_parse_service(): move RCU lock to callers
  ip_vs_genl_set_cmd(), ip_vs_genl_dump_dests() and ip_vs_genl_get_cmd()
* ip_vs_genl_dump_dests(): just fill skb

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20260224205048.4718-3-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agoipvs: make ip_vs_svc_table and ip_vs_svc_fwm_table per netns
Jiejian Wu [Tue, 24 Feb 2026 20:50:40 +0000 (21:50 +0100)] 
ipvs: make ip_vs_svc_table and ip_vs_svc_fwm_table per netns

Current ipvs uses one global mutex "__ip_vs_mutex" to keep the global
"ip_vs_svc_table" and "ip_vs_svc_fwm_table" safe. But when there are
tens of thousands of services from different netns in the table, it
takes a long time to look up the table, for example, using "ipvsadm
-ln" from different netns simultaneously.

We make "ip_vs_svc_table" and "ip_vs_svc_fwm_table" per netns, and we
add "service_mutex" per netns to keep these two tables safe instead of
the global "__ip_vs_mutex" in current version. To this end, looking up
services from different netns simultaneously will not get stuck,
shortening the time consumption in large-scale deployment. It can be
reproduced using the simple scripts below.

init.sh: #!/bin/bash
for((i=1;i<=4;i++));do
        ip netns add ns$i
        ip netns exec ns$i ip link set dev lo up
        ip netns exec ns$i sh add-services.sh
done

add-services.sh: #!/bin/bash
for((i=0;i<30000;i++)); do
        ipvsadm -A  -t 10.10.10.10:$((80+$i)) -s rr
done

runtest.sh: #!/bin/bash
for((i=1;i<4;i++));do
        ip netns exec ns$i ipvsadm -ln > /dev/null &
done
ip netns exec ns4 ipvsadm -ln > /dev/null

Run "sh init.sh" to initiate the network environment. Then run "time
./runtest.sh" to evaluate the time consumption. Our testbed is a 4-core
Intel Xeon ECS. The result of the original version is around 8 seconds,
while the result of the modified version is only 0.8 seconds.

Signed-off-by: Jiejian Wu <jiejian@linux.alibaba.com>
Co-developed-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20260224205048.4718-2-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet: pppoe: avoid zero-length arrays in struct pppoe_hdr
Eric Woudstra [Tue, 24 Feb 2026 15:50:30 +0000 (16:50 +0100)] 
net: pppoe: avoid zero-length arrays in struct pppoe_hdr

Jakub Kicinski reported following issue in upcoming patches:

W=1 C=1 GCC build gives us:

net/bridge/netfilter/nf_conntrack_bridge.c: note: in included file (through
../include/linux/if_pppox.h, ../include/uapi/linux/netfilter_bridge.h,
../include/linux/netfilter_bridge.h): include/uapi/linux/if_pppox.h:
153:29: warning: array of flexible structures

sparse doesn't like that hdr has a zero-length array which overlaps
proto. The kernel code doesn't currently need those arrays.

PPPoE connection is functional after applying this patch.

Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Signed-off-by: Eric Woudstra <ericwouds@gmail.com>
Link: https://patch.msgid.link/20260224155030.106918-1-ericwouds@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet/ibmveth: fix comment typos in ibmveth.c
Abhilekh Deka [Tue, 24 Feb 2026 15:36:01 +0000 (21:06 +0530)] 
net/ibmveth: fix comment typos in ibmveth.c

Correct spelling mistakes in comments:
- Fix misspelling of gro_receive
- Fix misspelling of Partition

Signed-off-by: Abhilekh Deka <abhindeka@gmail.com>
Reviewed-by: Dave Marquardt <davemarq@linux.ibm.com>
Link: https://patch.msgid.link/20260224153601.17534-1-abhindeka@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 weeks agonet: cadence: macb: add ethtool nway_reset support
Nicolai Buchwitz [Tue, 24 Feb 2026 14:57:23 +0000 (15:57 +0100)] 
net: cadence: macb: add ethtool nway_reset support

Wire phy_ethtool_nway_reset() as the .nway_reset ethtool operation,
allowing userspace to restart PHY autonegotiation via 'ethtool -r'.

Signed-off-by: Nicolai Buchwitz <nb@tipi-net.de>
Link: https://patch.msgid.link/20260224145723.49450-1-nb@tipi-net.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>