]> git.ipfire.org Git - thirdparty/kernel/linux.git/log
thirdparty/kernel/linux.git
6 weeks agoice: switch to Page Pool
Michal Kubiak [Thu, 25 Sep 2025 09:22:53 +0000 (11:22 +0200)] 
ice: switch to Page Pool

This patch completes the transition of the ice driver to use the Page Pool
and libeth APIs, following the same direction as commit 5fa4caff59f2
("iavf: switch to Page Pool"). With the legacy page splitting and recycling
logic already removed, the driver is now in a clean state to adopt the
modern memory model.

The Page Pool integration simplifies buffer management by offloading
DMA mapping and recycling to the core infrastructure. This eliminates
the need for driver-specific handling of headroom, buffer sizing, and
page order. The libeth helper is used for CPU-side processing, while
DMA-for-device is handled by the Page Pool core.

Additionally, this patch extends the conversion to cover XDP support.
The driver now uses libeth_xdp helpers for Rx buffer processing,
and optimizes XDP_TX by skipping per-frame DMA mapping. Instead, all
buffers are mapped as bi-directional up front, leveraging Page Pool's
lifecycle management. This significantly reduces overhead in virtualized
environments.

Performance observations:
- In typical scenarios (netperf, XDP_PASS, XDP_DROP), performance remains
  on par with the previous implementation.
- In XDP_TX mode:
  * With IOMMU enabled, performance improves dramatically - over 5x
    increase - due to reduced DMA mapping overhead and better memory reuse.
  * With IOMMU disabled, performance remains comparable to the previous
    implementation, with no significant changes observed.
- In XDP_DROP mode:
  * For small MTUs, (where multiple buffers can be allocated on a single
    memory page), a performance drop of approximately 20% is observed.
    According to 'perf top' analysis, the bottleneck is caused by atomic
    reference counter increments in the Page Pool.
  * For normal MTUs, (where only one buffer can be allocated within a
    single memory page), performance remains comparable to baseline
    levels.

This change is also a step toward a more modular and unified XDP
implementation across Intel Ethernet drivers, aligning with ongoing
efforts to consolidate and streamline feature support.

Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Suggested-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>
Tested-by: Alexander Nowlin <alexander.nowlin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
6 weeks agoice: drop page splitting and recycling
Michal Kubiak [Thu, 25 Sep 2025 09:22:52 +0000 (11:22 +0200)] 
ice: drop page splitting and recycling

As part of the transition toward Page Pool integration, remove the
legacy page splitting and recycling logic from the ice driver. This
mirrors the approach taken in commit 920d86f3c552 ("iavf: drop page
splitting and recycling").

The previous model attempted to reuse partially consumed pages by
splitting them and tracking their usage across descriptors. While
this was once a memory optimization, it introduced significant
complexity and overhead in the Rx path, including:
- Manual refcount management and page reuse heuristics;
- Per-descriptor buffer shuffling, which could involve moving dozens
  of `ice_rx_buf` structures per NAPI cycle;
- Increased branching and cache pressure in the hotpath.

This change simplifies the Rx logic by always allocating fresh pages
and letting the networking stack handle their lifecycle. Although this
may temporarily reduce performance (up to ~98% in some XDP cases), it
greatly improves maintainability and paves the way for Page Pool,
which will restore and exceed previous performance levels.

The `ice_rx_buf` array is retained for now to minimize diffstat and
ease future replacement with a shared buffer abstraction.

Co-developed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>
Tested-by: Alexander Nowlin <alexander.nowlin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
6 weeks agoice: remove legacy Rx and construct SKB
Michal Kubiak [Thu, 25 Sep 2025 09:22:51 +0000 (11:22 +0200)] 
ice: remove legacy Rx and construct SKB

The commit 53844673d555 ("iavf: kill 'legacy-rx' for good") removed
the legacy Rx path in the iavf driver. This change applies the same
rationale to the ice driver.

The legacy Rx path relied on manual skb allocation and header copying,
which has become increasingly inefficient and difficult to maintain.
With the stabilization of build_skb() and the growing adoption of
features like XDP, page_pool, and multi-buffer support, the legacy
approach is no longer viable.

Key drawbacks of the legacy path included:
- Higher memory pressure due to direct page allocations and splitting;
- Redundant memcpy() operations for packet headers;
- CPU overhead from eth_get_headlen() and Flow Dissector usage;
- Compatibility issues with XDP, which imposes strict headroom and
  tailroom requirements.

The ice driver, like iavf, does not benefit from the minimal headroom
savings that legacy Rx once offered, as it already splits pages into
fixed halves. Removing this path simplifies the Rx logic, eliminates
unnecessary branches in the hotpath, and prepares the driver for
upcoming enhancements.

In addition to removing the legacy Rx path, this change also eliminates
the custom construct_skb() functions from both the standard and
zero-copy (ZC) Rx paths. These are replaced with the build_skb()
and standardized xdp_build_skb_from_zc() helpers, aligning the driver
with the modern XDP infrastructure and reducing code duplication.

This cleanup also reduces code complexity and improves maintainability
as we move toward a more unified and modern Rx model across drivers.

Co-developed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Michal Kubiak <michal.kubiak@intel.com>
Tested-by: Alexander Nowlin <alexander.nowlin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
6 weeks agonet: phy: motorcomm: Add support for PHY LEDs on YT8531
Tianling Shen [Sun, 26 Oct 2025 13:36:52 +0000 (21:36 +0800)] 
net: phy: motorcomm: Add support for PHY LEDs on YT8531

The LED registers on YT8531 are exactly same as YT8521, so simply
reuse yt8521_led_hw_* functions.

Tested on OrangePi R1 Plus LTS and Zero3.

Signed-off-by: Tianling Shen <cnsztl@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Jijie Shao<shaojijie@huawei.com>
Link: https://patch.msgid.link/20251026133652.1288732-1-cnsztl@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: phy: realtek: Add RTL8224 cable testing support
Issam Hamdi [Fri, 24 Oct 2025 09:49:00 +0000 (11:49 +0200)] 
net: phy: realtek: Add RTL8224 cable testing support

The RTL8224 can detect open pairs and short types (in same pair or some
other pair). The distance to this problem can be estimated. This is done
for each of the 4 pairs separately.

It is not meant to be run while there is an active link partner because
this interferes with the active test pulses.

Output with open 50 m cable:

  Pair A code Open Circuit, source: TDR
  Pair A, fault length: 51.79m, source: TDR
  Pair B code Open Circuit, source: TDR
  Pair B, fault length: 51.28m, source: TDR
  Pair C code Open Circuit, source: TDR
  Pair C, fault length: 50.46m, source: TDR
  Pair D code Open Circuit, source: TDR
  Pair D, fault length: 51.12m, source: TDR

Terminated cable:

  Pair A code OK, source: TDR
  Pair B code OK, source: TDR
  Pair C code OK, source: TDR
  Pair D code OK, source: TDR

Shorted cable (both short types are at roughly the same distance)

  Pair A code Short to another pair, source: TDR
  Pair A, fault length: 2.35m, source: TDR
  Pair B code Short to another pair, source: TDR
  Pair B, fault length: 2.15m, source: TDR
  Pair C code OK, source: TDR
  Pair D code Short within Pair, source: TDR
  Pair D, fault length: 1.94m, source: TDR

Signed-off-by: Issam Hamdi <ih@simonwunderlich.de>
Co-developed-by: Sven Eckelmann <se@simonwunderlich.de>
Signed-off-by: Sven Eckelmann <se@simonwunderlich.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20251024-rtl8224-cable-test-v1-1-e3cda89ac98f@simonwunderlich.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoMerge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next...
Jakub Kicinski [Wed, 29 Oct 2025 01:12:07 +0000 (18:12 -0700)] 
Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue

Tony Nguyen says:

====================
ice: postpone service task disabling

Przemek Kitszel says:

Move service task shutdown to the very end of driver teardown procedure.
This is needed (or at least beneficial) for all unwinding functions that
talk to FW/HW via Admin Queue (so, most of top-level functions, like
ice_deinit_hw()).

Most of the patches move stuff around (I believe it makes it much easier
to review/proof when kept separate) in preparation to defer stopping the
service task to the very end of ice_remove() (and other unwinding flows).
Then last patch fixes duplicate call to ice_init_hw() (actual, but
unlikely to encounter, so -next, given the size of the changes).

First patch is not much related, only by that it was developed together

* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  ice: remove duplicate call to ice_deinit_hw() on error paths
  ice: move ice_deinit_dev() to the end of deinit paths
  ice: extract ice_init_dev() from ice_init()
  ice: move ice_init_pf() out of ice_init_dev()
  ice: move udp_tunnel_nic and misc IRQ setup into ice_init_pf()
  ice: ice_init_pf: destroy mutexes and xarrays on memory alloc failure
  ice: move ice_init_interrupt_scheme() prior ice_init_pf()
  ice: move service task start out of ice_init_pf()
  ice: enforce RTNL assumption of queue NAPI manipulation
====================

Link: https://patch.msgid.link/20251024204746.3092277-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: tcp_lp: fix kernel-doc warnings and update outdated reference links
Rakuram Eswaran [Sat, 25 Oct 2025 12:05:18 +0000 (17:35 +0530)] 
net: tcp_lp: fix kernel-doc warnings and update outdated reference links

Fix kernel-doc warnings in tcp_lp.c by adding missing parameter
descriptions for tcp_lp_cong_avoid() and tcp_lp_pkts_acked() when
building with W=1.

Also replace invalid URLs in the file header comment with the currently
valid links to the TCP-LP paper and implementation page.

No functional changes.

Signed-off-by: Rakuram Eswaran <rakuram.e96@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20251025-net_ipv4_tcp_lp_c-v1-1-058cc221499e@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agosctp: Constify struct sctp_sched_ops
Christophe JAILLET [Sat, 25 Oct 2025 07:40:59 +0000 (09:40 +0200)] 
sctp: Constify struct sctp_sched_ops

'struct sctp_sched_ops' is not modified in these drivers.

Constifying this structure moves some data to a read-only section, so
increases overall security, especially when the structure holds some
function pointers.

On a x86_64, with allmodconfig, as an example:
Before:
======
   text    data     bss     dec     hex filename
   8019     568       0    8587    218b net/sctp/stream_sched_fc.o

After:
=====
   text    data     bss     dec     hex filename
   8275     312       0    8587    218b net/sctp/stream_sched_fc.o

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://patch.msgid.link/dce03527eb7b7cc8a3c26d5cdac12bafe3350135.1761377890.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: netmem: remove NET_IOV_MAX from net_iov_type enum
Bobby Eshleman [Fri, 24 Oct 2025 18:02:56 +0000 (11:02 -0700)] 
net: netmem: remove NET_IOV_MAX from net_iov_type enum

Remove the NET_IOV_MAX workaround from the net_iov_type enum. This entry
was previously added to force the enum size to unsigned long to satisfy
the NET_IOV_ASSERT_OFFSET static assertions.

After commit f3d85c9ee510 ("netmem: introduce struct netmem_desc
mirroring struct page") this approach became unnecessary by placing the
net_iov_type after the netmem_desc. Placing the net_iov_type after
netmem_desc results in the net_iov_type size having no effect on the
position or layout of the fields that mirror the struct page.

The layout before this patch:

struct net_iov {
union {
struct netmem_desc desc;                 /*     0    48 */
struct {
long unsigned int _flags;        /*     0     8 */
long unsigned int pp_magic;      /*     8     8 */
struct page_pool * pp;           /*    16     8 */
long unsigned int _pp_mapping_pad; /*    24     8 */
long unsigned int dma_addr;      /*    32     8 */
atomic_long_t pp_ref_count;      /*    40     8 */
};                                       /*     0    48 */
};                                               /*     0    48 */
struct net_iov_area *      owner;                /*    48     8 */
enum net_iov_type          type;                 /*    56     8 */

/* size: 64, cachelines: 1, members: 3 */
};

The layout after this patch:

struct net_iov {
union {
struct netmem_desc desc;                 /*     0    48 */
struct {
long unsigned int _flags;        /*     0     8 */
long unsigned int pp_magic;      /*     8     8 */
struct page_pool * pp;           /*    16     8 */
long unsigned int _pp_mapping_pad; /*    24     8 */
long unsigned int dma_addr;      /*    32     8 */
atomic_long_t pp_ref_count;      /*    40     8 */
};                                       /*     0    48 */
};                                               /*     0    48 */
struct net_iov_area *      owner;                /*    48     8 */
enum net_iov_type          type;                 /*    56     4 */

/* size: 64, cachelines: 1, members: 3 */
/* padding: 4 */
};

Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20251024-b4-devmem-remove-niov-max-v1-1-ba72c68bc869@meta.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: rps: softnet_data reorg to make enqueue_to_backlog() fast
Eric Dumazet [Fri, 24 Oct 2025 09:12:40 +0000 (09:12 +0000)] 
net: rps: softnet_data reorg to make enqueue_to_backlog() fast

enqueue_to_backlog() is showing up in kernel profiles on hosts
with many cores, when RFS/RPS is used.

The following softnet_data fields need to be updated:

- input_queue_tail
- input_pkt_queue (next, prev, qlen, lock)
- backlog.state (if input_pkt_queue was empty)

Unfortunately they are currenly using two cache lines:

/* --- cacheline 3 boundary (192 bytes) --- */
call_single_data_t         csd __attribute__((__aligned__(64))); /*  0xc0  0x20 */
struct softnet_data *      rps_ipi_next;         /*  0xe0   0x8 */
unsigned int               cpu;                  /*  0xe8   0x4 */
unsigned int               input_queue_tail;     /*  0xec   0x4 */
struct sk_buff_head        input_pkt_queue;      /*  0xf0  0x18 */

/* --- cacheline 4 boundary (256 bytes) was 8 bytes ago --- */

struct napi_struct         backlog __attribute__((__aligned__(8))); /* 0x108 0x1f0 */

Add one ____cacheline_aligned_in_smp to make sure they now are using
a single cache line.

Also, because napi_struct has written fields, make @state its first field.

We want to make sure that cpus adding packets to sd->input_pkt_queue
are not slowing down cpus processing their backlog because of
false sharing.

After this patch new layout is:

/* --- cacheline 5 boundary (320 bytes) --- */
long int                   pad[3] __attribute__((__aligned__(64))); /* 0x140  0x18 */
unsigned int               input_queue_tail;     /* 0x158   0x4 */

/* XXX 4 bytes hole, try to pack */

struct sk_buff_head        input_pkt_queue;      /* 0x160  0x18 */
struct napi_struct         backlog __attribute__((__aligned__(8))); /* 0x178 0x1f0 */

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251024091240.3292546-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: optimize enqueue_to_backlog() for the fast path
Eric Dumazet [Fri, 24 Oct 2025 09:05:17 +0000 (09:05 +0000)] 
net: optimize enqueue_to_backlog() for the fast path

Add likely() and unlikely() clauses for the common cases:

Device is running.
Queue is not full.
Queue is less than half capacity.

Add max_backlog parameter to skb_flow_limit() to avoid
a second READ_ONCE(net_hotdata.max_backlog).

skb_flow_limit() does not need the backlog_lock protection,
and can be called before we acquire the lock, for even better
resistance to attacks.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251024090517.3289181-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agotools: ynl: rework the string representation of NlError
Jakub Kicinski [Mon, 27 Oct 2025 19:29:58 +0000 (12:29 -0700)] 
tools: ynl: rework the string representation of NlError

In early days of YNL development dumping the NlMsg on errors
was quite useful, as the library itself could have been buggy.
These days increasingly the NlMsg is just taking up screen space
and means nothing to a typical user. Try to format the errors
more in line with how YNL C formats its errors strings.

Before:
  $ ynl --family ethtool  --do channels-set  --json '{}'
  Netlink error: Invalid argument
  nl_len = 44 (28) nl_flags = 0x300 nl_type = 2
error: -22
extack: {'miss-type': 'header'}

  $ ynl --family ethtool  --do channels-set  --json '{..., "tx-count": 999}'
  Netlink error: Invalid argument
  nl_len = 88 (72) nl_flags = 0x300 nl_type = 2
error: -22
extack: {'msg': 'requested channel count exceeds maximum', 'bad-attr': '.tx-count'}

After:
  $ ynl --family ethtool  --do channels-set  --json '{}'
  Netlink error: Invalid argument {'miss-type': 'header'}

  $ ynl --family ethtool  --do channels-set  --json '{..., "tx-count": 999}'
  Netlink error: requested channel count exceeds maximum: Invalid argument {'bad-attr': '.tx-count'}

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20251027192958.2058340-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agotools: ynl: fix indent issues in the main Python lib
Jakub Kicinski [Mon, 27 Oct 2025 19:29:57 +0000 (12:29 -0700)] 
tools: ynl: fix indent issues in the main Python lib

Class NlError() and operation_do_attributes() are indented by 2 spaces
rather than 4 spaces used by the rest of the file.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20251027192958.2058340-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoMerge branch 'net-stmmac-add-support-for-coarse-timestamping'
Paolo Abeni [Tue, 28 Oct 2025 14:34:36 +0000 (15:34 +0100)] 
Merge branch 'net-stmmac-add-support-for-coarse-timestamping'

Maxime Chevallier says:

====================
net: stmmac: Add support for coarse timestamping

This is V2 for coarse timetamping support in stmmac. This version uses a
dedicated devlink param "ts_coarse" to control this mode.

This doesn't conflict with Russell's cleanup of hwif.

Maxime

[1] : https://lore.kernel.org/netdev/20200514102808.31163-1-olivier.dautricourt@orolia.com/

V1: https://lore.kernel.org/netdev/20251015102725.1297985-1-maxime.chevallier@bootlin.com/
====================

Link: https://patch.msgid.link/20251024070720.71174-1-maxime.chevallier@bootlin.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agonet: stmmac: Add a devlink attribute to control timestamping mode
Maxime Chevallier [Fri, 24 Oct 2025 07:07:18 +0000 (09:07 +0200)] 
net: stmmac: Add a devlink attribute to control timestamping mode

The DWMAC1000 supports 2 timestamping configurations to configure how
frequency adjustments are made to the ptp_clock, as well as the reported
timestamp values.

There was a previous attempt at upstreaming support for configuring this
mode by Olivier Dautricourt and Julien Beraud a few years back [1]

In a nutshell, the timestamping can be either set in fine mode or in
coarse mode.

In fine mode, which is the default, we use the overflow of an accumulator to
trigger frequency adjustments, but by doing so we lose precision on the
timetamps that are produced by the timestamping unit. The main drawback
is that the sub-second increment value, used to generate timestamps, can't be
set to lower than (2 / ptp_clock_freq).

The "fine" qualification comes from the frequent frequency adjustments we are
able to do, which is perfect for a PTP follower usecase.

In Coarse mode, we don't do frequency adjustments based on an
accumulator overflow. We can therefore have very fine subsecond
increment values, allowing for better timestamping precision. However
this mode works best when the ptp clock frequency is adjusted based on
an external signal, such as a PPS input produced by a GPS clock. This
mode is therefore perfect for a Grand-master usecase.

Introduce a driver-specific devlink parameter "ts_coarse" to enable or
disable coarse mode, keeping the "fine" mode as a default.

This can then be changed with:

  devlink dev param set <dev> name ts_coarse value true cmode runtime

The associated documentation is also added.

[1] : https://lore.kernel.org/netdev/20200514102808.31163-1-olivier.dautricourt@orolia.com/

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://patch.msgid.link/20251024070720.71174-3-maxime.chevallier@bootlin.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agonet: stmmac: Move subsecond increment configuration in dedicated helper
Maxime Chevallier [Fri, 24 Oct 2025 07:07:17 +0000 (09:07 +0200)] 
net: stmmac: Move subsecond increment configuration in dedicated helper

In preparation for fine/coarse support, let's move the subsecond increment
and addend configuration in a dedicated helper.

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/20251024070720.71174-2-maxime.chevallier@bootlin.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agoMerge branch 'net-macb-eyeq5-support'
Paolo Abeni [Tue, 28 Oct 2025 14:17:55 +0000 (15:17 +0100)] 
Merge branch 'net-macb-eyeq5-support'

 says:

====================
net: macb: EyeQ5 support

This series' goal is adding support to the MACB driver for EyeQ5 GEM.
The specifics for this compatible are:

 - HW cannot add dummy bytes at the start of IP packets for alignment
   purposes. The behavior can be detected using DCFG6 so it isn't
   attached to compatible data.

 - The hardware LSO/TSO is known to be buggy: add a compatible
   capability flag to force disable it.

 - At init, we have to wiggle two syscon registers that configure the
   PHY integration.

   In past attempts [0] we did it in macb_config->init() using a syscon
   regmap. That was far from ideal so now a generic PHY driver
   abstracts that away. We reuse the bp->sgmii_phy field used by some
   compatibles.

   We have to add a phy_set_mode() call as the PHY power on sequence
   depends on whether we do RGMII or SGMII.

[0]: https://lore.kernel.org/lkml/20250627-macb-v2-15-ff8207d0bb77@bootlin.com/

Signed-off-by: Théo Lebrun <theo.lebrun@bootlin.com>
---
Changes in v3:
- Drop Fixes: trailer on [2/5]. We don't fix any platform using the
  driver currently.
- Improve [5/5] commit message; add info about how an unconditional
  phy_set_mode_ext() won't break existing platforms.
- Hardbreak 82 characters line in [2/5]; warning by patchwork.
- Trailers:
  - 1x Acked-by: Conor Dooley on [1/5].
  - 2x Reviewed-by: Andrew Lunn on [1/5] and [4/5].
  - 2x Reviewed-by: Maxime Chevallier on [4/5] and [5/5].
- Link to v2: https://lore.kernel.org/r/20251022-macb-eyeq5-v2-0-7c140abb0581@bootlin.com

Changes in v2:
- Drop non net-next patches.
- Re-run get_maintainers.pl to shorten the To/Cc list.
- Rebase upon latest net-next; no changes. Tested on HW.
- Link to v1: https://lore.kernel.org/r/20251021-macb-eyeq5-v1-0-3b0b5a9d2f85@bootlin.com

Past versions of the MACB EyeQ5 patches:
 - March 2025: [PATCH net-next 00/13] Support the Cadence MACB/GEM
   instances on Mobileye EyeQ5 SoCs
   https://lore.kernel.org/lkml/20250321-macb-v1-0-537b7e37971d@bootlin.com/
 - June 2025: [PATCH net-next v2 00/18] Support the Cadence MACB/GEM
   instances on Mobileye EyeQ5 SoCs
   https://lore.kernel.org/lkml/20250627-macb-v2-0-ff8207d0bb77@bootlin.com/
 - August 2025: [PATCH net v3 00/16] net: macb: various fixes & cleanup
   https://lore.kernel.org/lkml/20250808-macb-fixes-v3-0-08f1fcb5179f@bootlin.com/

---
Théo Lebrun (5):
      dt-bindings: net: cdns,macb: add Mobileye EyeQ5 ethernet interface
      net: macb: match skb_reserve(skb, NET_IP_ALIGN) with HW alignment
      net: macb: add no LSO capability (MACB_CAPS_NO_LSO)
      net: macb: rename bp->sgmii_phy field to bp->phy
      net: macb: Add "mobileye,eyeq5-gem" compatible

 .../devicetree/bindings/net/cdns,macb.yaml         | 10 +++
 drivers/net/ethernet/cadence/macb.h                |  6 +-
 drivers/net/ethernet/cadence/macb_main.c           | 94 +++++++++++++++++-----
 3 files changed, 91 insertions(+), 19 deletions(-)
---
base-commit: 61b7ade9ba8c3b16867e25411b5f7cf1abe35879
change-id: 20251020-macb-eyeq5-fe2c0d1edc75

Best regards,
====================

Link: https://patch.msgid.link/20251023-macb-eyeq5-v3-0-af509422c204@bootlin.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agonet: macb: Add "mobileye,eyeq5-gem" compatible
Théo Lebrun [Thu, 23 Oct 2025 16:22:55 +0000 (18:22 +0200)] 
net: macb: Add "mobileye,eyeq5-gem" compatible

Add support for the two GEM instances inside Mobileye EyeQ5 SoCs, using
compatible "mobileye,eyeq5-gem". With it, add a custom init sequence
that must grab a generic PHY and initialise it.

We use bp->phy in both RGMII and SGMII cases. Tell our mode by adding a
phy_set_mode_ext() during macb_open(), before phy_power_on(). We are
the first users of bp->phy that use it in non-SGMII cases.

The phy_set_mode_ext() call is made unconditionally. It cannot cause
issues on platforms where !bp->phy or !bp->phy->ops->set_mode as, in
those cases, the call is a no-op (returning zero). From reading
upstream DTS, we can figure out that no platform has a bp->phy and a
PHY driver that has a .set_mode() implementation:
 - cdns,zynqmp-gem: no DTS upstream.
 - microchip,mpfs-macb: microchip/mpfs.dtsi, &mac0..1, no PHY attached.
 - xlnx,versal-gem: xilinx/versal-net.dtsi, &gem0..1, no PHY attached.
 - xlnx,zynqmp-gem: xilinx/zynqmp.dtsi, &gem0..3, PHY attached to
   drivers/phy/xilinx/phy-zynqmp.c which has no .set_mode().

Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: Théo Lebrun <theo.lebrun@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20251023-macb-eyeq5-v3-5-af509422c204@bootlin.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agonet: macb: rename bp->sgmii_phy field to bp->phy
Théo Lebrun [Thu, 23 Oct 2025 16:22:54 +0000 (18:22 +0200)] 
net: macb: rename bp->sgmii_phy field to bp->phy

The bp->sgmii_phy field is initialised at probe by init_reset_optional()
if bp->phy_interface == PHY_INTERFACE_MODE_SGMII. It gets used by:
 - zynqmp_config: "cdns,zynqmp-gem" or "xlnx,zynqmp-gem" compatibles.
 - mpfs_config: "microchip,mpfs-macb" compatible.
 - versal_config: "xlnx,versal-gem" compatible.

Make name more generic as EyeQ5 requires the PHY in SGMII & RGMII cases.

Drop "for ZynqMP SGMII mode" comment that is already a lie, as it gets
used on Microchip platforms as well. And soon it won't be SGMII-only.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: Théo Lebrun <theo.lebrun@bootlin.com>
Link: https://patch.msgid.link/20251023-macb-eyeq5-v3-4-af509422c204@bootlin.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agonet: macb: add no LSO capability (MACB_CAPS_NO_LSO)
Théo Lebrun [Thu, 23 Oct 2025 16:22:53 +0000 (18:22 +0200)] 
net: macb: add no LSO capability (MACB_CAPS_NO_LSO)

LSO is runtime-detected using the PBUF_LSO field inside register DCFG6.
Allow disabling that feature if it is broken by using bp->caps coming
from match data.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Théo Lebrun <theo.lebrun@bootlin.com>
Link: https://patch.msgid.link/20251023-macb-eyeq5-v3-3-af509422c204@bootlin.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agonet: macb: match skb_reserve(skb, NET_IP_ALIGN) with HW alignment
Théo Lebrun [Thu, 23 Oct 2025 16:22:52 +0000 (18:22 +0200)] 
net: macb: match skb_reserve(skb, NET_IP_ALIGN) with HW alignment

If HW is RSC capable, it cannot add dummy bytes at the start of IP
packets. Alignment (ie number of dummy bytes) is configured using the
RBOF field inside the NCFGR register.

On the software side, the skb_reserve(skb, NET_IP_ALIGN) call must only
be done if those dummy bytes are added by the hardware; notice the
skb_reserve() is done AFTER writing the address to the device.

We cannot do the skb_reserve() call BEFORE writing the address because
the address field ignores the low 2/3 bits. Conclusion: in some cases,
we risk not being able to respect the NET_IP_ALIGN value (which is
picked based on unaligned CPU access performance).

Signed-off-by: Théo Lebrun <theo.lebrun@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20251023-macb-eyeq5-v3-2-af509422c204@bootlin.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agodt-bindings: net: cdns,macb: add Mobileye EyeQ5 ethernet interface
Théo Lebrun [Thu, 23 Oct 2025 16:22:51 +0000 (18:22 +0200)] 
dt-bindings: net: cdns,macb: add Mobileye EyeQ5 ethernet interface

Add "cdns,eyeq5-gem" as compatible for the integrated GEM block inside
Mobileye EyeQ5 SoCs. It is different from other compatibles in two main
ways: (1) it requires a generic PHY and (2) it is better to keep TCP
Segmentation Offload (TSO) disabled.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Signed-off-by: Théo Lebrun <theo.lebrun@bootlin.com>
Link: https://patch.msgid.link/20251023-macb-eyeq5-v3-1-af509422c204@bootlin.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agodibs: Use subsys_initcall()
Alexandra Winter [Thu, 23 Oct 2025 15:06:36 +0000 (17:06 +0200)] 
dibs: Use subsys_initcall()

In the case of built-in modules, the order of module_init() calls are
derived from the Makefiles.

Use subsys_initcall() for the dibs module, to make sure dibs_init() is
executed before dibs clients like smc and dibs devices like ism are
initialized. So future dibs client or dibs device modules can use
module_init() without the risk of getting the order in the Makefiles wrong.

Reported-by: Mete Durlu <meted@linux.ibm.com>
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20251023150636.3995476-2-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agodibs: Remove reset of static vars in dibs_init()
Alexandra Winter [Thu, 23 Oct 2025 15:06:35 +0000 (17:06 +0200)] 
dibs: Remove reset of static vars in dibs_init()

'clients' and 'max_client' are static variables and therefore don't need to
be initialized.

Reported-by: Mete Durlu <meted@linux.ibm.com>
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Link: https://patch.msgid.link/20251023150636.3995476-1-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agoMerge branch 'net-mlx5-add-balance-id-support-for-lag-multiplane-groups'
Paolo Abeni [Tue, 28 Oct 2025 10:11:38 +0000 (11:11 +0100)] 
Merge branch 'net-mlx5-add-balance-id-support-for-lag-multiplane-groups'

Tariq Toukan says:

====================
net/mlx5: Add balance ID support for LAG multiplane groups

This series adds balance ID support for MLX5 LAG in multiplane
configurations.

See detailed description by Mark below [1].

[1]
The problem: In complex multiplane LAG setups, we need finer control over LAG
groups. Currently, devices with the same system image GUID are treated
identically, but hardware now supports per-multiplane-group balance IDs that
let us differentiate between them. On such systems image system guid
isn't enough to decide which devices should be part of which LAG.

The solution: Extend the system image GUID with a balance ID byte when the
hardware supports it. This gives us the granularity we need without breaking
existing deployments.

What this series does:

1. Clean up some duplicate code while we're here
2. Rework the system image GUID infrastructure to handle variable lengths
3. Update PTP clock pairing to use the new approach
4. Restructure capability setting to make room for the new feature
5. Actually implement the balance ID support

The key insight is in patch 5: we only append the balance ID when both
capabilities are present, so older hardware and software continue to work
exactly as before. For newer setups, you get the extra byte that enables
per-multiplane-group load balancing.

This has been tested with both old and new hardware configurations.
====================

Link: https://patch.msgid.link/1761211020-925651-1-git-send-email-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agonet/mlx5: Add balance ID support for LAG multiplane groups
Mark Bloch [Thu, 23 Oct 2025 09:17:00 +0000 (12:17 +0300)] 
net/mlx5: Add balance ID support for LAG multiplane groups

Implement balance ID support for multiplane LAG configurations. This
feature enables per-multiplane group load balancing by extending the
software system image GUID with a balance ID component.

Key implementations:
- Enable lag_per_mp_group capability when supported by hardware.
- Append load_balance_id to software system image GUID when conditions
  are met.
- Increase MLX5_SW_IMAGE_GUID_MAX_BYTES from 8 to 9 to accommodate the
  extra byte.

The balance ID is appended to the system image GUID only when both
load_balance_id and lag_per_mp_group capabilities are available, ensuring
backward compatibility while enabling enhanced LAG functionality.

This enhancement allows for more granular load balancing control in complex
multi-plane LAG deployments, improving network performance and flexibility.

Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Shay Drori <shayd@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1761211020-925651-6-git-send-email-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agonet/mlx5: Refactor HCA cap 2 setting
Mark Bloch [Thu, 23 Oct 2025 09:16:59 +0000 (12:16 +0300)] 
net/mlx5: Refactor HCA cap 2 setting

Refactor HCA capability 2 setting logic to be more structured and
conditional. Move the sw_vhca_id_valid setting inside proper conditional
checks and prepare the function for additional capability settings.

The refactoring:
- Always copy current capabilities to set_hca_cap buffer.
- Apply sw_vhca_id_valid setting only when conditions are met.
- Improve code readability and maintainability.

This cleanup prepares the handle_hca_cap_2() function for the upcoming
balance ID capability setting.

Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Shay Drori <shayd@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1761211020-925651-5-git-send-email-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agonet/mlx5: Refactor PTP clock devcom pairing
Mark Bloch [Thu, 23 Oct 2025 09:16:58 +0000 (12:16 +0300)] 
net/mlx5: Refactor PTP clock devcom pairing

Refactor PTP clock device component pairing to use the clock identity
buffer instead of casting it to a u64 key. This change leverages the new
software system image GUID infrastructure.

Changes include:
- Pass identity buffer to mlx5_shared_clock_register().
- Use memcpy for identity buffer in devcom matching attributes.
- Remove intermediate u64 key conversion.
- Add BUILD_BUG_ON to ensure identity size fits in match key.

Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Shay Drori <shayd@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1761211020-925651-4-git-send-email-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agonet/mlx5: Add software system image GUID infrastructure
Mark Bloch [Thu, 23 Oct 2025 09:16:57 +0000 (12:16 +0300)] 
net/mlx5: Add software system image GUID infrastructure

Replace direct hardware system image GUID usage with a new software
system image GUID function that supports variable-length identifiers.

Key changes:
- Add mlx5_query_nic_sw_system_image_guid() function with length parameter.
- Update all callsites to use the new function and buffer/length approach.
- Modify mapping contexts to use byte arrays instead of u64 keys.
- Update devcom matching to support variable-length keys.
- Change mlx5_same_hw_devs() to use buffer comparison instead of u64.

This refactoring prepares the infrastructure for balance ID support,
which requires extending the system image GUID with additional data.
The change maintains backward compatibility while enabling future
enhancements.

Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Shay Drori <shayd@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1761211020-925651-3-git-send-email-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agonet/mlx5: Use common mlx5_same_hw_devs function
Mark Bloch [Thu, 23 Oct 2025 09:16:56 +0000 (12:16 +0300)] 
net/mlx5: Use common mlx5_same_hw_devs function

Refactor duplicate hardware device comparison code to use the common
mlx5_same_hw_devs() function instead of reimplementing system GUID
comparison logic in multiple places.

This cleanup eliminates code duplication in:
- Bridge representor device comparison.
- TC hardware device comparison.

Using the centralized function improves maintainability and ensures
consistent behavior across the driver.

Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Shay Drori <shayd@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1761211020-925651-2-git-send-email-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agoMerge branch 'implement-more-features-for-txgbe-devices'
Paolo Abeni [Tue, 28 Oct 2025 09:44:19 +0000 (10:44 +0100)] 
Merge branch 'implement-more-features-for-txgbe-devices'

Jiawen Wu says:

====================
Implement more features for txgbe devices

Based on the features of hardware support, implement RX desc merge and
TX head write-back for AML devices, support RSC offload for AML and SP
devices.
====================

Link: https://patch.msgid.link/20251023014538.12644-1-jiawenwu@trustnetic.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agonet: txgbe: support RSC offload
Jiawen Wu [Thu, 23 Oct 2025 01:45:38 +0000 (09:45 +0800)] 
net: txgbe: support RSC offload

Support to enable and disable RSC for txgbe devices.

Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20251023014538.12644-4-jiawenwu@trustnetic.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agonet: txgbe: support TX head write-back mode
Jiawen Wu [Thu, 23 Oct 2025 01:45:37 +0000 (09:45 +0800)] 
net: txgbe: support TX head write-back mode

TX head write-back mode is supported on AML devices. When it is enabled,
the hardware no longer writes the descriptors DD one by one, but write
back pointer of completion descriptor to the head_wb address.

Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20251023014538.12644-3-jiawenwu@trustnetic.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agonet: txgbe: support RX desc merge mode
Jiawen Wu [Thu, 23 Oct 2025 01:45:36 +0000 (09:45 +0800)] 
net: txgbe: support RX desc merge mode

RX descriptor merge mode is supported on AML devices. When it is
enabled, the hardware process the RX descriptors in batches.

Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20251023014538.12644-2-jiawenwu@trustnetic.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agodt-bindings: net: phy: vsc8531: Convert to DT schema
Lad Prabhakar [Sat, 25 Oct 2025 06:48:50 +0000 (07:48 +0100)] 
dt-bindings: net: phy: vsc8531: Convert to DT schema

Convert VSC8531 Gigabit ethernet phy binding to DT schema format. While
at it add compatible string for VSC8541 PHY which is very much similar
to the VSC8531 PHY and is already supported in the kernel. VSC8541 PHY
is present on the Renesas RZ/T2H EVK.

Signed-off-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20251025064850.393797-1-prabhakar.mahadev-lad.rj@bp.renesas.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agotcp: remove one ktime_get() from recvmsg() fast path
Eric Dumazet [Fri, 24 Oct 2025 12:07:07 +0000 (12:07 +0000)] 
tcp: remove one ktime_get() from recvmsg() fast path

Each time some payload is consumed by user space (recvmsg() and friends),
TCP calls tcp_rcv_space_adjust() to run DRS algorithm to check
if an increase of sk->sk_rcvbuf is needed.

This function is based on time sampling, and currently calls
tcp_mstamp_refresh(tp), which is a wrapper around ktime_get_ns().

ktime_get_ns() has a high cost on some platforms.
100+ cycles for rdtscp on AMD EPYC Turin for instance.

We do not have to refresh tp->tcp_mpstamp, using the last cached value
is enough. We only need to refresh it from __tcp_cleanup_rbuf()
if an ACK must be sent (this is a rare event).

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251024120707.3516550-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet/sched: Remove unused typedef psched_tdiff_t
Yue Haibing [Fri, 24 Oct 2025 02:51:45 +0000 (10:51 +0800)] 
net/sched: Remove unused typedef psched_tdiff_t

Since commit 051d44209842 ("net/sched: Retire CBQ qdisc")
this is not used anymore.

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Link: https://patch.msgid.link/20251024025145.4069583-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoMerge branch 'sctp-avoid-redundant-initialisation-in-sctp_accept-and-sctp_do_peeloff'
Jakub Kicinski [Tue, 28 Oct 2025 01:05:01 +0000 (18:05 -0700)] 
Merge branch 'sctp-avoid-redundant-initialisation-in-sctp_accept-and-sctp_do_peeloff'

Kuniyuki Iwashima says:

====================
sctp: Avoid redundant initialisation in sctp_accept() and sctp_do_peeloff().

When sctp_accept() and sctp_do_peeloff() allocates a new socket,
somehow sk_alloc() is used, and the new socket goes through full
initialisation, but most of the fields are overwritten later.

  1)
  sctp_accept()
  |- sctp_v[46]_create_accept_sk()
  |  |- sk_alloc()
  |  |- sock_init_data()
  |  |- sctp_copy_sock()
  |  `- newsk->sk_prot->init() / sctp_init_sock()
  |
  `- sctp_sock_migrate()
     `- sctp_copy_descendant(newsk, oldsk)

  sock_init_data() initialises struct sock, but many fields are
  overwritten by sctp_copy_sock(), which inherits fields of struct
  sock and inet_sock from the parent socket.

  sctp_init_sock() fully initialises struct sctp_sock, but later
  sctp_copy_descendant() inherits most fields from the parent's
  struct sctp_sock by memcpy().

  2)
  sctp_do_peeloff()
  |- sock_create()
  |  |
  |  ...
  |      |- sk_alloc()
  |      |- sock_init_data()
  |  ...
  |    `- newsk->sk_prot->init() / sctp_init_sock()
  |
  |- sctp_copy_sock()
  `- sctp_sock_migrate()
     `- sctp_copy_descendant(newsk, oldsk)

  sock_create() creates a brand new socket, but sctp_copy_sock()
  and sctp_sock_migrate() overwrite most of the fields.

So, sk_alloc(), sock_init_data(), sctp_copy_sock(), and
sctp_copy_descendant() can be replaced with a single function
like sk_clone_lock().

This series does the conversion and removes TODO comment added
by commit 4a997d49d92ad ("tcp: Save lock_sock() for memcg in
inet_csk_accept().").

Tested accept() and SCTP_SOCKOPT_PEELOFF and both work properly.

  socket(AF_INET, SOCK_STREAM, IPPROTO_SCTP) = 3
  bind(3, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
  listen(3, -1)                           = 0
  getsockname(3, {sa_family=AF_INET, sin_port=htons(49460), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0
  socket(AF_INET, SOCK_STREAM, IPPROTO_SCTP) = 4
  connect(4, {sa_family=AF_INET, sin_port=htons(49460), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
  accept(3, NULL, NULL)                   = 5
  ...
  socket(AF_INET, SOCK_SEQPACKET, IPPROTO_SCTP) = 3
  bind(3, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
  listen(3, -1)                           = 0
  getsockname(3, {sa_family=AF_INET, sin_port=htons(48240), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0
  socket(AF_INET, SOCK_SEQPACKET, IPPROTO_SCTP) = 4
  connect(4, {sa_family=AF_INET, sin_port=htons(48240), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
  getsockopt(3, SOL_SCTP, SCTP_SOCKOPT_PEELOFF, "*\0\0\0\5\0\0\0", [8]) = 5

v1: https://lore.kernel.org/20251021214422.1941691-1-kuniyu@google.com
====================

Link: https://patch.msgid.link/20251023231751.4168390-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agosctp: Remove sctp_copy_sock() and sctp_copy_descendant().
Kuniyuki Iwashima [Thu, 23 Oct 2025 23:16:57 +0000 (23:16 +0000)] 
sctp: Remove sctp_copy_sock() and sctp_copy_descendant().

Now, sctp_accept() and sctp_do_peeloff() use sk_clone(), and
we no longer need sctp_copy_sock() and sctp_copy_descendant().

Let's remove them.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Acked-by: Xin Long <lucien.xin@gmail.com>
Link: https://patch.msgid.link/20251023231751.4168390-9-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agosctp: Use sctp_clone_sock() in sctp_do_peeloff().
Kuniyuki Iwashima [Thu, 23 Oct 2025 23:16:56 +0000 (23:16 +0000)] 
sctp: Use sctp_clone_sock() in sctp_do_peeloff().

sctp_do_peeloff() calls sock_create() to allocate and initialise
struct sock, inet_sock, and sctp_sock, but later sctp_copy_sock()
and sctp_sock_migrate() overwrite most fields.

What sctp_do_peeloff() does is more like accept().

Let's use sock_create_lite() and sctp_clone_sock().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Acked-by: Xin Long <lucien.xin@gmail.com>
Link: https://patch.msgid.link/20251023231751.4168390-8-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agosctp: Remove sctp_pf.create_accept_sk().
Kuniyuki Iwashima [Thu, 23 Oct 2025 23:16:55 +0000 (23:16 +0000)] 
sctp: Remove sctp_pf.create_accept_sk().

sctp_v[46]_create_accept_sk() are no longer used.

Let's remove sctp_pf.create_accept_sk().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Acked-by: Xin Long <lucien.xin@gmail.com>
Link: https://patch.msgid.link/20251023231751.4168390-7-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agosctp: Use sk_clone() in sctp_accept().
Kuniyuki Iwashima [Thu, 23 Oct 2025 23:16:54 +0000 (23:16 +0000)] 
sctp: Use sk_clone() in sctp_accept().

sctp_accept() calls sctp_v[46]_create_accept_sk() to allocate a new
socket and calls sctp_sock_migrate() to copy fields from the parent
socket to the new socket.

sctp_v4_create_accept_sk() allocates sk by sk_alloc(), initialises
it by sock_init_data(), and copy a bunch of fields from the parent
socekt by sctp_copy_sock().

sctp_sock_migrate() calls sctp_copy_descendant() to copy most fields
in sctp_sock from the parent socket by memcpy().

These can be simply replaced by sk_clone().

Let's consolidate sctp_v[46]_create_accept_sk() to sctp_clone_sock()
with sk_clone().

We will reuse sctp_clone_sock() for sctp_do_peeloff() and then remove
sctp_copy_descendant().

Note that sock_reset_flag(newsk, SOCK_ZAPPED) is not copied to
sctp_clone_sock() as sctp does not use SOCK_ZAPPED at all.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Acked-by: Xin Long <lucien.xin@gmail.com>
Link: https://patch.msgid.link/20251023231751.4168390-6-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: Add sk_clone().
Kuniyuki Iwashima [Thu, 23 Oct 2025 23:16:53 +0000 (23:16 +0000)] 
net: Add sk_clone().

sctp_accept() will use sk_clone_lock(), but it will be called
with the parent socket locked, and sctp_migrate() acquires the
child lock later.

Let's add no lock version of sk_clone_lock().

Note that lockdep complains if we simply use bh_lock_sock_nested().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Link: https://patch.msgid.link/20251023231751.4168390-5-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agosctp: Don't call sk->sk_prot->init() in sctp_v[46]_create_accept_sk().
Kuniyuki Iwashima [Thu, 23 Oct 2025 23:16:52 +0000 (23:16 +0000)] 
sctp: Don't call sk->sk_prot->init() in sctp_v[46]_create_accept_sk().

sctp_accept() calls sctp_v[46]_create_accept_sk() to allocate a new
socket and calls sctp_sock_migrate() to copy fields from the parent
socket to the new socket.

sctp_v[46]_create_accept_sk() calls sctp_init_sock() to initialise
sctp_sock, but most fields are overwritten by sctp_copy_descendant()
called from sctp_sock_migrate().

Things done in sctp_init_sock() but not in sctp_sock_migrate() are
the following:

  1. Copy sk->sk_gso
  2. Copy sk->sk_destruct (sctp_v6_init_sock())
  3. Allocate sctp_sock.ep
  4. Initialise sctp_sock.pd_lobby
  5. Count sk_sockets_allocated_inc(), sock_prot_inuse_add(),
     and SCTP_DBG_OBJCNT_INC()

Let's do these in sctp_copy_sock() and sctp_sock_migrate() and avoid
calling sk->sk_prot->init() in sctp_v[46]_create_accept_sk().

Note that sk->sk_destruct is already copied in sctp_copy_sock().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Acked-by: Xin Long <lucien.xin@gmail.com>
Link: https://patch.msgid.link/20251023231751.4168390-4-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agosctp: Don't copy sk_sndbuf and sk_rcvbuf in sctp_sock_migrate().
Kuniyuki Iwashima [Thu, 23 Oct 2025 23:16:51 +0000 (23:16 +0000)] 
sctp: Don't copy sk_sndbuf and sk_rcvbuf in sctp_sock_migrate().

sctp_sock_migrate() is called from 2 places.

1) sctp_accept() calls sp->pf->create_accept_sk() before
   sctp_sock_migrate(), and sp->pf->create_accept_sk() calls
   sctp_copy_sock().

2) sctp_do_peeloff() also calls sctp_copy_sock() before
   sctp_sock_migrate().

sctp_copy_sock() copies sk_sndbuf and sk_rcvbuf from the
parent socket.

Let's not copy the two fields in sctp_sock_migrate().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Acked-by: Xin Long <lucien.xin@gmail.com>
Link: https://patch.msgid.link/20251023231751.4168390-3-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agosctp: Defer SCTP_DBG_OBJCNT_DEC() to sctp_destroy_sock().
Kuniyuki Iwashima [Thu, 23 Oct 2025 23:16:50 +0000 (23:16 +0000)] 
sctp: Defer SCTP_DBG_OBJCNT_DEC() to sctp_destroy_sock().

SCTP_DBG_OBJCNT_INC() is called only when sctp_init_sock()
returns 0 after successfully allocating sctp_sk(sk)->ep.

OTOH, SCTP_DBG_OBJCNT_DEC() is called in sctp_close().

The code seems to expect that the socket is always exposed
to userspace once SCTP_DBG_OBJCNT_INC() is incremented, but
there is a path where the assumption is not true.

In sctp_accept(), sctp_sock_migrate() could fail after
sctp_init_sock().

Then, sk_common_release() does not call inet_release() nor
sctp_close().  Instead, it calls sk->sk_prot->destroy().

Let's move SCTP_DBG_OBJCNT_DEC() from sctp_close() to
sctp_destroy_sock().

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Acked-by: Xin Long <lucien.xin@gmail.com>
Link: https://patch.msgid.link/20251023231751.4168390-2-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoMerge branch 'convert-net-drivers-to-ndo_hwtstamp-api-part-2'
Jakub Kicinski [Tue, 28 Oct 2025 01:04:38 +0000 (18:04 -0700)] 
Merge branch 'convert-net-drivers-to-ndo_hwtstamp-api-part-2'

Vadim Fedorenko says:

====================
convert net drivers to ndo_hwtstamp API part 2

This is part 2 of patchset to convert drivers which support HW
timestamping to use .ndo_hwtstamp_get()/.ndo_hwtstamp_set() callbacks.
The new API uses netlink to communicate with user-space and have some
test coverage.
====================

Link: https://patch.msgid.link/20251023220457.3201122-1-vadim.fedorenko@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: hns3: add hwtstamp_get/hwtstamp_set ops
Vadim Fedorenko [Thu, 23 Oct 2025 22:04:57 +0000 (22:04 +0000)] 
net: hns3: add hwtstamp_get/hwtstamp_set ops

And .ndo_hwtstamp_get()/.ndo_hwtstamp_set() callbacks to HNS3 framework
to support HW timestamp configuration via netlink and adopt hns3pf to
use .ndo_hwtstamp_get()/.ndo_hwtstamp_set() callbacks.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20251023220457.3201122-7-vadim.fedorenko@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: renesas: rswitch: convert to ndo_hwtstamp API
Vadim Fedorenko [Thu, 23 Oct 2025 22:04:56 +0000 (22:04 +0000)] 
net: renesas: rswitch: convert to ndo_hwtstamp API

Convert driver to use .ndo_hwtstamp_set()/.ndo_hwtstamp_get() callbacks.
rswitch_eth_ioctl() becomes phy_do_ioctl_running(), remove it and
replace .ndo_eth_ioctl callback with phy_do_ioctl_running().

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20251023220457.3201122-6-vadim.fedorenko@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: ravb: convert to ndo_hwtstamp API
Vadim Fedorenko [Thu, 23 Oct 2025 22:04:55 +0000 (22:04 +0000)] 
net: ravb: convert to ndo_hwtstamp API

Convert driver to use .ndo_hwtstamp_set()/.ndo_hwtstamp_get callbacks.
ravb_do_ioctl() becomes pure phy_do_ioctl_running(), remove it and
replace in callbacks.

Reviewed-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20251023220457.3201122-5-vadim.fedorenko@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoionic: convert to ndo_hwtstamp API
Vadim Fedorenko [Thu, 23 Oct 2025 22:04:54 +0000 (22:04 +0000)] 
ionic: convert to ndo_hwtstamp API

Convert driver to use .ndo_hwtstamp_get()/.ndo_hwtstamp_set() callbacks.
ionic_eth_ioctl() becomes empty, remove it.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20251023220457.3201122-4-vadim.fedorenko@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agomlx4: convert to ndo_hwtstamp API
Vadim Fedorenko [Thu, 23 Oct 2025 22:04:53 +0000 (22:04 +0000)] 
mlx4: convert to ndo_hwtstamp API

Convert driver to use .ndo_hwtstamp_get()/.ndo_hwtstamp_set() callbacks.
mlx4_en_ioctl() becomes empty, remove it.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20251023220457.3201122-3-vadim.fedorenko@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoocteontx2: convert to ndo_hwtstamp API
Vadim Fedorenko [Thu, 23 Oct 2025 22:04:52 +0000 (22:04 +0000)] 
octeontx2: convert to ndo_hwtstamp API

Convert driver to use .ndo_hwtstamp_get()/.ndo_hwtstamp_set() callbacks.
otx2_ioctl() becomes empty, remove it.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20251023220457.3201122-2-vadim.fedorenko@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: airoha: Fix a copy and paste bug in probe()
Dan Carpenter [Fri, 24 Oct 2025 11:23:35 +0000 (14:23 +0300)] 
net: airoha: Fix a copy and paste bug in probe()

This code has a copy and paste bug where it accidentally checks "if (err)"
instead of checking if "xsi_rsts" is NULL.  Also, as a free bonus, I
changed the allocation from kzalloc() to  kcalloc() which is a kernel
hardening measure to protect against integer overflows.

Fixes: 5863b4e065e2 ("net: airoha: Add airoha_eth_soc_data struct")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Acked-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/aPtht6y5DRokn9zv@stanley.mountain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoMerge tag 'batadv-next-pullrequest-20251024' of https://git.open-mesh.org/linux-merge
Jakub Kicinski [Tue, 28 Oct 2025 01:02:38 +0000 (18:02 -0700)] 
Merge tag 'batadv-next-pullrequest-20251024' of https://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
This cleanup patchset includes the following patches:

 - bump version strings, by Simon Wunderlich

 - use skb_crc32c() instead of skb_seq_read(), by Sven Eckelmann

* tag 'batadv-next-pullrequest-20251024' of https://git.open-mesh.org/linux-merge:
  batman-adv: use skb_crc32c() instead of skb_seq_read()
  batman-adv: Start new development cycle
====================

Link: https://patch.msgid.link/20251024092315.232636-1-sw@simonwunderlich.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoMerge branch 'phy-mscc-fix-ptp-for-vsc8574-and-vsc8572'
Jakub Kicinski [Tue, 28 Oct 2025 00:58:05 +0000 (17:58 -0700)] 
Merge branch 'phy-mscc-fix-ptp-for-vsc8574-and-vsc8572'

Horatiu Vultur says:

====================
phy: mscc: Fix PTP for VSC8574 and VSC8572

The first patch will update the PHYs VSC8584, VSC8582, VSC8575 and VSC856X
to use PHY_ID_MATCH_EXACT because only rev B exists for these PHYs.
But for the PHYs VSC8574 and VSC8572 exists rev A, B, C, D and E.
This is just a preparation for the second patch to allow the VSC8574 and
VSC8572 to use the function vsc8584_probe().

We want to use vsc8584_probe() for VSC8574 and VSC8572 because this
function does the correct PTP initialization. This change is in the second
patch.
====================

Link: https://patch.msgid.link/20251023191350.190940-1-horatiu.vultur@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agophy: mscc: Fix PTP for VSC8574 and VSC8572
Horatiu Vultur [Thu, 23 Oct 2025 19:13:50 +0000 (21:13 +0200)] 
phy: mscc: Fix PTP for VSC8574 and VSC8572

The PTP initialization is two-step. First part are the function
vsc8584_ptp_probe_once() and vsc8584_ptp_probe() at probe time which
initialize the locks, queues, creates the PTP device. The second part is
the function vsc8584_ptp_init() at config_init() time which initialize
PTP in the HW.

For VSC8574 and VSC8572, the PTP initialization is incomplete. It is
missing the first part but it makes the second part. Meaning that the
ptp_clock_register() is never called.

There is no crash without the first part when enabling PTP but this is
unexpected because some PHys have PTP functionality exposed by the
driver and some don't even though they share the same PTP clock PTP.

Fixes: 774626fa440e ("net: phy: mscc: Add PTP support for 2 more VSC PHYs")
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Link: https://patch.msgid.link/20251023191350.190940-3-horatiu.vultur@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agophy: mscc: Use PHY_ID_MATCH_EXACT for VSC8584, VSC8582, VSC8575, VSC856X
Horatiu Vultur [Thu, 23 Oct 2025 19:13:49 +0000 (21:13 +0200)] 
phy: mscc: Use PHY_ID_MATCH_EXACT for VSC8584, VSC8582, VSC8575, VSC856X

As the PHYs VSC8584, VSC8582, VSC8575 and VSC856X exists only as rev B,
we can use PHY_ID_MATCH_EXACT to match exactly on revision B of the PHY.
Because of this change then there is not need the check if it is a
different revision than rev B in the function vsc8584_probe() as we
already know that this will never happen.
These changes are a preparation for the next patch because in that patch
we will make the PHYs VSC8574 and VSC8572 to use vsc8584_probe() and
these PHYs have multiple revision.

Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Link: https://patch.msgid.link/20251023191350.190940-2-horatiu.vultur@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoselftests: bridge_mdb: Add a test for MDB flush on snooping disable
Petr Machata [Thu, 23 Oct 2025 14:45:38 +0000 (16:45 +0200)] 
selftests: bridge_mdb: Add a test for MDB flush on snooping disable

Check that non-permanent MDB entries are removed as IGMP / MLD snooping is
disabled.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/9420dfbcf26c8e1134d31244e9e7d6a49d677a69.1761228273.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: bridge: Flush multicast groups when snooping is disabled
Petr Machata [Thu, 23 Oct 2025 14:45:37 +0000 (16:45 +0200)] 
net: bridge: Flush multicast groups when snooping is disabled

When forwarding multicast packets, the bridge takes MDB into account when
IGMP / MLD snooping is enabled. Currently, when snooping is disabled, the
MDB is retained, even though it is not used anymore.

At the same time, during the time that snooping is disabled, the IGMP / MLD
control packets are obviously ignored, and after the snooping is reenabled,
the administrator has to assume it is out of sync. In particular, missed
join and leave messages would lead to traffic being forwarded to wrong
interfaces.

Keeping the MDB entries around thus serves no purpose, and just takes
memory. Note also that disabling per-VLAN snooping does actually flush the
relevant MDB entries.

This patch flushes non-permanent MDB entries as global snooping is
disabled.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/5e992df1bb93b88e19c0ea5819e23b669e3dde5d.1761228273.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoselftests: tls: add tls record_size_limit test
Wilfred Mallawa [Wed, 22 Oct 2025 00:19:37 +0000 (10:19 +1000)] 
selftests: tls: add tls record_size_limit test

Test that outgoing plaintext records respect the tls TLS_TX_MAX_PAYLOAD_LEN
set using setsockopt(). The limit is set to be 128, thus, in all received
records, the plaintext must not exceed this amount.

Also test that setting a new record size limit whilst a pending open
record exists is handled correctly by discarding the request.

Suggested-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20251022001937.20155-2-wilfred.opensource@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet/tls: support setting the maximum payload size
Wilfred Mallawa [Wed, 22 Oct 2025 00:19:36 +0000 (10:19 +1000)] 
net/tls: support setting the maximum payload size

During a handshake, an endpoint may specify a maximum record size limit.
Currently, the kernel defaults to TLS_MAX_PAYLOAD_SIZE (16KB) for the
maximum record size. Meaning that, the outgoing records from the kernel
can exceed a lower size negotiated during the handshake. In such a case,
the TLS endpoint must send a fatal "record_overflow" alert [1], and
thus the record is discarded.

Upcoming Western Digital NVMe-TCP hardware controllers implement TLS
support. For these devices, supporting TLS record size negotiation is
necessary because the maximum TLS record size supported by the controller
is less than the default 16KB currently used by the kernel.

Currently, there is no way to inform the kernel of such a limit. This patch
adds support to a new setsockopt() option `TLS_TX_MAX_PAYLOAD_LEN` that
allows for setting the maximum plaintext fragment size. Once set, outgoing
records are no larger than the size specified. This option can be used to
specify the record size limit.

[1] https://www.rfc-editor.org/rfc/rfc8449

Signed-off-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20251022001937.20155-1-wilfred.opensource@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoMerge branch 'dwmac-support-for-rockchip-rk3506'
Jakub Kicinski [Sat, 25 Oct 2025 02:07:48 +0000 (19:07 -0700)] 
Merge branch 'dwmac-support-for-rockchip-rk3506'

Heiko Stuebner says:

====================
DWMAC support for Rockchip RK3506

Some cleanups to the DT binding for Rockchip variants of the dwmac
and adding the RK3506 support on top.

As well as the driver-glue needed for setting up the correct RMII
speed seitings.
====================

Link: https://patch.msgid.link/20251023111213.298860-1-heiko@sntech.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoMAINTAINERS: add dwmac-rk glue driver to the main Rockchip entry
Heiko Stuebner [Thu, 23 Oct 2025 11:12:12 +0000 (13:12 +0200)] 
MAINTAINERS: add dwmac-rk glue driver to the main Rockchip entry

The dwmac-rk glue driver is currently not caught by the general maintainer
entry for Rockchip SoCs, so add it explicitly, similar to the i2c driver.

The binding document in net/rockchip-dwmac.yaml already gets caught by
the wildcard match.

Signed-off-by: Heiko Stuebner <heiko@sntech.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20251023111213.298860-6-heiko@sntech.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoethernet: stmmac: dwmac-rk: Add RK3506 GMAC support
David Wu [Thu, 23 Oct 2025 11:12:11 +0000 (13:12 +0200)] 
ethernet: stmmac: dwmac-rk: Add RK3506 GMAC support

Add the needed glue blocks for the RK3506-specific setup.

The RK3506 dwmac only supports up to 100MBit with a RMII PHY,
but no RGMII.

Signed-off-by: David Wu <david.wu@rock-chips.com>
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
Link: https://patch.msgid.link/20251023111213.298860-5-heiko@sntech.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agodt-bindings: net: rockchip-dwmac: Add compatible string for RK3506
Heiko Stuebner [Thu, 23 Oct 2025 11:12:10 +0000 (13:12 +0200)] 
dt-bindings: net: rockchip-dwmac: Add compatible string for RK3506

Rockchip RK3506 has two Ethernet controllers based on Synopsys DWC
Ethernet QoS IP.

Add compatible string for the RK3506 variant.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
Link: https://patch.msgid.link/20251023111213.298860-4-heiko@sntech.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agodt-bindings: net: snps,dwmac: Sync list of Rockchip compatibles
Heiko Stuebner [Thu, 23 Oct 2025 11:12:09 +0000 (13:12 +0200)] 
dt-bindings: net: snps,dwmac: Sync list of Rockchip compatibles

A number of dwmac variants from Rockchip SoCs have turned up in the
Rockchip-specific binding, but not in the main list in snps,dwmac.yaml
which as the comment indicates is needed for accurate matching.

So add the missing rk3528, rk3568 and rv1126 to the main list.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
Link: https://patch.msgid.link/20251023111213.298860-3-heiko@sntech.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agodt-bindings: net: snps,dwmac: move rk3399 line to its correct position
Heiko Stuebner [Thu, 23 Oct 2025 11:12:08 +0000 (13:12 +0200)] 
dt-bindings: net: snps,dwmac: move rk3399 line to its correct position

Move the rk3399 compatible to its alphabetically correct position.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
Link: https://patch.msgid.link/20251023111213.298860-2-heiko@sntech.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoMerge branch 'net-ravb-soc-specific-configuration'
Jakub Kicinski [Sat, 25 Oct 2025 02:04:36 +0000 (19:04 -0700)] 
Merge branch 'net-ravb-soc-specific-configuration'

Lad Prabhakar says:

====================
net: ravb: SoC-specific configuration

This series addresses several issues in the Renesas Ethernet AVB (ravb)
driver related to SoC-specific resource configuration.

The series includes the following changes:

- Make DBAT entry count configurable per SoC
The number of descriptor base address table (DBAT) entries is not uniform
across all SoCs. Pass this information via the hardware info structure and
allocate resources accordingly.

- Allocate correct number of queues based on SoC support
Use the per-SoC configuration to determine whether a network control queue
is available, and allocate queues dynamically to match the SoC's
capability.

v2: https://lore.kernel.org/20251017151830.171062-1-prabhakar.mahadev-lad.rj@bp.renesas.com
====================

Link: https://patch.msgid.link/20251023112111.215198-1-prabhakar.mahadev-lad.rj@bp.renesas.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agonet: ravb: Allocate correct number of queues based on SoC support
Lad Prabhakar [Thu, 23 Oct 2025 11:21:11 +0000 (12:21 +0100)] 
net: ravb: Allocate correct number of queues based on SoC support

Use the per-SoC match data flag `nc_queues` to decide how many TX/RX
queues to allocate. If the SoC does not provide a network-control queue,
fall back to a single TX/RX queue. Obtain the match data before calling
alloc_etherdev_mqs() so the allocation is sized correctly.

Signed-off-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
Reviewed-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Link: https://patch.msgid.link/20251023112111.215198-3-prabhakar.mahadev-lad.rj@bp.renesas.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agonet: ravb: Make DBAT entry count configurable per-SoC
Lad Prabhakar [Thu, 23 Oct 2025 11:21:10 +0000 (12:21 +0100)] 
net: ravb: Make DBAT entry count configurable per-SoC

Avoid wasting coherent DMA memory by allocating the descriptor base
address table sized for the actual number of DBAT/CDARq entries supported
by the SoC. Some platforms (for example GBETH) only provide two CDARq
entries; previously the driver always allocated space for 22 entries which
needlessly consumed memory on those systems.

Pass the per-SoC dbat_entry_num via struct ravb_hw_info and use it for
allocation and initialization in probe. This sizes the table correctly and
removes the unnecessary memory overhead on SoCs with fewer DBAT entries.

Signed-off-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
Reviewed-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Link: https://patch.msgid.link/20251023112111.215198-2-prabhakar.mahadev-lad.rj@bp.renesas.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agonet: usb: usbnet: coding style for functions
Oliver Neukum [Thu, 23 Oct 2025 10:00:19 +0000 (12:00 +0200)] 
net: usb: usbnet: coding style for functions

Functions are not to have blanks between names
and parameter lists. Remove them.

Signed-off-by: Oliver Neukum <oneukum@suse.com>
Link: https://patch.msgid.link/20251023100136.909118-1-oneukum@suse.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoMerge branch 'net-stmmac-pcs-support-part-2'
Jakub Kicinski [Sat, 25 Oct 2025 01:56:37 +0000 (18:56 -0700)] 
Merge branch 'net-stmmac-pcs-support-part-2'

Russell King says:

====================
net: stmmac: pcs support part 2

This is the next part of stmmac PCS support. Not much here, other than
dealing with what remains of the interrupts, which are the PCS AN
complete and PCS Link interrupts, which are just cleared and update
accounting.

Currently, they are enabled at core init time, but if we have an
implementation that supports multiple PHY interfaces, we want to
enable only the appropriate interrupts.

I also noticed that stmmac_fpe_configure_pmac() also modifies the
interrupt mask during run time. As a pre-requisit, we need a way
to ensure that we don't have different threads modifying the
interrupt settings at the same time. So, the first patch introduces
a new function and a spinlock which must be held when manipulating
the interrupt enable/mask state.

The second patch adds the PCS bits for enabling the PCS AN and PCS
link interrupts when the PCS is in-use.
====================

Link: https://patch.msgid.link/aPn5YVeUcWo4CW3c@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agonet: stmmac: add support for controlling PCS interrupts
Russell King (Oracle) [Thu, 23 Oct 2025 09:46:25 +0000 (10:46 +0100)] 
net: stmmac: add support for controlling PCS interrupts

Add support to the PCS instance for controlling the PCS interrupts
depending on whether the PCS is used.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vBrtp-0000000BMYs-3bhI@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agonet: stmmac: add stmmac_mac_irq_modify()
Russell King (Oracle) [Thu, 23 Oct 2025 09:46:20 +0000 (10:46 +0100)] 
net: stmmac: add stmmac_mac_irq_modify()

Add a function to allow interrupts to be enabled and disabled in a
core independent manner.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vBrtk-0000000BMYm-3CV5@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoMerge branch 'net-add-phylink-managed-wol-and-convert-stmmac'
Jakub Kicinski [Sat, 25 Oct 2025 01:52:09 +0000 (18:52 -0700)] 
Merge branch 'net-add-phylink-managed-wol-and-convert-stmmac'

Russell King says:

====================
net: add phylink managed WoL and convert stmmac

This series is implementing the thoughts of Andrew, Florian and myself
to improve the quality of Wake-on-Lan (WoL) implementations.

This changes nothing for MAC drivers that do not wish to participate in
this, but if they do, then they gain the benefit of phylink configuring
WoL at the point closest to the media as possible.

We first need to solve the problem that the multitude of PHY drivers
report their device supports WoL, but are not capable of waking the
system. Correcting this is fundamental to choosing where WoL should be
enabled - a mis-reported WoL support can render WoL completely
ineffective.

The only PHY drivers which uses the driver model's wakeup support is
drivers/net/phy/broadcom.c, and until recently, realtek. This means
we have the opportunity for PHY drivers to be _correctly_ converted
to use this method of signalling wake-up capability only when they can
actually wake the system, and thus providing a way for phylink to
know whether to use PHY-based WoL at all.

However, a PHY driver not implementing that logic doesn't become a
blocker to MACs wanting to convert. In full, the logic is:

- phylink supports a flag, wol_phy_legacy, which forces phylink to use
  the PHY-based WoL even if the MDIO device is not marked as wake-up
  capable.

- when wol_phy_legacy is not set, we check whether the PHY MDIO device
  is wake-up capable. If it is, we offer the WoL request to the PHY.

- if neither wol_phy_legacy is set, or the PHY is not wake-up capable,
  we do not offer the WoL request to the PHY.

In both cases, after setting any PHY based WoL, we remove the options
that the PHY now reports are enabled from the options mask, and offer
these (if any) to the MAC. The mac will get a "mac_set_wol()" method
call when any settings change.

Phylink mainatains the WoL state for the MAC, so there's no need for
a "mac_get_wol()" method. There may be the need to set the initial
state but this is not supported at present.

I've also added support for doing the PHY speed-up/speed-down at
suspend/resume time depending on the WoL state, which takes another
issue from the MAC authors.

Lastly, with phylink now having the full picture for WoL, the
"mac_wol" argument for phylink_suspend() becomes redundant, and for
MAC drivers that implement mac_set_wol(), the value passed becomes
irrelevant.
====================

Link: https://patch.msgid.link/aPnyW54J80h9DmhB@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agonet: stmmac: convert to phylink managed WoL PHY speed
Russell King (Oracle) [Thu, 23 Oct 2025 09:16:55 +0000 (10:16 +0100)] 
net: stmmac: convert to phylink managed WoL PHY speed

Convert stmmac to use phylink's management of the PHY speed when
Wake-on-Lan is enabled.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vBrRH-0000000BLzm-3JjF@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agonet: stmmac: convert to phylink-managed Wake-on-Lan
Russell King (Oracle) [Thu, 23 Oct 2025 09:16:50 +0000 (10:16 +0100)] 
net: stmmac: convert to phylink-managed Wake-on-Lan

Convert stmmac to use phylink-managed Wake-on-Lan support. To achieve
this, we implement the .mac_wol_set() method, which simply configures
the driver model's struct device wakeup for stmmac, and sets the
priv->wolopts appropriately.

When STMMAC_FLAG_USE_PHY_WOL is set, in the stmmac world this means to
only use the PHY's WoL support and ignore the MAC's WoL capabilities.
To preserve this behaviour, we enable phylink's legacy mode, and avoid
telling phylink that the MAC has any WoL support. This achieves the
same functionality for this case.

When STMMAC_FLAG_USE_PHY_WOL is not set, we provide the MAC's WoL
capabilities to phylink, which then allows phylink to choose between
the PHY and MAC for WoL depending on their individual capabilities
as described in the phylink commit. This only augments the WoL
functionality with PHYs that declare to the driver model that they are
wake-up capable. Currently, very few PHY drivers support this.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vBrRC-0000000BLzg-2tA4@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agonet: phylink: add phylink managed wake-on-lan PHY speed control
Russell King (Oracle) [Thu, 23 Oct 2025 09:16:45 +0000 (10:16 +0100)] 
net: phylink: add phylink managed wake-on-lan PHY speed control

Some drivers, e.g. stmmac, use the speed_up()/speed_down() APIs to
gain additional power saving during Wake-on-LAN where the PHY is
managing the state.

Add support to phylink for this, which can be enabled by the MAC
driver. Only change the PHY speed if the PHY is configured for
wake-up, but without any wake-up on the MAC side, as MAC side
means changing the configuration once the negotiation has
completed.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vBrR7-0000000BLza-2PjK@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agonet: phylink: add phylink managed MAC Wake-on-Lan support
Russell King (Oracle) [Thu, 23 Oct 2025 09:16:40 +0000 (10:16 +0100)] 
net: phylink: add phylink managed MAC Wake-on-Lan support

Add core phylink managed Wake-on-Lan support, which is enabled when the
MAC driver fills in the new .mac_wol_set() method that this commit
creates.

When this feature is disabled, phylink acts as it has in the past,
merely passing the ethtool WoL calls to phylib whenever a PHY exists.
No other new functionality provided by this commit is enabled.

When this feature is enabled, a more inteligent approach is used.
Phylink will first pass WoL options to the PHY, read them back, and
attempt to set any options that were not set at the PHY at the MAC.

Since we have PHY drivers that report they support WoL, and accept WoL
configuration even though they aren't wired up to be capable of waking
the system, we need a way to differentiate between PHYs that think
they support WoL and those which actually do. As PHY drivers do not
make use of the driver model's wake-up infrastructure, but could, we
use this to determine whether PHY drivers can participate. This gives
a path forward where, as MAC drivers are converted to this, it
encourages PHY drivers to also be converted.

Phylink will also ignore the mac_wol argument to phylink_suspend() as
it now knows the WoL state at the MAC.

MAC drivers are expected to record/configure the Wake-on-Lan state in
their .mac_set_wol() method, and deal appropriately with it in their
suspend/resume methods. The driver model provides assistance to set the
IRQ wake support which may assist driver authors in achieving the
necessary configuration.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vBrR2-0000000BLzU-1xYL@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agonet: phy: add phy_may_wakeup()
Russell King (Oracle) [Thu, 23 Oct 2025 09:16:35 +0000 (10:16 +0100)] 
net: phy: add phy_may_wakeup()

Add phy_may_wakeup() which uses the driver model's device_may_wakeup()
when the PHY driver has marked the device as wakeup capable in the
driver model, otherwise use phy_drv_wol_enabled().

Replace the sites that used to call phy_drv_wol_enabled() with this
as checking the driver model will be more efficient than checking the
WoL state.

Export phy_may_wakeup() so that phylink can use it.

Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vBrQx-0000000BLzO-1RLt@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agonet: phy: add phy_can_wakeup()
Russell King (Oracle) [Thu, 23 Oct 2025 09:16:30 +0000 (10:16 +0100)] 
net: phy: add phy_can_wakeup()

Add phy_can_wakeup() to report whether the PHY driver has marked the
PHY device as being wake-up capable as far as the driver model is
concerned.

Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vBrQs-0000000BLzI-0w3U@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agosmc: rename smc_find_ism_store_rc to reflect broader usage
Dust Li [Thu, 23 Oct 2025 02:00:12 +0000 (10:00 +0800)] 
smc: rename smc_find_ism_store_rc to reflect broader usage

The function smc_find_ism_store_rc() is used to record the reason
why a suitable device (either ISM or RDMA) could not be found.
However, its name suggests it is ISM-specific, which is misleading.

Rename it to better reflect its actual usage.

No functional changes.

Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20251023020012.69609-1-dust.li@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agostrparser: fix typo in comment
Julia Lawall [Thu, 23 Oct 2025 01:30:51 +0000 (03:30 +0200)] 
strparser: fix typo in comment

The name frags_list doesn't appear in the kernel.
It should be frag_list as in the next sentence.

Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Link: https://patch.msgid.link/20251023013051.1728388-1-Julia.Lawall@inria.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoselftest: net: prevent use of uninitialized variable
Alessandro Zanni [Thu, 23 Oct 2025 20:53:52 +0000 (22:53 +0200)] 
selftest: net: prevent use of uninitialized variable

Fix to avoid the usage of the `ret` variable uninitialized in the
following macro expansions.

It solves the following warning:

In file included from netlink-dumps.c:21:
netlink-dumps.c: In function ‘dump_extack’:
../kselftest_harness.h:788:35: warning: ‘ret’ may be used uninitialized [-Wmaybe-uninitialized]
  788 |                         intmax_t  __exp_print = (intmax_t)__exp; \
      |                                   ^~~~~~~~~~~
../kselftest_harness.h:631:9: note: in expansion of macro ‘__EXPECT’
  631 |         __EXPECT(expected, #expected, seen, #seen, ==, 0)
      |         ^~~~~~~~
netlink-dumps.c:169:9: note: in expansion of macro ‘EXPECT_EQ’
  169 |         EXPECT_EQ(ret, FOUND_EXTACK);
      |         ^~~~~~~~~

The issue can be reproduced, building the tests, with the command:
make -C tools/testing/selftests TARGETS=net

Signed-off-by: Alessandro Zanni <alessandro.zanni87@gmail.com>
Link: https://patch.msgid.link/20251023205354.28249-1-alessandro.zanni87@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoMerge branch 'neighbour-convert-rtm_getneightbl-and-rtm_setneightbl-to-rcu'
Jakub Kicinski [Sat, 25 Oct 2025 00:57:27 +0000 (17:57 -0700)] 
Merge branch 'neighbour-convert-rtm_getneightbl-and-rtm_setneightbl-to-rcu'

Kuniyuki Iwashima says:

====================
neighbour: Convert RTM_GETNEIGHTBL and RTM_SETNEIGHTBL to RCU.

Patch 1 & 2 are prep for RCU conversion for RTM_GETNEIGHTBL.

Patch 3 & 4 converts RTM_GETNEIGHTBL and RTM_SETNEIGHTBL to RCU.

Patch 5 converts the neighbour table rwlock to the plain spinlock.
====================

Link: https://patch.msgid.link/20251022054004.2514876-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoneighbour: Convert rwlock of struct neigh_table to spinlock.
Kuniyuki Iwashima [Wed, 22 Oct 2025 05:39:49 +0000 (05:39 +0000)] 
neighbour: Convert rwlock of struct neigh_table to spinlock.

Only neigh_for_each() and neigh_seq_start/stop() are on the
reader side of neigh_table.lock.

Let's convert rwlock to the plain spinlock.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251022054004.2514876-6-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoneighbour: Convert RTM_SETNEIGHTBL to RCU.
Kuniyuki Iwashima [Wed, 22 Oct 2025 05:39:48 +0000 (05:39 +0000)] 
neighbour: Convert RTM_SETNEIGHTBL to RCU.

neightbl_set() fetches neigh_tables[] and updates attributes under
write_lock_bh(&tbl->lock), so RTNL is not needed.

neigh_table_clear() synchronises RCU only, and rcu_dereference_rtnl()
protects nothing here.

If we released RCU after fetching neigh_tables[], there would be no
synchronisation to block neigh_table_clear() further, so RCU is held
until the end of the function.

Another option would be to protect neigh_tables[] user with SRCU
and add synchronize_srcu() in neigh_table_clear().

But, holding RCU should be fine as we hold write_lock_bh() for the
rest of neightbl_set() anyway.

Let's perform RTM_SETNEIGHTBL under RCU and drop RTNL.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251022054004.2514876-5-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoneighbour: Convert RTM_GETNEIGHTBL to RCU.
Kuniyuki Iwashima [Wed, 22 Oct 2025 05:39:47 +0000 (05:39 +0000)] 
neighbour: Convert RTM_GETNEIGHTBL to RCU.

neightbl_dump_info() calls these functions for each neigh_tables[]
entry:

  1. neightbl_fill_info() for tbl->parms
  2. neightbl_fill_param_info() for tbl->parms_list (except tbl->parms)

Both functions rely on the table lock (read_lock_bh(&tbl->lock))
and RTNL is not needed.

Let's fetch the table under RCU and convert RTM_GETNEIGHTBL to RCU.

Note that the first entry of tbl->parms_list is tbl->parms.list and
embedded in neigh_table, so list_next_entry() is safe.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251022054004.2514876-4-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoneighbour: Annotate access to neigh_parms fields.
Kuniyuki Iwashima [Wed, 22 Oct 2025 05:39:46 +0000 (05:39 +0000)] 
neighbour: Annotate access to neigh_parms fields.

NEIGH_VAR() is read locklessly in the fast path, and IPv6 ndisc uses
NEIGH_VAR_SET() locklessly.

The next patch will convert neightbl_dump_info() to RCU.

Let's annotate accesses to neigh_param with READ_ONCE() and WRITE_ONCE().

Note that ndisc_ifinfo_sysctl_change() uses &NEIGH_VAR() and we cannot
use '&' with READ_ONCE(), so NEIGH_VAR_PTR() is introduced.

Note also that NEIGH_VAR_INIT() does not need WRITE_ONCE() as it is before
parms is published.  Also, the only user hippi_neigh_setup_dev() is no
longer called since commit e3804cbebb67 ("net: remove COMPAT_NET_DEV_OPS"),
which looks wrong, but probably no one uses HIPPI and RoadRunner.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251022054004.2514876-3-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoneighbour: Use RCU list helpers for neigh_parms.list writers.
Kuniyuki Iwashima [Wed, 22 Oct 2025 05:39:45 +0000 (05:39 +0000)] 
neighbour: Use RCU list helpers for neigh_parms.list writers.

We will convert RTM_GETNEIGHTBL to RCU soon, where we traverse
tbl->parms_list under RCU in neightbl_dump_info().

Let's use RCU list helper for neigh_parms in neigh_parms_alloc()
and neigh_parms_release().

neigh_table_init() uses the plain list_add() for the default
neigh_parm that is embedded in the table and not yet published.

Note that neigh_parms_release() already uses call_rcu() to free
neigh_parms.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251022054004.2514876-2-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoice: remove duplicate call to ice_deinit_hw() on error paths
Przemek Kitszel [Fri, 12 Sep 2025 13:06:27 +0000 (15:06 +0200)] 
ice: remove duplicate call to ice_deinit_hw() on error paths

Current unwinding code on error paths of ice_devlink_reinit_up() and
ice_probe() have manual call to ice_deinit_hw() (which is good, as there
is also manual call to ice_hw_init() there), which is then duplicated
(and was prior current series) in ice_deinit_dev().

Fix the above by removing ice_deinit_hw() from ice_deinit_dev().
Add a (now missing) call in ice_remove().

Reported-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/intel-wired-lan/20250717-jk-ddp-safe-mode-issue-v1-1-e113b2baed79@intel.com/
Fixes: 4d3f59bfa2cd ("ice: split ice_init_hw() out from ice_init_dev()")
Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
7 weeks agoice: move ice_deinit_dev() to the end of deinit paths
Przemek Kitszel [Fri, 12 Sep 2025 13:06:26 +0000 (15:06 +0200)] 
ice: move ice_deinit_dev() to the end of deinit paths

ice_deinit_dev() takes care of turning off adminq processing, which is
much needed during driver teardown (remove, reset, error path). Move it
to the very end where applicable.
For example, ice_deinit_hw() called after adminq deinit slows rmmod on
my two-card setup by about 60 seconds.

ice_init_dev() and ice_deinit_dev() scopes were reduced by previous
commits of the series, with a final touch of extracting ice_init_dev_hw()
out now (there is no deinit counterpart).

Note that removed ice_service_task_stop() call from ice_remove() is placed
in the ice_deinit_dev() (and stopping twice makes no sense).

Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
7 weeks agoice: extract ice_init_dev() from ice_init()
Przemek Kitszel [Fri, 12 Sep 2025 13:06:25 +0000 (15:06 +0200)] 
ice: extract ice_init_dev() from ice_init()

Extract ice_init_dev() from ice_init(), to allow service task and IRQ
scheme teardown to be put after clearing SW constructs in the subsequent
commit.

Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
7 weeks agoice: move ice_init_pf() out of ice_init_dev()
Przemek Kitszel [Fri, 12 Sep 2025 13:06:24 +0000 (15:06 +0200)] 
ice: move ice_init_pf() out of ice_init_dev()

Move ice_init_pf() out of ice_init_dev().
Do the same for deinit counterpart.

Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
7 weeks agoice: move udp_tunnel_nic and misc IRQ setup into ice_init_pf()
Przemek Kitszel [Fri, 12 Sep 2025 13:06:23 +0000 (15:06 +0200)] 
ice: move udp_tunnel_nic and misc IRQ setup into ice_init_pf()

Move udp_tunnel_nic setup and ice_req_irq_msix_misc() call into
ice_init_pf(), remove some redundancy in the former while moving.

Move ice_free_irq_msix_misc() call into ice_deinit_pf(), to mimic
the above in terms of needed cleanup. Guard it via emptiness check,
to keep the allowance of half-initialized pf being cleaned up.

Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
7 weeks agoice: ice_init_pf: destroy mutexes and xarrays on memory alloc failure
Przemek Kitszel [Fri, 12 Sep 2025 13:06:22 +0000 (15:06 +0200)] 
ice: ice_init_pf: destroy mutexes and xarrays on memory alloc failure

Unroll actions of ice_init_pf() when it fails.
ice_deinit_pf() happens to be perfect to call here.

Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
7 weeks agoice: move ice_init_interrupt_scheme() prior ice_init_pf()
Przemek Kitszel [Fri, 12 Sep 2025 13:06:21 +0000 (15:06 +0200)] 
ice: move ice_init_interrupt_scheme() prior ice_init_pf()

Move ice_init_interrupt_scheme() prior ice_init_pf().
To enable the move ice_set_pf_caps() was moved out from ice_init_pf()
to the caller (ice_init_dev()), and placed prior to the irq scheme init.

The move makes deinit order of ice_deinit_dev() and failure-path of
ice_init_pf() match (at least in terms of not calling
ice_clear_interrupt_scheme() and ice_deinit_pf() in opposite ways).

The new order aligns with findings made by Jakub Buchocki in
the commit 24b454bc354a ("ice: Fix ice module unload").

Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
7 weeks agoice: move service task start out of ice_init_pf()
Przemek Kitszel [Fri, 12 Sep 2025 13:06:20 +0000 (15:06 +0200)] 
ice: move service task start out of ice_init_pf()

Move service task start out of ice_init_pf(). Do analogous with deinit.
Service task is needed up to the very end of driver removal, later commit
of the series will move it later on execution timeline.

Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
7 weeks agoice: enforce RTNL assumption of queue NAPI manipulation
Przemek Kitszel [Fri, 12 Sep 2025 13:06:19 +0000 (15:06 +0200)] 
ice: enforce RTNL assumption of queue NAPI manipulation

Instead of making assumptions in comments move them into code.
Be also more precise, RTNL must be locked only when there is
NAPI, and we have VSIs w/o NAPI that call ice_vsi_clear_napi_queues()
during rmmod.

Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>