git.ipfire.org Git - thirdparty/linux.git/log

net/sched: fq_codel: local packets no longer count against memory limit

Commit 95b58430abe7 ("fq_codel: add memory limitation per queue")
claimed that the 32Mb default was "reasonable even for heavy duty usages."

In practice, this is not the case.

Packets that are associated with local sockets sk_wmem_alloc
do not really need additional memory control.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@toke.dk>
Link: https://patch.msgid.link/20260512094859.3673997-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: make is_skb_wmem() available to modules

Following patch will use is_skb_wmem() from fq_codel.

Provide __sock_wfree() only if CONFIG_INET=y

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@toke.dk>
Link: https://patch.msgid.link/20260512094859.3673997-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-mlx5e-improve-rss-indirection-table-sizing-and-resizing'

Tariq Toukan says:

====================
net/mlx5e: improve RSS indirection table sizing and resizing

This series by Yael improves mlx5e RSS indirection table handling around
channel count changes and large RSS configurations.

The series:
* removes the XOR8-specific channel count limitation,
* advertises the maximum supported RSS indirection table size,
* fixes resizing of non-default RSS contexts,
* allows resizing configured default RSS contexts during channel
changes,
* and increases the default RSS spread factor from 2x to 4x to improve
traffic distribution for large channel counts.

Together, these changes make RSS table sizing more flexible and robust,
while improving load balancing behavior on large systems.
====================

Link: https://patch.msgid.link/20260511172719.330490-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: increase RSS indirection table spread factor

Increase the RQT uniform spread factor from 2 to 4 so that each channel
gets more indirection table entries and traffic is spread more evenly.
For num_channels > 64 imbalance drops from up to ~50% to up to ~25%.
For 64 or fewer channels the 256 entry minimum already provides at least
4x coverage and the table size is unchanged by this commit.

This satisfies the minimum 4x coverage requirement validated by the
generic RSS selftest commit 9e3d4dae9832 ("selftests: drv-net: rss:
validate min RSS table size").

The 4x spread factor is best-effort and the table size is always capped by
the device's log_max_rqt_size capability.

Signed-off-by: Yael Chemla <ychemla@nvidia.com>
Reviewed-by: Nimrod Oren <noren@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260511172719.330490-6-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: resize configured default RSS context table on channel change

mlx5e_ethtool_set_channels() rejected channel count changes that
required a different RQT size when the default context indirection
table was user-configured. This restriction was introduced by
commit ee3572409f74 ("net/mlx5e: RSS, Block changing channels number
when RXFH is configured").

Lift the restriction. Validate the resize upfront with
ethtool_rxfh_indir_can_resize(), then fold or unfold the table
in-place via ethtool_rxfh_indir_resize() inside state_lock, before
mlx5e_safe_switch_params(), so the preactivate callback sees the
correct table content when it programs the HW.

Signed-off-by: Yael Chemla <ychemla@nvidia.com>
Reviewed-by: Nimrod Oren <noren@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260511172719.330490-5-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: resize non-default RSS indirection tables on channel change

When the channel count changes and the RQT size changes with it, a
problem arise for non-default RSS contexts. The driver-side indirection
table grows actual_table_size without filling the new entries; stale
entries from a prior larger configuration may be re-exposed, causing
mlx5e_calc_indir_rqns() to WARN on an out-of-range index.

Replace mlx5e_rss_params_indir_modify_actual_size() with
mlx5e_rss_ctx_resize(), which fills new entries by replicating
the existing pattern, matching what ethtool_rxfh_ctxs_resize() does
for the same case. And restrict the loop to non-default contexts.

Call ethtool_rxfh_ctxs_can_resize() before acquiring state_lock to
validate that all non-default contexts can be resized, and
ethtool_rxfh_ctxs_resize() after releasing it to fold or unfold their
indirection tables. Both functions acquire rss_lock internally and
cannot be called under state_lock. RTNL, held by all set_channels
callers, serialises context creation and deletion making the pre-lock
check safe.

Guard both ethtool calls on mlx5e_rx_res_rss_cnt() > 1: skip the
validation and resize when no non-default contexts exist. This
naturally covers representors and IPoIB, which share
mlx5e_ethtool_set_channels() but cannot have non-default RSS contexts.

Signed-off-by: Yael Chemla <ychemla@nvidia.com>
Reviewed-by: Nimrod Oren <noren@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260511172719.330490-4-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: advertise max RSS indirection table size to ethtool

Set rxfh_indir_space to the maximum indirection table size the driver
can support: the next power of two above MLX5E_MAX_NUM_CHANNELS times
MLX5E_UNIFORM_SPREAD_RQT_FACTOR.

Without this, ethtool_rxfh_ctxs_can_resize() returns -EINVAL, blocking
non-default RSS contexts from tracking indirection table size changes
when the channel count changes.

Signed-off-by: Yael Chemla <ychemla@nvidia.com>
Reviewed-by: Nimrod Oren <noren@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260511172719.330490-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: remove channel count limit for XOR8 RSS hash

mlx5e_ethtool_set_channels() and mlx5e_rxfh_hfunc_check() rejected
channel counts that would produce an indirection table larger than 256
entries when the XOR8 hash function was active. This check was
introduced in commit 49e6c9387051 ("net/mlx5e: RSS, Block XOR hash
with over 128 channels").

XOR8 yields an 8-bit hash, so in practice only up to 256 entries in the
indirection table can be reached due to limited entropy. However, this
does not provide a strong justification for prohibiting larger
indirection tables. Remove the limitation.

Signed-off-by: Yael Chemla <ychemla@nvidia.com>
Reviewed-by: Nimrod Oren <noren@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260511172719.330490-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'netpoll-move-out-netconsole-specific-functions'

Breno Leitao says:

====================
netpoll: move out netconsole-specific functions

netpoll and netconsole were created together and their code has
been intermixed in net/core/netpoll.c for decades. The result is
that netpoll exposes two send-side interfaces:

* a generic "give me an sk_buff" path used by every stacked-device
driver (bonding, team, vlan, bridge, macvlan, dsa),

* a second path that takes raw bytes and builds a UDP/IP/Ethernet
packet -- exclusively for netconsole.

The packet builder, an skb pool allocator, and several
netconsole-specific helpers all live next to the generic plumbing even
though no other consumer ever touches them.

Worse, every netpoll user pays for that overlap: struct netpoll carries
an skb_pool and a refill work_struct that only netconsole's find_skb()
ever reads from, and net-core has to review unrelated changes (TTL, hop
limit, IP ID generation, source MAC selection, pool sizing) just because
they happen to be coded inside netpoll.

This is a waste of memory for something useless.

This series splits the netconsole-specific code out:

* netpoll_send_udp() and its private helpers (push_ipv6, push_ipv4,
push_eth, push_udp, netpoll_udp_checksum, find_skb) move into
drivers/net/netconsole.c, leaving netpoll with a single skb-only
send interface that is the same for every user.

The moves are one function per patch for reviewability; helpers are
temporarily EXPORT_SYMBOL_GPL'd while netpoll_send_udp() is still in
netpoll calling them, then those exports are dropped together once
netpoll_send_udp() itself moves.

The only new permanent export is zap_completion_queue(), needed because
find_skb() still drains the per-CPU TX completion queue before
allocating.

struct netpoll is unchanged in this series; making the pool itself
netconsole-private (and reclaiming the skb_pool / refill_wq fields for
the rest of netpoll's users) is the natural follow-up, once this patchset
lands.
====================

Link: https://patch.msgid.link/20260512-netconsole_split-v2-0-1191d14ad66d@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netconsole: move find_skb() from netpoll

find_skb() is the netconsole-specific entry into the netpoll skb
pool: every other netpoll consumer (bonding, team, vlan, bridge,
macvlan, dsa) builds its own sk_buff and never touches the pool.
With netpoll_send_udp() (its only caller) now living in netconsole,
find_skb() can join it.

Move find_skb() into drivers/net/netconsole.c as a file-static
helper, drop EXPORT_SYMBOL_GPL(find_skb) and remove its prototype
from include/linux/netpoll.h. find_skb() drains TX completions via
netpoll_zap_completion_queue(), which is already exported in the
NETDEV_INTERNAL namespace, so netconsole picks up
MODULE_IMPORT_NS("NETDEV_INTERNAL") to consume it.

The skb pool's lifecycle (np->skb_pool, np->refill_wq, refill_skbs(),
refill_skbs_work_handler(), skb_pool_flush()) stays in netpoll: it
is initialised in __netpoll_setup() and torn down in
__netpoll_cleanup(), both of which remain netpoll's responsibility.
The refill work queued via schedule_work(&np->refill_wq) from the
moved find_skb() runs refill_skbs_work_handler() in netpoll without
any further plumbing.

This is pure code motion: the function body is unchanged and its
sole caller (netpoll_send_udp(), already moved by an earlier patch)
keeps invoking it the same way. Pre-existing concerns about
find_skb() running from NMI/printk context (zap_completion_queue()
re-entry, skb_pool spinlocks, GFP_ATOMIC allocation, fallback skb
sizing vs. MAX_SKB_SIZE, PREEMPT_RT semantics of __kfree_skb()) are
inherited as-is and are not addressed here; they predate this
series and are out of scope. Fixing them is left for follow-up
work.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260512-netconsole_split-v2-9-1191d14ad66d@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netpoll: rename and export netpoll_zap_completion_queue()

zap_completion_queue() drains the per-CPU softnet completion queue.
Rename it with the netpoll_ prefix shared by the rest of the
subsystem's public API, and promote it from file-static to
EXPORT_SYMBOL_NS_GPL in the NETDEV_INTERNAL namespace so the upcoming
netconsole-side find_skb() can call it once the function moves out.
A forward declaration is added to include/linux/netpoll.h, and the
old file-static forward declaration is dropped.

No functional change.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260512-netconsole_split-v2-8-1191d14ad66d@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netconsole: move netpoll_udp_checksum() from netpoll

netpoll_udp_checksum() computes the UDP checksum for netconsole's
packets. Move it into drivers/net/netconsole.c as a file-static
helper; drop its EXPORT_SYMBOL_GPL and remove the prototype from
include/linux/netpoll.h.

This was the last csum_ipv6_magic() consumer in net/core/netpoll.c,
so drop the now-stale <net/ip6_checksum.h> include there. Pull it
into netconsole.c so the moved code keeps building.

It was also the last udp_hdr() consumer in net/core/netpoll.c. The
file no longer needs anything from <net/udp.h> (the UDP socket-layer
helpers); MAX_SKB_SIZE only needs struct udphdr, which is provided
by the lighter <linux/udp.h>. Swap the include accordingly.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260512-netconsole_split-v2-7-1191d14ad66d@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netconsole: move push_udp() from netpoll

push_udp() builds the UDP header (and triggers the checksum) for
netconsole's UDP packets. Move it into drivers/net/netconsole.c as
a file-static helper; drop its EXPORT_SYMBOL_GPL and remove the
prototype from include/linux/netpoll.h.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260512-netconsole_split-v2-6-1191d14ad66d@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netconsole: move push_eth() from netpoll

push_eth() builds the Ethernet header for netconsole's UDP packets.
Move it into drivers/net/netconsole.c as a file-static helper; drop
its EXPORT_SYMBOL_GPL and remove the prototype from
include/linux/netpoll.h.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260512-netconsole_split-v2-5-1191d14ad66d@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netconsole: move push_ipv4() from netpoll

push_ipv4() builds the IPv4 header for netconsole's UDP packets.
Move it into drivers/net/netconsole.c as a file-static helper; drop
its EXPORT_SYMBOL_GPL and remove the prototype from
include/linux/netpoll.h.

put_unaligned() is no longer used in net/core/netpoll.c, so drop
the now-stale <linux/unaligned.h> include from there. Pull it into
netconsole.c so the moved code keeps building.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260512-netconsole_split-v2-4-1191d14ad66d@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netconsole: move push_ipv6() from netpoll

push_ipv6() builds the IPv6 header for netconsole's UDP packets.
Its only caller, netpoll_send_udp(), now lives in netconsole, so
the helper can move there as a file-static function. Drop its
EXPORT_SYMBOL_GPL and remove the prototype from
include/linux/netpoll.h.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260512-netconsole_split-v2-3-1191d14ad66d@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netconsole: move netpoll_send_udp() from netpoll

Move netpoll_send_udp() from net/core/netpoll.c into
drivers/net/netconsole.c as a static helper, drop EXPORT_SYMBOL(),
and remove the prototype from include/linux/netpoll.h.

netconsole was the only in-tree caller of this entry point. Every
other netpoll consumer (bonding, team, vlan, bridge, macvlan, dsa)
already builds its own sk_buff and hands it to netpoll_send_skb(),
so the netpoll send-side interface is now skb-only.

The helpers it depends on (find_skb(), push_ipv6(), push_ipv4(),
push_udp(), push_eth(), netpoll_udp_checksum()) were exposed in
the previous patches and stay in net/core/netpoll.c for now.
Subsequent patches move each of them into netconsole one at a time
and drop the corresponding EXPORT_SYMBOL_GPL.

Pull <linux/ip.h>, <linux/ipv6.h> and <linux/udp.h> into netconsole.c
so the moved code can name the header structures.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260512-netconsole_split-v2-2-1191d14ad66d@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netpoll: expose UDP packet builder helpers for netconsole

Promote each from file-static to EXPORT_SYMBOL_GPL and forward-
declare them in include/linux/netpoll.h so netconsole can call
them once netpoll_send_udp() moves out.

These exports are kept until the end of the series, when
al of them move into netconsole.

No functional change.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260512-netconsole_split-v2-1-1191d14ad66d@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: usb: pegasus: replace simple_strtoul with kstrtouint

simple_strtoul() is deprecated as it has no error checking. Replace it
with kstrtouint() which returns an error code on invalid input, and add
appropriate error handling.

Also add a NULL check before parsing flags, since strsep() can set id
to NULL if the input has fewer tokens than expected.

Preserve the original behavior for a trailing colon by checking *id
before parsing flags, so an empty string results in flags = 0 rather
than an error.

Signed-off-by: Sajal Gupta <sajal2005gupta@gmail.com>
Link: https://patch.msgid.link/20260509095518.2640-1-sajal2005gupta@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipvlan: use netif_receive_skb() in ipvlan_process_multicast()

ipvlan_process_multicast() runs from process context, there is no
risk of stack overflow if we call netif_receive_skb() instead
of netif_rx().

This avoids some overhead adding/removing skbs to/from a per-cpu
backlog and raising/processing NET_RX softirqs.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260512042019.3300975-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'tun-tap-vhost-net-apply-qdisc-backpressure-on-full-ptr_ring-to-reduce-tx-drops'

Simon Schippers says:

====================
tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops

This patch series deals with tun/tap & vhost-net which drop incoming
SKBs whenever their internal ptr_ring buffer is full. Instead, with this
patch series, the associated netdev queue is stopped - but only when a
qdisc is attached. If no qdisc is present the existing behavior is
preserved. The XDP transmit path is not affected. This patch series
touches tun/tap and vhost-net, as they share common logic and must be
updated together. Modifying only one of them would break the other.

By applying proper backpressure, this change allows the connected qdisc to
operate correctly, as reported in [1], and significantly improves
performance in real-world scenarios, as demonstrated in our paper [2]. For
example, we observed a 36% TCP throughput improvement for an OpenVPN
connection between Germany and the USA.

Synthetic pktgen benchmarks indicate a slight regression, and packet
loss is reduced to near zero. Pktgen benchmarks are provided per commit,
with the final commit showing the overall performance.

Link: https://unix.stackexchange.com/questions/762935/traffic-shaping-ineffective-on-tun-device
Link: https://cni.etit.tu-dortmund.de/storages/cni-etit/r/Research/Publications/2025/Gebauer_2025_VTCFall/Gebauer_VTCFall2025_AuthorsVersion.pdf
====================

Link: https://patch.msgid.link/20260510151529.43895-1-simon.schippers@tu-dortmund.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present

This commit prevents tail-drop when a qdisc is present and the ptr_ring
becomes full. Once the ring reaches capacity after a produce attempt,
the netdev queue is stopped instead of dropping subsequent packets.
If no qdisc is present, the previous tail-drop behavior is preserved.

If producing an entry fails anyway due to a race, tun_net_xmit() drops
the packet. Such races are expected because LLTX is enabled and the
transmit path operates without the usual locking.

The __tun_wake_queue() function of the consumer races with the producer
for waking/stopping the netdev queue, which could result in a stalled
queue. Therefore, an smp_mb__after_atomic() is introduced that pairs
with the smp_mb() of the consumer. It follows the principle of store
buffering described in tools/memory-model/Documentation/recipes.txt:

- The producer in tun_net_xmit() first sets __QUEUE_STATE_DRV_XOFF,
  followed by an smp_mb__after_atomic() (= smp_mb()), and then reads the
  ring with __ptr_ring_check_produce().

- The consumer in __tun_wake_queue() first writes zero to the ring in
  __ptr_ring_consume(), followed by an smp_mb(), and then reads the queue
  status with netif_tx_queue_stopped().

=> Following the aforementioned principle, it is impossible for the
   producer to see a full ring (and therefore not wake the queue on the
   re-check) while the consumer simultaneously fails to see a stopped
   queue (and therefore also does not wake it).

Benchmarks:
The benchmarks show a slight regression in raw transmission performance
when using two sending threads. Packet loss also occurs only in the
two-thread sending case; no packet loss was observed with a single
sending thread.

Test setup:
AMD Ryzen 5 5600X at 4.3 GHz, 3200 MHz RAM, isolated QEMU threads;
Average over 50 runs @ 100,000,000 packets. SRSO and spectre v2
mitigations disabled.

Note for tap+vhost-net:
XDP drop program active in VM -> ~2.5x faster; slower for tap due to
more syscalls (high utilization of entry_SYSRETQ_unsafe_stack in perf)

+--------------------------+--------------+----------------+----------+
| 1 thread                 | Stock        | Patched with   | diff     |
| sending                  |              | fq_codel qdisc |          |
+------------+-------------+--------------+----------------+----------+
| TAP        | Received    | 1.132 Mpps   | 1.123 Mpps     | -0.8%    |
|            +-------------+--------------+----------------+----------+
|            | Lost/s      | 3.765 Mpps   | 0 pps          |          |
+------------+-------------+--------------+----------------+----------+
| TAP        | Received    | 3.857 Mpps   | 3.901 Mpps     | +1.1%    |
|            +-------------+--------------+----------------+----------+
| +vhost-net | Lost/s      | 0.802 Mpps   | 0 pps          |          |
+------------+-------------+--------------+----------------+----------+

+--------------------------+--------------+----------------+----------+
| 2 threads                | Stock        | Patched with   | diff     |
| sending                  |              | fq_codel qdisc |          |
+------------+-------------+--------------+----------------+----------+
| TAP        | Received    | 1.115 Mpps   | 1.081 Mpps     | -3.0%    |
|            +-------------+--------------+----------------+----------+
|            | Lost/s      | 8.490 Mpps   | 391 pps        |          |
+------------+-------------+--------------+----------------+----------+
| TAP        | Received    | 3.664 Mpps   | 3.555 Mpps     | -3.0%    |
|            +-------------+--------------+----------------+----------+
| +vhost-net | Lost/s      | 5.330 Mpps   | 938 pps        |          |
+------------+-------------+--------------+----------------+----------+

Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://patch.msgid.link/20260510151529.43895-5-simon.schippers@tu-dortmund.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ptr_ring: move free-space check into separate helper

This patch moves the check for available free space for a new entry into
a separate function. Existing callers that only check for a non-zero
return value are unaffected; __ptr_ring_produce() now returns -EINVAL
for a zero-size ring and -ENOSPC when full, whereas before both cases
returned -ENOSPC. The new helper allows callers to determine in advance
whether subsequent __ptr_ring_produce() calls will succeed. This
information can, for example, be used to temporarily stop producing until
__ptr_ring_check_produce() indicates that space is available again.

Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://patch.msgid.link/20260510151529.43895-4-simon.schippers@tu-dortmund.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

vhost-net: wake queue of tun/tap after ptr_ring consume

Add tun_wake_queue() to tun.c and export it for use by vhost-net. The
function validates that the file belongs to a tun/tap device and that
the tfile exists, dereferences the tun_struct under RCU, and delegates
to __tun_wake_queue().

vhost_net_buf_produce() now calls tun_wake_queue() after a successful
batched consume of the ring to allow the netdev subqueue to be woken up.
The point is to allow the queue to be stopped when it gets full, which
is required for traffic shaping - implemented by the following
"avoid ptr_ring tail-drop when a qdisc is present".

Without the corresponding queue stopping, this patch alone causes no
throughput regression for a tap+vhost-net setup sending to a qemu VM:
3.857 Mpps to 3.891 Mpps.

Details: AMD Ryzen 5 5600X at 4.3 GHz, 3200 MHz RAM, isolated QEMU
threads, XDP drop program active in VM, pktgen sender; Avg over
50 runs @ 100,000,000 packets. SRSO and spectre v2 mitigations disabled.

Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://patch.msgid.link/20260510151529.43895-3-simon.schippers@tu-dortmund.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tun/tap: add ptr_ring consume helper with netdev queue wakeup

Introduce tun_ring_consume() that wraps ptr_ring_consume() and calls
__tun_wake_queue(). The latter wakes the stopped netdev subqueue once
half of the ring capacity has been consumed, tracked via the new
cons_cnt field in tun_file. As a safety net, the queue is also woken on
the last consumed entry if it leaves the ring empty. The point is to
allow the queue to be stopped when it gets full, which is required for
traffic shaping - implemented by the following "avoid ptr_ring tail-drop
when a qdisc is present".

Some implementation details:
- tun_ring_recv() replaces ptr_ring_consume() with tun_ring_consume()
  to properly wake the queue.
- __tun_detach() locks the tx_ring.consumer_lock to avoid races with
  the consumer on the queue_index.
- The ptr_ring_consume() call in tun_queue_purge() is not replaced with
  tun_ring_consume(). Instead, within the same tx_ring.consumer_lock
  in __tun_detach(), the netdev queue is woken for the ntfile taking
  it over, to avoid a possible stall. This does not matter for
  tun_detach_all(), as it is called during device teardown and no tfile
  takes over any queue.
- Reset cons_cnt in tun_attach() so the half-ring wake threshold is
  valid for the new ring size after ptr_ring_resize().
- tun_queue_resize() wakes all queues after resizing with the proper
  tx_ring.consumer_lock and resets the cons_cnt to avoid a possible
  stale queue.
- The aforementioned upcoming patch explains the pairing of the smp_mb()
  of __tun_wake_queue().

Without the corresponding queue stopping, this patch alone causes no
regression for a tap setup sending to a qemu VM: 1.132 Mpps
to 1.134 Mpps.

Details: AMD Ryzen 5 5600X at 4.3 GHz, 3200 MHz RAM, isolated QEMU
threads, pktgen sender; Avg over 50 runs @ 100,000,000 packets;
SRSO and spectre v2 mitigations disabled.

Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://patch.msgid.link/20260510151529.43895-2-simon.schippers@tu-dortmund.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'dpll-rework-fractional-frequency-offset-reporting'

Ivan Vecera says:

====================
dpll: rework fractional frequency offset reporting

Rework how the fractional frequency offset (FFO) is reported in
the DPLL subsystem.

Both fractional-frequency-offset (PPM) and
fractional-frequency-offset-ppt (PPT) attributes are now present at
the top level of a pin and inside each pin-parent-device nest. They
carry the same measurement at different precisions.

Introduce enum dpll_ffo_type and struct dpll_ffo_param to distinguish
FFO contexts: DPLL_FFO_PORT_RXTX_RATE for the RX vs TX symbol rate
offset at the top level, and DPLL_FFO_PIN_DEVICE for the pin vs
parent DPLL offset in the nest. Drivers declare which types they
support via the supported_ffo bitmask in dpll_pin_ops; the core only
calls ffo_get for opted-in types.

Patch 1 adds the type-safe FFO API, updates the YAML spec, netlink
handling, and documentation, and converts mlx5 and zl3073x drivers.

Patch 2 implements the nested FFO for zl3073x using the
dpll_df_offset_x register with ref_ofst=1, providing 2^-48
resolution. The old per-reference frequency measurement is removed
as it was redundant with measured-frequency.
====================

Link: https://patch.msgid.link/20260511155816.99936-1-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dpll: zl3073x: report FFO as DPLL vs input reference offset

Replace the per-reference frequency offset measurement (which was
redundant with measured-frequency) with a direct read of the DPLL's
delta frequency offset vs its tracked input reference.

The new implementation uses the dpll_df_offset_x register with
ref_ofst=1 via the dpll_df_read_x semaphore mechanism. This
provides 2^-48 resolution (~3.5 fE) and reports the actual
frequency difference between the DPLL and its active input.

Switch supported_ffo from DPLL_FFO_PORT_RXTX_RATE to
DPLL_FFO_PIN_DEVICE so FFO is reported only in the per-parent
context for the active input pin.

Use atomic64_t for freq_offset to prevent torn reads on 32-bit
architectures between the periodic worker and netlink callbacks.

Rewrite ffo_check to compare the cached df_offset converted to PPT
instead of using the old per-reference measurement. Remove the
ref_ffo_update periodic measurement and the ref ffo field since
they are no longer needed.

Changes v3 -> v4:
- Switch to DPLL_FFO_PIN_DEVICE, remove dpll=NULL guard
- Use atomic64_t for freq_offset (torn read on 32-bit)

Reviewed-by: Petr Oros <poros@redhat.com>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Link: https://patch.msgid.link/20260511155816.99936-3-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dpll: add fractional frequency offset to pin-parent-device

Add both fractional-frequency-offset (PPM) and
fractional-frequency-offset-ppt (PPT) attributes to the
pin-parent-device nested attribute set, alongside the existing
top-level pin attributes. Both carry the same measurement at
different precisions.

Introduce enum dpll_ffo_type and struct dpll_ffo_param to
distinguish FFO contexts: DPLL_FFO_PORT_RXTX_RATE for the RX vs
TX symbol rate offset reported at the top level, and
DPLL_FFO_PIN_DEVICE for the pin vs parent DPLL offset reported
in the pin-parent-device nest.

Add a supported_ffo bitmask to struct dpll_pin_ops so drivers
declare which FFO types they support. The core only calls ffo_get
for types the driver has opted into, eliminating the need for
per-driver NULL pointer guards. Validate at pin registration time
that supported_ffo is not set without an ffo_get callback.

Update mlx5 (DPLL_FFO_PORT_RXTX_RATE) and zl3073x
(DPLL_FFO_PORT_RXTX_RATE) drivers to use the new API.

Add documentation for both FFO types to dpll.rst.

Changes v3 -> v4:
- Replace dpll=NULL overloading with enum dpll_ffo_type and
  struct dpll_ffo_param (Jakub Kicinski)
- Add supported_ffo opt-in bitmask in dpll_pin_ops for fail-close
  driver validation (Jakub Kicinski)
- Add WARN_ON in dpll_pin_register for supported_ffo without
  ffo_get callback

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Link: https://patch.msgid.link/20260511155816.99936-2-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: realtek: rtl8365mb: add support for RTL8367SB

Add chip info entry for the Realtek RTL8367SB switch. This device has
chip ID 0x6367 and version 0x0010. It exposes two external interfaces:
port 6 supports MII, TMII, RMII, RGMII, SGMII and HSGMII, while port 7
supports MII, TMII, RMII and RGMII. Use the existing 8365MB-VC jam table
for initialization.

Reviewed-by: Luiz Angelo Daros de Luca <luizluca@gmail.com>
Signed-off-by: Mieczyslaw Nalewaj <namiltd@yahoo.com>
Link: https://patch.msgid.link/3c6d822b-0e85-4173-86ba-2badb140bbf1@yahoo.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-use-ip_outnoroutes-drop-reason'

Eric Dumazet says:

====================
net: use IP_OUTNOROUTES drop reason

First patch changes sk_skb_reason_drop() sock to be const.

Second and last patch add SKB_DROP_REASON_IP_OUTNOROUTES
to both tcp_v6_send_response() and inet6_csk_xmit().
====================

Link: https://patch.msgid.link/20260511072310.1094859-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: use SKB_DROP_REASON_IP_OUTNOROUTES in inet6_csk_xmit()

Replace a bare kfree_skb() with a modern sk_skb_reason_drop() call,
and provide IP_OUTNOROUTES drop reason.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260511072310.1094859-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: use SKB_DROP_REASON_IP_OUTNOROUTES in tcp_v6_send_response()

Replace a bare kfree_skb() with a modern sk_skb_reason_drop() call,
and provide IP_OUTNOROUTES drop reason.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260511072310.1094859-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: constify sk_skb_reason_drop() sock parameter

sk_skb_reason_drop() does not change sock parameter, make it
const so that we can call it from TCP stack without a cast
on a (const) listener socket.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260511072310.1094859-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

rtnetlink: add RTEXT_FILTER_NAME_ONLY support

iproute2 can spend considerable amount of time in ll_init_map()
or ll_link_get() to dump verbose netdev attributes, contributing
to RTNL pressure.

Add RTEXT_FILTER_NAME_ONLY new flag so that rtnl_fill_ifinfo()
limits its output to:

- struct nlmsghdr
- IFLA_IFNAME
- IFLA_PROP_LIST (alternate names)

We can later avoid using RTNL when RTEXT_FILTER_NAME_ONLY
is requested, as none of these attributes need RTNL.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260511070244.971028-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'rework-pci_device_id-initialisation'

Uwe Kleine-König says:

====================
Rework pci_device_id initialisation
====================

Link: https://patch.msgid.link/20260511090023.1634387-4-u.kleine-koenig@baylibre.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: Consistently define pci_device_ids using named initializers

... and PCI device helpers.

The various struct pci_device_id arrays were initialized mostly by one
the PCI_DEVICE macros and then list expressions. The latter isn't easily
readable if you're not into PCI. Using named initializers is more
explicit and thus easier to parse.

Also use PCI_DEVICE* helper macros to assign .vendor, .device,
.subvendor and .subdevice where appropriate and skip explicit
assignments of 0 (which the compiler takes care of).

The secret plan is to make struct pci_device_id::driver_data an
anonymous union (similar to
https://lore.kernel.org/all/cover.1776579304.git.u.kleine-koenig@baylibre.com/)
and that requires named initializers. But it's also a nice cleanup on
its own.

This change doesn't introduce changes to the compiled pci_device_id
arrays. Tested on x86 and arm64.

Reviewed-by: Jijie Shao <shaojijie@huawei.com>
Acked-by: Arend van Spriel <arend.vanspriel@broadcom.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Petr Machata <petrm@nvidia.com> # for mlxsw
Acked-by: Jacob Keller <jacob.e.keller@intel.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com>
Forwarded: id:76da4f44d48bdde84580963862bf9616bee5c9e9.1778149923.git.u.kleine-koenig@baylibre.com (v2)
Reviewed-by: Michael Grzeschik <mgr@kernel.org>
Link: https://patch.msgid.link/20260511090023.1634387-6-u.kleine-koenig@baylibre.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: nfp: Drop PCI class entries with .class_mask = 0

With .class_mask being zero the value of .class doesn't matter because
to check if a pci_device_id entry matches a given device the expression

(id->class ^ dev->class) & id->class_mask

is checked for being zero (see pci_match_one_device()). So drop the
useless and irritating assignment for .class to match what (I think) all
other drivers are doing that don't need to match on .class, i.e. set
both members to zero.

Signed-off-by: Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Link: https://patch.msgid.link/20260511090023.1634387-5-u.kleine-koenig@baylibre.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: intel-xway: add PHY-level statistics via ethtool

Report PCS receive error counts for all supported PEF 7061, 7071, 7072 and
xRX200 PHYs.

Accumulate the vendor-specific PHY_ERRCNT read-clear counter
(SEL=RXERR) in .update_stats() and expose it as both IEEE 802.3
SymbolErrorDuringCarrier and generic rx_errors via
.get_phy_stats().

Signed-off-by: Aleksander Jan Bajkowski <olek2@wp.pl>
Link: https://patch.msgid.link/20260509205933.3965832-1-olek2@wp.pl
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: intel-xway: fix typo in Kconfig description

Replace "22E" with "22F" in the description.

Signed-off-by: Aleksander Jan Bajkowski <olek2@wp.pl>
Link: https://patch.msgid.link/20260509210900.3968447-1-olek2@wp.pl
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: drv-net: cope with slow env in so_txtime.py test

This test was converted from shell script to drv-net test.

The new version is flaky in dbg builds on the netdev.bots dashboard.
The previous shell script had more protections to avoid these. Added
in commit a7ee79b9c455 ("selftests: net: cope with slow env in
so_txtime.sh test").

Add the same overall protection:

- Suppress so_txtime process failure if KSFT_MACHINE_SLOW

Also relax two timeouts to reduce the number of process failures
themselves

- Increase SO_RCVTIMEO to 2 seconds
- Increase process start-up stabilization to 2 seconds

Delays were experimentally arrived at while running with vng
built with kernel/configs/debug.config

Fixes: 5c6baef3885c ("selftests: drv-net: convert so_txtime to drv-net")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Closes: https://lore.kernel.org/netdev/20260510174219.74aeee6d@kernel.org/
Signed-off-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260511222138.2045551-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ethtool: fix missing closing paren in rings_reply_size()

sizeof(u32) on the _RINGS_CQE_SIZE line is missing its closing
parenthesis, causing nla_total_size() to absorb the subsequent
_TX_PUSH and _RX_PUSH entries.

The resulting size estimate happens to be numerically identical
due to NLA alignment, so not treating this as a real fix.
But the nesting is wrong and misleading.

Signed-off-by: Tao Cui <cuitao@kylinos.cn>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Reviewed-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260508125412.189804-1-cuitao@kylinos.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-sched-prepare-lockless-qdisc-dumps'

Eric Dumazet says:

====================
net/sched: prepare lockless qdisc dumps

Goal is to no longer acquire RTNL in qdisc dumps.

This series annotate data-races, and change mq and mq_prio to
no longer acquire children qdisc spinlocks.
====================

Link: https://patch.msgid.link/20260510091455.4039245-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/sched: mq_prio: no longer acquire qdisc spinlocks in mqprio_dump_class_stats()

Prepare mqprio_dump_class_stats() for RTNL avoidance.

Use RCU instead of RTNL, and no longer acquire each children spinlock.

As a bonus we no longer have to release/acquire d->lock.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Link: https://patch.msgid.link/20260510091455.4039245-9-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/sched: mq_prio: no longer acquire qdisc spinlocks in mqprio_dump()

Prepare mqprio_dump() for RTNL avoidance.

Use RCU instead of RTNL, and no longer acquire each children spinlock.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Link: https://patch.msgid.link/20260510091455.4039245-8-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/sched: mq: no longer acquire qdisc spinlocks in dump operations

Prepare mq_dump_common() for RTNL avoidance.

Use RCU instead of RTNL, and no longer acquire each children spinlock.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Link: https://patch.msgid.link/20260510091455.4039245-7-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/sched: add const qualifiers to gnet_stats helpers

In preparation of lockless qdisc dumps, add const qualifiers to:

- gnet_stats_add_basic()
- gnet_stats_copy_basic()
- gnet_stats_copy_basic_hw()
- gnet_stats_copy_queue()
- gnet_stats_read_basic()
- ___gnet_stats_copy_basic()
- qdisc_qstats_copy()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Link: https://patch.msgid.link/20260510091455.4039245-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/sched: add qdisc_qlen_lockless() helper

Used in contexts were qdisc spinlock is not held.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Link: https://patch.msgid.link/20260510091455.4039245-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/sched: annotate data-races around sch->qstats.backlog

Add qstats_backlog_sub() and qstats_backlog_add() helpers
and use them instead of open-coding them.

These helpers use WRITE_ONCE() to prevent store-tearing.

Also use WRITE_ONCE() in fq_reset() and qdisc_reset()
when sch->qstats.backlog is cleared.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Link: https://patch.msgid.link/20260510091455.4039245-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/sched: add qdisc_qlen_inc() and qdisc_qlen_dec()

Helpers to increment or decrement sch->q.qlen, with appropriate
WRITE_ONCE() to prevent store tearing.

Add other WRITE_ONCE() when sch->q.qlen is changed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Link: https://patch.msgid.link/20260510091455.4039245-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/sched: add READ_ONCE() in gnet_stats_add_queue[_cpu]

Stats are read locklessly, add READ_ONCE() to prevent load-tearing.

Write side will be handled in separate patches.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Link: https://patch.msgid.link/20260510091455.4039245-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-phy-motorcomm-add-acpi-_dsd-property-support'

chunzhi.lin says:

====================
net: phy: motorcomm: add ACPI _DSD property support

This series makes the Motorcomm PHY driver parse firmware properties via
device_property_*() so the same property set can be provided by either
Devicetree or ACPI _DSD.

Patch 1 switches drivers/net/phy/motorcomm.c from of_property_*() to
device_property_*() on &phydev->mdio.dev.

Patch 2 documents Motorcomm yt8xxx PHY ACPI _DSD properties under
Documentation/firmware-guide/acpi/dsd and links the new document from
the ACPI index.
====================

Link: https://patch.msgid.link/20260507040221.3679454-1-linchunzhi0@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

docs: acpi: dsd: add Motorcomm yt8xxx PHY properties

Document Motorcomm yt8xxx PHY ACPI _DSD properties consumed by
the motorcomm PHY driver.

Describe property placement, UUID usage, and reference the DT binding
for value constraints and defaults.

Signed-off-by: chunzhi.lin <linchunzhi0@gmail.com>
Changes in v2:
- Keep dsd/ entries sorted in Documentation/firmware-guide/acpi/index.rst

Link: https://patch.msgid.link/20260507040221.3679454-3-linchunzhi0@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: motorcomm: use device properties for firmware tuning

The Motorcomm PHY driver reads optional firmware properties via
of_property_read_*() from phydev->mdio.dev.of_node. This works for
Device Tree based systems, but causes ACPI platforms to ignore the same
properties when they are supplied through _DSD.

As a result, ACPI-described Motorcomm PHY devices fall back to default
settings instead of applying firmware-provided tuning such as
rx/tx internal delay, drive strength, clock output frequency, and
optional boolean controls like auto-sleep-disabled,
keep-pll-enabled, and tx clock inversion.

Switch these lookups to device_property_read_*() so the driver uses the
generic firmware node interface and can consume the same property names
from either Device Tree or ACPI.

This keeps the existing DT behavior unchanged while allowing ACPI
platforms to honor PHY configuration from firmware.

We have completed testing on Sophgo RISC-V architecture server SD3-10.
This server has a 64-core Thead C920 CPU whose DWMAC is connected to
Motorcomm's PHY YT8531. This server supports UEFI boot and it would like
to use the ACPI table.

Signed-off-by: chunzhi.lin <linchunzhi0@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20260507040221.3679454-2-linchunzhi0@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'mptcp-pm-in-kernel-increase-limits'

Matthieu Baerts says:

====================
mptcp: pm: in-kernel: increase limits

Allow switching from 8 to 64 for the maximum number of subflows and
accepted ADD_ADDR, and from 8 to 255 for the number of MPTCP endpoints.

The previous limit of 8 subflows makes sense in most cases. Using more
subflows will very likely *not* improve the situation, and could even
decrease the performances. But there are no technical limitations nor
performance impact to raise this limit, so let's do it: this will allow
people with very specific use-cases, and researchers to easily create
more subflows, and measure the performance impact by themselves.

- Patches 1-2: increase subflows and accepted ADD_ADDR limits.

- Patches 3-4: increase endpoints limit.

- Patches 5-7: validate the new limits: 64 subflows, 255 endpoints.

- Patch 8: selftests: use send()/recv() instead of sendto()/recvfrom().
====================

Link: https://patch.msgid.link/20260508-net-next-mptcp-pm-inc-limits-v1-0-c84e3fdf9b6a@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mptcp: pm: use simpler send/recv forms

Instead of sendto() and recvfrom() which the NL address that was already
provided before.

Just simpler and easier to read without the to/from variants.

While at it, fix a checkpatch warning by removing multiple assignments.

Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260508-net-next-mptcp-pm-inc-limits-v1-8-c84e3fdf9b6a@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mptcp: pm: validate new limits

These limits have been recently updated, from 8 to:

- 64 for the subflows and accepted add_addr

- 255 for the MPTCP endpoints

These modifications validate the new limits, but are also compatible
with the previous ones, to be able to continue to validate stable kernel
using the last version of the selftests. That's why new variables are
now used instead of hard-coded values.

Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260508-net-next-mptcp-pm-inc-limits-v1-7-c84e3fdf9b6a@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mptcp: join: validate 8x8 subflows

The limits have been recently increased, it is required to validate that
having 64 subflows is allowed.

Here, both the client and the server have 8 network interfaces. The
server has 8 endpoints marked as 'signal' to announce all its v4
addresses. The client also has 8 endpoints, but marked as 'subflow' and
'fullmesh' in order to create 8 subflows to each address announced by
the server. This means 63 additional subflows will be created after the
initial one.

If it is not possible to increase the limits to 64, it means an older
kernel version is being used, and the test is skipped.

Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260508-net-next-mptcp-pm-inc-limits-v1-6-c84e3fdf9b6a@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mptcp: join: allow changing ifaces nr per test

By default, 4 network interfaces are created per subtest in a dedicated
net namespace. Each netns has a dedicated pair of v4 and v6 addresses.
Future tests will need more.

Simply always creating more network interfaces per test will increase
the execution time for all other tests, for no other benefits. So now it
is possible to change this number only when needed, by setting ifaces_nr
when calling 'reset' and 'init_shapers', e.g.

ifaces_nr=8 reset "Subtest title"
ifaces_nr=8 init_shapers

Note that it might also be interesting to decrease the default value to
2 to reduce the setup time, especially when a debug kernel config is
being used.

Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260508-net-next-mptcp-pm-inc-limits-v1-5-c84e3fdf9b6a@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: pm: in-kernel: increase endpoints limit

The endpoints are managed in a list which was limited to 8 entries.

This limit can be too small in some cases: by having the same limit as
the number of subflows, it might not allow creating all expected
subflows when having a mix of v4 and v6 addresses that can all use MPTCP
on v4/v6 only networks.

While increasing the limit above the new subflows one, why not using the
technical limit: 255. Indeed, the endpoint will each have an ID that
will be used on the wire, limited to u8, and the ID 0 is reserved to the
initial subflow.

Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260508-net-next-mptcp-pm-inc-limits-v1-4-c84e3fdf9b6a@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: pm: kernel: allow flushing more than 8 endpoints

The mptcp_rm_list structure contains an array of IDs of 8 entries: to be
able to send a RM_ADDR with 8 IDs. This limitation was OK so far because
there could maximum 8 endpoints.

But this is going to change in the next commit. To cope with that, if
one of the arrays is full, the iteration stops, the lists are processed,
then the iteration continues where it previously stopped.

Note that if there are many endpoints to remove, and multiple RM_ADDR to
send, it might be more likely that some of these RM_ADDRs are dropped or
lost. This is a known limitation: RM_ADDR are not retransmitted in
MPTCPv1.

Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260508-net-next-mptcp-pm-inc-limits-v1-3-c84e3fdf9b6a@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: pm: in-kernel: increase all limits to 64

This means switching the maximum from 8 to 64 for the number of subflows
and accepted ADD_ADDR.

The previous limit of 8 subflows makes sense in most cases. Using more
subflows will very likely *not* improve the situation, and could even
decrease the performances. But there are no technical limitations nor
performance impact to raise this limit, so let's do it: this will allow
people with very specific use-cases, and researchers to easily create
more subflows, and measure the performance impact by themselves.

The theoretical limit is 255 -- the ID is written in a u8 on the wire --
but 64 is more than enough. With so many subflows, it will be costly to
iterate over all of them when operations are done in bottom half.

Note that the in-kernel PM will continue to create subflows in reply to
ADD_ADDR with a single batch of maximum 8 subflows. Same when adding new
"subflow" endpoints with the fullmesh flag. Increasing those batch
limits would have a memory impact, and it looks fine not to cover these
cases with larger batches for the moment. If more is needed later, the
position of the last subflow from the list could be remembered, and the
list iteration could continue later.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/434
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260508-net-next-mptcp-pm-inc-limits-v1-2-c84e3fdf9b6a@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: pm: in-kernel: explicitly limit batches to array size

The in-kernel PM can create subflows in reply to ADD_ADDR by batch of
maximum 8 subflows for the moment. Same when adding new "subflow"
endpoints with the fullmesh flag. This limit is linked to the arrays
used during these steps.

There was no explicit limit to the arrays size (8), because the limit of
extra subflows is the same (8). It seems safer to use an explicit limit,
but also these two sizes are going to be different in the next commit.

Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260508-net-next-mptcp-pm-inc-limits-v1-1-c84e3fdf9b6a@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mention the convention for .ndo_setup_tc()

qdisc_offload_dump_helper(), originated from commit 602f3baf2218
("net_sch: red: Add offload ability to RED qdisc"), is designed to that

    Whether RED is being offloaded is being determined every time dump
    action is being called because parent change of this qdisc could
    change its offload state but doesn't require any RED function to be
    called.

and returning -EOPNOTSUPP (for dump queries) does not mean "I don't have
any statistics", but "I don't offload this qdisc anymore". At least two
existing drivers did it wrong, so it is worth mentioning.

Signed-off-by: David Yang <mmyangfl@gmail.com>
Link: https://patch.msgid.link/20260507214054.2539790-1-mmyangfl@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-mlx5-steering-misc-enhancements'

Tariq Toukan says:

====================
net/mlx5: Steering misc enhancements

This small series by Yevgeny contains a few steering enhancements /
cleanups.
====================

Link: https://patch.msgid.link/20260507173443.320465-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: DR, Remove unused field of struct mlx5dr_matcher_rx_tx

Remove a field that was never used.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Erez Shitrit <erezsh@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260507173443.320465-4-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: HWS, Handle destroying table that has a miss table

If a table has a miss table that was created by
'mlx5hws_table_set_default_miss' API function, its miss_tbl
keeps the table that points to it in a list.
If such table is deleted, we need to also remove it from the
miss_tbl list, otherwise the node in miss_tbl list will contain
garbage.

Signed-off-by: Erez Shitrit <erezsh@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260507173443.320465-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: HWS, Check if device is down while polling for completion

In case the device is down for any reason (e.g. FLR),
the HW will no longer generate completions - no point
polling and waiting for timeout.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Erez Shitrit <erezsh@nvidia.com>
Reviewed-by: Shay Drori <shayd@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260507173443.320465-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'log-clean-up-and-tap-follow-ups'

Allison Henderson says:

====================
Log clean up and TAP follow ups

This is a follow up series to the  "Log collection, TAP compliance and
cleanups" set.  The sashiko report had made some points that I thought
was worth addressing.  This patch set fixes a few more TAP compliance
prints in the check_gcov* routines.  Also since the user must now pass
in the log folder to collect logs, log clean up is tightened to only
remove rds* prefixed artifacts instead of the entire folder.  Lastly a
the signal handler alarm should be disarmed after the completes to
avoid multiple calls to the stop_pcaps routine.
====================

Link: https://patch.msgid.link/20260507233213.556182-1-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: rds: Disarm signal alarm on test completion

A race in stop_pcaps is possible if the test completes and then
times out while waiting for the tcpdump process to exit.  The
signal handler may fire again and needlessly call stop_pcap a
second time.  Fix this by disabling the alarm after normal
test completion.

Also if there are no tcpdump processes to wait on, stop_pcaps can
just exit.  This avoids misleading prints when there are no procs
to collect dumps from.

Signed-off-by: Allison Henderson <achender@kernel.org>
Link: https://patch.msgid.link/20260507233213.556182-4-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: rds: Fix TAP-prefixed prints in check_gcov*

This patch adds the # prefix to info and warning prints in
the check_gcov* routines. Since these routines do not exit,
as the other check_* routines do, the output here should be
kept TAP compliant.

Signed-off-by: Allison Henderson <achender@kernel.org>
Link: https://patch.msgid.link/20260507233213.556182-3-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: rds: Fix stale log clean up

Since rds self tests no longer has a default folder, users must
specify a log collection folder if they want to collect logs.
Currently the log folder is deleted and recreated, but this can
be dangerous if the user exports RDS_LOG_DIR=/tmp or /var/log.
This patch corrects the clean up to delete only rds log artifacts
from the log folder, and further prefixes rds specific logs as rds*

Signed-off-by: Allison Henderson <achender@kernel.org>
Link: https://patch.msgid.link/20260507233213.556182-2-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'ipv4-flush-the-fib-once-on-multiple-nexthop-removal'

Cosmin Ratiu says:

====================
ipv4: Flush the FIB once on multiple nexthop removal

This series optimizes multiple nexthop removal performance from having
to do a FIB flush for each nexthop being removed to only doing a single
FIB flush after all nexthops are removed.

This dramatically improves performance in scenarios where there are
many nexthops and many ipv4 routes. Please see individual patches for
more details and for a test scenario.
====================

Link: https://patch.msgid.link/20260507075606.322405-1-cratiu@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv4: Add __must_check to nexthop removal functions

These functions return a signal whether FIB flushing is required which
must not be ignored. Use the compiler to help with enforcing this
requirement in the future.

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20260507075606.322405-4-cratiu@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv4: Flush the FIB once on multiple nexthop removal

When a device is going down or when a net namespace is deleted, all
nexthops on it are removed, and for each nexthop being removed the FIB
table is flushed, which does a full trie traversal looking for entries
marked RTNH_F_DEAD and removing them. This is O(N x R), with N being
number of dev nexthops and R being number of IPv4 routes.

The RTNL is held the entire time.

When there are many nexthops to be removed and many routing entries,
this can result in the RTNL being held for multiple minutes, which
causes unhappiness in other processes trying to acquire the RTNL (e.g.
systemd-networkd for DHCP renewals).

In a complicated deployment with multiple vxlan devices, each having
16K nexthops and a total of 128K ipv4 routes, this is exactly what
happens:

nexthop_flush_dev()                # loops over 16K nexthops
  -> remove_nexthop()
    -> __remove_nexthop()
      -> __remove_nexthop_fib()    # marks fi->fib_flags |= RTNH_F_DEAD
        -> fib_flush()             # for EACH nexthop!
  -> fib_table_flush()     # walks the ENTIRE FIB, 128K entries

This patch makes use of the previously added FIB flushing signal to only
do a single FIB flush after all nexthops to be removed are marked as
RTNH_F_DEAD:
- __remove_nexthop_fib() no longer flushes the FIB.
- nexthop_flush_dev() and flush_all_nexthops() now keep track whether
  any nexthop was removed and trigger a FIB flush at the end.
- a new wrapper is defined, remove_one_nexthop() which calls
  remove_nexthop() and flushes if necessary. This is intended for places
  which must remove a single nexthop and shouldn't worry about the need
  to trigger a FIB flush. For now, the only caller is rtm_del_nexthop().
- The two direct callers of __remove_nexthop() get a WARN_ON_ONCE, since
  the nh about to be removed should not have any FIB entries referencing
  it when replacing or inserting a new one.

This dramatically improves performance from O(N x R) to O(N + R).

Releasing a nexthop reference in remove_nexthop() now no longer frees
it. Instead, it is deleted when the last fib_info pointing to it gets
freed via free_fib_info_rcu(). All routing code is already careful not
to take into consideration routes marked with RTNH_F_DEAD.

Tested with:
DEV=eth2
ip link set up dev $DEV
ip link add testnh0 link $DEV type macvlan mode bridge
ip addr add 198.51.100.1/24 dev testnh0
ip link set testnh0 up

seq 1 65536 | \
sed 's/.*/nexthop add id & via 198.51.100.2 dev testnh0/' | \
ip -batch -

i=1
for a in $(seq 0 255); do
  for b in $(seq 0 255); do
    echo "route add 10.${a}.${b}.0/32 nhid $i"
    i=$((i + 1))
  done
done | ip -batch -

time ip link set testnh0 down
ip link del testnh0

Without this patch:
real 0m32.601s
user 0m0.000s
sys 0m32.511s

With this patch:
real 0m0.209s
user 0m0.000s
sys 0m0.153s

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20260507075606.322405-3-cratiu@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv4: Provide a FIB flushing signal from nexthop removal functions

Plumb a bool value throughout the various nexthop removal functions,
determined in the innermost __remove_nexthop_fib() (which still does the
FIB flushing) and propagated up all callers.

The next patch will make use of this signal to optimize the removal of
multiple nexthops by moving the FIB flushing up the call hierarchy.

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20260507075606.322405-2-cratiu@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-convert-four-more-protocols-to-getsockopt_iter'

Breno Leitao says:

====================
net: convert four more protocols to getsockopt_iter

Continue the work to convert protocols to the new getsockopt_iter API.

Convert four additional getsockopt implementations to the new
sockopt_t/getsockopt_iter callback:

  - MCTP
  - LLC
  - X.25
  - KCM

These are mechanical, ABI-preserving conversions following the same
pattern as the previously converted protocols (af_packet, can/raw,
af_netlink, af_vsock): the (char __user *optval, int __user *optlen)
pair is replaced with a single sockopt_t *opt that carries the buffer
length on input and the returned size on output, and exposes an iov_iter
for the copy-out path. put_user()/copy_to_user() pairs are replaced with
a single copy_to_iter() per option, and the wrapper in
do_sock_getsockopt() handles writing optlen back to userspace.

I picked these four because each is small and self-contained.
====================

Link: https://patch.msgid.link/20260507-getsock_two-v2-0-5873111d9c12@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: net: getsockopt_iter: cleanup

Apply two cleanups suggested by Stanislav and bobby on the original
selftest series:

- Reorder local variable declarations into reverse christmas-tree
  order (longest line first). Because that ordering puts socklen_t
  optlen before the variable whose size it stores, the
  "optlen = sizeof(...)" initializer is moved out of the declaration
  to a plain assignment in the test body, as Stanislav suggested.

- Add ASSERT_EQ(optlen, ...) on every error path so the value the
  kernel writes back to the userspace optlen is pinned down even
  when the syscall returns -1. With do_sock_getsockopt() now writing
  opt->optlen back to userspace unconditionally, asserting that the
  netlink/vsock error paths leave the original input length untouched
  guards against future regressions.

Bobby Eshleman pointed out that
SO_VM_SOCKETS_CONNECT_TIMEOUT_NEW/OLD return a sock_timeval-shaped
payload (16 bytes on 64-bit), which is wider than the u64 case
already covered. Add four tests that exercise this path:

- connect_timeout_new_exact         exact-size buffer
- connect_timeout_new_oversize_clamped  oversize buffer, clamped
- connect_timeout_new_undersize     undersize -> -EINVAL, optlen
                                    untouched
- connect_timeout_old_exact         exact-size buffer for OLD optname

Suggested-by: Stanislav Fomichev <sdf@fomichev.me>
Suggested-by: Bobby Eshleman <bobbyeshleman@meta.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260507-getsock_two-v2-5-5873111d9c12@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

kcm: convert to getsockopt_iter

Convert KCM socket's getsockopt implementation to use the new
getsockopt_iter callback with sockopt_t.

Key changes:
- Replace (char __user *optval, int __user *optlen) with sockopt_t *opt
- Use opt->optlen for buffer length (input) and returned size (output)
- Use copy_to_iter() instead of put_user()/copy_to_user()
- Add linux/uio.h for copy_to_iter()

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260507-getsock_two-v2-4-5873111d9c12@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

x25: convert to getsockopt_iter

Convert X.25 socket's getsockopt implementation to use the new
getsockopt_iter callback with sockopt_t.

Key changes:
- Replace (char __user *optval, int __user *optlen) with sockopt_t *opt
- Use opt->optlen for buffer length (input) and returned size (output)
- Use copy_to_iter() instead of put_user()/copy_to_user()
- Add linux/uio.h for copy_to_iter()

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260507-getsock_two-v2-3-5873111d9c12@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

llc: convert to getsockopt_iter

Convert LLC socket's getsockopt implementation to use the new
getsockopt_iter callback with sockopt_t.

Key changes:
- Replace (char __user *optval, int __user *optlen) with sockopt_t *opt
- Use opt->optlen for buffer length (input) and returned size (output)
- Use copy_to_iter() instead of put_user()/copy_to_user()
- Add linux/uio.h for copy_to_iter()

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260507-getsock_two-v2-2-5873111d9c12@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mctp: convert to getsockopt_iter

Convert MCTP socket's getsockopt implementation to use the new
getsockopt_iter callback with sockopt_t.

Key changes:
- Replace (char __user *optval, int __user *optlen) with sockopt_t *opt
- Use opt->optlen for buffer length (input)
- Use copy_to_iter() instead of copy_to_user()
- Add linux/uio.h for copy_to_iter()

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260507-getsock_two-v2-1-5873111d9c12@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: eth: fbnic: Fix addr validation in pcs write

The DW IP has two distinct PCS address ranges cooresponding
to the C45 PCS registers.

The shim translates the PCS addr/regno into specific CSR writes
into one of those two zero-relative ranges.

This patch fixes a one off in the test that could allow an invalid
CSR write if an addr == 2 was called.

There are is of yet, no real impact for the bug as no PCS writes are
present.

Signed-off-by: Mike Marciniszyn (Meta) <mike.marciniszyn@gmail.com>
Link: https://patch.msgid.link/20260507154203.3667-1-mike.marciniszyn@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-dsa-microchip-remove-one-indirection-layer'

Bastien Curutchet says:

====================
net: dsa: microchip: Remove one indirection layer

This series follows the discussions we had on a previous series that
aimed to add PTP support for the KSZ8463 (cf [1]).

The KSZ driver got way too convoluted over time because it uses a common
framework to handle more than 20 switches split in 5 families (see below
table)

+----------+---------+---------+---------+---------+---------+
| Family   | KSZ8463 | KSZ87xx | KSZ88xx | KSZ9477 | LAN937X |
+----------+---------+---------+---------+---------+---------+
| Switches | KSZ8463 | KSZ8795 | KSZ88X3 | KSZ8563 | LAN9370 |
|          |         | KSZ8794 | KSZ8864 | KSZ9477 | LAN9371 |
|          |         | KSZ8765 | KSZ8895 | KSZ9896 | LAN9372 |
|          |         |         |         | KSZ9897 | LAN9373 |
|          |         |         |         | KSZ9893 | LAN9374 |
|          |         |         |         | KSZ9563 |         |
|          |         |         |         | KSZ8567 |         |
|          |         |         |         | KSZ9567 |         |
|          |         |         |         | LAN9646 |         |
+----------+---------+---------+---------+---------+---------+

A unique struct dsa_switch_ops is used by all the switches. Next to it,
each switch family has its own struct ksz_dev_ops with family-specific
callbacks. So the dsa_switch_ops operations handle the specificities of
each family through these ksz_dev_ops callbacks and/or conditional
branches based on the chip ID.

Vladimir initiated a rework of the driver ([2]) which I carried on. On
top of the rework I added PTP and periodic output support for the
KSZ8463 (which was my first goal). There are more than 60 patches for
all this so this series will be followed by several others and if you
want to see the full picture we can check my github ([3]).

This first series aims to split the unique struct dsa_switch_ops into
5 so each switch family will be able to implement its own set of DSA
operations.

I haven't finished yet to group all the patches into meaningful series
but here is more or less what I plan to do next:

- A series will remove from the struct ksz_dev_ops the callbacks
  that have an equivalent in dsa_switch_ops to remove one level of
  indirection.
- A series will split again some operations to get rid of the
  if (is_kszXYZ) branches.
- Maybe a fourth one will be needed to completely move out of
  ksz_common.c everything that isn't truly common to all the switches
- A series will add PTP support for the KSZ8463
- A final series will add periodic output support for the KSZ8463

[1]: https://lore.kernel.org/r/20260304-ksz8463-ptp-v6-0-3f4c47954c71@bootlin.com)
[2]: https://github.com/vladimiroltean/linux/tree/ksz_separate_dsa_switch_ops
[3]: https://github.com/bastien-curutchet/linux/tree/ksz_rework
====================

Link: https://patch.msgid.link/20260505-clean-ksz-driver-v1-0-05d70fa42461@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: microchip: split ksz_connect_tag_protocol()

All the KSZ switches use the same ksz_connect_tag_protocol while they
don't support all the KSZ tag protocols. So if, for some reason, a given
switch tries to connect another KSZ tag protocol, it won't fail.

Split the common ksz_connect_tag_protocol() into switch-specific
operations. This way, each switch will only accept to connect the tag
protocol it supports.
Remove the no longer used common operation.

Signed-off-by: Bastien Curutchet (Schneider Electric) <bastien.curutchet@bootlin.com>
Link: https://patch.msgid.link/20260505-clean-ksz-driver-v1-9-05d70fa42461@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: microchip: split ksz_get_tag_protocol()

All the switch families use a common function to implement
.get_tag_protocol(). This function then returns the relevant protocol
depending on the chip ID.

Make the protocol to dsa_switch_ops association a little bit more
obvious by having separate implementations.

Change made by manually checking which chip id has which dsa_switch_ops
assigned to it, then filtering the common ksz_get_tag_protocol() for
just those chip IDs pertaining to it.

As an important benefit, we no longer have that weird-looking
DSA_TAG_PROTO_NONE fallback which was never actually returned.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Bastien Curutchet (Schneider Electric) <bastien.curutchet@bootlin.com>
Link: https://patch.msgid.link/20260505-clean-ksz-driver-v1-8-05d70fa42461@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: microchip: hook up ksz_switch_alloc() to chip-specific dsa_switch_ops

Now that each switch driver has its own dsa_switch_ops (currently a copy
of ksz_switch_ops), we no longer need ksz_switch_ops and can remove it.

Get to the driver-specific dsa_switch_ops through the ksz_chip_data
structure.
Reorder the alloc()/get_match_data() calls such as to have that
pointer available.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Bastien Curutchet (Schneider Electric) <bastien.curutchet@bootlin.com>
Link: https://patch.msgid.link/20260505-clean-ksz-driver-v1-7-05d70fa42461@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: microchip: ensure each ksz_dev_ops has its own dsa_switch_ops

Currently we have a single dsa_switch_ops for 4 very distinct families
of switches, and many dsa_switch_ops methods are simply a dispatches
through ksz_dev_ops. That creates an avoidable level of indirection.

As a preparation for removing that indirection layer, create a separate
dsa_switch_ops structure wherever we have a ksz_dev_ops. These
structures are not yet used - ksz_switch_ops from ksz_common.c still is.
However, this reduces the noise from subsequent changes.

All new dsa_switch_ops are exact copies of ksz_switch_ops. But we need
to export function prototypes from ksz_common.c so that they are
callable from individual drivers.

Note that "individual drivers" are not actual separate kernel modules.
All of ksz8.c, ksz9477.c and lan937x_main.c are part of the same
ksz_switch.ko. Only the "register interface" drivers are different
modules (ksz9477_i2c.o for I2C, ksz_spi.o for SPI, ksz8863_smi.o for
MDIO). So we don't need to export any symbol.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Bastien Curutchet (Schneider Electric) <bastien.curutchet@bootlin.com>
Link: https://patch.msgid.link/20260505-clean-ksz-driver-v1-6-05d70fa42461@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: microchip: move phylink_mac_ops to individual drivers

Similar to ksz_dev_ops, struct phylink_mac_ops shouldn't be part of
the common code. Instead, the common code should provide callable
functionality.

Invert the paradigm and export the common aspects from ksz_common.c, and
move the chip-specific stuff in individual drivers.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Bastien Curutchet (Schneider Electric) <bastien.curutchet@bootlin.com>
Link: https://patch.msgid.link/20260505-clean-ksz-driver-v1-5-05d70fa42461@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: microchip: move KSZ9477 and LAN937 ksz_dev_ops to individual drivers

The ksz_dev_ops() are specific to each switch family so they should
belong to the individual drivers instead of the common section.

Move the ksz_dev_ops() definitions of the KSZ9477 and the LAN937 to
their individual drivers.
Set static the functions that aren't exported anymore.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Bastien Curutchet (Schneider Electric) <bastien.curutchet@bootlin.com>
Link: https://patch.msgid.link/20260505-clean-ksz-driver-v1-4-05d70fa42461@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: microchip: move KSZ8 ksz_dev_ops to ksz8.c

The ksz_dev_ops() are specific to each switch family so they should
belong to the individual drivers instead of the common section.

Move the ksz_dev_ops() definitions of the KSZ8xxx to ksz8.c
Set static the functions that aren't exported anymore.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Bastien Curutchet (Schneider Electric) <bastien.curutchet@bootlin.com>
Link: https://patch.msgid.link/20260505-clean-ksz-driver-v1-3-05d70fa42461@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: microchip: remove unused port_cleanup() callback

ksz_dev_ops :: port_cleanup() isn't used anywhere.

Remove it.

Signed-off-by: Bastien Curutchet (Schneider Electric) <bastien.curutchet@bootlin.com>
Link: https://patch.msgid.link/20260505-clean-ksz-driver-v1-2-05d70fa42461@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: microchip: Remove unused ksz8_all_queues_split()

ksz8_all_queues_split() isn't used anywhere.

Remove it.

Signed-off-by: Bastien Curutchet (Schneider Electric) <bastien.curutchet@bootlin.com>
Link: https://patch.msgid.link/20260505-clean-ksz-driver-v1-1-05d70fa42461@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-mlx5-icm-page-management-in-vhca_id-mode'

Tariq Toukan says:

====================
net/mlx5: ICM page management in VHCA_ID mode

This series adds driver support for the VHCA_ID page management mode.
When firmware and driver support this mode, ICM (Interconnect Context
Memory) page management uses the device vhca_id as the function
identifier in MANAGE_PAGES, QUERY_PAGES, and page request events instead
of the legacy function_id + ec_function pair.

Background
Firmware can operate page management in two modes:
FUNC_ID mode (current): Function identity is (function_id, ec_function).
This remains the default and is used for boot pages and when the new
mode capability is not set.
VHCA_ID mode (new): Function identity is vhca_id only; ec_function is
ignored. This aligns page management with the vhca_id-based model used
by other firmware commands and simplifies identification on SmartNIC and
multi-function setups.
====================

Link: https://patch.msgid.link/20260506133239.276237-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Add VHCA_ID page management mode support

Add support for VHCA_ID-based page management mode. When the device
firmware advertises the icm_mng_function_id_mode capability with
MLX5_ID_MODE_FUNCTION_VHCA_ID, page management operations between the
driver and firmware may use vhca_id instead of function_id as the
effective function identifier, and the ec_function field is ignored.

Update page management commands to conditionally set ec_function field
only in FUNC_ID mode. Boot page allocation always uses FUNC_ID mode
semantics for backward compatibility, as the capability bit is only
available after set_hca_cap(). If after set_hca_cap() VHCA_ID mode was
set, modify the tracking of the boot pages in page_root_xa to use
vhca_id too.

Add mlx5_esw_vhca_id_to_func_type() to resolve the function type in
VHCA_ID mode, enabling per-type debugfs counters. Use a dedicated
vhca_type_map xarray, to provide lockless lookup. Store the resolved
type on each fw_page at allocation time so reclaim and release paths
read it directly without any lookup.

Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Akiva Goldberger <agoldberger@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260506133239.276237-4-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Make debugfs page counters by function type dynamic

Make the per function type debugfs page counters dynamically added after
mlx5_eswitch_init(). When page management operates in vhca_id mode, only
the function acting as either eSwitch or vport manager can initialize
the eSwitch structure and translate the vhca_id to function type for the
functions to which it supplies pages. The next patch will add support
for page management in vhca_id mode.

Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Akiva Goldberger <agoldberger@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260506133239.276237-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Relax capability check for eswitch query paths

Several eswitch functions that only query other functions' HCA
capabilities or read cached vport state are guarded by the
vhca_resource_manager capability. This capability is required for
set_hca_cap operations but query_hca_cap of other functions only
requires the vport_group_manager capability.

Relax the capability check from vhca_resource_manager to
vport_group_manager in the following query-only paths:
- mlx5_esw_vport_caps_get() - queries other function general caps
- esw_ipsec_vf_query_generic() - queries other function ipsec cap
- mlx5_devlink_port_fn_migratable_get() - reads cached vport state
- mlx5_devlink_port_fn_roce_get() - reads cached vport state
- mlx5_devlink_port_fn_max_io_eqs_get() - queries other function caps
- mlx5_esw_vport_enable/disable() - vhca_id map/unmap

Functions that perform also set_hca_cap (migratable_set, roce_set,
max_io_eqs_set, esw_ipsec_vf_set_generic, esw_ipsec_vf_set_bytype)
retain the vhca_resource_manager requirement.

Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Akiva Goldberger <agoldberger@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260506133239.276237-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ixgbe: E610: do not fill EEE lp_advertised from local PHY caps

ixgbe_get_eee_e610() fills kedata->lp_advertised from pcaps.eee_cap
returned by ixgbe_aci_get_phy_caps() with IXGBE_ACI_REPORT_ACTIVE_CFG.
That report mode (and the other IXGBE_ACI_REPORT_* modes) describe the
local PHY only, not the link partner. The X550 path uses a separate
FW_PHY_ACT_UD_2 activity for partner data; the E610 ACI has no
equivalent.

Leave lp_advertised zeroed via the existing linkmode_zero() and drop
the now-unused ixgbe_eee_cap_map[]. eee_active/eee_enabled are
unaffected (sourced from link.eee_status).

Fixes: b61dbdeff3a9 ("ixgbe: E610: add EEE support")
Signed-off-by: David Carlier <devnexen@gmail.com>
Reviewed-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260507-jk-iwl-next-fix-eee-ixgbe-v1-1-62bc1d197d1d@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

sctp: Fix typo in comment

Fix a typo in a comment in sctp_endpoint_destroy(): "releated" should
be "related".

Signed-off-by: Md Shofiqul Islam <shofiqtest@gmail.com>
Link: https://patch.msgid.link/20260507105758.25728-1-shofiqtest@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-fix-protodown-with-macvlan'

Ido Schimmel says:

====================
net: Fix protodown with macvlan

When protodown is enabled on a macvlan, two bugs cause the macvlan to
incorrectly gain carrier:

1. Toggling the lower device's carrier while protodown is enabled on the
macvlan causes the macvlan to gain carrier, effectively bypassing the
protodown mechanism.

2. Toggling protodown on and then off on the macvlan while the lower
device has no carrier causes the macvlan to gain carrier, since
netif_change_proto_down() unconditionally turns the carrier on.

Patch #1 is a preparation.

Patch #2 solves the first problem by making netif_carrier_on() return
early when protodown is on.

Patch #3 solves the second problem by only calling netif_carrier_on()
when protodown is turned off if there is no linked net device or if the
linked net device has a carrier.

Patch #4 adds a selftest covering both bugs and the basic protodown
functionality.

Targeting at net-next since these are not regressions (i.e., never
worked).

Note that while these changes are in the core, they should only affect
macvlan as protodown is only supported by macvlan and vxlan and only
the former has a linked net device.
====================

Link: https://patch.msgid.link/20260507105906.891817-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: net: Add protodown tests

Add a selftest for the protodown mechanism.

Five test cases are included:

1. Basic protodown toggling: Verify that setting protodown on macvlan
   results in DOWN operational state and clearing it restores UP.

2. Same as the previous test case, but with vxlan.

3. Protodown reasons: Verify that protodown cannot be cleared while
   there are active protodown reasons, but can be cleared once all
   reasons are removed.

4. Protodown with lower device being toggled: Verify that toggling the
   lower device's carrier while protodown is on does not cause the
   macvlan to gain carrier.

5. Protodown with lower device down: Verify that toggling protodown
   while the lower device has no carrier does not cause the macvlan to
   gain carrier.

Note that the last two test cases fail without "net: Do not turn on
carrier when protodown is on" and "net: Do not unconditionally turn on
carrier when turning off protodown":

# ./protodown.sh
TEST: Basic protodown on/off with macvlan                           [ OK ]
TEST: Basic protodown on/off with vxlan                             [ OK ]
TEST: Protodown reasons                                             [ OK ]
TEST: Protodown with lower device toggled                           [FAIL]
         Macvlan operational state is not DOWN despite protodown
TEST: Protodown with lower device down                              [FAIL]
         Macvlan is not LOWERLAYERDOWN after clearing protodown

Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260507105906.891817-5-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>