]> git.ipfire.org Git - thirdparty/kernel/linux.git/log
thirdparty/kernel/linux.git
5 weeks agoxsk: use a smaller new lock for shared pool case
Jason Xing [Thu, 30 Oct 2025 00:06:46 +0000 (08:06 +0800)] 
xsk: use a smaller new lock for shared pool case

- Split cq_lock into two smaller locks: cq_prod_lock and
  cq_cached_prod_lock
- Avoid disabling/enabling interrupts in the hot xmit path

In either xsk_cq_cancel_locked() or xsk_cq_reserve_locked() function,
the race condition is only between multiple xsks sharing the same
pool. They are all in the process context rather than interrupt context,
so now the small lock named cq_cached_prod_lock can be used without
handling interrupts.

While cq_cached_prod_lock ensures the exclusive modification of
@cached_prod, cq_prod_lock in xsk_cq_submit_addr_locked() only cares
about @producer and corresponding @desc. Both of them don't necessarily
be consistent with @cached_prod protected by cq_cached_prod_lock.
That's the reason why the previous big lock can be split into two
smaller ones. Please note that SPSC rule is all about the global state
of producer and consumer that can affect both layers instead of local
or cached ones.

Frequently disabling and enabling interrupt are very time consuming
in some cases, especially in a per-descriptor granularity, which now
can be avoided after this optimization, even when the pool is shared by
multiple xsks.

With this patch, the performance number[1] could go from 1,872,565 pps
to 1,961,009 pps. It's a minor rise of around 5%.

[1]: taskset -c 1 ./xdpsock -i enp2s0f1 -q 0 -t -S -s 64

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://patch.msgid.link/20251030000646.18859-3-kerneljasonxing@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agoxsk: do not enable/disable irq when grabbing/releasing xsk_tx_list_lock
Jason Xing [Thu, 30 Oct 2025 00:06:45 +0000 (08:06 +0800)] 
xsk: do not enable/disable irq when grabbing/releasing xsk_tx_list_lock

The commit ac98d8aab61b ("xsk: wire upp Tx zero-copy functions")
originally introducing this lock put the deletion process in the
sk_destruct which can run in irq context obviously, so the
xxx_irqsave()/xxx_irqrestore() pair was used. But later another
commit 541d7fdd7694 ("xsk: proper AF_XDP socket teardown ordering")
moved the deletion into xsk_release() that only happens in process
context. It means that since this commit, it doesn't necessarily
need that pair.

Now, there are two places that use this xsk_tx_list_lock and only
run in the process context. So avoid manipulating the irq then.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://patch.msgid.link/20251030000646.18859-2-kerneljasonxing@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agonet: add net cookie for net device trace events
Tonghao Zhang [Tue, 28 Oct 2025 04:32:44 +0000 (12:32 +0800)] 
net: add net cookie for net device trace events

In a multi-network card or container environment, this is needed in order
to differentiate between trace events relating to net devices that exist
in different network namespaces and share the same name.

for xmit_timeout trace events:
[002] ..s1.  1838.311662: net_dev_xmit_timeout: dev=eth0 driver=virtio_net queue=10 net_cookie=3
[007] ..s1.  1839.335650: net_dev_xmit_timeout: dev=eth0 driver=virtio_net queue=10 net_cookie=4100
[007] ..s1.  1844.455659: net_dev_xmit_timeout: dev=eth0 driver=virtio_net queue=10 net_cookie=3
[002] ..s1.  1850.087647: net_dev_xmit_timeout: dev=eth0 driver=virtio_net queue=10 net_cookie=3

Cc: Eran Ben Elisha <eranbe@mellanox.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Suggested-by: Ido Schimmel <idosch@idosch.org>
Signed-off-by: Tonghao Zhang <tonghao@bamaicloud.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20251028043244.82288-1-tonghao@bamaicloud.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5 weeks agoMerge branch 'ethtool-introduce-phy-mse-diagnostics-uapi-and-drivers'
Jakub Kicinski [Tue, 4 Nov 2025 02:32:33 +0000 (18:32 -0800)] 
Merge branch 'ethtool-introduce-phy-mse-diagnostics-uapi-and-drivers'

Oleksij Rempel says:

====================
ethtool: introduce PHY MSE diagnostics UAPI and drivers

This series introduces a generic kernel-userspace API for retrieving PHY
Mean Square Error (MSE) diagnostics, together with netlink integration,
a fast-path reporting hook in LINKSTATE_GET, and initial driver
implementations for the KSZ9477 and DP83TD510E PHYs.

MSE is defined by the OPEN Alliance "Advanced diagnostic features for
100BASE-T1 automotive Ethernet PHYs" specification [1] as a measure of
slicer error rate, typically used internally to derive the Signal
Quality Indicator (SQI). While SQI is useful as a normalized quality
index, it hides raw measurement data, varies in scaling and thresholds
between vendors, and may not indicate certain failure modes - for
example, cases where autonegotiation would fail even though SQI reports
a good link. In practice, such scenarios can only be investigated in
fixed-link mode; here, MSE can provide an empirically estimated value
indicating conditions under which autonegotiation would not succeed.

Example output with current implementation:
root@DistroKit:~ ethtool lan1
Settings for lan1:
...
        Speed: 1000Mb/s
        Duplex: Full
...
        Link detected: yes
        SQI: 5/7
        MSE: 3/127 (channel: worst)

root@DistroKit:~ ethtool --show-mse lan1
MSE diagnostics for lan1:
MSE Configuration:
        Max Average MSE: 127
        Refresh Rate: 2000000 ps
        Symbols per Sample: 250
        Supported capabilities: average channel-a channel-b channel-c
                                channel-d worst

MSE Snapshot (Channel: a):
        Average MSE: 4

MSE Snapshot (Channel: b):
        Average MSE: 3

MSE Snapshot (Channel: c):
        Average MSE: 2

MSE Snapshot (Channel: d):
        Average MSE: 3

[1] https://opensig.org/wp-content/uploads/2024/01/Advanced_PHY_features_for_automotive_Ethernet_V1.0.pdf
====================

Link: https://patch.msgid.link/20251027122801.982364-1-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonet: phy: dp83td510: add MSE interface support for 10BASE-T1L
Oleksij Rempel [Mon, 27 Oct 2025 12:28:01 +0000 (13:28 +0100)] 
net: phy: dp83td510: add MSE interface support for 10BASE-T1L

Implement get_mse_capability() and get_mse_snapshot() for the DP83TD510E
to expose its Mean Square Error (MSE) register via the new PHY MSE
UAPI.

The DP83TD510E does not document any peak MSE values; it only exposes
a single average MSE register used internally to derive SQI. This
implementation therefore advertises only PHY_MSE_CAP_AVG, along with
LINK and channel-A selectors. Scaling is fixed to 0xFFFF, and the
refresh interval/number of symbols are estimated from 10BASE-T1L
symbol rate (7.5 MBd) and typical diagnostic intervals (~1 ms).

For 10BASE-T1L deployments, SQI is a reliable indicator of link
modulation quality once the link is established, but it does not
indicate whether autonegotiation pulses will be correctly received
in marginal conditions. MSE provides a direct measurement of slicer
error rate that can be used to evaluate if autonegotiation is likely
to succeed under a given cable length and condition. In practice,
testing such scenarios often requires forcing a fixed-link setup to
isolate MSE behaviour from the autonegotiation process.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20251027122801.982364-5-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonet: phy: micrel: add MSE interface support for KSZ9477 family
Oleksij Rempel [Mon, 27 Oct 2025 12:28:00 +0000 (13:28 +0100)] 
net: phy: micrel: add MSE interface support for KSZ9477 family

Implement the get_mse_capability() and get_mse_snapshot() PHY driver ops
for KSZ9477-series integrated PHYs to demonstrate the new PHY MSE
UAPI.

These PHYs do not expose a documented direct MSE register, but the
Signal Quality Indicator (SQI) registers are derived from the
internal MSE computation. This hook maps SQI readings into the MSE
interface so that tooling can retrieve the raw value together with
metadata for correct interpretation in userspace.

Behaviour:
  - For 1000BASE-T, report per-channel (A–D) values and support a
    WORST channel selector.
  - For 100BASE-TX, only LINK-wide measurements are available.
  - Report average MSE only, with a max scale based on
    KSZ9477_MMD_SQI_MASK and a fixed refresh rate of 2 µs.

This mapping differs from the OPEN Alliance SQI definition, which
assigns thresholds such as pre-fail indices; the MSE interface
instead provides the raw measurement, leaving interpretation to
userspace.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20251027122801.982364-4-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agoethtool: netlink: add ETHTOOL_MSG_MSE_GET and wire up PHY MSE access
Oleksij Rempel [Mon, 27 Oct 2025 12:27:59 +0000 (13:27 +0100)] 
ethtool: netlink: add ETHTOOL_MSG_MSE_GET and wire up PHY MSE access

Introduce the userspace entry point for PHY MSE diagnostics via
ethtool netlink. This exposes the core API added previously and
returns both capability information and one or more snapshots.

Userspace sends ETHTOOL_MSG_MSE_GET. The reply carries:
- ETHTOOL_A_MSE_CAPABILITIES: scale limits and timing information
- ETHTOOL_A_MSE_CHANNEL_* nests: one or more snapshots (per-channel
  if available, otherwise WORST, otherwise LINK)

Link down returns -ENETDOWN.

Changes:
  - YAML: add attribute sets (mse, mse-capabilities, mse-snapshot)
    and the mse-get operation
  - UAPI (generated): add ETHTOOL_A_MSE_* enums and message IDs,
    ETHTOOL_MSG_MSE_GET/REPLY
  - ethtool core: add net/ethtool/mse.c implementing the request,
    register genl op, and hook into ethnl dispatch
  - docs: document MSE_GET in ethtool-netlink.rst

The include/uapi/linux/ethtool_netlink_generated.h is generated
from Documentation/netlink/specs/ethtool.yaml.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20251027122801.982364-3-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonet: phy: introduce internal API for PHY MSE diagnostics
Oleksij Rempel [Mon, 27 Oct 2025 12:27:58 +0000 (13:27 +0100)] 
net: phy: introduce internal API for PHY MSE diagnostics

Add the base infrastructure for Mean Square Error (MSE) diagnostics,
as proposed by the OPEN Alliance "Advanced diagnostic features for
100BASE-T1 automotive Ethernet PHYs" [1] specification.

The OPEN Alliance spec defines only average MSE and average peak MSE
over a fixed number of symbols. However, other PHYs, such as the
KSZ9131, additionally expose a worst-peak MSE value latched since the
last channel capture. This API accounts for such vendor extensions by
adding a distinct capability bit and snapshot field.

Channel-to-pair mapping is normally straightforward, but in some cases
(e.g. 100BASE-TX with MDI-X resolution unknown) the mapping is ambiguous.
If hardware does not expose MDI-X status, the exact pair cannot be
determined. To avoid returning misleading per-channel data in this case,
a LINK selector is defined for aggregate MSE measurements.

All investigated devices differ in MSE capabilities, such
as sample rate, number of analyzed symbols, and scaling factors.
For example, the KSZ9131 uses different scaling for MSE and pMSE.
To make this visible to callers, scale limits and timing information
are returned via get_mse_capability().

Some PHYs sample very few symbols at high frequency (e.g. 2 us update
rate). To cover such cases and allow for future high-speed PHYs with
even shorter intervals, the refresh rate is reported as u64 in
picoseconds.

This patch introduces the internal PHY API for Mean Square Error
diagnostics. It defines new kernel-side data types and driver hooks:

  - struct phy_mse_capability: describes supported metrics, scale
    limits, update interval, and sampling length.
  - struct phy_mse_snapshot: holds one correlated measurement set.
  - New phy_driver ops: `get_mse_capability()` and `get_mse_snapshot()`.

These definitions form the core kernel API. No user-visible interfaces
are added in this commit.

Standardization notes:
OPEN Alliance defines presence and interpretation of some metrics but does
not fix numeric scales or sampling internals:

- SQI (3-bit, 0..7) is mandatory; correlation to SNR/BER is informative
  (OA 100BASE-T1 TC1 v1.0 6.1.2; OA 1000BASE-T1 TC12 v2.2 6.1.2).
- MSE is optional; OA recommends 2^16 symbols and scaling to 0..511,
  with a worst-case latch since last read (OA 100BASE-T1 TC1 v1.0 6.1.1; OA
  1000BASE-T1 TC12 v2.2 6.1.1). Refresh is recommended (~0.8-2.0 ms for
  100BASE-T1; ~80-200 us for 1000BASE-T1). Exact scaling/time windows
  are vendor-specific.
- Peak MSE (pMSE) is defined only for 100BASE-T1 as optional, e.g.
  128-symbol sliding window with 8-bit range and worst-case latch (OA
  100BASE-T1 TC1 v1.0 6.1.3).

Therefore this API exposes which measures and selectors a PHY supports,
and documents where behavior is standard-referenced vs vendor-specific.

[1] <https://opensig.org/wp-content/uploads/2024/01/
     Advanced_PHY_features_for_automotive_Ethernet_V1.0.pdf>

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20251027122801.982364-2-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agoMerge branch 'add-support-to-do-threaded-napi-busy-poll'
Jakub Kicinski [Tue, 4 Nov 2025 02:11:43 +0000 (18:11 -0800)] 
Merge branch 'add-support-to-do-threaded-napi-busy-poll'

Samiullah Khawaja says:

====================
Add support to do threaded napi busy poll

Extend the already existing support of threaded napi poll to do continuous
busy polling.

This is used for doing continuous polling of napi to fetch descriptors
from backing RX/TX queues for low latency applications. Allow enabling
of threaded busypoll using netlink so this can be enabled on a set of
dedicated napis for low latency applications.

Once enabled user can fetch the PID of the kthread doing NAPI polling
and set affinity, priority and scheduler for it depending on the
low-latency requirements.

Extend the netlink interface to allow enabling/disabling threaded
busypolling at individual napi level.

We use this for our AF_XDP based hard low-latency usecase with usecs
level latency requirement. For our usecase we want low jitter and stable
latency at P99.

Following is an analysis and comparison of available (and compatible)
busy poll interfaces for a low latency usecase with stable P99. This can
be suitable for applications that want very low latency at the expense
of cpu usage and efficiency.

Already existing APIs (SO_BUSYPOLL and epoll) allow busy polling a NAPI
backing a socket, but the missing piece is a mechanism to busy poll a
NAPI instance in a dedicated thread while ignoring available events or
packets, regardless of the userspace API. Most existing mechanisms are
designed to work in a pattern where you poll until new packets or events
are received, after which userspace is expected to handle them.

As a result, one has to hack together a solution using a mechanism
intended to receive packets or events, not to simply NAPI poll. NAPI
threaded busy polling, on the other hand, provides this capability
natively, independent of any userspace API. This makes it really easy to
setup and manage.

For analysis we use an AF_XDP based benchmarking tool `xsk_rr`. The
description of the tool and how it tries to simulate the real workload
is following,

- It sends UDP packets between 2 machines.
- The client machine sends packets at a fixed frequency. To maintain the
  frequency of the packet being sent, we use open-loop sampling. That is
  the packets are sent in a separate thread.
- The server replies to the packet inline by reading the pkt from the
  recv ring and replies using the tx ring.
- To simulate the application processing time, we use a configurable
  delay in usecs on the client side after a reply is received from the
  server.

The xsk_rr tool is posted separately as an RFC for tools/testing/selftest.

We use this tool with following napi polling configurations,

- Interrupts only
- SO_BUSYPOLL (inline in the same thread where the client receives the
  packet).
- SO_BUSYPOLL (separate thread and separate core)
- Threaded NAPI busypoll

System is configured using following script in all 4 cases,

```
echo 0 | sudo tee /sys/class/net/eth0/threaded
echo 0 | sudo tee /proc/sys/kernel/timer_migration
echo off | sudo tee  /sys/devices/system/cpu/smt/control

sudo ethtool -L eth0 rx 1 tx 1
sudo ethtool -G eth0 rx 1024

echo 0 | sudo tee /proc/sys/net/core/rps_sock_flow_entries
echo 0 | sudo tee /sys/class/net/eth0/queues/rx-0/rps_cpus

 # pin IRQs on CPU 2
IRQS="$(gawk '/eth0-(TxRx-)?1/ {match($1, /([0-9]+)/, arr); \
print arr[0]}' < /proc/interrupts)"
for irq in "${IRQS}"; \
do echo 2 | sudo tee /proc/irq/$irq/smp_affinity_list; done

echo -1 | sudo tee /proc/sys/kernel/sched_rt_runtime_us

for i in /sys/devices/virtual/workqueue/*/cpumask; \
do echo $i; echo 1,2,3,4,5,6 > $i; done

if [[ -z "$1" ]]; then
  echo 400 | sudo tee /proc/sys/net/core/busy_read
  echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
  echo 15000   | sudo tee /sys/class/net/eth0/gro_flush_timeout
fi

sudo ethtool -C eth0 adaptive-rx off adaptive-tx off rx-usecs 0 tx-usecs 0

if [[ "$1" == "enable_threaded" ]]; then
  echo 0 | sudo tee /proc/sys/net/core/busy_poll
  echo 0 | sudo tee /proc/sys/net/core/busy_read
  echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
  echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
  NAPI_ID=$(ynl --family netdev --output-json --do queue-get \
    --json '{"ifindex": '${IFINDEX}', "id": '0', "type": "rx"}' | jq '."napi-id"')

  ynl --family netdev --json '{"id": "'${NAPI_ID}'", "threaded": "busy-poll"}'

  NAPI_T=$(ynl --family netdev --output-json --do napi-get \
    --json '{"id": "'$NAPI_ID'"}' | jq '."pid"')

  sudo chrt -f  -p 50 $NAPI_T

  # pin threaded poll thread to CPU 2
  sudo taskset -pc 2 $NAPI_T
fi

if [[ "$1" == "enable_interrupt" ]]; then
  echo 0 | sudo tee /proc/sys/net/core/busy_read
  echo 0 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
  echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
fi
```

To enable various configurations, script can be run as following,

- Interrupt Only
  ```
  <script> enable_interrupt
  ```

- SO_BUSYPOLL (no arguments to script)
  ```
  <script>
  ```

- NAPI threaded busypoll
  ```
  <script> enable_threaded
  ```

Once configured, the workload is run with various configurations using
following commands. Set period (1/frequency) and delay in usecs to
produce results for packet frequency and application processing delay.

 ## Interrupt Only and SO_BUSYPOLL (inline)

- Server
```
sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
-D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 -h -v
```

- Client
```
sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
-S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
-P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v
```

 ## SO_BUSYPOLL(done in separate core using recvfrom)

Argument -t spawns a separate thread and continuously calls recvfrom.

- Server
```
sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
-D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
-h -v -t
```

- Client
```
sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
-S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
-P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v -t
```

 ## NAPI Threaded Busy Poll

Argument -n skips the recvfrom call as there is no recv kick needed.

- Server
```
sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
-D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
-h -v -n
```

- Client
```
sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
-S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
-P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v -n
```

| Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) | NAPI threaded |
|---|---|---|---|---|
| 12 Kpkt/s + 0us delay | | | | |
|  | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 |
|  | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 |
|  | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 |
|  | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 |
| 32 Kpkt/s + 30us delay | | | | |
|  | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 |
|  | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 |
|  | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 |
|  | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 |
| 125 Kpkt/s + 6us delay | | | | |
|  | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 |
|  | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 |
|  | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 |
|  | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 |
| 12 Kpkt/s + 78us delay | | | | |
|  | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 |
|  | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 |
|  | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 |
|  | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 |
| 25 Kpkt/s + 38us delay | | | | |
|  | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 |
|  | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 |
|  | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 |
|  | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 |

 ## Observations

- Here without application processing all the approaches give the same
  latency within 1usecs range and NAPI threaded gives minimum latency.
- With application processing the latency increases by 3-4usecs when
  doing inline polling.
- Using a dedicated core to drive napi polling keeps the latency same
  even with application processing. This is observed both in userspace
  and threaded napi (in kernel).
- Using napi threaded polling in kernel gives lower latency by
  1-2usecs as compared to userspace driven polling in separate core.
- Even on a dedicated core, SO_BUSYPOLL adds around 1-2usecs of latency.
  This is because it doesn't continuously busy poll until events are
  ready. Instead, it returns after polling only once, requiring the
  process to re-invoke the syscall for each poll, which requires a new
  enter/leave kernel cycle and the setup/teardown of the busy poll for
  every single poll attempt.
- With application processing userspace will get the packet from recv
  ring and spend some time doing application processing and then do napi
  polling. While application processing is happening a dedicated core
  doing napi polling can pull the packet of the NAPI RX queue and
  populate the AF_XDP recv ring. This means that when the application
  thread is done with application processing it has new packets ready to
  recv and process in recv ring.
- Napi threaded busy polling in the kernel with a dedicated core gives
  the consistent P5-P99 latency.

Note well that threaded napi busy-polling has not been shown to yield
efficiency or throughput benefits. In contrast, dedicating an entire
core to busy-polling one NAPI (NIC queue) is rather inefficient.
However, in certain specific use cases, this mechanism results in lower
packet processing latency. The experimental testing reported here only
covers those use cases and does not present a comprehensive evaluation
of threaded napi busy-polling.

Following histogram is generated to measure the time spent in recvfrom
while using inline thread with SO_BUSYPOLL. The histogram is generated
using the following bpftrace command. In this experiment there are 32K
packets per second and the application processing delay is 30usecs. This
is to measure whether there is significant time spent pulling packets
from the descriptor queue that it will affect the overall latency if
done inline.

```
bpftrace -e '
        kprobe:xsk_recvmsg {
                @start[tid] = nsecs;
        }
        kretprobe:xsk_recvmsg {
                if (@start[tid]) {
                        $sample = (nsecs - @start[tid]);
                        @xsk_recvfrom_hist = hist($sample);
                        delete(@start[tid]);
                }
        }
        END { clear(@start);}'
```

Here in case of inline busypolling around 35 percent of calls are taking
1-2usecs and around 50 percent are taking 0.5-2usecs.

@xsk_recvfrom_hist:
[128, 256)         24073 |@@@@@@@@@@@@@@@@@@@@@@                              |
[256, 512)         55633 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[512, 1K)          20974 |@@@@@@@@@@@@@@@@@@@                                 |
[1K, 2K)           34234 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                     |
[2K, 4K)            3266 |@@@                                                 |
[4K, 8K)              19 |                                                    |
====================

Link: https://patch.msgid.link/20251028203007.575686-1-skhawaja@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agoselftests: Add napi threaded busy poll test in `busy_poller`
Samiullah Khawaja [Tue, 28 Oct 2025 20:30:06 +0000 (20:30 +0000)] 
selftests: Add napi threaded busy poll test in `busy_poller`

Add testcase to run busy poll test with threaded napi busy poll enabled.

Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Acked-by: Martin Karsten <mkarsten@uwaterloo.ca>
Tested-by: Martin Karsten <mkarsten@uwaterloo.ca>
Link: https://patch.msgid.link/20251028203007.575686-3-skhawaja@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonet: Extend NAPI threaded polling to allow kthread based busy polling
Samiullah Khawaja [Tue, 28 Oct 2025 20:30:05 +0000 (20:30 +0000)] 
net: Extend NAPI threaded polling to allow kthread based busy polling

Add a new state NAPI_STATE_THREADED_BUSY_POLL to the NAPI state enum to
enable and disable threaded busy polling.

When threaded busy polling is enabled for a NAPI, enable
NAPI_STATE_THREADED also.

When the threaded NAPI is scheduled, set NAPI_STATE_IN_BUSY_POLL to
signal napi_complete_done not to rearm interrupts.

Whenever NAPI_STATE_THREADED_BUSY_POLL is unset, the
NAPI_STATE_IN_BUSY_POLL will be unset, napi_complete_done unsets the
NAPI_STATE_SCHED_THREADED bit also, which in turn will make the kthread
go to sleep.

Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Acked-by: Martin Karsten <mkarsten@uwaterloo.ca>
Tested-by: Martin Karsten <mkarsten@uwaterloo.ca>
Link: https://patch.msgid.link/20251028203007.575686-2-skhawaja@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agoMerge branch 'mpls-remove-rtnl-dependency'
Jakub Kicinski [Tue, 4 Nov 2025 01:40:59 +0000 (17:40 -0800)] 
Merge branch 'mpls-remove-rtnl-dependency'

Kuniyuki Iwashima says:

====================
mpls: Remove RTNL dependency.

MPLS uses RTNL

  1) to guarantee the lifetime of struct mpls_nh.nh_dev
  2) to protect net->mpls.platform_label

, but neither actually requires RTNL.

If struct mpls_nh holds a refcnt for nh_dev, we do not need RTNL,
and it can be replaced with a dedicated mutex.

The series removes RTNL from net/mpls/.

Overview:

  Patch 1 is misc cleanup.

  Patch 2 - 9 are prep to drop RTNL for RTM_{NEW,DEL,GET}ROUTE
  handlers.

  Patch 10 & 11 converts mpls_dump_routes() and RTM_GETNETCONF to RCU.

  Patch 12 replaces RTNL with a new per-netns mutex.

  Patch 13 drops RTNL from RTM_{NEW,DEL,GET}ROUTE.
====================

Link: https://patch.msgid.link/20251029173344.2934622-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agompls: Drop RTNL for RTM_NEWROUTE, RTM_DELROUTE, and RTM_GETROUTE.
Kuniyuki Iwashima [Wed, 29 Oct 2025 17:33:05 +0000 (17:33 +0000)] 
mpls: Drop RTNL for RTM_NEWROUTE, RTM_DELROUTE, and RTM_GETROUTE.

RTM_NEWROUTE looks up dev under RCU (ip_route_output(),
ipv6_stub->ipv6_dst_lookup_flow(), netdev_get_by_index()),
and each neighbour holds the refcnt of its dev.

Also, net->mpls.platform_label is protected by a dedicated
per-netns mutex.

Now, no MPLS code depends on RTNL.

Let's drop RTNL for RTM_NEWROUTE, RTM_DELROUTE, and RTM_GETROUTE.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/20251029173344.2934622-14-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agompls: Protect net->mpls.platform_label with a per-netns mutex.
Kuniyuki Iwashima [Wed, 29 Oct 2025 17:33:04 +0000 (17:33 +0000)] 
mpls: Protect net->mpls.platform_label with a per-netns mutex.

MPLS (re)uses RTNL to protect net->mpls.platform_label,
but the lock does not need to be RTNL at all.

Let's protect net->mpls.platform_label with a dedicated
per-netns mutex.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/20251029173344.2934622-13-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agompls: Convert RTM_GETNETCONF to RCU.
Kuniyuki Iwashima [Wed, 29 Oct 2025 17:33:03 +0000 (17:33 +0000)] 
mpls: Convert RTM_GETNETCONF to RCU.

mpls_netconf_get_devconf() calls __dev_get_by_index(),
and this only depends on RTNL.

Let's convert mpls_netconf_get_devconf() to RCU and use
dev_get_by_index_rcu().

Note that nlmsg_new() is moved ahead to use GFP_KERNEL.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/20251029173344.2934622-12-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agompls: Convert mpls_dump_routes() to RCU.
Kuniyuki Iwashima [Wed, 29 Oct 2025 17:33:02 +0000 (17:33 +0000)] 
mpls: Convert mpls_dump_routes() to RCU.

mpls_dump_routes() sets fib_dump_filter.rtnl_held to true and
calls __dev_get_by_index() in mpls_valid_fib_dump_req().

This is the only RTNL dependant in mpls_dump_routes().

Also, synchronize_rcu() in resize_platform_label_table()
guarantees that net->mpls.platform_label is alive under RCU.

Let's convert mpls_dump_routes() to RCU and use dev_get_by_index_rcu().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/20251029173344.2934622-11-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agompls: Use mpls_route_input() where appropriate.
Kuniyuki Iwashima [Wed, 29 Oct 2025 17:33:01 +0000 (17:33 +0000)] 
mpls: Use mpls_route_input() where appropriate.

In many places, we uses rtnl_dereference() twice for
net->mpls.platform_label and net->mpls.platform_label[index].

Let's replace the code with mpls_route_input().

We do not use mpls_route_input() in mpls_dump_routes() since
we will rely on RCU there.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/20251029173344.2934622-10-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agompls: Add mpls_route_input().
Kuniyuki Iwashima [Wed, 29 Oct 2025 17:33:00 +0000 (17:33 +0000)] 
mpls: Add mpls_route_input().

mpls_route_input_rcu() is called from mpls_forward() and
mpls_getroute().

The former is under RCU, and the latter is under RTNL, so
mpls_route_input_rcu() uses rcu_dereference_rtnl().

Let's use rcu_dereference() in mpls_route_input_rcu() and
add an RTNL variant for mpls_getroute().

Later, we will remove rtnl_dereference() there.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/20251029173344.2934622-9-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agompls: Pass net to mpls_dev_get().
Kuniyuki Iwashima [Wed, 29 Oct 2025 17:32:59 +0000 (17:32 +0000)] 
mpls: Pass net to mpls_dev_get().

We will replace RTNL with a per-netns mutex to protect dev->mpls_ptr.

Then, we will use rcu_dereference_protected() with the lockdep_is_held()
annotation, which requires net to access the per-netns mutex.

However, dev_net(dev) is not safe without RTNL.

Let's pass net to mpls_dev_get().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/20251029173344.2934622-8-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agompls: Add mpls_dev_rcu().
Kuniyuki Iwashima [Wed, 29 Oct 2025 17:32:58 +0000 (17:32 +0000)] 
mpls: Add mpls_dev_rcu().

mpls_dev_get() uses rcu_dereference_rtnl() to fetch dev->mpls_ptr.

We will replace RTNL with a dedicated mutex to protect the field.

Then, we will use rcu_dereference_protected() for clarity.

Let's add mpls_dev_rcu() for the RCU reader.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/20251029173344.2934622-7-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agompls: Use in6_dev_rcu() and dev_net_rcu() in mpls_forward() and mpls_xmit().
Kuniyuki Iwashima [Wed, 29 Oct 2025 17:32:57 +0000 (17:32 +0000)] 
mpls: Use in6_dev_rcu() and dev_net_rcu() in mpls_forward() and mpls_xmit().

mpls_forward() and mpls_xmit() are called under RCU.

Let's use in6_dev_rcu() and dev_net_rcu() there to annotate
as such.

Now we pass net to mpls_stats_inc_outucastpkts() not to read
dev_net_rcu() twice.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/20251029173344.2934622-6-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agoipv6: Add in6_dev_rcu().
Kuniyuki Iwashima [Wed, 29 Oct 2025 17:32:56 +0000 (17:32 +0000)] 
ipv6: Add in6_dev_rcu().

rcu_dereference_rtnl() does not clearly tell whether the caller
is under RCU or RTNL.

Let's add in6_dev_rcu() to make it easy to remove __in6_dev_get()
in the future.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/20251029173344.2934622-5-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agompls: Unify return paths in mpls_dev_notify().
Kuniyuki Iwashima [Wed, 29 Oct 2025 17:32:55 +0000 (17:32 +0000)] 
mpls: Unify return paths in mpls_dev_notify().

We will protect net->mpls.platform_label by a dedicated mutex.

Then, we need to wrap functions called from mpls_dev_notify()
with the mutex.

As a prep, let's unify the return paths.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/20251029173344.2934622-4-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agompls: Hold dev refcnt for mpls_nh.
Kuniyuki Iwashima [Wed, 29 Oct 2025 17:32:54 +0000 (17:32 +0000)] 
mpls: Hold dev refcnt for mpls_nh.

MPLS uses RTNL

  1) to guarantee the lifetime of struct mpls_nh.nh_dev
  2) to protect net->mpls.platform_label

, but neither actually requires RTNL.

If we do not call dev_put() in find_outdev() and call it
just before freeing struct mpls_route, we can drop RTNL for 1).

Let's hold the refcnt of mpls_nh.nh_dev and track it with
netdevice_tracker.

Two notable changes:

If mpls_nh_build_multi() fails to set up a neighbour, we need
to call netdev_put() for successfully created neighbours in
mpls_rt_free_rcu(), so the number of neighbours (rt->rt_nhn)
is now updated in each iteration.

When a dev is unregistered, mpls_ifdown() clones mpls_route
and replaces it with the clone, so the clone requires extra
netdev_hold().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/20251029173344.2934622-3-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agompls: Return early in mpls_label_ok().
Kuniyuki Iwashima [Wed, 29 Oct 2025 17:32:53 +0000 (17:32 +0000)] 
mpls: Return early in mpls_label_ok().

When mpls_label_ok() returns false, it does not need to update *index.

Let's remove is_ok and return early.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/20251029173344.2934622-2-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonet: stmmac: rename devlink parameter ts_coarse into phc_coarse_adj
Maxime Chevallier [Thu, 30 Oct 2025 18:24:53 +0000 (19:24 +0100)] 
net: stmmac: rename devlink parameter ts_coarse into phc_coarse_adj

The devlink param "ts_coarse" doesn't indicate that we get coarse
timestamps, but rather that the PHC clock adjusments are coarse as the
frequency won't be continuously adjusted. Adjust the devlink parameter
name to reflect that.

The Coarse terminlogy comes from the dwmac register naming, update the
documentation to better explain what the parameter is about.

With this change, the parameter can now be adjusted using:
  devlink dev param set <dev> name phc_coarse_adj value true cmode runtime

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20251030182454.182406-1-maxime.chevallier@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonet: dsa: yt921x: Fix spelling mistake "stucked" -> "stuck"
Colin Ian King [Sat, 1 Nov 2025 18:34:46 +0000 (18:34 +0000)] 
net: dsa: yt921x: Fix spelling mistake "stucked" -> "stuck"

There is a spelling mistake in a dev_err message. Fix it.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20251101183446.32134-1-colin.i.king@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonet: phy: realtek: add interrupt support for RTL8221B
Jianhui Zhao [Sun, 2 Nov 2025 15:26:37 +0000 (16:26 +0100)] 
net: phy: realtek: add interrupt support for RTL8221B

This commit introduces interrupt support for RTL8221B (C45 mode).
Interrupts are mapped on the VEND2 page. VEND2 registers are only
accessible via C45 reads and cannot be accessed by C45 over C22.

Signed-off-by: Jianhui Zhao <zhaojh329@gmail.com>
[Enable only link state change interrupts]
Signed-off-by: Aleksander Jan Bajkowski <olek2@wp.pl>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20251102152644.1676482-1-olek2@wp.pl
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agohinic3: fix misleading error message in hinic3_open_channel()
Alok Tiwari [Fri, 31 Oct 2025 11:26:44 +0000 (04:26 -0700)] 
hinic3: fix misleading error message in hinic3_open_channel()

The error message printed when hinic3_configure() fails incorrectly
reports "Failed to init txrxq irq", which does not match the actual
operation performed. The hinic3_configure() function sets up various
device resources such as MTU and RSS parameters , not IRQ initialization.

Update the log to "Failed to configure device resources" to make the
message accurate and clearer for debugging.

Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Reviewed-by: Fan Gong <gongfan1@huawei.com>
Link: https://patch.msgid.link/20251031112654.46187-1-alok.a.tiwari@oracle.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoDocumentation: ARCnet: Update obsolete contact info
Bagas Sanjaya [Tue, 28 Oct 2025 01:44:52 +0000 (08:44 +0700)] 
Documentation: ARCnet: Update obsolete contact info

ARCnet docs states that inquiries on the subsystem should be emailed to
Avery Pennarun <apenwarr@worldvisions.ca>, for whom has been in CREDITS
since the beginning of kernel git history and her email address is
unreachable (bounce). The subsystem is now maintained by Michael
Grzeschik since c38f6ac74c9980 ("MAINTAINERS: add arcnet and take
maintainership").

In addition, there used to be a dedicated ARCnet mailing list but its
archive at epistolary.org has been shut down. ARCnet discussion nowadays
take place in netdev list. The arcnet.com domain mentioned has become
AIoT (Artificial Intelligence of Things) related Typeform page and
ARCnet info now resides on arcnet.cc (ARCnet Resource Center) instead.

Update contact information.

Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://patch.msgid.link/20251028014451.10521-2-bagasdotme@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoMerge branch 'dpll-add-support-for-phase-adjustment-granularity'
Jakub Kicinski [Sat, 1 Nov 2025 00:59:21 +0000 (17:59 -0700)] 
Merge branch 'dpll-add-support-for-phase-adjustment-granularity'

Ivan Vecera says:

====================
dpll: Add support for phase adjustment granularity

Phase-adjust values are currently limited only by a min-max range. Some
hardware requires, for certain pin types, that values be multiples of
a specific granularity, as in the zl3073x driver.

Patch 1: Adds 'phase-adjust-gran' pin attribute and an appropriate
         handling
Patch 2: Adds a support for this attribute into zl3073x driver
====================

Link: https://patch.msgid.link/20251029153207.178448-1-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agodpll: zl3073x: Specify phase adjustment granularity for pins
Ivan Vecera [Wed, 29 Oct 2025 15:32:07 +0000 (16:32 +0100)] 
dpll: zl3073x: Specify phase adjustment granularity for pins

Output pins phase adjustment values in the device are expressed
in half synth clock cycles. Use this number of cycles as output
pins' phase adjust granularity and simplify both get/set callbacks.

Reviewed-by: Michal Schmidt <mschmidt@redhat.com>
Reviewed-by: Petr Oros <poros@redhat.com>
Tested-by: Prathosh Satish <Prathosh.Satish@microchip.com>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Link: https://patch.msgid.link/20251029153207.178448-3-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agodpll: add phase-adjust-gran pin attribute
Ivan Vecera [Wed, 29 Oct 2025 15:32:06 +0000 (16:32 +0100)] 
dpll: add phase-adjust-gran pin attribute

Phase-adjust values are currently limited by a min-max range. Some
hardware requires, for certain pin types, that values be multiples of
a specific granularity, as in the zl3073x driver.

Add a `phase-adjust-gran` pin attribute and an appropriate field in
dpll_pin_properties. If set by the driver, use its value to validate
user-provided phase-adjust values.

Reviewed-by: Michal Schmidt <mschmidt@redhat.com>
Reviewed-by: Petr Oros <poros@redhat.com>
Tested-by: Prathosh Satish <Prathosh.Satish@microchip.com>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Link: https://patch.msgid.link/20251029153207.178448-2-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoMerge branch 'net-pse-pd-add-tps23881b-support'
Jakub Kicinski [Sat, 1 Nov 2025 00:56:33 +0000 (17:56 -0700)] 
Merge branch 'net-pse-pd-add-tps23881b-support'

Thomas Wismer says:

====================
net: pse-pd: Add TPS23881B support

This patch series aims at adding support for the TI TPS23881B PoE
PSE controller.
====================

Link: https://patch.msgid.link/20251029212312.108749-1-thomas@wismer.xyz
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agodt-bindings: pse-pd: ti,tps23881: Add TPS23881B
Thomas Wismer [Wed, 29 Oct 2025 21:23:10 +0000 (22:23 +0100)] 
dt-bindings: pse-pd: ti,tps23881: Add TPS23881B

Add the TPS23881B I2C power sourcing equipment controller to the list of
supported devices.

Falling back to the TPS23881 predecessor device is not suitable as firmware
loading needs to handled differently by the driver. The TPS23881 and
TPS23881B devices require different firmware. Trying to load the TPS23881
firmware on a TPS23881B device fails and must therefore be omitted.

Signed-off-by: Thomas Wismer <thomas.wismer@scs.ch>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://patch.msgid.link/20251029212312.108749-3-thomas@wismer.xyz
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: pse-pd: tps23881: Add support for TPS23881B
Thomas Wismer [Wed, 29 Oct 2025 21:23:09 +0000 (22:23 +0100)] 
net: pse-pd: tps23881: Add support for TPS23881B

The TPS23881B uses different firmware than the TPS23881. Trying to load the
TPS23881 firmware on a TPS23881B device fails and must be omitted.

The TPS23881B ships with a more recent ROM firmware. Moreover, no updated
firmware has been released yet and so the firmware loading step must be
skipped. As of today, the TPS23881B is intended to use its ROM firmware.

Signed-off-by: Thomas Wismer <thomas.wismer@scs.ch>
Reviewed-by: Kory Maincent <kory.maincent@bootlin.com>
Acked-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20251029212312.108749-2-thomas@wismer.xyz
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoDocumentation: netconsole: Separate literal code blocks for full and short netcat...
Bagas Sanjaya [Thu, 30 Oct 2025 07:50:13 +0000 (14:50 +0700)] 
Documentation: netconsole: Separate literal code blocks for full and short netcat command name versions

Both full and short (abbreviated) command name versions of netcat
example are combined in single literal code block due to 'or::'
paragraph being indented one more space than the preceding paragraph
(before the short version example).

Unindent it to separate the versions.

Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://patch.msgid.link/20251030075013.40418-1-bagasdotme@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoMerge branch 'net-phy-microchip_t1s-add-support-for-lan867x-rev-d0-phy'
Jakub Kicinski [Fri, 31 Oct 2025 23:52:08 +0000 (16:52 -0700)] 
Merge branch 'net-phy-microchip_t1s-add-support-for-lan867x-rev-d0-phy'

Parthiban Veerasooran says:

====================
net: phy: microchip_t1s: Add support for LAN867x Rev.D0 PHY

This patch series adds support for the latest Microchip LAN8670/1/2 Rev.D0
10BASE-T1S PHYs to the microchip_t1s driver.

The new Rev.D0 silicon introduces updated initialization requirements and
link status handling behavior compared to earlier revisions (Rev.C2 and
below). These updates are necessary for full compliance with the OPEN
Alliance 10BASE-T1S specification and are documented in Microchip
Application Note AN1699 Revision G (DS60001699G – October 2025).

Summary of changes:
- Implements Rev.D0-specific configuration sequence as described in AN1699
  Rev.G.
- Introduces link status control configuration for LAN867x Rev.D0.
====================

Link: https://patch.msgid.link/20251030102258.180061-1-parthiban.veerasooran@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: phy: microchip_t1s: configure link status control for LAN867x Rev.D0
Parthiban Veerasooran [Thu, 30 Oct 2025 10:22:58 +0000 (15:52 +0530)] 
net: phy: microchip_t1s: configure link status control for LAN867x Rev.D0

Configure the link status in the Link Status Control register for
LAN8670/1/2 Rev.D0 PHYs, depending on whether PLCA or CSMA/CD mode
is enabled. When PLCA is enabled, the link status reflects the PLCA
status. When PLCA is disabled (CSMA/CD mode), the PHY does not support
autonegotiation, so the link status is forced active by setting
the LINK_STATUS_SEMAPHORE bit.

The link status control is configured:
- During PHY initialization, for default CSMA/CD mode.
- Whenever PLCA configuration is updated.

This ensures correct link reporting and consistent behavior for
LAN867x Rev.D0 devices.

Signed-off-by: Parthiban Veerasooran <parthiban.veerasooran@microchip.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20251030102258.180061-3-parthiban.veerasooran@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: phy: microchip_t1s: add support for Microchip LAN867X Rev.D0 PHY
Parthiban Veerasooran [Thu, 30 Oct 2025 10:22:57 +0000 (15:52 +0530)] 
net: phy: microchip_t1s: add support for Microchip LAN867X Rev.D0 PHY

Add support for the LAN8670/1/2 Rev.D0 10BASE-T1S PHYs from Microchip.
The new Rev.D0 silicon requires a specific set of initialization
settings to be configured for optimal performance and compliance with
OPEN Alliance specifications, as described in Microchip Application Note
AN1699 (Revision G, DS60001699G – October 2025).
https://www.microchip.com/en-us/application-notes/an1699

Signed-off-by: Parthiban Veerasooran <parthiban.veerasooran@microchip.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20251030102258.180061-2-parthiban.veerasooran@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: stmmac: qcom-ethqos: remove MAC_CTRL_REG modification
Russell King (Oracle) [Thu, 30 Oct 2025 10:20:32 +0000 (10:20 +0000)] 
net: stmmac: qcom-ethqos: remove MAC_CTRL_REG modification

When operating in "SGMII" mode (Cisco SGMII or 2500BASE-X), qcom-ethqos
modifies the MAC control register in its ethqos_configure_sgmii()
function, which is only called from one path:

stmmac_mac_link_up()
+- reads MAC_CTRL_REG
+- masks out priv->hw->link.speed_mask
+- sets bits according to speed (2500, 1000, 100, 10) from priv->hw.link.speed*
+- ethqos_fix_mac_speed()
|  +- qcom_ethqos_set_sgmii_loopback(false)
|  +- ethqos_update_link_clk(speed)
|  `- ethqos_configure(speed)
|     `- ethqos_configure_sgmii(speed)
|        +- reads MAC_CTRL_REG,
|        +- configures PS/FES bits according to speed
|        `- writes MAC_CTRL_REG as the last operation
+- sets duplex bit(s)
+- stmmac_mac_flow_ctrl()
+- writes MAC_CTRL_REG if changed from original read
...

As can be seen, the modification of the control register that
stmmac_mac_link_up() overwrites the changes that ethqos_fix_mac_speed()
does to the register. This makes ethqos_configure_sgmii()'s
modification questionable at best.

Analysing the values written, GMAC4 sets the speed bits as:
speed_mask = GMAC_CONFIG_FES | GMAC_CONFIG_PS
speed2500 = GMAC_CONFIG_FES                     B14=1 B15=0
speed1000 = 0                                   B14=0 B15=0
speed100 = GMAC_CONFIG_FES | GMAC_CONFIG_PS     B14=1 B15=1
speed10 = GMAC_CONFIG_PS                        B14=0 B15=1

Whereas ethqos_configure_sgmii():
2500: clears ETHQOS_MAC_CTRL_PORT_SEL           B14=X B15=0
1000: clears ETHQOS_MAC_CTRL_PORT_SEL           B14=X B15=0
100: sets ETHQOS_MAC_CTRL_PORT_SEL |            B14=1 B15=1
          ETHQOS_MAC_CTRL_SPEED_MODE
10: sets ETHQOS_MAC_CTRL_PORT_SEL               B14=0 B15=1
    clears ETHQOS_MAC_CTRL_SPEED_MODE

Thus, they appear to be doing very similar, with the exception of the
FES bit (bit 14) for 1G and 2.5G speeds.

Given that stmmac_mac_link_up() will write the MAC_CTRL_REG after
ethqos_configure_sgmii(), remove the unnecessary update in the
glue driver's ethqos_configure_sgmii() method, simplifying the code.

Konrad states:

Without any additional knowledge, the register description says:

2500: B14=1 B15=0
1000: B14=0 B15=0
 100: B14=1 B15=1
  10: B14=0 B15=1

Tested-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com>
Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vEPlg-0000000CFHY-282A@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoMerge branch 'convert-mlx5e-and-ipoib-to-ndo_hwtstamp_get-set'
Jakub Kicinski [Fri, 31 Oct 2025 23:41:12 +0000 (16:41 -0700)] 
Merge branch 'convert-mlx5e-and-ipoib-to-ndo_hwtstamp_get-set'

Tariq Toukan says:

====================
Convert mlx5e and IPoIB to ndo_hwtstamp_get/set

This series by Carolina migrates mlx5e and IPoIB to the
ndo_hwtstamp_get/set interface and removes legacy hardware timestamp
ioctl handling.  While doing so, it also cleans up naming and removes
redundant code.

No functional change in timestamp behavior.

Cleanup patches:
- net/mlx5e: Remove redundant tstamp pointer from channel structures
- net/mlx5e: Remove unnecessary tstamp local variable in mlx5i_complete_rx_cqe
- net/mlx5e: Rename hwstamp functions to hwtstamp
- net/mlx5e: Rename timestamp fields to hwtstamp_config

Add suppport in ipoib:
- IB/IPoIB: Add support for hwtstamp get/set ndos

Convert mlx5:
- net/mlx5e: Convert to new hwtstamp_get/set interface
====================

Link: https://patch.msgid.link/1761819910-1011051-1-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet/mlx5e: Convert to new hwtstamp_get/set interface
Carolina Jubran [Thu, 30 Oct 2025 10:25:10 +0000 (12:25 +0200)] 
net/mlx5e: Convert to new hwtstamp_get/set interface

Migrate from the legacy ioctl hardware timestamping interface to the
ndo_hwtstamp_get/set operations.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1761819910-1011051-7-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoIB/IPoIB: Add support for hwtstamp get/set ndos
Carolina Jubran [Thu, 30 Oct 2025 10:25:09 +0000 (12:25 +0200)] 
IB/IPoIB: Add support for hwtstamp get/set ndos

Add support for the ndo_hwtstamp_get and ndo_hwtstamp_set operations in
IPoIB. This allows lower devices to handle hardware timestamp
configuration through the new ndos instead of the legacy ioctls.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1761819910-1011051-6-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet/mlx5e: Rename timestamp fields to hwtstamp_config
Carolina Jubran [Thu, 30 Oct 2025 10:25:08 +0000 (12:25 +0200)] 
net/mlx5e: Rename timestamp fields to hwtstamp_config

Rename hardware timestamp-related fields from 'tstamp' to
'hwtstamp_config' throughout the MLX5 driver. The new name is more
descriptive as it clearly indicates these fields contain hardware
timestamp configuration.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1761819910-1011051-5-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet/mlx5e: Rename hwstamp functions to hwtstamp
Carolina Jubran [Thu, 30 Oct 2025 10:25:07 +0000 (12:25 +0200)] 
net/mlx5e: Rename hwstamp functions to hwtstamp

Rename mlx5e_hwstamp_set/get() functions to mlx5e_hwtstamp_set/get()
to better reflect that these functions handle hardware timestamping,
not just hardware stamping.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1761819910-1011051-4-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet/mlx5e: Remove unnecessary tstamp local variable in mlx5i_complete_rx_cqe
Carolina Jubran [Thu, 30 Oct 2025 10:25:06 +0000 (12:25 +0200)] 
net/mlx5e: Remove unnecessary tstamp local variable in mlx5i_complete_rx_cqe

Remove the tstamp local variable in mlx5i_complete_rx_cqe() and directly
pass the tstamp field from priv to mlx5e_rx_hw_stamp(). The local variable
was only used once and provided no additional value.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1761819910-1011051-3-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet/mlx5e: Remove redundant tstamp pointer from channel structures
Carolina Jubran [Thu, 30 Oct 2025 10:25:05 +0000 (12:25 +0200)] 
net/mlx5e: Remove redundant tstamp pointer from channel structures

Remove the tstamp pointer field from mlx5e_channel, mlx5e_ptp, and
mlx5e_trap structures, since it was only used to reference the tstamp
field in the priv structure. Instead, directly use the tstamp field
from priv when initializing RQ structures.

Also remove the unused hwtstamp_config field from mlx5_clock structure
as part of the cleanup.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1761819910-1011051-2-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: mana: Support HW link state events
Haiyang Zhang [Wed, 29 Oct 2025 20:43:10 +0000 (13:43 -0700)] 
net: mana: Support HW link state events

Handle the NIC hardware link state events received from the HW
channel, then set the proper link state accordingly.

And, add a feature bit, GDMA_DRV_CAP_FLAG_1_HW_VPORT_LINK_AWARE,
to inform the NIC hardware this handler exists.

Our MANA NIC only sends out the link state down/up messages
when we need to let the VM rerun DHCP client and change IP
address. So, add netif_carrier_on() in the probe(), let the NIC
show the right initial state in /sys/class/net/ethX/operstate.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Link: https://patch.msgid.link/1761770601-16920-1-git-send-email-haiyangz@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Jakub Kicinski [Thu, 16 Oct 2025 17:53:13 +0000 (10:53 -0700)] 
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR (net-6.18-rc4).

No conflicts, adjacent changes:

drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
  ded9813d17d3 ("net: stmmac: Consider Tx VLAN offload tag length for maxSDU")
  26ab9830beab ("net: stmmac: replace has_xxxx with core_type")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agocxgb4: flower: add support for fragmentation
Harshita V Rajput [Tue, 28 Oct 2025 07:52:55 +0000 (13:22 +0530)] 
cxgb4: flower: add support for fragmentation

This patch adds support for matching fragmented packets in tc flower
filters.

Previously, commit 93a8540aac72 ("cxgb4: flower: validate control flags")
added a check using flow_rule_match_has_control_flags() to reject
any rules with control flags, as the driver did not support
fragmentation at that time.

Now, with this patch, support for FLOW_DIS_IS_FRAGMENT is added:
- The driver checks for control flags using
  flow_rule_is_supp_control_flags(), as recommended in
  commit d11e63119432 ("flow_offload: add control flag checking helpers").
- If the fragmentation flag is present, the driver sets `fs->val.frag` and
  `fs->mask.frag` accordingly in the filter specification.

Since fragmentation is now supported, the earlier check that rejected all
control flags (flow_rule_match_has_control_flags()) has been removed.

Signed-off-by: Harshita V Rajput <harshitha.vr@chelsio.com>
Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20251028075255.1391596-1-harshitha.vr@chelsio.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoMerge tag 'net-6.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Linus Torvalds [Fri, 31 Oct 2025 01:35:35 +0000 (18:35 -0700)] 
Merge tag 'net-6.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Including fixes from wireless, Bluetooth and netfilter.

  Current release - regressions:

    - tcp: fix too slow tcp_rcvbuf_grow() action

    - bluetooth: fix corruption in h4_recv_buf() after cleanup

  Previous releases - regressions:

    - mptcp: restore window probe

    - bluetooth:
       - fix connection cleanup with BIG with 2 or more BIS
       - fix crash in set_mesh_sync and set_mesh_complete

    - batman-adv: release references to inactive interfaces

    - nic:
       - ice: fix usage of logical PF id
       - sfc: fix potential memory leak in efx_mae_process_mport()

  Previous releases - always broken:

    - devmem: refresh devmem TX dst in case of route invalidation

    - netfilter: add seqadj extension for natted connections

    - wifi:
       - iwlwifi: fix potential use after free in iwl_mld_remove_link()
       - brcmfmac: fix crash while sending action frames in standalone AP Mode

    - eth:
       - mlx5e: cancel tls RX async resync request in error flows
       - ixgbe: fix memory leak and use-after-free in ixgbe_recovery_probe()
       - hibmcge: fix rx buf avl irq is not re-enabled in irq_handle issue
       - cxgb4: fix potential use-after-free in ipsec callback
       - nfp: fix memory leak in nfp_net_alloc()"

* tag 'net-6.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (75 commits)
  net: sctp: fix KMSAN uninit-value in sctp_inq_pop
  net: devmem: refresh devmem TX dst in case of route invalidation
  net: stmmac: est: Fix GCL bounds checks
  net: stmmac: Consider Tx VLAN offload tag length for maxSDU
  net: stmmac: vlan: Disable 802.1AD tag insertion offload
  net/mlx5e: kTLS, Cancel RX async resync request in error flows
  net: tls: Cancel RX async resync request on rcd_delta overflow
  net: tls: Change async resync helpers argument
  net: phy: dp83869: fix STRAP_OPMODE bitmask
  selftests: net: use BASH for bareudp testing
  net: mctp: Fix tx queue stall
  net/mlx5: Don't zero user_count when destroying FDB tables
  net: usb: asix_devices: Check return value of usbnet_get_endpoints
  mptcp: zero window probe mib
  mptcp: restore window probe
  mptcp: fix MSG_PEEK stream corruption
  mptcp: drop bogus optimization in __mptcp_check_push()
  netconsole: Fix race condition in between reader and writer of userdata
  Documentation: netconsole: Remove obsolete contact people
  nfp: xsk: fix memory leak in nfp_net_alloc()
  ...

6 weeks agoMerge tag 'nf-next-25-10-30' of https://git.kernel.org/pub/scm/linux/kernel/git/netfi...
Jakub Kicinski [Fri, 31 Oct 2025 00:57:07 +0000 (17:57 -0700)] 
Merge tag 'nf-next-25-10-30' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next

Florian Westphal says:

====================
netfilter: updates for net-next

1) Convert nf_tables 'nft_set_iter' usage to use C99 struct
   initialization, from Fernando Fernandez Mancera.
2) Disallow nf_conntrack_max=0.  This was an (undocumented)
   historic inheritance from ip_conntrack (ipv4 only nf_conntrack
   predecessor).  Doing so will simplify future changes to make
   this pernet-tuneable.
3) Fix a typo in conntrack.h comment, from Weibiao Tu.

* tag 'nf-next-25-10-30' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  netfilter: fix typo in nf_conntrack_l4proto.h comment
  netfilter: conntrack: disable 0 value for conntrack_max setting
  netfilter: nf_tables: use C99 struct initializer for nft_set_iter
====================

Link: https://patch.msgid.link/20251030121954.29175-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoMerge tag 'wireless-next-2025-10-30' of https://git.kernel.org/pub/scm/linux/kernel...
Jakub Kicinski [Fri, 31 Oct 2025 00:38:37 +0000 (17:38 -0700)] 
Merge tag 'wireless-next-2025-10-30' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next

Johannes Berg says:

====================
Not that many changes this time:

 - mac80211:
   - improved VHT radiotap reporting
   - S1G improvements
   - multi-radio monitor improvements
   - HT action frame handling on 6 GHz
   - mesh rate tracking improvements
   - CSA handling improvements
 - cfg80211: multi-radio debugfs
 - rt2x00: improvements for embedded platforms

* tag 'wireless-next-2025-10-30' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next:
  wifi: mac80211: Allow HT Action frame processing on 6 GHz when HE is supported
  wifi: rt2x00: add nvmem eeprom support
  wifi: mac80211: add RX flag to report radiotap VHT information
  net: wireless: Remove redundant pm_runtime_mark_last_busy() calls
  wifi: cfg80211: Add parameters to radio-specific debugfs directories
  wifi: cfg80211: Add debugfs support for multi-radio wiphy
  wifi: mac80211: fix missing RX bitrate update for mesh forwarding path
  wifi: cfg80211: default S1G chandef width to 1MHz
  wifi: mac80211: get probe response chan via ieee80211_get_channel_khz
  wifi: mac80211: reset CRC valid after CSA
  wifi: mac80211_hwsim: advertise puncturing feature support
  wifi: cfg80211/mac80211: validate radio frequency range for monitor mode
  wifi: rt2x00: check retval for of_get_mac_address
====================

Link: https://patch.msgid.link/20251030105355.13216-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoselftests: drv-net: replace the nsim ring test with a drv-net one
Jakub Kicinski [Wed, 29 Oct 2025 16:49:30 +0000 (09:49 -0700)] 
selftests: drv-net: replace the nsim ring test with a drv-net one

We are trying to move away from netdevsim-only tests and towards
tests which can be run both against netdevsim and real drivers.

Replace the simple bash script we have for checking ethtool -g/-G
on netdevsim with a Python test tweaking those params as well
as channel count.

The new test is not exactly equivalent to the netdevsim one,
but real drivers don't often support random ring sizes,
let alone modifying max values via debugfs.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20251029164930.2923448-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoMerge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next...
Jakub Kicinski [Fri, 31 Oct 2025 00:24:56 +0000 (17:24 -0700)] 
Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue

Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2025-10-29 (ice, i40e, idpf, ixgbe, igbvf)

For ice:
Michal converts driver to utilize Page Pool and libeth APIs. Conversion
is based on similar changes done for iavf in order to simplify buffer
management, improve maintainability, and increase code reuse across
Intel Ethernet drivers.

Additional details:
https://lore.kernel.org/20250925092253.1306476-1-michal.kubiak@intel.com

Alexander adds support for header split, configurable via ethtool.

Grzegorz allows for use of 100Mbps on E825C SGMII devices.

For i40e:
Jay Vosburgh avoids sending link state changes to VF if it is already in
the requested state.

For idpf:
Sreedevi removes duplicated defines.

For ixgbe:
Alok Tiwari fixes some typos.

For igbvf:
Alok Tiwari fixes output of VLAN warning message.

* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  igbvf: fix misplaced newline in VLAN add warning message
  ixgbe: fix typos in ixgbe driver comments
  idpf: remove duplicate defines in IDPF_CAP_RSS
  i40e: avoid redundant VF link state updates
  ice: Allow 100M speed for E825C SGMII device
  ice: implement configurable header split for regular Rx
  ice: switch to Page Pool
  ice: drop page splitting and recycling
  ice: remove legacy Rx and construct SKB
====================

Link: https://patch.msgid.link/20251029231218.1277233-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoMerge branch 'net-smc-make-wr-buffer-count-configurable'
Paolo Abeni [Thu, 30 Oct 2025 12:31:45 +0000 (13:31 +0100)] 
Merge branch 'net-smc-make-wr-buffer-count-configurable'

Halil Pasic says:

====================
net/smc: make wr buffer count configurable

The current value of SMC_WR_BUF_CNT is 16 which leads to heavy
contention on the wr_tx_wait workqueue of the SMC-R linkgroup and its
spinlock when many connections are competing for the work request
buffers. Currently up to 256 connections per linkgroup are supported.

To make things worse when finally a buffer becomes available and
smc_wr_tx_put_slot() signals the linkgroup's wr_tx_wait wq, because
WQ_FLAG_EXCLUSIVE is not used all the waiters get woken up, most of the
time a single one can proceed, and the rest is contending on the
spinlock of the wq to go to sleep again.

Addressing this by simply bumping SMC_WR_BUF_CNT to 256 was deemed
risky, because the large-ish physically continuous allocation could fail
and lead to TCP fall-backs. For reference see this discussion thread on
"[PATCH net-next] net/smc: increase SMC_WR_BUF_CNT" (in archive
https://lists.openwall.net/netdev/2024/11/05/186), which concludes with
the agreement to try to come up with something smarter, which is what
this series aims for.

Additionally if for some reason it is known that heavy contention is not
to be expected going with something like 256 work request buffers is
wasteful. To address these concerns make the number of work requests
configurable, and introduce a back-off logic with handles -ENOMEM form
smc_wr_alloc_link_mem() gracefully.

v5: https://lore.kernel.org/netdev/20250929000001.1752206-1-pasic@linux.ibm.com/
v4: https://lore.kernel.org/netdev/20250927232144.3478161-1-pasic@linux.ibm.com/
v3: https://lore.kernel.org/netdev/20250921214440.325325-1-pasic@linux.ibm.com/
v2: https://lore.kernel.org/netdev/20250908220150.3329433-1-pasic@linux.ibm.com/
v1: https://lore.kernel.org/all/20250904211254.1057445-1-pasic@linux.ibm.com/
====================

Link: https://patch.msgid.link/20251027224856.2970019-1-pasic@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agonet/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully
Halil Pasic [Mon, 27 Oct 2025 22:48:56 +0000 (23:48 +0100)] 
net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully

Currently if a -ENOMEM from smc_wr_alloc_link_mem() is handled by
giving up and going the way of a TCP fallback. This was reasonable
before the sizes of the allocations there were compile time constants
and reasonably small. But now those are actually configurable.

So instead of giving up, keep retrying with half of the requested size
unless we dip below the old static sizes -- then give up! In terms of
numbers that means we give up when it is certain that we at best would
end up allocating less than 16 send WR buffers or less than 48 recv WR
buffers. This is to avoid regressions due to having fewer buffers
compared the static values of the past.

Please note that SMC-R is supposed to be an optimisation over TCP, and
falling back to TCP is superior to establishing an SMC connection that
is going to perform worse. If the memory allocation fails (and we
propagate -ENOMEM), we fall back to TCP.

Preserve (modulo truncation) the ratio of send/recv WR buffer counts.

Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com>
Reviewed-by: Sidraya Jayagond <sidraya@linux.ibm.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Tested-by: Mahanta Jambigi <mjambigi@linux.ibm.com>
Link: https://patch.msgid.link/20251027224856.2970019-3-pasic@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agonet/smc: make wr buffer count configurable
Halil Pasic [Mon, 27 Oct 2025 22:48:55 +0000 (23:48 +0100)] 
net/smc: make wr buffer count configurable

Think SMC_WR_BUF_CNT_SEND := SMC_WR_BUF_CNT used in send context and
SMC_WR_BUF_CNT_RECV := 3 * SMC_WR_BUF_CNT used in recv context. Those
get replaced with lgr->max_send_wr and lgr->max_recv_wr respective.

Please note that although with the default sysctl values
qp_attr.cap.max_send_wr ==  qp_attr.cap.max_recv_wr is maintained but
can not be assumed to be generally true any more. I see no downside to
that, but my confidence level is rather modest.

Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
Reviewed-by: Sidraya Jayagond <sidraya@linux.ibm.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Tested-by: Mahanta Jambigi <mjambigi@linux.ibm.com>
Link: https://patch.msgid.link/20251027224856.2970019-2-pasic@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agonetfilter: fix typo in nf_conntrack_l4proto.h comment
caivive (Weibiao Tu) [Thu, 28 Nov 2024 12:52:04 +0000 (20:52 +0800)] 
netfilter: fix typo in nf_conntrack_l4proto.h comment

In the comment for nf_conntrack_l4proto.h, the word "nfnetink" was
incorrectly spelled. It has been corrected to "nfnetlink".

Fixes a typo to enhance readability and ensure consistency.

Signed-off-by: caivive (Weibiao Tu) <cavivie@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
6 weeks agonetfilter: conntrack: disable 0 value for conntrack_max setting
Florian Westphal [Sun, 21 Sep 2025 15:45:30 +0000 (17:45 +0200)] 
netfilter: conntrack: disable 0 value for conntrack_max setting

Undocumented historical artifact inherited from ip_conntrack.
If value is 0, then no limit is applied at all, conntrack table
can grow to huge value, only limited by size of conntrack hashes and
the kernel-internal upper limit on the hash chain lengths.

This feature makes no sense; users can just set
conntrack_max=2147483647 (INT_MAX).

Disallow a 0 value.  This will make it slightly easier to allow
per-netns constraints for this value in a future patch.

Signed-off-by: Florian Westphal <fw@strlen.de>
6 weeks agonetfilter: nf_tables: use C99 struct initializer for nft_set_iter
Fernando Fernandez Mancera [Mon, 13 Oct 2025 08:59:48 +0000 (10:59 +0200)] 
netfilter: nf_tables: use C99 struct initializer for nft_set_iter

Use C99 struct initializer for nft_set_iter, simplifying the code and
preventing future errors due to uninitialized fields if new fields are
added to the struct.

Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
6 weeks agonet: sctp: fix KMSAN uninit-value in sctp_inq_pop
Ranganath V N [Sun, 26 Oct 2025 16:33:12 +0000 (22:03 +0530)] 
net: sctp: fix KMSAN uninit-value in sctp_inq_pop

Fix an issue detected by syzbot:

KMSAN reported an uninitialized-value access in sctp_inq_pop
BUG: KMSAN: uninit-value in sctp_inq_pop

The issue is actually caused by skb trimming via sk_filter() in sctp_rcv().
In the reproducer, skb->len becomes 1 after sk_filter(), which bypassed the
original check:

        if (skb->len < sizeof(struct sctphdr) + sizeof(struct sctp_chunkhdr) +
                       skb_transport_offset(skb))
To handle this safely, a new check should be performed after sk_filter().

Reported-by: syzbot+d101e12bccd4095460e7@syzkaller.appspotmail.com
Tested-by: syzbot+d101e12bccd4095460e7@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=d101e12bccd4095460e7
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Suggested-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Ranganath V N <vnranganath.20@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20251026-kmsan_fix-v3-1-2634a409fa5f@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agoMerge branch 'add-cn20k-nix-and-npa-contexts'
Paolo Abeni [Thu, 30 Oct 2025 09:44:12 +0000 (10:44 +0100)] 
Merge branch 'add-cn20k-nix-and-npa-contexts'

Subbaraya Sundeep says:

====================
Add CN20K NIX and NPA contexts

The hardware contexts of blocks NIX and NPA in CN20K silicon are
different than that of previous silicons CN10K and CN9XK. This
patchset adds the new contexts of CN20K in AF and PF drivers.
A new mailbox for enqueuing contexts to hardware is added.

Patch 1 simplifies context writing and reading by using max context
size supported by hardware instead of using each context size.
Patch 2 and 3 adds NIX block contexts in AF driver and extends
debugfs to display those new contexts
Patch 4 and 5 adds NPA block contexts in AF driver and extends
debugfs to display those new contexts
Patch 6 omits NDC configuration since CN20K NPA does not use NDC
for caching its contexts
Patch 7 and 8 uses the new NIX and NPA contexts in PF/VF driver.
Patch 9, 10 and 11 are to support more bandwidth profiles present in
CN20K for RX ratelimiting and to display new profiles in debugfs

v3: https://lore.kernel.org/all/1752772063-6160-1-git-send-email-sbhatta@marvell.com/
====================

Link: https://patch.msgid.link/1761388367-16579-1-git-send-email-sbhatta@marvell.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agoocteontx2-pf: Use new bandwidth profiles in receive queue
Subbaraya Sundeep [Sat, 25 Oct 2025 10:32:47 +0000 (16:02 +0530)] 
octeontx2-pf: Use new bandwidth profiles in receive queue

Receive queue points to a bandwidth profile for rate limiting.
Since cn20k has additional bandwidth profiles use them
too while mapping receive queue to bandwidth profile.

Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Link: https://patch.msgid.link/1761388367-16579-12-git-send-email-sbhatta@marvell.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agoocteontx2-af: Display new bandwidth profiles too in debugfs
Subbaraya Sundeep [Sat, 25 Oct 2025 10:32:46 +0000 (16:02 +0530)] 
octeontx2-af: Display new bandwidth profiles too in debugfs

Consider the new profiles of cn20k too while displaying
bandwidth profile contexts in debugfs.

Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Link: https://patch.msgid.link/1761388367-16579-11-git-send-email-sbhatta@marvell.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agoocteontx2-af: Accommodate more bandwidth profiles for cn20k
Subbaraya Sundeep [Sat, 25 Oct 2025 10:32:45 +0000 (16:02 +0530)] 
octeontx2-af: Accommodate more bandwidth profiles for cn20k

CN20K has 16k of leaf profiles, 2k of middle profiles and
256 of top profiles. This patch modifies existing receive
queue and bandwidth profile context structures to accommodate
additional profiles of cn20k.

Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Link: https://patch.msgid.link/1761388367-16579-10-git-send-email-sbhatta@marvell.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agoocteontx2-pf: Initialize new NIX SQ context for cn20k
Subbaraya Sundeep [Sat, 25 Oct 2025 10:32:44 +0000 (16:02 +0530)] 
octeontx2-pf: Initialize new NIX SQ context for cn20k

cn20k has different NIX context for send queue hence use
the new cn20k mailbox to init SQ context.

Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Link: https://patch.msgid.link/1761388367-16579-9-git-send-email-sbhatta@marvell.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agoocteontx2-pf: Initialize cn20k specific aura and pool contexts
Linu Cherian [Sat, 25 Oct 2025 10:32:43 +0000 (16:02 +0530)] 
octeontx2-pf: Initialize cn20k specific aura and pool contexts

With new CN20K NPA pool and aura contexts supported in AF
driver this patch modifies PF driver to use new NPA contexts.
Implement new hw_ops for intializing aura and pool contexts
for all the silicons.

Signed-off-by: Linu Cherian <lcherian@marvell.com>
Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Link: https://patch.msgid.link/1761388367-16579-8-git-send-email-sbhatta@marvell.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agoocteontx2-af: Skip NDC operations for cn20k
Linu Cherian [Sat, 25 Oct 2025 10:32:42 +0000 (16:02 +0530)] 
octeontx2-af: Skip NDC operations for cn20k

For cn20k, NPA block doesn't use the general purpose
NDC (Near Coprocessor Bus Data cache Unit) for caching,
hence skip the NDC related operations.
Also refactor NDC configuration code to a helper function.

Signed-off-by: Linu Cherian <lcherian@marvell.com>
Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Link: https://patch.msgid.link/1761388367-16579-7-git-send-email-sbhatta@marvell.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agoocteontx2-af: Extend debugfs support for cn20k NPA
Linu Cherian [Sat, 25 Oct 2025 10:32:41 +0000 (16:02 +0530)] 
octeontx2-af: Extend debugfs support for cn20k NPA

Extend debugfs to display CN20K NPA aura and pool contexts.

Signed-off-by: Linu Cherian <lcherian@marvell.com>
Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Link: https://patch.msgid.link/1761388367-16579-6-git-send-email-sbhatta@marvell.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agoocteontx2-af: Add cn20k NPA block contexts
Linu Cherian [Sat, 25 Oct 2025 10:32:40 +0000 (16:02 +0530)] 
octeontx2-af: Add cn20k NPA block contexts

New CN20K silicon has NPA hardware context structures different from
previous silicons. Add NPA aura and pool context definitions for cn20k.
Extend NPA context handling support to cn20k.

Signed-off-by: Linu Cherian <lcherian@marvell.com>
Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Link: https://patch.msgid.link/1761388367-16579-5-git-send-email-sbhatta@marvell.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agoocteontx2-af: Extend debugfs support for cn20k NIX
Subbaraya Sundeep [Sat, 25 Oct 2025 10:32:39 +0000 (16:02 +0530)] 
octeontx2-af: Extend debugfs support for cn20k NIX

Extend debugfs to display CN20K NIX send, receive and
completion queue contexts.

Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Link: https://patch.msgid.link/1761388367-16579-4-git-send-email-sbhatta@marvell.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agoocteontx2-af: Add cn20k NIX block contexts
Subbaraya Sundeep [Sat, 25 Oct 2025 10:32:38 +0000 (16:02 +0530)] 
octeontx2-af: Add cn20k NIX block contexts

New CN20K silicon has NIX hardware context structures different from
previous silicons. Add NIX send and completion queue context
definitions for cn20k. Extend NIX context handling support to cn20k.

Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Link: https://patch.msgid.link/1761388367-16579-3-git-send-email-sbhatta@marvell.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agoocteontx2-af: Simplify context writing and reading to hardware
Subbaraya Sundeep [Sat, 25 Oct 2025 10:32:37 +0000 (16:02 +0530)] 
octeontx2-af: Simplify context writing and reading to hardware

Simplify NIX context reading and writing by using hardware
maximum context size instead of using individual sizes of
each context type.

Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Link: https://patch.msgid.link/1761388367-16579-2-git-send-email-sbhatta@marvell.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agowifi: mac80211: Allow HT Action frame processing on 6 GHz when HE is supported
Thomas Wu [Tue, 28 Oct 2025 04:34:42 +0000 (10:04 +0530)] 
wifi: mac80211: Allow HT Action frame processing on 6 GHz when HE is supported

Management frames on 6 GHz do not include HT Capabilities, causing HT
Action frames to be dropped in ieee80211_rx_h_action(). The current logic
checks only ht_cap.ht_supported, which fails for 6 GHz radios that support
only HE and EHT.

Update the condition to also allow HT Action frame processing when
he_cap.has_he is true. This enables support for HE dynamic SM power save
as defined in IEEE Std 802.11ax-2021, section 26.14.4.

Signed-off-by: Thomas Wu <quic_wthomas@quicinc.com>
Signed-off-by: Aaradhana Sahu <aaradhana.sahu@oss.qualcomm.com>
Link: https://patch.msgid.link/20251028043442.523647-1-aaradhana.sahu@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
6 weeks agowifi: rt2x00: add nvmem eeprom support
Rosen Penev [Mon, 27 Oct 2025 18:06:38 +0000 (11:06 -0700)] 
wifi: rt2x00: add nvmem eeprom support

Some embedded platforms have eeproms located in flash. Add nvmem support
to handle this. Support is added for PCI and SOC backends.

Signed-off-by: Rosen Penev <rosenp@gmail.com>
Link: https://patch.msgid.link/20251027180639.3797-1-rosenp@gmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
6 weeks agowifi: mac80211: add RX flag to report radiotap VHT information
Benjamin Berg [Mon, 27 Oct 2025 12:22:14 +0000 (14:22 +0200)] 
wifi: mac80211: add RX flag to report radiotap VHT information

mac80211 already reports some basic information in the radiotap header
with the known fields declared by the driver. However, drivers may want
to report more accurate information and in that case the full VHT
radiotap structure needs to be provided.

Add a new RX_FLAG_RADIOTAP_VHT which is set when the VHT information
should be pulled from the skb. Update the code to fill in the VHT fields
to only do so when requested by the driver or if the information has not
yet been set. This way the driver can fully control the information if
it chooses so.

Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20251027142118.0bad1c307a21.I2cf285c20a822698039603f2af00ed9c548f2ee0@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
6 weeks agonet: wireless: Remove redundant pm_runtime_mark_last_busy() calls
Sakari Ailus [Mon, 27 Oct 2025 11:50:21 +0000 (13:50 +0200)] 
net: wireless: Remove redundant pm_runtime_mark_last_busy() calls

pm_runtime_put_autosuspend(), pm_runtime_put_sync_autosuspend(),
pm_runtime_autosuspend() and pm_request_autosuspend() now include a call
to pm_runtime_mark_last_busy(). Remove the now-reduntant explicit call to
pm_runtime_mark_last_busy().

Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Link: https://patch.msgid.link/20251027115022.390997-3-sakari.ailus@linux.intel.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
6 weeks agonet: devmem: refresh devmem TX dst in case of route invalidation
Shivaji Kant [Wed, 29 Oct 2025 06:54:19 +0000 (06:54 +0000)] 
net: devmem: refresh devmem TX dst in case of route invalidation

The zero-copy Device Memory (Devmem) transmit path
relies on the socket's route cache (`dst_entry`) to
validate that the packet is being sent via the network
device to which the DMA buffer was bound.

However, this check incorrectly fails and returns `-ENODEV`
if the socket's route cache entry (`dst`) is merely missing
or expired (`dst == NULL`). This scenario is observed during
network events, such as when flow steering rules are deleted,
leading to a temporary route cache invalidation.

This patch fixes -ENODEV error for `net_devmem_get_binding()`
by doing the following:

1.  It attempts to rebuild the route via `rebuild_header()`
if the route is initially missing (`dst == NULL`). This
allows the TCP/IP stack to recover from transient route
cache misses.
2.  It uses `rcu_read_lock()` and `dst_dev_rcu()` to safely
access the network device pointer (`dst_dev`) from the
route, preventing use-after-free conditions if the
device is concurrently removed.
3.  It maintains the critical safety check by validating
that the retrieved destination device (`dst_dev`) is
exactly the device registered in the Devmem binding
(`binding->dev`).

These changes prevent unnecessary ENODEV failures while
maintaining the critical safety requirement that the
Devmem resources are only used on the bound network device.

Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com>
Reported-by: Eric Dumazet <edumazet@google.com>
Reported-by: Vedant Mathur <vedantmathur@google.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Fixes: bd61848900bf ("net: devmem: Implement TX path")
Signed-off-by: Shivaji Kant <shivajikant@google.com>
Link: https://patch.msgid.link/20251029065420.3489943-1-shivajikant@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoMerge branch 'net-phy-add-iterator-mdiobus_for_each_phy'
Jakub Kicinski [Thu, 30 Oct 2025 02:00:37 +0000 (19:00 -0700)] 
Merge branch 'net-phy-add-iterator-mdiobus_for_each_phy'

Heiner Kallweit says:

====================
net: phy: add iterator mdiobus_for_each_phy

Add and use an iterator for all PHY's on a MII bus, and phy_find_next()
as a prerequisite.
====================

Link: https://patch.msgid.link/07fc63e8-53fd-46aa-853e-96187bba9d44@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: phy: use new iterator mdiobus_for_each_phy in mdiobus_prevent_c45_scan
Heiner Kallweit [Sat, 25 Oct 2025 18:52:56 +0000 (20:52 +0200)] 
net: phy: use new iterator mdiobus_for_each_phy in mdiobus_prevent_c45_scan

Use new iterator mdiobus_for_each_phy() to simplify the code.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://patch.msgid.link/6d792b1e-d23d-4b7e-a94f-89c6617b620f@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: davinci_mdio: use new iterator mdiobus_for_each_phy
Heiner Kallweit [Sat, 25 Oct 2025 18:52:09 +0000 (20:52 +0200)] 
net: davinci_mdio: use new iterator mdiobus_for_each_phy

Use new iterator mdiobus_for_each_phy() to simplify the code.

Reviewed-by: Siddharth Vadapalli <s-vadapalli@ti.com>
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://patch.msgid.link/326d1337-2c22-42e3-a152-046ac5c43095@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: fec: use new iterator mdiobus_for_each_phy
Heiner Kallweit [Sat, 25 Oct 2025 18:50:41 +0000 (20:50 +0200)] 
net: fec: use new iterator mdiobus_for_each_phy

Use new iterator mdiobus_for_each_phy() to simplify the code.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Wei Fang <wei.fang@nxp.com>
Link: https://patch.msgid.link/65eb9490-5666-4b4a-8d26-3fca738b1315@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: phy: add iterator mdiobus_for_each_phy
Heiner Kallweit [Sat, 25 Oct 2025 18:49:50 +0000 (20:49 +0200)] 
net: phy: add iterator mdiobus_for_each_phy

Add an iterator for all PHY's on a MII bus, and phy_find_next()
as a prerequisite.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Wei Fang <wei.fang@nxp.com>
Link: https://patch.msgid.link/cd112f15-401a-43d9-8525-9ff0965a68cd@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: stmmac: mdio: fix incorrect phy address check
Heiner Kallweit [Sat, 25 Oct 2025 18:35:47 +0000 (20:35 +0200)] 
net: stmmac: mdio: fix incorrect phy address check

max_addr is the max number of addresses, not the highest possible address,
therefore check phydev->mdio.addr > max_addr isn't correct.
To fix this change the semantics of max_addr, so that it represents
the highest possible address. IMO this is also a little bit more intuitive
wrt name max_addr.

Fixes: 4a107a0e8361 ("net: stmmac: mdio: use phy_find_first to simplify stmmac_mdio_register")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Reported-by: Simon Horman <horms@kernel.org>
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/e869999b-2d4b-4dc1-9890-c2d3d1e8d0f8@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: wwan: Remove redundant pm_runtime_mark_last_busy() calls
Sakari Ailus [Mon, 27 Oct 2025 11:50:22 +0000 (13:50 +0200)] 
net: wwan: Remove redundant pm_runtime_mark_last_busy() calls

pm_runtime_put_autosuspend(), pm_runtime_put_sync_autosuspend(),
pm_runtime_autosuspend() and pm_request_autosuspend() now include a call
to pm_runtime_mark_last_busy(). Remove the now-reduntant explicit call to
pm_runtime_mark_last_busy().

Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Link: https://patch.msgid.link/20251027115022.390997-4-sakari.ailus@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: ipa: Remove redundant pm_runtime_mark_last_busy() calls
Sakari Ailus [Mon, 27 Oct 2025 11:50:20 +0000 (13:50 +0200)] 
net: ipa: Remove redundant pm_runtime_mark_last_busy() calls

pm_runtime_put_autosuspend(), pm_runtime_put_sync_autosuspend(),
pm_runtime_autosuspend() and pm_request_autosuspend() now include a call
to pm_runtime_mark_last_busy(). Remove the now-reduntant explicit call to
pm_runtime_mark_last_busy().

Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com>
Link: https://patch.msgid.link/20251027115022.390997-2-sakari.ailus@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: ethernet: Remove redundant pm_runtime_mark_last_busy() calls
Sakari Ailus [Mon, 27 Oct 2025 11:50:19 +0000 (13:50 +0200)] 
net: ethernet: Remove redundant pm_runtime_mark_last_busy() calls

pm_runtime_put_autosuspend(), pm_runtime_put_sync_autosuspend(),
pm_runtime_autosuspend() and pm_request_autosuspend() now include a call
to pm_runtime_mark_last_busy(). Remove the now-reduntant explicit call to
pm_runtime_mark_last_busy().

Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Reviewed-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Link: https://patch.msgid.link/20251027115022.390997-1-sakari.ailus@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoMerge branch 'net-stmmac-fixes-for-stmmac-tx-vlan-insert-and-est'
Jakub Kicinski [Thu, 30 Oct 2025 01:49:26 +0000 (18:49 -0700)] 
Merge branch 'net-stmmac-fixes-for-stmmac-tx-vlan-insert-and-est'

Rohan G Thomas says:

====================
net: stmmac: Fixes for stmmac Tx VLAN insert and EST

This patchset includes following fixes for stmmac Tx VLAN insert and
EST implementations:
   1. Disable STAG insertion offloading, as DWMAC IPs doesn't support
      offload of STAG for double VLAN packets and CTAG for single VLAN
      packets when using the same register configuration. The current
      configuration in the driver is undocumented and is adding an
      additional 802.1Q tag with VLAN ID 0 for double VLAN packets.
   2. Consider Tx VLAN offload tag length for maxSDU estimation.
   3. Fix GCL bounds check
====================

Link: https://patch.msgid.link/20251028-qbv-fixes-v4-0-26481c7634e3@altera.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: stmmac: est: Fix GCL bounds checks
Rohan G Thomas [Tue, 28 Oct 2025 03:18:45 +0000 (11:18 +0800)] 
net: stmmac: est: Fix GCL bounds checks

Fix the bounds checks for the hw supported maximum GCL entry
count and gate interval time.

Fixes: b60189e0392f ("net: stmmac: Integrate EST with TAPRIO scheduler API")
Signed-off-by: Rohan G Thomas <rohan.g.thomas@altera.com>
Reviewed-by: Matthew Gerlach <matthew.gerlach@altera.com>
Link: https://patch.msgid.link/20251028-qbv-fixes-v4-3-26481c7634e3@altera.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: stmmac: Consider Tx VLAN offload tag length for maxSDU
Rohan G Thomas [Tue, 28 Oct 2025 03:18:44 +0000 (11:18 +0800)] 
net: stmmac: Consider Tx VLAN offload tag length for maxSDU

Queue maxSDU requirement of 802.1 Qbv standard requires mac to drop
packets that exceeds maxSDU length and maxSDU doesn't include
preamble, destination and source address, or FCS but includes
ethernet type and VLAN header.

On hardware with Tx VLAN offload enabled, VLAN header length is not
included in the skb->len, when Tx VLAN offload is requested. This
leads to incorrect length checks and allows transmission of
oversized packets. Add the VLAN_HLEN to the skb->len before checking
the Qbv maxSDU if Tx VLAN offload is requested for the packet.

Fixes: c5c3e1bfc9e0 ("net: stmmac: Offload queueMaxSDU from tc-taprio")
Signed-off-by: Rohan G Thomas <rohan.g.thomas@altera.com>
Reviewed-by: Matthew Gerlach <matthew.gerlach@altera.com>
Link: https://patch.msgid.link/20251028-qbv-fixes-v4-2-26481c7634e3@altera.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: stmmac: vlan: Disable 802.1AD tag insertion offload
Rohan G Thomas [Tue, 28 Oct 2025 03:18:43 +0000 (11:18 +0800)] 
net: stmmac: vlan: Disable 802.1AD tag insertion offload

The DWMAC IP's VLAN tag insertion offload does not support inserting
STAG (802.1AD) and CTAG (802.1Q) types in bytes 13 and 14 using the
same MAC_VLAN_Incl and MAC_VLAN_Inner_Incl register configurations.

Currently, MAC_VLAN_Incl is configured to offload only STAG type
insertion. However, the DWMAC IP inserts a CTAG type when the inner
VLAN ID field of the descriptor is not configured, and a STAG type
when it is configured. This behavior is not documented and leads to
inconsistent double VLAN tagging.

Additionally, an unexpected CTAG with VLAN ID 0 is inserted, resulting
in frames like:

Frame 1: 110 bytes on wire (880 bits), 110 bytes captured (880 bits)
Ethernet II, Src: <src> (<src>), Dst: <dst> (<dst>)
IEEE 802.1ad, ID: 100
802.1Q Virtual LAN, PRI: 0, DEI: 0, ID: 0 (unexpected)
802.1Q Virtual LAN, PRI: 0, DEI: 0, ID: 200
Internet Protocol Version 4, Src: 192.168.4.10, Dst: 192.168.4.11
Internet Control Message Protocol

To avoid this undocumented and incorrect behavior, disable 802.1AD tag
insertion offload. Also, don't set CSVL bit. As per the data book,
when this bit is set, S-VLAN type (0x88A8) is inserted in the 13th and
14th bytes of transmitted packets and when this bit is reset, C-VLAN
type (0x8100) is inserted in the 13th and 14th bytes of transmitted
packets.

Fixes: 30d932279dc2 ("net: stmmac: Add support for VLAN Insertion Offload")
Fixes: e94e3f3b51ce ("net: stmmac: Add support for VLAN Insertion Offload in GMAC4+")
Fixes: 1d2c7a5fee31 ("net: stmmac: Refactor VLAN implementation")
Signed-off-by: Rohan G Thomas <rohan.g.thomas@altera.com>
Reviewed-by: Boon Khai Ng <boon.khai.ng@altera.com>
Link: https://patch.msgid.link/20251028-qbv-fixes-v4-1-26481c7634e3@altera.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoMerge branch 'net-enetc-add-i-mx94-enetc-support'
Jakub Kicinski [Thu, 30 Oct 2025 01:44:21 +0000 (18:44 -0700)] 
Merge branch 'net-enetc-add-i-mx94-enetc-support'

Wei Fang says:

====================
net: enetc: Add i.MX94 ENETC support

i.MX94 NETC has two kinds of ENETCs, one is the same as i.MX95, which
can be used as a standalone network port. The other one is an internal
ENETC, it connects to the CPU port of NETC switch through the pseudo
MAC. Also, i.MX94 have multiple PTP Timers, which is different from
i.MX95. Any PTP Timer can be bound to a specified standalone ENETC by
the IERB ETBCR registers. Currently, this patch only add ENETC support
and Timer support for i.MX94. The switch will be added by a separate
patch set.

In addition, note that i.MX94 SoC is launched after i.MX95, its NETC
has a higher version, so the driver support is added after i.MX95.
====================

Link: https://patch.msgid.link/20251029013900.407583-1-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: enetc: add standalone ENETC support for i.MX94
Wei Fang [Wed, 29 Oct 2025 01:39:00 +0000 (09:39 +0800)] 
net: enetc: add standalone ENETC support for i.MX94

The revision of i.MX94 ENETC is changed to v4.3, so add this revision to
enetc_info to support i.MX94 ENETC. And add PTP suspport for i.MX94.

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://patch.msgid.link/20251029013900.407583-7-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: enetc: add basic support for the ENETC with pseudo MAC for i.MX94
Wei Fang [Wed, 29 Oct 2025 01:38:59 +0000 (09:38 +0800)] 
net: enetc: add basic support for the ENETC with pseudo MAC for i.MX94

The ENETC with pseudo MAC is an internal port which connects to the CPU
port of the switch. The switch CPU/host ENETC is fully integrated with
the switch and does not require a back-to-back MAC, instead a light
weight "pseudo MAC" provides the delineation between switch and ENETC.
This translates to lower power (less logic and memory) and lower delay
(as there is no serialization delay across this link).

Different from the standalone ENETC which is used as the external port,
the internal ENETC has a different PCIe device ID, and it does not have
Ethernet MAC port registers, instead, it has a small number of pseudo
MAC port registers, so some features are not supported by pseudo MAC,
such as loopback, half duplex, one-step timestamping and so on.

Therefore, the configuration of this internal ENETC is also somewhat
different from that of the standalone ENETC. So add the basic support
for ENETC with pseudo MAC. More supports will be added in the future.

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Link: https://patch.msgid.link/20251029013900.407583-6-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: enetc: add ptp timer binding support for i.MX94
Clark Wang [Wed, 29 Oct 2025 01:38:58 +0000 (09:38 +0800)] 
net: enetc: add ptp timer binding support for i.MX94

The i.MX94 has three PTP timers, and all standalone ENETCs can select
one of them to bind to as their PHC. The 'ptp-timer' property is used
to represent the PTP device of the Ethernet controller. So users can
add 'ptp-timer' to the ENETC node to specify the PTP timer. The driver
parses this property to bind the two hardware devices.

If the "ptp-timer" property is not present, the first timer of the PCIe
bus where the ENETC is located is used as the default bound PTP timer.

Signed-off-by: Clark Wang <xiaoning.wang@nxp.com>
Signed-off-by: Wei Fang <wei.fang@nxp.com>
Link: https://patch.msgid.link/20251029013900.407583-5-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: enetc: add preliminary i.MX94 NETC blocks control support
Wei Fang [Wed, 29 Oct 2025 01:38:57 +0000 (09:38 +0800)] 
net: enetc: add preliminary i.MX94 NETC blocks control support

NETC blocks control is used for warm reset and pre-boot initialization.
Different versions of NETC blocks control are not exactly the same. We
need to add corresponding netc_devinfo data for each version. i.MX94
series are launched after i.MX95, so its NETC version (v4.3) is higher
than i.MX95 NETC (v4.1). Currently, the patch adds the following
configurations for ENETCs.

1. Set the link's MII protocol.
2. ENETC 0 (MAC 3) and the switch port 2 (MAC 2) share the same parallel
interface, but due to SoC constraint, they cannot be used simultaneously.
Since the switch is not supported yet, so the interface is assigned to
ENETC 0 by default.

The switch configuration will be added separately in a subsequent patch.

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://patch.msgid.link/20251029013900.407583-4-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agodt-bindings: net: enetc: add compatible string for ENETC with pseduo MAC
Wei Fang [Wed, 29 Oct 2025 01:38:56 +0000 (09:38 +0800)] 
dt-bindings: net: enetc: add compatible string for ENETC with pseduo MAC

The ENETC with pseudo MAC is used to connect to the CPU port of the NETC
switch. This ENETC has a different PCI device ID, so add a standard PCI
device compatible string to it.

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Acked-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20251029013900.407583-3-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agodt-bindings: net: netc-blk-ctrl: add compatible string for i.MX94 platforms
Wei Fang [Wed, 29 Oct 2025 01:38:55 +0000 (09:38 +0800)] 
dt-bindings: net: netc-blk-ctrl: add compatible string for i.MX94 platforms

Add the compatible string "nxp,imx94-netc-blk-ctrl" for i.MX94 platforms.

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Acked-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20251029013900.407583-2-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>