Oleksij Rempel [Mon, 27 Oct 2025 12:27:58 +0000 (13:27 +0100)]
net: phy: introduce internal API for PHY MSE diagnostics
Add the base infrastructure for Mean Square Error (MSE) diagnostics,
as proposed by the OPEN Alliance "Advanced diagnostic features for
100BASE-T1 automotive Ethernet PHYs" [1] specification.
The OPEN Alliance spec defines only average MSE and average peak MSE
over a fixed number of symbols. However, other PHYs, such as the
KSZ9131, additionally expose a worst-peak MSE value latched since the
last channel capture. This API accounts for such vendor extensions by
adding a distinct capability bit and snapshot field.
Channel-to-pair mapping is normally straightforward, but in some cases
(e.g. 100BASE-TX with MDI-X resolution unknown) the mapping is ambiguous.
If hardware does not expose MDI-X status, the exact pair cannot be
determined. To avoid returning misleading per-channel data in this case,
a LINK selector is defined for aggregate MSE measurements.
All investigated devices differ in MSE capabilities, such
as sample rate, number of analyzed symbols, and scaling factors.
For example, the KSZ9131 uses different scaling for MSE and pMSE.
To make this visible to callers, scale limits and timing information
are returned via get_mse_capability().
Some PHYs sample very few symbols at high frequency (e.g. 2 us update
rate). To cover such cases and allow for future high-speed PHYs with
even shorter intervals, the refresh rate is reported as u64 in
picoseconds.
This patch introduces the internal PHY API for Mean Square Error
diagnostics. It defines new kernel-side data types and driver hooks:
- struct phy_mse_capability: describes supported metrics, scale
limits, update interval, and sampling length.
- struct phy_mse_snapshot: holds one correlated measurement set.
- New phy_driver ops: `get_mse_capability()` and `get_mse_snapshot()`.
These definitions form the core kernel API. No user-visible interfaces
are added in this commit.
Standardization notes:
OPEN Alliance defines presence and interpretation of some metrics but does
not fix numeric scales or sampling internals:
- SQI (3-bit, 0..7) is mandatory; correlation to SNR/BER is informative
(OA 100BASE-T1 TC1 v1.0 6.1.2; OA 1000BASE-T1 TC12 v2.2 6.1.2).
- MSE is optional; OA recommends 2^16 symbols and scaling to 0..511,
with a worst-case latch since last read (OA 100BASE-T1 TC1 v1.0 6.1.1; OA
1000BASE-T1 TC12 v2.2 6.1.1). Refresh is recommended (~0.8-2.0 ms for
100BASE-T1; ~80-200 us for 1000BASE-T1). Exact scaling/time windows
are vendor-specific.
- Peak MSE (pMSE) is defined only for 100BASE-T1 as optional, e.g.
128-symbol sliding window with 8-bit range and worst-case latch (OA
100BASE-T1 TC1 v1.0 6.1.3).
Therefore this API exposes which measures and selectors a PHY supports,
and documents where behavior is standard-referenced vs vendor-specific.
====================
Add support to do threaded napi busy poll
Extend the already existing support of threaded napi poll to do continuous
busy polling.
This is used for doing continuous polling of napi to fetch descriptors
from backing RX/TX queues for low latency applications. Allow enabling
of threaded busypoll using netlink so this can be enabled on a set of
dedicated napis for low latency applications.
Once enabled user can fetch the PID of the kthread doing NAPI polling
and set affinity, priority and scheduler for it depending on the
low-latency requirements.
Extend the netlink interface to allow enabling/disabling threaded
busypolling at individual napi level.
We use this for our AF_XDP based hard low-latency usecase with usecs
level latency requirement. For our usecase we want low jitter and stable
latency at P99.
Following is an analysis and comparison of available (and compatible)
busy poll interfaces for a low latency usecase with stable P99. This can
be suitable for applications that want very low latency at the expense
of cpu usage and efficiency.
Already existing APIs (SO_BUSYPOLL and epoll) allow busy polling a NAPI
backing a socket, but the missing piece is a mechanism to busy poll a
NAPI instance in a dedicated thread while ignoring available events or
packets, regardless of the userspace API. Most existing mechanisms are
designed to work in a pattern where you poll until new packets or events
are received, after which userspace is expected to handle them.
As a result, one has to hack together a solution using a mechanism
intended to receive packets or events, not to simply NAPI poll. NAPI
threaded busy polling, on the other hand, provides this capability
natively, independent of any userspace API. This makes it really easy to
setup and manage.
For analysis we use an AF_XDP based benchmarking tool `xsk_rr`. The
description of the tool and how it tries to simulate the real workload
is following,
- It sends UDP packets between 2 machines.
- The client machine sends packets at a fixed frequency. To maintain the
frequency of the packet being sent, we use open-loop sampling. That is
the packets are sent in a separate thread.
- The server replies to the packet inline by reading the pkt from the
recv ring and replies using the tx ring.
- To simulate the application processing time, we use a configurable
delay in usecs on the client side after a reply is received from the
server.
The xsk_rr tool is posted separately as an RFC for tools/testing/selftest.
We use this tool with following napi polling configurations,
- Interrupts only
- SO_BUSYPOLL (inline in the same thread where the client receives the
packet).
- SO_BUSYPOLL (separate thread and separate core)
- Threaded NAPI busypoll
System is configured using following script in all 4 cases,
```
echo 0 | sudo tee /sys/class/net/eth0/threaded
echo 0 | sudo tee /proc/sys/kernel/timer_migration
echo off | sudo tee /sys/devices/system/cpu/smt/control
echo 0 | sudo tee /proc/sys/net/core/rps_sock_flow_entries
echo 0 | sudo tee /sys/class/net/eth0/queues/rx-0/rps_cpus
# pin IRQs on CPU 2
IRQS="$(gawk '/eth0-(TxRx-)?1/ {match($1, /([0-9]+)/, arr); \
print arr[0]}' < /proc/interrupts)"
for irq in "${IRQS}"; \
do echo 2 | sudo tee /proc/irq/$irq/smp_affinity_list; done
echo -1 | sudo tee /proc/sys/kernel/sched_rt_runtime_us
for i in /sys/devices/virtual/workqueue/*/cpumask; \
do echo $i; echo 1,2,3,4,5,6 > $i; done
if [[ -z "$1" ]]; then
echo 400 | sudo tee /proc/sys/net/core/busy_read
echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
fi
sudo ethtool -C eth0 adaptive-rx off adaptive-tx off rx-usecs 0 tx-usecs 0
if [[ "$1" == "enable_threaded" ]]; then
echo 0 | sudo tee /proc/sys/net/core/busy_poll
echo 0 | sudo tee /proc/sys/net/core/busy_read
echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
NAPI_ID=$(ynl --family netdev --output-json --do queue-get \
--json '{"ifindex": '${IFINDEX}', "id": '0', "type": "rx"}' | jq '."napi-id"')
# pin threaded poll thread to CPU 2
sudo taskset -pc 2 $NAPI_T
fi
if [[ "$1" == "enable_interrupt" ]]; then
echo 0 | sudo tee /proc/sys/net/core/busy_read
echo 0 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
fi
```
To enable various configurations, script can be run as following,
- Interrupt Only
```
<script> enable_interrupt
```
- SO_BUSYPOLL (no arguments to script)
```
<script>
```
- NAPI threaded busypoll
```
<script> enable_threaded
```
Once configured, the workload is run with various configurations using
following commands. Set period (1/frequency) and delay in usecs to
produce results for packet frequency and application processing delay.
- Here without application processing all the approaches give the same
latency within 1usecs range and NAPI threaded gives minimum latency.
- With application processing the latency increases by 3-4usecs when
doing inline polling.
- Using a dedicated core to drive napi polling keeps the latency same
even with application processing. This is observed both in userspace
and threaded napi (in kernel).
- Using napi threaded polling in kernel gives lower latency by
1-2usecs as compared to userspace driven polling in separate core.
- Even on a dedicated core, SO_BUSYPOLL adds around 1-2usecs of latency.
This is because it doesn't continuously busy poll until events are
ready. Instead, it returns after polling only once, requiring the
process to re-invoke the syscall for each poll, which requires a new
enter/leave kernel cycle and the setup/teardown of the busy poll for
every single poll attempt.
- With application processing userspace will get the packet from recv
ring and spend some time doing application processing and then do napi
polling. While application processing is happening a dedicated core
doing napi polling can pull the packet of the NAPI RX queue and
populate the AF_XDP recv ring. This means that when the application
thread is done with application processing it has new packets ready to
recv and process in recv ring.
- Napi threaded busy polling in the kernel with a dedicated core gives
the consistent P5-P99 latency.
Note well that threaded napi busy-polling has not been shown to yield
efficiency or throughput benefits. In contrast, dedicating an entire
core to busy-polling one NAPI (NIC queue) is rather inefficient.
However, in certain specific use cases, this mechanism results in lower
packet processing latency. The experimental testing reported here only
covers those use cases and does not present a comprehensive evaluation
of threaded napi busy-polling.
Following histogram is generated to measure the time spent in recvfrom
while using inline thread with SO_BUSYPOLL. The histogram is generated
using the following bpftrace command. In this experiment there are 32K
packets per second and the application processing delay is 30usecs. This
is to measure whether there is significant time spent pulling packets
from the descriptor queue that it will affect the overall latency if
done inline.
selftests: Add napi threaded busy poll test in `busy_poller`
Add testcase to run busy poll test with threaded napi busy poll enabled.
Signed-off-by: Samiullah Khawaja <skhawaja@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Acked-by: Martin Karsten <mkarsten@uwaterloo.ca> Tested-by: Martin Karsten <mkarsten@uwaterloo.ca> Link: https://patch.msgid.link/20251028203007.575686-3-skhawaja@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: Extend NAPI threaded polling to allow kthread based busy polling
Add a new state NAPI_STATE_THREADED_BUSY_POLL to the NAPI state enum to
enable and disable threaded busy polling.
When threaded busy polling is enabled for a NAPI, enable
NAPI_STATE_THREADED also.
When the threaded NAPI is scheduled, set NAPI_STATE_IN_BUSY_POLL to
signal napi_complete_done not to rearm interrupts.
Whenever NAPI_STATE_THREADED_BUSY_POLL is unset, the
NAPI_STATE_IN_BUSY_POLL will be unset, napi_complete_done unsets the
NAPI_STATE_SCHED_THREADED bit also, which in turn will make the kthread
go to sleep.
Signed-off-by: Samiullah Khawaja <skhawaja@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Acked-by: Martin Karsten <mkarsten@uwaterloo.ca> Tested-by: Martin Karsten <mkarsten@uwaterloo.ca> Link: https://patch.msgid.link/20251028203007.575686-2-skhawaja@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
mpls: Drop RTNL for RTM_NEWROUTE, RTM_DELROUTE, and RTM_GETROUTE.
RTM_NEWROUTE looks up dev under RCU (ip_route_output(),
ipv6_stub->ipv6_dst_lookup_flow(), netdev_get_by_index()),
and each neighbour holds the refcnt of its dev.
Also, net->mpls.platform_label is protected by a dedicated
per-netns mutex.
Now, no MPLS code depends on RTNL.
Let's drop RTNL for RTM_NEWROUTE, RTM_DELROUTE, and RTM_GETROUTE.
1) to guarantee the lifetime of struct mpls_nh.nh_dev
2) to protect net->mpls.platform_label
, but neither actually requires RTNL.
If we do not call dev_put() in find_outdev() and call it
just before freeing struct mpls_route, we can drop RTNL for 1).
Let's hold the refcnt of mpls_nh.nh_dev and track it with
netdevice_tracker.
Two notable changes:
If mpls_nh_build_multi() fails to set up a neighbour, we need
to call netdev_put() for successfully created neighbours in
mpls_rt_free_rcu(), so the number of neighbours (rt->rt_nhn)
is now updated in each iteration.
When a dev is unregistered, mpls_ifdown() clones mpls_route
and replaces it with the clone, so the clone requires extra
netdev_hold().
net: stmmac: rename devlink parameter ts_coarse into phc_coarse_adj
The devlink param "ts_coarse" doesn't indicate that we get coarse
timestamps, but rather that the PHC clock adjusments are coarse as the
frequency won't be continuously adjusted. Adjust the devlink parameter
name to reflect that.
The Coarse terminlogy comes from the dwmac register naming, update the
documentation to better explain what the parameter is about.
With this change, the parameter can now be adjusted using:
devlink dev param set <dev> name phc_coarse_adj value true cmode runtime
Jianhui Zhao [Sun, 2 Nov 2025 15:26:37 +0000 (16:26 +0100)]
net: phy: realtek: add interrupt support for RTL8221B
This commit introduces interrupt support for RTL8221B (C45 mode).
Interrupts are mapped on the VEND2 page. VEND2 registers are only
accessible via C45 reads and cannot be accessed by C45 over C22.
Signed-off-by: Jianhui Zhao <zhaojh329@gmail.com>
[Enable only link state change interrupts] Signed-off-by: Aleksander Jan Bajkowski <olek2@wp.pl> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20251102152644.1676482-1-olek2@wp.pl Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Alok Tiwari [Fri, 31 Oct 2025 11:26:44 +0000 (04:26 -0700)]
hinic3: fix misleading error message in hinic3_open_channel()
The error message printed when hinic3_configure() fails incorrectly
reports "Failed to init txrxq irq", which does not match the actual
operation performed. The hinic3_configure() function sets up various
device resources such as MTU and RSS parameters , not IRQ initialization.
Update the log to "Failed to configure device resources" to make the
message accurate and clearer for debugging.
Bagas Sanjaya [Tue, 28 Oct 2025 01:44:52 +0000 (08:44 +0700)]
Documentation: ARCnet: Update obsolete contact info
ARCnet docs states that inquiries on the subsystem should be emailed to
Avery Pennarun <apenwarr@worldvisions.ca>, for whom has been in CREDITS
since the beginning of kernel git history and her email address is
unreachable (bounce). The subsystem is now maintained by Michael
Grzeschik since c38f6ac74c9980 ("MAINTAINERS: add arcnet and take
maintainership").
In addition, there used to be a dedicated ARCnet mailing list but its
archive at epistolary.org has been shut down. ARCnet discussion nowadays
take place in netdev list. The arcnet.com domain mentioned has become
AIoT (Artificial Intelligence of Things) related Typeform page and
ARCnet info now resides on arcnet.cc (ARCnet Resource Center) instead.
Update contact information.
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Link: https://patch.msgid.link/20251028014451.10521-2-bagasdotme@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
dpll: Add support for phase adjustment granularity
Phase-adjust values are currently limited only by a min-max range. Some
hardware requires, for certain pin types, that values be multiples of
a specific granularity, as in the zl3073x driver.
Patch 1: Adds 'phase-adjust-gran' pin attribute and an appropriate
handling
Patch 2: Adds a support for this attribute into zl3073x driver
====================
Ivan Vecera [Wed, 29 Oct 2025 15:32:07 +0000 (16:32 +0100)]
dpll: zl3073x: Specify phase adjustment granularity for pins
Output pins phase adjustment values in the device are expressed
in half synth clock cycles. Use this number of cycles as output
pins' phase adjust granularity and simplify both get/set callbacks.
Reviewed-by: Michal Schmidt <mschmidt@redhat.com> Reviewed-by: Petr Oros <poros@redhat.com> Tested-by: Prathosh Satish <Prathosh.Satish@microchip.com> Signed-off-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Link: https://patch.msgid.link/20251029153207.178448-3-ivecera@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ivan Vecera [Wed, 29 Oct 2025 15:32:06 +0000 (16:32 +0100)]
dpll: add phase-adjust-gran pin attribute
Phase-adjust values are currently limited by a min-max range. Some
hardware requires, for certain pin types, that values be multiples of
a specific granularity, as in the zl3073x driver.
Add a `phase-adjust-gran` pin attribute and an appropriate field in
dpll_pin_properties. If set by the driver, use its value to validate
user-provided phase-adjust values.
Reviewed-by: Michal Schmidt <mschmidt@redhat.com> Reviewed-by: Petr Oros <poros@redhat.com> Tested-by: Prathosh Satish <Prathosh.Satish@microchip.com> Signed-off-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Link: https://patch.msgid.link/20251029153207.178448-2-ivecera@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Thomas Wismer [Wed, 29 Oct 2025 21:23:10 +0000 (22:23 +0100)]
dt-bindings: pse-pd: ti,tps23881: Add TPS23881B
Add the TPS23881B I2C power sourcing equipment controller to the list of
supported devices.
Falling back to the TPS23881 predecessor device is not suitable as firmware
loading needs to handled differently by the driver. The TPS23881 and
TPS23881B devices require different firmware. Trying to load the TPS23881
firmware on a TPS23881B device fails and must therefore be omitted.
Signed-off-by: Thomas Wismer <thomas.wismer@scs.ch> Acked-by: Conor Dooley <conor.dooley@microchip.com> Reviewed-by: Kory Maincent <kory.maincent@bootlin.com> Link: https://patch.msgid.link/20251029212312.108749-3-thomas@wismer.xyz Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Thomas Wismer [Wed, 29 Oct 2025 21:23:09 +0000 (22:23 +0100)]
net: pse-pd: tps23881: Add support for TPS23881B
The TPS23881B uses different firmware than the TPS23881. Trying to load the
TPS23881 firmware on a TPS23881B device fails and must be omitted.
The TPS23881B ships with a more recent ROM firmware. Moreover, no updated
firmware has been released yet and so the firmware loading step must be
skipped. As of today, the TPS23881B is intended to use its ROM firmware.
Signed-off-by: Thomas Wismer <thomas.wismer@scs.ch> Reviewed-by: Kory Maincent <kory.maincent@bootlin.com> Acked-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://patch.msgid.link/20251029212312.108749-2-thomas@wismer.xyz Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Bagas Sanjaya [Thu, 30 Oct 2025 07:50:13 +0000 (14:50 +0700)]
Documentation: netconsole: Separate literal code blocks for full and short netcat command name versions
Both full and short (abbreviated) command name versions of netcat
example are combined in single literal code block due to 'or::'
paragraph being indented one more space than the preceding paragraph
(before the short version example).
Unindent it to separate the versions.
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Link: https://patch.msgid.link/20251030075013.40418-1-bagasdotme@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
net: phy: microchip_t1s: Add support for LAN867x Rev.D0 PHY
This patch series adds support for the latest Microchip LAN8670/1/2 Rev.D0
10BASE-T1S PHYs to the microchip_t1s driver.
The new Rev.D0 silicon introduces updated initialization requirements and
link status handling behavior compared to earlier revisions (Rev.C2 and
below). These updates are necessary for full compliance with the OPEN
Alliance 10BASE-T1S specification and are documented in Microchip
Application Note AN1699 Revision G (DS60001699G – October 2025).
Summary of changes:
- Implements Rev.D0-specific configuration sequence as described in AN1699
Rev.G.
- Introduces link status control configuration for LAN867x Rev.D0.
====================
net: phy: microchip_t1s: configure link status control for LAN867x Rev.D0
Configure the link status in the Link Status Control register for
LAN8670/1/2 Rev.D0 PHYs, depending on whether PLCA or CSMA/CD mode
is enabled. When PLCA is enabled, the link status reflects the PLCA
status. When PLCA is disabled (CSMA/CD mode), the PHY does not support
autonegotiation, so the link status is forced active by setting
the LINK_STATUS_SEMAPHORE bit.
The link status control is configured:
- During PHY initialization, for default CSMA/CD mode.
- Whenever PLCA configuration is updated.
This ensures correct link reporting and consistent behavior for
LAN867x Rev.D0 devices.
net: phy: microchip_t1s: add support for Microchip LAN867X Rev.D0 PHY
Add support for the LAN8670/1/2 Rev.D0 10BASE-T1S PHYs from Microchip.
The new Rev.D0 silicon requires a specific set of initialization
settings to be configured for optimal performance and compliance with
OPEN Alliance specifications, as described in Microchip Application Note
AN1699 (Revision G, DS60001699G – October 2025).
https://www.microchip.com/en-us/application-notes/an1699
When operating in "SGMII" mode (Cisco SGMII or 2500BASE-X), qcom-ethqos
modifies the MAC control register in its ethqos_configure_sgmii()
function, which is only called from one path:
stmmac_mac_link_up()
+- reads MAC_CTRL_REG
+- masks out priv->hw->link.speed_mask
+- sets bits according to speed (2500, 1000, 100, 10) from priv->hw.link.speed*
+- ethqos_fix_mac_speed()
| +- qcom_ethqos_set_sgmii_loopback(false)
| +- ethqos_update_link_clk(speed)
| `- ethqos_configure(speed)
| `- ethqos_configure_sgmii(speed)
| +- reads MAC_CTRL_REG,
| +- configures PS/FES bits according to speed
| `- writes MAC_CTRL_REG as the last operation
+- sets duplex bit(s)
+- stmmac_mac_flow_ctrl()
+- writes MAC_CTRL_REG if changed from original read
...
As can be seen, the modification of the control register that
stmmac_mac_link_up() overwrites the changes that ethqos_fix_mac_speed()
does to the register. This makes ethqos_configure_sgmii()'s
modification questionable at best.
Thus, they appear to be doing very similar, with the exception of the
FES bit (bit 14) for 1G and 2.5G speeds.
Given that stmmac_mac_link_up() will write the MAC_CTRL_REG after
ethqos_configure_sgmii(), remove the unnecessary update in the
glue driver's ethqos_configure_sgmii() method, simplifying the code.
Konrad states:
Without any additional knowledge, the register description says:
Tested-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com> Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/E1vEPlg-0000000CFHY-282A@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
Convert mlx5e and IPoIB to ndo_hwtstamp_get/set
This series by Carolina migrates mlx5e and IPoIB to the
ndo_hwtstamp_get/set interface and removes legacy hardware timestamp
ioctl handling. While doing so, it also cleans up naming and removes
redundant code.
No functional change in timestamp behavior.
Cleanup patches:
- net/mlx5e: Remove redundant tstamp pointer from channel structures
- net/mlx5e: Remove unnecessary tstamp local variable in mlx5i_complete_rx_cqe
- net/mlx5e: Rename hwstamp functions to hwtstamp
- net/mlx5e: Rename timestamp fields to hwtstamp_config
Add suppport in ipoib:
- IB/IPoIB: Add support for hwtstamp get/set ndos
Convert mlx5:
- net/mlx5e: Convert to new hwtstamp_get/set interface
====================
Carolina Jubran [Thu, 30 Oct 2025 10:25:09 +0000 (12:25 +0200)]
IB/IPoIB: Add support for hwtstamp get/set ndos
Add support for the ndo_hwtstamp_get and ndo_hwtstamp_set operations in
IPoIB. This allows lower devices to handle hardware timestamp
configuration through the new ndos instead of the legacy ioctls.
Carolina Jubran [Thu, 30 Oct 2025 10:25:08 +0000 (12:25 +0200)]
net/mlx5e: Rename timestamp fields to hwtstamp_config
Rename hardware timestamp-related fields from 'tstamp' to
'hwtstamp_config' throughout the MLX5 driver. The new name is more
descriptive as it clearly indicates these fields contain hardware
timestamp configuration.
Carolina Jubran [Thu, 30 Oct 2025 10:25:07 +0000 (12:25 +0200)]
net/mlx5e: Rename hwstamp functions to hwtstamp
Rename mlx5e_hwstamp_set/get() functions to mlx5e_hwtstamp_set/get()
to better reflect that these functions handle hardware timestamping,
not just hardware stamping.
Carolina Jubran [Thu, 30 Oct 2025 10:25:06 +0000 (12:25 +0200)]
net/mlx5e: Remove unnecessary tstamp local variable in mlx5i_complete_rx_cqe
Remove the tstamp local variable in mlx5i_complete_rx_cqe() and directly
pass the tstamp field from priv to mlx5e_rx_hw_stamp(). The local variable
was only used once and provided no additional value.
Carolina Jubran [Thu, 30 Oct 2025 10:25:05 +0000 (12:25 +0200)]
net/mlx5e: Remove redundant tstamp pointer from channel structures
Remove the tstamp pointer field from mlx5e_channel, mlx5e_ptp, and
mlx5e_trap structures, since it was only used to reference the tstamp
field in the priv structure. Instead, directly use the tstamp field
from priv when initializing RQ structures.
Also remove the unused hwtstamp_config field from mlx5_clock structure
as part of the cleanup.
Haiyang Zhang [Wed, 29 Oct 2025 20:43:10 +0000 (13:43 -0700)]
net: mana: Support HW link state events
Handle the NIC hardware link state events received from the HW
channel, then set the proper link state accordingly.
And, add a feature bit, GDMA_DRV_CAP_FLAG_1_HW_VPORT_LINK_AWARE,
to inform the NIC hardware this handler exists.
Our MANA NIC only sends out the link state down/up messages
when we need to let the VM rerun DHCP client and change IP
address. So, add netif_carrier_on() in the probe(), let the NIC
show the right initial state in /sys/class/net/ethX/operstate.
This patch adds support for matching fragmented packets in tc flower
filters.
Previously, commit 93a8540aac72 ("cxgb4: flower: validate control flags")
added a check using flow_rule_match_has_control_flags() to reject
any rules with control flags, as the driver did not support
fragmentation at that time.
Now, with this patch, support for FLOW_DIS_IS_FRAGMENT is added:
- The driver checks for control flags using
flow_rule_is_supp_control_flags(), as recommended in
commit d11e63119432 ("flow_offload: add control flag checking helpers").
- If the fragmentation flag is present, the driver sets `fs->val.frag` and
`fs->mask.frag` accordingly in the filter specification.
Since fragmentation is now supported, the earlier check that rejected all
control flags (flow_rule_match_has_control_flags()) has been removed.
Jakub Kicinski [Fri, 31 Oct 2025 00:57:07 +0000 (17:57 -0700)]
Merge tag 'nf-next-25-10-30' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Florian Westphal says:
====================
netfilter: updates for net-next
1) Convert nf_tables 'nft_set_iter' usage to use C99 struct
initialization, from Fernando Fernandez Mancera.
2) Disallow nf_conntrack_max=0. This was an (undocumented)
historic inheritance from ip_conntrack (ipv4 only nf_conntrack
predecessor). Doing so will simplify future changes to make
this pernet-tuneable.
3) Fix a typo in conntrack.h comment, from Weibiao Tu.
* tag 'nf-next-25-10-30' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: fix typo in nf_conntrack_l4proto.h comment
netfilter: conntrack: disable 0 value for conntrack_max setting
netfilter: nf_tables: use C99 struct initializer for nft_set_iter
====================
* tag 'wireless-next-2025-10-30' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next:
wifi: mac80211: Allow HT Action frame processing on 6 GHz when HE is supported
wifi: rt2x00: add nvmem eeprom support
wifi: mac80211: add RX flag to report radiotap VHT information
net: wireless: Remove redundant pm_runtime_mark_last_busy() calls
wifi: cfg80211: Add parameters to radio-specific debugfs directories
wifi: cfg80211: Add debugfs support for multi-radio wiphy
wifi: mac80211: fix missing RX bitrate update for mesh forwarding path
wifi: cfg80211: default S1G chandef width to 1MHz
wifi: mac80211: get probe response chan via ieee80211_get_channel_khz
wifi: mac80211: reset CRC valid after CSA
wifi: mac80211_hwsim: advertise puncturing feature support
wifi: cfg80211/mac80211: validate radio frequency range for monitor mode
wifi: rt2x00: check retval for of_get_mac_address
====================
Jakub Kicinski [Wed, 29 Oct 2025 16:49:30 +0000 (09:49 -0700)]
selftests: drv-net: replace the nsim ring test with a drv-net one
We are trying to move away from netdevsim-only tests and towards
tests which can be run both against netdevsim and real drivers.
Replace the simple bash script we have for checking ethtool -g/-G
on netdevsim with a Python test tweaking those params as well
as channel count.
The new test is not exactly equivalent to the netdevsim one,
but real drivers don't often support random ring sizes,
let alone modifying max values via debugfs.
For ice:
Michal converts driver to utilize Page Pool and libeth APIs. Conversion
is based on similar changes done for iavf in order to simplify buffer
management, improve maintainability, and increase code reuse across
Intel Ethernet drivers.
Alexander adds support for header split, configurable via ethtool.
Grzegorz allows for use of 100Mbps on E825C SGMII devices.
For i40e:
Jay Vosburgh avoids sending link state changes to VF if it is already in
the requested state.
For idpf:
Sreedevi removes duplicated defines.
For ixgbe:
Alok Tiwari fixes some typos.
For igbvf:
Alok Tiwari fixes output of VLAN warning message.
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
igbvf: fix misplaced newline in VLAN add warning message
ixgbe: fix typos in ixgbe driver comments
idpf: remove duplicate defines in IDPF_CAP_RSS
i40e: avoid redundant VF link state updates
ice: Allow 100M speed for E825C SGMII device
ice: implement configurable header split for regular Rx
ice: switch to Page Pool
ice: drop page splitting and recycling
ice: remove legacy Rx and construct SKB
====================
====================
net/smc: make wr buffer count configurable
The current value of SMC_WR_BUF_CNT is 16 which leads to heavy
contention on the wr_tx_wait workqueue of the SMC-R linkgroup and its
spinlock when many connections are competing for the work request
buffers. Currently up to 256 connections per linkgroup are supported.
To make things worse when finally a buffer becomes available and
smc_wr_tx_put_slot() signals the linkgroup's wr_tx_wait wq, because
WQ_FLAG_EXCLUSIVE is not used all the waiters get woken up, most of the
time a single one can proceed, and the rest is contending on the
spinlock of the wq to go to sleep again.
Addressing this by simply bumping SMC_WR_BUF_CNT to 256 was deemed
risky, because the large-ish physically continuous allocation could fail
and lead to TCP fall-backs. For reference see this discussion thread on
"[PATCH net-next] net/smc: increase SMC_WR_BUF_CNT" (in archive
https://lists.openwall.net/netdev/2024/11/05/186), which concludes with
the agreement to try to come up with something smarter, which is what
this series aims for.
Additionally if for some reason it is known that heavy contention is not
to be expected going with something like 256 work request buffers is
wasteful. To address these concerns make the number of work requests
configurable, and introduce a back-off logic with handles -ENOMEM form
smc_wr_alloc_link_mem() gracefully.
Halil Pasic [Mon, 27 Oct 2025 22:48:56 +0000 (23:48 +0100)]
net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully
Currently if a -ENOMEM from smc_wr_alloc_link_mem() is handled by
giving up and going the way of a TCP fallback. This was reasonable
before the sizes of the allocations there were compile time constants
and reasonably small. But now those are actually configurable.
So instead of giving up, keep retrying with half of the requested size
unless we dip below the old static sizes -- then give up! In terms of
numbers that means we give up when it is certain that we at best would
end up allocating less than 16 send WR buffers or less than 48 recv WR
buffers. This is to avoid regressions due to having fewer buffers
compared the static values of the past.
Please note that SMC-R is supposed to be an optimisation over TCP, and
falling back to TCP is superior to establishing an SMC connection that
is going to perform worse. If the memory allocation fails (and we
propagate -ENOMEM), we fall back to TCP.
Preserve (modulo truncation) the ratio of send/recv WR buffer counts.
Signed-off-by: Halil Pasic <pasic@linux.ibm.com> Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com> Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com> Reviewed-by: Sidraya Jayagond <sidraya@linux.ibm.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Tested-by: Mahanta Jambigi <mjambigi@linux.ibm.com> Link: https://patch.msgid.link/20251027224856.2970019-3-pasic@linux.ibm.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Halil Pasic [Mon, 27 Oct 2025 22:48:55 +0000 (23:48 +0100)]
net/smc: make wr buffer count configurable
Think SMC_WR_BUF_CNT_SEND := SMC_WR_BUF_CNT used in send context and
SMC_WR_BUF_CNT_RECV := 3 * SMC_WR_BUF_CNT used in recv context. Those
get replaced with lgr->max_send_wr and lgr->max_recv_wr respective.
Please note that although with the default sysctl values
qp_attr.cap.max_send_wr == qp_attr.cap.max_recv_wr is maintained but
can not be assumed to be generally true any more. I see no downside to
that, but my confidence level is rather modest.
Signed-off-by: Halil Pasic <pasic@linux.ibm.com> Reviewed-by: Sidraya Jayagond <sidraya@linux.ibm.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Tested-by: Mahanta Jambigi <mjambigi@linux.ibm.com> Link: https://patch.msgid.link/20251027224856.2970019-2-pasic@linux.ibm.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
netfilter: conntrack: disable 0 value for conntrack_max setting
Undocumented historical artifact inherited from ip_conntrack.
If value is 0, then no limit is applied at all, conntrack table
can grow to huge value, only limited by size of conntrack hashes and
the kernel-internal upper limit on the hash chain lengths.
This feature makes no sense; users can just set
conntrack_max=2147483647 (INT_MAX).
Disallow a 0 value. This will make it slightly easier to allow
per-netns constraints for this value in a future patch.
netfilter: nf_tables: use C99 struct initializer for nft_set_iter
Use C99 struct initializer for nft_set_iter, simplifying the code and
preventing future errors due to uninitialized fields if new fields are
added to the struct.
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Florian Westphal <fw@strlen.de>
Ranganath V N [Sun, 26 Oct 2025 16:33:12 +0000 (22:03 +0530)]
net: sctp: fix KMSAN uninit-value in sctp_inq_pop
Fix an issue detected by syzbot:
KMSAN reported an uninitialized-value access in sctp_inq_pop
BUG: KMSAN: uninit-value in sctp_inq_pop
The issue is actually caused by skb trimming via sk_filter() in sctp_rcv().
In the reproducer, skb->len becomes 1 after sk_filter(), which bypassed the
original check:
if (skb->len < sizeof(struct sctphdr) + sizeof(struct sctp_chunkhdr) +
skb_transport_offset(skb))
To handle this safely, a new check should be performed after sk_filter().
Reported-by: syzbot+d101e12bccd4095460e7@syzkaller.appspotmail.com Tested-by: syzbot+d101e12bccd4095460e7@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=d101e12bccd4095460e7 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Suggested-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Ranganath V N <vnranganath.20@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20251026-kmsan_fix-v3-1-2634a409fa5f@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Paolo Abeni [Thu, 30 Oct 2025 09:44:12 +0000 (10:44 +0100)]
Merge branch 'add-cn20k-nix-and-npa-contexts'
Subbaraya Sundeep says:
====================
Add CN20K NIX and NPA contexts
The hardware contexts of blocks NIX and NPA in CN20K silicon are
different than that of previous silicons CN10K and CN9XK. This
patchset adds the new contexts of CN20K in AF and PF drivers.
A new mailbox for enqueuing contexts to hardware is added.
Patch 1 simplifies context writing and reading by using max context
size supported by hardware instead of using each context size.
Patch 2 and 3 adds NIX block contexts in AF driver and extends
debugfs to display those new contexts
Patch 4 and 5 adds NPA block contexts in AF driver and extends
debugfs to display those new contexts
Patch 6 omits NDC configuration since CN20K NPA does not use NDC
for caching its contexts
Patch 7 and 8 uses the new NIX and NPA contexts in PF/VF driver.
Patch 9, 10 and 11 are to support more bandwidth profiles present in
CN20K for RX ratelimiting and to display new profiles in debugfs
octeontx2-pf: Use new bandwidth profiles in receive queue
Receive queue points to a bandwidth profile for rate limiting.
Since cn20k has additional bandwidth profiles use them
too while mapping receive queue to bandwidth profile.
octeontx2-af: Accommodate more bandwidth profiles for cn20k
CN20K has 16k of leaf profiles, 2k of middle profiles and
256 of top profiles. This patch modifies existing receive
queue and bandwidth profile context structures to accommodate
additional profiles of cn20k.
Linu Cherian [Sat, 25 Oct 2025 10:32:43 +0000 (16:02 +0530)]
octeontx2-pf: Initialize cn20k specific aura and pool contexts
With new CN20K NPA pool and aura contexts supported in AF
driver this patch modifies PF driver to use new NPA contexts.
Implement new hw_ops for intializing aura and pool contexts
for all the silicons.
Linu Cherian [Sat, 25 Oct 2025 10:32:42 +0000 (16:02 +0530)]
octeontx2-af: Skip NDC operations for cn20k
For cn20k, NPA block doesn't use the general purpose
NDC (Near Coprocessor Bus Data cache Unit) for caching,
hence skip the NDC related operations.
Also refactor NDC configuration code to a helper function.
Linu Cherian [Sat, 25 Oct 2025 10:32:40 +0000 (16:02 +0530)]
octeontx2-af: Add cn20k NPA block contexts
New CN20K silicon has NPA hardware context structures different from
previous silicons. Add NPA aura and pool context definitions for cn20k.
Extend NPA context handling support to cn20k.
New CN20K silicon has NIX hardware context structures different from
previous silicons. Add NIX send and completion queue context
definitions for cn20k. Extend NIX context handling support to cn20k.
Thomas Wu [Tue, 28 Oct 2025 04:34:42 +0000 (10:04 +0530)]
wifi: mac80211: Allow HT Action frame processing on 6 GHz when HE is supported
Management frames on 6 GHz do not include HT Capabilities, causing HT
Action frames to be dropped in ieee80211_rx_h_action(). The current logic
checks only ht_cap.ht_supported, which fails for 6 GHz radios that support
only HE and EHT.
Update the condition to also allow HT Action frame processing when
he_cap.has_he is true. This enables support for HE dynamic SM power save
as defined in IEEE Std 802.11ax-2021, section 26.14.4.
Benjamin Berg [Mon, 27 Oct 2025 12:22:14 +0000 (14:22 +0200)]
wifi: mac80211: add RX flag to report radiotap VHT information
mac80211 already reports some basic information in the radiotap header
with the known fields declared by the driver. However, drivers may want
to report more accurate information and in that case the full VHT
radiotap structure needs to be provided.
Add a new RX_FLAG_RADIOTAP_VHT which is set when the VHT information
should be pulled from the skb. Update the code to fill in the VHT fields
to only do so when requested by the driver or if the information has not
yet been set. This way the driver can fully control the information if
it chooses so.
pm_runtime_put_autosuspend(), pm_runtime_put_sync_autosuspend(),
pm_runtime_autosuspend() and pm_request_autosuspend() now include a call
to pm_runtime_mark_last_busy(). Remove the now-reduntant explicit call to
pm_runtime_mark_last_busy().
Shivaji Kant [Wed, 29 Oct 2025 06:54:19 +0000 (06:54 +0000)]
net: devmem: refresh devmem TX dst in case of route invalidation
The zero-copy Device Memory (Devmem) transmit path
relies on the socket's route cache (`dst_entry`) to
validate that the packet is being sent via the network
device to which the DMA buffer was bound.
However, this check incorrectly fails and returns `-ENODEV`
if the socket's route cache entry (`dst`) is merely missing
or expired (`dst == NULL`). This scenario is observed during
network events, such as when flow steering rules are deleted,
leading to a temporary route cache invalidation.
This patch fixes -ENODEV error for `net_devmem_get_binding()`
by doing the following:
1. It attempts to rebuild the route via `rebuild_header()`
if the route is initially missing (`dst == NULL`). This
allows the TCP/IP stack to recover from transient route
cache misses.
2. It uses `rcu_read_lock()` and `dst_dev_rcu()` to safely
access the network device pointer (`dst_dev`) from the
route, preventing use-after-free conditions if the
device is concurrently removed.
3. It maintains the critical safety check by validating
that the retrieved destination device (`dst_dev`) is
exactly the device registered in the Devmem binding
(`binding->dev`).
These changes prevent unnecessary ENODEV failures while
maintaining the critical safety requirement that the
Devmem resources are only used on the bound network device.
Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Reported-by: Eric Dumazet <edumazet@google.com> Reported-by: Vedant Mathur <vedantmathur@google.com> Suggested-by: Eric Dumazet <edumazet@google.com> Fixes: bd61848900bf ("net: devmem: Implement TX path") Signed-off-by: Shivaji Kant <shivajikant@google.com> Link: https://patch.msgid.link/20251029065420.3489943-1-shivajikant@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
max_addr is the max number of addresses, not the highest possible address,
therefore check phydev->mdio.addr > max_addr isn't correct.
To fix this change the semantics of max_addr, so that it represents
the highest possible address. IMO this is also a little bit more intuitive
wrt name max_addr.
Fixes: 4a107a0e8361 ("net: stmmac: mdio: use phy_find_first to simplify stmmac_mdio_register") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Reported-by: Simon Horman <horms@kernel.org> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/e869999b-2d4b-4dc1-9890-c2d3d1e8d0f8@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
pm_runtime_put_autosuspend(), pm_runtime_put_sync_autosuspend(),
pm_runtime_autosuspend() and pm_request_autosuspend() now include a call
to pm_runtime_mark_last_busy(). Remove the now-reduntant explicit call to
pm_runtime_mark_last_busy().
pm_runtime_put_autosuspend(), pm_runtime_put_sync_autosuspend(),
pm_runtime_autosuspend() and pm_request_autosuspend() now include a call
to pm_runtime_mark_last_busy(). Remove the now-reduntant explicit call to
pm_runtime_mark_last_busy().
pm_runtime_put_autosuspend(), pm_runtime_put_sync_autosuspend(),
pm_runtime_autosuspend() and pm_request_autosuspend() now include a call
to pm_runtime_mark_last_busy(). Remove the now-reduntant explicit call to
pm_runtime_mark_last_busy().
====================
net: stmmac: Fixes for stmmac Tx VLAN insert and EST
This patchset includes following fixes for stmmac Tx VLAN insert and
EST implementations:
1. Disable STAG insertion offloading, as DWMAC IPs doesn't support
offload of STAG for double VLAN packets and CTAG for single VLAN
packets when using the same register configuration. The current
configuration in the driver is undocumented and is adding an
additional 802.1Q tag with VLAN ID 0 for double VLAN packets.
2. Consider Tx VLAN offload tag length for maxSDU estimation.
3. Fix GCL bounds check
====================
Rohan G Thomas [Tue, 28 Oct 2025 03:18:45 +0000 (11:18 +0800)]
net: stmmac: est: Fix GCL bounds checks
Fix the bounds checks for the hw supported maximum GCL entry
count and gate interval time.
Fixes: b60189e0392f ("net: stmmac: Integrate EST with TAPRIO scheduler API") Signed-off-by: Rohan G Thomas <rohan.g.thomas@altera.com> Reviewed-by: Matthew Gerlach <matthew.gerlach@altera.com> Link: https://patch.msgid.link/20251028-qbv-fixes-v4-3-26481c7634e3@altera.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Rohan G Thomas [Tue, 28 Oct 2025 03:18:44 +0000 (11:18 +0800)]
net: stmmac: Consider Tx VLAN offload tag length for maxSDU
Queue maxSDU requirement of 802.1 Qbv standard requires mac to drop
packets that exceeds maxSDU length and maxSDU doesn't include
preamble, destination and source address, or FCS but includes
ethernet type and VLAN header.
On hardware with Tx VLAN offload enabled, VLAN header length is not
included in the skb->len, when Tx VLAN offload is requested. This
leads to incorrect length checks and allows transmission of
oversized packets. Add the VLAN_HLEN to the skb->len before checking
the Qbv maxSDU if Tx VLAN offload is requested for the packet.
Fixes: c5c3e1bfc9e0 ("net: stmmac: Offload queueMaxSDU from tc-taprio") Signed-off-by: Rohan G Thomas <rohan.g.thomas@altera.com> Reviewed-by: Matthew Gerlach <matthew.gerlach@altera.com> Link: https://patch.msgid.link/20251028-qbv-fixes-v4-2-26481c7634e3@altera.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Rohan G Thomas [Tue, 28 Oct 2025 03:18:43 +0000 (11:18 +0800)]
net: stmmac: vlan: Disable 802.1AD tag insertion offload
The DWMAC IP's VLAN tag insertion offload does not support inserting
STAG (802.1AD) and CTAG (802.1Q) types in bytes 13 and 14 using the
same MAC_VLAN_Incl and MAC_VLAN_Inner_Incl register configurations.
Currently, MAC_VLAN_Incl is configured to offload only STAG type
insertion. However, the DWMAC IP inserts a CTAG type when the inner
VLAN ID field of the descriptor is not configured, and a STAG type
when it is configured. This behavior is not documented and leads to
inconsistent double VLAN tagging.
Additionally, an unexpected CTAG with VLAN ID 0 is inserted, resulting
in frames like:
Frame 1: 110 bytes on wire (880 bits), 110 bytes captured (880 bits)
Ethernet II, Src: <src> (<src>), Dst: <dst> (<dst>)
IEEE 802.1ad, ID: 100
802.1Q Virtual LAN, PRI: 0, DEI: 0, ID: 0 (unexpected)
802.1Q Virtual LAN, PRI: 0, DEI: 0, ID: 200
Internet Protocol Version 4, Src: 192.168.4.10, Dst: 192.168.4.11
Internet Control Message Protocol
To avoid this undocumented and incorrect behavior, disable 802.1AD tag
insertion offload. Also, don't set CSVL bit. As per the data book,
when this bit is set, S-VLAN type (0x88A8) is inserted in the 13th and
14th bytes of transmitted packets and when this bit is reset, C-VLAN
type (0x8100) is inserted in the 13th and 14th bytes of transmitted
packets.
Fixes: 30d932279dc2 ("net: stmmac: Add support for VLAN Insertion Offload") Fixes: e94e3f3b51ce ("net: stmmac: Add support for VLAN Insertion Offload in GMAC4+") Fixes: 1d2c7a5fee31 ("net: stmmac: Refactor VLAN implementation") Signed-off-by: Rohan G Thomas <rohan.g.thomas@altera.com> Reviewed-by: Boon Khai Ng <boon.khai.ng@altera.com> Link: https://patch.msgid.link/20251028-qbv-fixes-v4-1-26481c7634e3@altera.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 30 Oct 2025 01:44:21 +0000 (18:44 -0700)]
Merge branch 'net-enetc-add-i-mx94-enetc-support'
Wei Fang says:
====================
net: enetc: Add i.MX94 ENETC support
i.MX94 NETC has two kinds of ENETCs, one is the same as i.MX95, which
can be used as a standalone network port. The other one is an internal
ENETC, it connects to the CPU port of NETC switch through the pseudo
MAC. Also, i.MX94 have multiple PTP Timers, which is different from
i.MX95. Any PTP Timer can be bound to a specified standalone ENETC by
the IERB ETBCR registers. Currently, this patch only add ENETC support
and Timer support for i.MX94. The switch will be added by a separate
patch set.
In addition, note that i.MX94 SoC is launched after i.MX95, its NETC
has a higher version, so the driver support is added after i.MX95.
====================
Wei Fang [Wed, 29 Oct 2025 01:38:59 +0000 (09:38 +0800)]
net: enetc: add basic support for the ENETC with pseudo MAC for i.MX94
The ENETC with pseudo MAC is an internal port which connects to the CPU
port of the switch. The switch CPU/host ENETC is fully integrated with
the switch and does not require a back-to-back MAC, instead a light
weight "pseudo MAC" provides the delineation between switch and ENETC.
This translates to lower power (less logic and memory) and lower delay
(as there is no serialization delay across this link).
Different from the standalone ENETC which is used as the external port,
the internal ENETC has a different PCIe device ID, and it does not have
Ethernet MAC port registers, instead, it has a small number of pseudo
MAC port registers, so some features are not supported by pseudo MAC,
such as loopback, half duplex, one-step timestamping and so on.
Therefore, the configuration of this internal ENETC is also somewhat
different from that of the standalone ENETC. So add the basic support
for ENETC with pseudo MAC. More supports will be added in the future.
Clark Wang [Wed, 29 Oct 2025 01:38:58 +0000 (09:38 +0800)]
net: enetc: add ptp timer binding support for i.MX94
The i.MX94 has three PTP timers, and all standalone ENETCs can select
one of them to bind to as their PHC. The 'ptp-timer' property is used
to represent the PTP device of the Ethernet controller. So users can
add 'ptp-timer' to the ENETC node to specify the PTP timer. The driver
parses this property to bind the two hardware devices.
If the "ptp-timer" property is not present, the first timer of the PCIe
bus where the ENETC is located is used as the default bound PTP timer.
Wei Fang [Wed, 29 Oct 2025 01:38:57 +0000 (09:38 +0800)]
net: enetc: add preliminary i.MX94 NETC blocks control support
NETC blocks control is used for warm reset and pre-boot initialization.
Different versions of NETC blocks control are not exactly the same. We
need to add corresponding netc_devinfo data for each version. i.MX94
series are launched after i.MX95, so its NETC version (v4.3) is higher
than i.MX95 NETC (v4.1). Currently, the patch adds the following
configurations for ENETCs.
1. Set the link's MII protocol.
2. ENETC 0 (MAC 3) and the switch port 2 (MAC 2) share the same parallel
interface, but due to SoC constraint, they cannot be used simultaneously.
Since the switch is not supported yet, so the interface is assigned to
ENETC 0 by default.
The switch configuration will be added separately in a subsequent patch.
Wei Fang [Wed, 29 Oct 2025 01:38:56 +0000 (09:38 +0800)]
dt-bindings: net: enetc: add compatible string for ENETC with pseduo MAC
The ENETC with pseudo MAC is used to connect to the CPU port of the NETC
switch. This ENETC has a different PCI device ID, so add a standard PCI
device compatible string to it.
====================
tls: Introduce and use RX async resync request cancel function
This series by Shahar introduces RX async resync request cancel function
in tls module, and uses it in mlx5e driver.
For a device-offloaded TLS RX connection, the TLS module increments
rcd_delta each time a new TLS record is received, tracking the distance
from the original resync request. In the meanwhile, the device is
queried and is expected to respond, asynchronously.
However, if the device response is delayed or fails (e.g due to unstable
connection and device getting out of tracking, hardware errors, resource
exhaustion etc.), the TLS module keeps logging and incrementing
rcd_delta, which can lead to a WARN() when rcd_delta exceeds the
threshold.
This series improves this code area by canceling the resync request when
spotting an issue with the device response.
====================
Shahar Shitrit [Sun, 26 Oct 2025 20:03:03 +0000 (22:03 +0200)]
net/mlx5e: kTLS, Cancel RX async resync request in error flows
When device loses track of TLS records, it attempts to resync by
monitoring records and requests an asynchronous resynchronization
from software for this TLS connection.
The TLS module handles such device RX resync requests by logging record
headers and comparing them with the record tcp_sn when provided by the
device. It also increments rcd_delta to track how far the current
record tcp_sn is from the tcp_sn of the original resync request.
If the device later responds with a matching tcp_sn, the TLS module
approves the tcp_sn for resync.
However, the device response may be delayed or never arrive,
particularly due to traffic-related issues such as packet drops or
reordering. In such cases, the TLS module remains unaware that resync
will not complete, and continues performing unnecessary work by logging
headers and incrementing rcd_delta, which can eventually exceed the
threshold and trigger a WARN(). For example, this was observed when the
device got out of tracking, causing
mlx5e_ktls_handle_get_psv_completion() to fail and ultimately leading
to the rcd_delta warning.
To address this, call tls_offload_rx_resync_async_request_cancel()
to cancel the resync request and stop resync tracking in such error
cases. Also, increment the tls_resync_req_skip counter to track these
cancellations.
Shahar Shitrit [Sun, 26 Oct 2025 20:03:02 +0000 (22:03 +0200)]
net: tls: Cancel RX async resync request on rcd_delta overflow
When a netdev issues a RX async resync request for a TLS connection,
the TLS module handles it by logging record headers and attempting to
match them to the tcp_sn provided by the device. If a match is found,
the TLS module approves the tcp_sn for resynchronization.
While waiting for a device response, the TLS module also increments
rcd_delta each time a new TLS record is received, tracking the distance
from the original resync request.
However, if the device response is delayed or fails (e.g due to
unstable connection and device getting out of tracking, hardware
errors, resource exhaustion etc.), the TLS module keeps logging and
incrementing, which can lead to a WARN() when rcd_delta exceeds the
threshold.
To address this, introduce tls_offload_rx_resync_async_request_cancel()
to explicitly cancel resync requests when a device response failure is
detected. Call this helper also as a final safeguard when rcd_delta
crosses its threshold, as reaching this point implies that earlier
cancellation did not occur.
Shahar Shitrit [Sun, 26 Oct 2025 20:03:01 +0000 (22:03 +0200)]
net: tls: Change async resync helpers argument
Update tls_offload_rx_resync_async_request_start() and
tls_offload_rx_resync_async_request_end() to get a struct
tls_offload_resync_async parameter directly, rather than
extracting it from struct sock.
This change aligns the function signatures with the upcoming
tls_offload_rx_resync_async_request_cancel() helper, which
will be introduced in a subsequent patch.
Jakub Kicinski [Thu, 30 Oct 2025 01:28:32 +0000 (18:28 -0700)]
Merge branch 'icmp-add-rfc-5837-support'
Ido Schimmel says:
====================
icmp: Add RFC 5837 support
tl;dr
=====
This patchset extends certain ICMP error messages (e.g., "Time
Exceeded") with incoming interface information in accordance with RFC
5837 [1]. This is required for more meaningful traceroute results in
unnumbered networks. Like other ICMP settings, the feature is controlled
via a per-{netns, address family} sysctl. The interface and the
implementation are designed to support more ICMP extensions.
Motivation
==========
Over the years, the kernel was extended with the ability to derive the
source IP of ICMP error messages from the interface that received the
datagram which elicited the ICMP error [2][3][4]. This is especially
important for "Time Exceeded" messages as it allows traceroute users to
trace the actual packet path along the network.
The above scheme does not work in unnumbered networks. In these
networks, only the loopback / VRF interface is assigned a global IP
address while router interfaces are assigned IPv6 link-local addresses.
As such, ICMP error messages are generated with a source IP derived from
the loopback / VRF interface, making it impossible to trace the actual
packet path when parallel links exist between routers.
The problem can be solved by implementing the solution proposed by RFC
4884 [5] and RFC 5837. The former defines an ICMP extension structure
that can be appended to selected ICMP messages and carry extension
objects. The latter defines an extension object called the "Interface
Information Object" (IIO) that can carry interface information (e.g.,
name, index, MTU) about interfaces with certain roles such as the
interface that received the datagram which elicited the ICMP error.
The payload of the datagram that elicited the error (potentially padded
/ trimmed) along with the ICMP extension structure will be queued to the
error queue of the originating socket, thereby allowing traceroute
applications to parse and display the information encoded in the ICMP
extension structure. Example:
# traceroute6 -e 2001:db8:1::3
traceroute to 2001:db8:1::3 (2001:db8:1::3), 30 hops max, 80 byte packets
1 2001:db8:1::2 (2001:db8:1::2) <INC:11,"eth1",mtu=1500> 0.214 ms 0.171 ms 0.162 ms
2 2001:db8:1::3 (2001:db8:1::3) <INC:12,"eth2",mtu=1500> 0.154 ms 0.135 ms 0.127 ms
# traceroute -e 192.0.2.3
traceroute to 192.0.2.3 (192.0.2.3), 30 hops max, 60 byte packets
1 192.0.2.2 (192.0.2.2) <INC:11,"eth1",mtu=1500> 0.191 ms 0.148 ms 0.144 ms
2 192.0.2.3 (192.0.2.3) <INC:12,"eth2",mtu=1500> 0.137 ms 0.122 ms 0.114 ms
Implementation
==============
As previously stated, the feature is controlled via a per-{netns,
address} sysctl. Specifically, a bit mask where each bit controls the
addition of a different ICMP extension to ICMP error messages.
Currently, only a single value is supported, to append the incoming
interface information.
Key points:
1. Global knob vs finer control. I am not aware of users who require
finer control, but it is possible that some users will want to avoid
appending ICMP extensions when the packet is sent out of a specific
interface (e.g., the management interface) or to a specific subnet. This
can be accomplished via a tc-bpf program that trims the ICMP extension
structure. An example program can be found here [6].
2. Split implementation between IPv4 / IPv6. While the implementation is
currently similar, there are some differences between both address
families. In addition, some extensions (e.g., RFC 8883 [7]) are
IPv6-specific. Given the above and given that the implementation is not
very complex, it makes sense to keep both implementations separate.
3. Compatibility with legacy applications. RFC 4884 from 2007 extended
certain ICMP messages with a length field that encodes the length of the
"original datagram" field, so that applications will be able to tell
where the "original datagram" ends and where the ICMP extension
structure starts.
Before the introduction of the IP{,6}_RECVERR_RFC4884 socket options
[8][9] in 2020 it was impossible for applications to know where the ICMP
extension structure starts and to this day some applications assume that
it starts at offset 128, which is the minimum length of the "original
datagram" field as specified by RFC 4884.
Therefore, in order to be compatible with both legacy and modern
applications, the datagram that elicited the ICMP error is trimmed /
padded to 128 bytes before appending the ICMP extension structure.
This behavior is specifically called out by RFC 4884: "Those wishing to
be backward compatible with non-compliant TRACEROUTE implementations
will include exactly 128 octets" [10].
Note that in 128 bytes we should be able to include enough headers for
the originating node to match the ICMP error message with the relevant
socket. For example, the following headers will be present in the
"original datagram" field when a VXLAN encapsulated IPv6 packet elicits
an ICMP error in an IPv6 underlay: IPv6 (40) | UDP (8) | VXLAN (8) | Eth
(14) | IPv6 (40) | UDP (8). Overall, 118 bytes.
If the 128 bytes limit proves to be insufficient for some use case, we
can consider dedicating a new bit in the previously mentioned sysctl to
allow for more bytes to be included in the "original datagram" field.
4. Extensibility. This patchset adds partial support for a single ICMP
extension. However, the interface and the implementation should be able
to support more extensions, if needed. Examples:
* More interface information objects as part of RFC 5837. We should be
able to derive the outgoing interface information and nexthop IP from
the dst entry attached to the packet that elicited the error.
* Extended Information object which encodes aggregate header limits as
part of RFC 8883.
A previous proposal from Ishaan Gandhi and Ron Bonica is available here
[12].
Testing
=======
The existing traceroute selftest is extended to test that ICMP
extensions are reported correctly when enabled. Both address families
are tested and with different packet sizes in order to make sure that
trimming / padding works correctly. Tested that packets are parsed
correctly by the IP{,6}_RECVERR_RFC4884 socket options using Willem's
selftest [13].
Changelog
=========
Changes since v1 [14]:
* Patches #1-#2: Added a comment about field ordering and review tags.
* Patch #3: Converted "sysctl" to "echo" when testing the return value.
Added a check to skip the test if traceroute version is older
than 2.1.5.
Ido Schimmel [Mon, 27 Oct 2025 08:22:32 +0000 (10:22 +0200)]
selftests: traceroute: Add ICMP extensions tests
Test that ICMP extensions are reported correctly when enabled and not
reported when disabled. Test both IPv4 and IPv6 and using different
packet sizes, to make sure trimming / padding works correctly.
Disable ICMP rate limiting (defaults to 1 per-second per-target) so that
the kernel will always generate ICMP errors when needed.
Ido Schimmel [Mon, 27 Oct 2025 08:22:31 +0000 (10:22 +0200)]
ipv6: icmp: Add RFC 5837 support
Add the ability to append the incoming IP interface information to
ICMPv6 error messages in accordance with RFC 5837 and RFC 4884. This is
required for more meaningful traceroute results in unnumbered networks.
The feature is disabled by default and controlled via a new sysctl
("net.ipv6.icmp.errors_extension_mask") which accepts a bitmask of ICMP
extensions to append to ICMP error messages. Currently, only a single
value is supported, but the interface and the implementation should be
able to support more extensions, if needed.
Clone the skb and copy the relevant data portions before modifying the
skb as the caller of icmp6_send() still owns the skb after the function
returns. This should be fine since by default ICMP error messages are
rate limited to 1000 per second and no more than 1 per second per
specific host.
Trim or pad the packet to 128 bytes before appending the ICMP extension
structure in order to be compatible with legacy applications that assume
that the ICMP extension structure always starts at this offset (the
minimum length specified by RFC 4884).
Since commit 20e1954fe238 ("ipv6: RFC 4884 partial support for SIT/GRE
tunnels") it is possible for icmp6_send() to be called with an skb that
already contains ICMP extensions. This can happen when we receive an
ICMPv4 message with extensions from a tunnel and translate it to an
ICMPv6 message towards an IPv6 host in the overlay network. I could not
find an RFC that supports this behavior, but it makes sense to not
overwrite the original extensions that were appended to the packet.
Therefore, avoid appending extensions if the length field in the
provided ICMPv6 header is already filled.
Export netdev_copy_name() using EXPORT_IPV6_MOD_GPL() to make it
available to IPv6 when it is built as a module.
Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251027082232.232571-3-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>