Bagas Sanjaya [Thu, 30 Oct 2025 07:50:13 +0000 (14:50 +0700)]
Documentation: netconsole: Separate literal code blocks for full and short netcat command name versions
Both full and short (abbreviated) command name versions of netcat
example are combined in single literal code block due to 'or::'
paragraph being indented one more space than the preceding paragraph
(before the short version example).
Unindent it to separate the versions.
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Link: https://patch.msgid.link/20251030075013.40418-1-bagasdotme@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
net: phy: microchip_t1s: Add support for LAN867x Rev.D0 PHY
This patch series adds support for the latest Microchip LAN8670/1/2 Rev.D0
10BASE-T1S PHYs to the microchip_t1s driver.
The new Rev.D0 silicon introduces updated initialization requirements and
link status handling behavior compared to earlier revisions (Rev.C2 and
below). These updates are necessary for full compliance with the OPEN
Alliance 10BASE-T1S specification and are documented in Microchip
Application Note AN1699 Revision G (DS60001699G – October 2025).
Summary of changes:
- Implements Rev.D0-specific configuration sequence as described in AN1699
Rev.G.
- Introduces link status control configuration for LAN867x Rev.D0.
====================
net: phy: microchip_t1s: configure link status control for LAN867x Rev.D0
Configure the link status in the Link Status Control register for
LAN8670/1/2 Rev.D0 PHYs, depending on whether PLCA or CSMA/CD mode
is enabled. When PLCA is enabled, the link status reflects the PLCA
status. When PLCA is disabled (CSMA/CD mode), the PHY does not support
autonegotiation, so the link status is forced active by setting
the LINK_STATUS_SEMAPHORE bit.
The link status control is configured:
- During PHY initialization, for default CSMA/CD mode.
- Whenever PLCA configuration is updated.
This ensures correct link reporting and consistent behavior for
LAN867x Rev.D0 devices.
net: phy: microchip_t1s: add support for Microchip LAN867X Rev.D0 PHY
Add support for the LAN8670/1/2 Rev.D0 10BASE-T1S PHYs from Microchip.
The new Rev.D0 silicon requires a specific set of initialization
settings to be configured for optimal performance and compliance with
OPEN Alliance specifications, as described in Microchip Application Note
AN1699 (Revision G, DS60001699G – October 2025).
https://www.microchip.com/en-us/application-notes/an1699
When operating in "SGMII" mode (Cisco SGMII or 2500BASE-X), qcom-ethqos
modifies the MAC control register in its ethqos_configure_sgmii()
function, which is only called from one path:
stmmac_mac_link_up()
+- reads MAC_CTRL_REG
+- masks out priv->hw->link.speed_mask
+- sets bits according to speed (2500, 1000, 100, 10) from priv->hw.link.speed*
+- ethqos_fix_mac_speed()
| +- qcom_ethqos_set_sgmii_loopback(false)
| +- ethqos_update_link_clk(speed)
| `- ethqos_configure(speed)
| `- ethqos_configure_sgmii(speed)
| +- reads MAC_CTRL_REG,
| +- configures PS/FES bits according to speed
| `- writes MAC_CTRL_REG as the last operation
+- sets duplex bit(s)
+- stmmac_mac_flow_ctrl()
+- writes MAC_CTRL_REG if changed from original read
...
As can be seen, the modification of the control register that
stmmac_mac_link_up() overwrites the changes that ethqos_fix_mac_speed()
does to the register. This makes ethqos_configure_sgmii()'s
modification questionable at best.
Thus, they appear to be doing very similar, with the exception of the
FES bit (bit 14) for 1G and 2.5G speeds.
Given that stmmac_mac_link_up() will write the MAC_CTRL_REG after
ethqos_configure_sgmii(), remove the unnecessary update in the
glue driver's ethqos_configure_sgmii() method, simplifying the code.
Konrad states:
Without any additional knowledge, the register description says:
Tested-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com> Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/E1vEPlg-0000000CFHY-282A@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
Convert mlx5e and IPoIB to ndo_hwtstamp_get/set
This series by Carolina migrates mlx5e and IPoIB to the
ndo_hwtstamp_get/set interface and removes legacy hardware timestamp
ioctl handling. While doing so, it also cleans up naming and removes
redundant code.
No functional change in timestamp behavior.
Cleanup patches:
- net/mlx5e: Remove redundant tstamp pointer from channel structures
- net/mlx5e: Remove unnecessary tstamp local variable in mlx5i_complete_rx_cqe
- net/mlx5e: Rename hwstamp functions to hwtstamp
- net/mlx5e: Rename timestamp fields to hwtstamp_config
Add suppport in ipoib:
- IB/IPoIB: Add support for hwtstamp get/set ndos
Convert mlx5:
- net/mlx5e: Convert to new hwtstamp_get/set interface
====================
Carolina Jubran [Thu, 30 Oct 2025 10:25:09 +0000 (12:25 +0200)]
IB/IPoIB: Add support for hwtstamp get/set ndos
Add support for the ndo_hwtstamp_get and ndo_hwtstamp_set operations in
IPoIB. This allows lower devices to handle hardware timestamp
configuration through the new ndos instead of the legacy ioctls.
Carolina Jubran [Thu, 30 Oct 2025 10:25:08 +0000 (12:25 +0200)]
net/mlx5e: Rename timestamp fields to hwtstamp_config
Rename hardware timestamp-related fields from 'tstamp' to
'hwtstamp_config' throughout the MLX5 driver. The new name is more
descriptive as it clearly indicates these fields contain hardware
timestamp configuration.
Carolina Jubran [Thu, 30 Oct 2025 10:25:07 +0000 (12:25 +0200)]
net/mlx5e: Rename hwstamp functions to hwtstamp
Rename mlx5e_hwstamp_set/get() functions to mlx5e_hwtstamp_set/get()
to better reflect that these functions handle hardware timestamping,
not just hardware stamping.
Carolina Jubran [Thu, 30 Oct 2025 10:25:06 +0000 (12:25 +0200)]
net/mlx5e: Remove unnecessary tstamp local variable in mlx5i_complete_rx_cqe
Remove the tstamp local variable in mlx5i_complete_rx_cqe() and directly
pass the tstamp field from priv to mlx5e_rx_hw_stamp(). The local variable
was only used once and provided no additional value.
Carolina Jubran [Thu, 30 Oct 2025 10:25:05 +0000 (12:25 +0200)]
net/mlx5e: Remove redundant tstamp pointer from channel structures
Remove the tstamp pointer field from mlx5e_channel, mlx5e_ptp, and
mlx5e_trap structures, since it was only used to reference the tstamp
field in the priv structure. Instead, directly use the tstamp field
from priv when initializing RQ structures.
Also remove the unused hwtstamp_config field from mlx5_clock structure
as part of the cleanup.
Haiyang Zhang [Wed, 29 Oct 2025 20:43:10 +0000 (13:43 -0700)]
net: mana: Support HW link state events
Handle the NIC hardware link state events received from the HW
channel, then set the proper link state accordingly.
And, add a feature bit, GDMA_DRV_CAP_FLAG_1_HW_VPORT_LINK_AWARE,
to inform the NIC hardware this handler exists.
Our MANA NIC only sends out the link state down/up messages
when we need to let the VM rerun DHCP client and change IP
address. So, add netif_carrier_on() in the probe(), let the NIC
show the right initial state in /sys/class/net/ethX/operstate.
This patch adds support for matching fragmented packets in tc flower
filters.
Previously, commit 93a8540aac72 ("cxgb4: flower: validate control flags")
added a check using flow_rule_match_has_control_flags() to reject
any rules with control flags, as the driver did not support
fragmentation at that time.
Now, with this patch, support for FLOW_DIS_IS_FRAGMENT is added:
- The driver checks for control flags using
flow_rule_is_supp_control_flags(), as recommended in
commit d11e63119432 ("flow_offload: add control flag checking helpers").
- If the fragmentation flag is present, the driver sets `fs->val.frag` and
`fs->mask.frag` accordingly in the filter specification.
Since fragmentation is now supported, the earlier check that rejected all
control flags (flow_rule_match_has_control_flags()) has been removed.
Jakub Kicinski [Fri, 31 Oct 2025 00:57:07 +0000 (17:57 -0700)]
Merge tag 'nf-next-25-10-30' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Florian Westphal says:
====================
netfilter: updates for net-next
1) Convert nf_tables 'nft_set_iter' usage to use C99 struct
initialization, from Fernando Fernandez Mancera.
2) Disallow nf_conntrack_max=0. This was an (undocumented)
historic inheritance from ip_conntrack (ipv4 only nf_conntrack
predecessor). Doing so will simplify future changes to make
this pernet-tuneable.
3) Fix a typo in conntrack.h comment, from Weibiao Tu.
* tag 'nf-next-25-10-30' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: fix typo in nf_conntrack_l4proto.h comment
netfilter: conntrack: disable 0 value for conntrack_max setting
netfilter: nf_tables: use C99 struct initializer for nft_set_iter
====================
* tag 'wireless-next-2025-10-30' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next:
wifi: mac80211: Allow HT Action frame processing on 6 GHz when HE is supported
wifi: rt2x00: add nvmem eeprom support
wifi: mac80211: add RX flag to report radiotap VHT information
net: wireless: Remove redundant pm_runtime_mark_last_busy() calls
wifi: cfg80211: Add parameters to radio-specific debugfs directories
wifi: cfg80211: Add debugfs support for multi-radio wiphy
wifi: mac80211: fix missing RX bitrate update for mesh forwarding path
wifi: cfg80211: default S1G chandef width to 1MHz
wifi: mac80211: get probe response chan via ieee80211_get_channel_khz
wifi: mac80211: reset CRC valid after CSA
wifi: mac80211_hwsim: advertise puncturing feature support
wifi: cfg80211/mac80211: validate radio frequency range for monitor mode
wifi: rt2x00: check retval for of_get_mac_address
====================
Jakub Kicinski [Wed, 29 Oct 2025 16:49:30 +0000 (09:49 -0700)]
selftests: drv-net: replace the nsim ring test with a drv-net one
We are trying to move away from netdevsim-only tests and towards
tests which can be run both against netdevsim and real drivers.
Replace the simple bash script we have for checking ethtool -g/-G
on netdevsim with a Python test tweaking those params as well
as channel count.
The new test is not exactly equivalent to the netdevsim one,
but real drivers don't often support random ring sizes,
let alone modifying max values via debugfs.
For ice:
Michal converts driver to utilize Page Pool and libeth APIs. Conversion
is based on similar changes done for iavf in order to simplify buffer
management, improve maintainability, and increase code reuse across
Intel Ethernet drivers.
Alexander adds support for header split, configurable via ethtool.
Grzegorz allows for use of 100Mbps on E825C SGMII devices.
For i40e:
Jay Vosburgh avoids sending link state changes to VF if it is already in
the requested state.
For idpf:
Sreedevi removes duplicated defines.
For ixgbe:
Alok Tiwari fixes some typos.
For igbvf:
Alok Tiwari fixes output of VLAN warning message.
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
igbvf: fix misplaced newline in VLAN add warning message
ixgbe: fix typos in ixgbe driver comments
idpf: remove duplicate defines in IDPF_CAP_RSS
i40e: avoid redundant VF link state updates
ice: Allow 100M speed for E825C SGMII device
ice: implement configurable header split for regular Rx
ice: switch to Page Pool
ice: drop page splitting and recycling
ice: remove legacy Rx and construct SKB
====================
====================
net/smc: make wr buffer count configurable
The current value of SMC_WR_BUF_CNT is 16 which leads to heavy
contention on the wr_tx_wait workqueue of the SMC-R linkgroup and its
spinlock when many connections are competing for the work request
buffers. Currently up to 256 connections per linkgroup are supported.
To make things worse when finally a buffer becomes available and
smc_wr_tx_put_slot() signals the linkgroup's wr_tx_wait wq, because
WQ_FLAG_EXCLUSIVE is not used all the waiters get woken up, most of the
time a single one can proceed, and the rest is contending on the
spinlock of the wq to go to sleep again.
Addressing this by simply bumping SMC_WR_BUF_CNT to 256 was deemed
risky, because the large-ish physically continuous allocation could fail
and lead to TCP fall-backs. For reference see this discussion thread on
"[PATCH net-next] net/smc: increase SMC_WR_BUF_CNT" (in archive
https://lists.openwall.net/netdev/2024/11/05/186), which concludes with
the agreement to try to come up with something smarter, which is what
this series aims for.
Additionally if for some reason it is known that heavy contention is not
to be expected going with something like 256 work request buffers is
wasteful. To address these concerns make the number of work requests
configurable, and introduce a back-off logic with handles -ENOMEM form
smc_wr_alloc_link_mem() gracefully.
Halil Pasic [Mon, 27 Oct 2025 22:48:56 +0000 (23:48 +0100)]
net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully
Currently if a -ENOMEM from smc_wr_alloc_link_mem() is handled by
giving up and going the way of a TCP fallback. This was reasonable
before the sizes of the allocations there were compile time constants
and reasonably small. But now those are actually configurable.
So instead of giving up, keep retrying with half of the requested size
unless we dip below the old static sizes -- then give up! In terms of
numbers that means we give up when it is certain that we at best would
end up allocating less than 16 send WR buffers or less than 48 recv WR
buffers. This is to avoid regressions due to having fewer buffers
compared the static values of the past.
Please note that SMC-R is supposed to be an optimisation over TCP, and
falling back to TCP is superior to establishing an SMC connection that
is going to perform worse. If the memory allocation fails (and we
propagate -ENOMEM), we fall back to TCP.
Preserve (modulo truncation) the ratio of send/recv WR buffer counts.
Signed-off-by: Halil Pasic <pasic@linux.ibm.com> Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com> Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com> Reviewed-by: Sidraya Jayagond <sidraya@linux.ibm.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Tested-by: Mahanta Jambigi <mjambigi@linux.ibm.com> Link: https://patch.msgid.link/20251027224856.2970019-3-pasic@linux.ibm.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Halil Pasic [Mon, 27 Oct 2025 22:48:55 +0000 (23:48 +0100)]
net/smc: make wr buffer count configurable
Think SMC_WR_BUF_CNT_SEND := SMC_WR_BUF_CNT used in send context and
SMC_WR_BUF_CNT_RECV := 3 * SMC_WR_BUF_CNT used in recv context. Those
get replaced with lgr->max_send_wr and lgr->max_recv_wr respective.
Please note that although with the default sysctl values
qp_attr.cap.max_send_wr == qp_attr.cap.max_recv_wr is maintained but
can not be assumed to be generally true any more. I see no downside to
that, but my confidence level is rather modest.
Signed-off-by: Halil Pasic <pasic@linux.ibm.com> Reviewed-by: Sidraya Jayagond <sidraya@linux.ibm.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Tested-by: Mahanta Jambigi <mjambigi@linux.ibm.com> Link: https://patch.msgid.link/20251027224856.2970019-2-pasic@linux.ibm.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
netfilter: conntrack: disable 0 value for conntrack_max setting
Undocumented historical artifact inherited from ip_conntrack.
If value is 0, then no limit is applied at all, conntrack table
can grow to huge value, only limited by size of conntrack hashes and
the kernel-internal upper limit on the hash chain lengths.
This feature makes no sense; users can just set
conntrack_max=2147483647 (INT_MAX).
Disallow a 0 value. This will make it slightly easier to allow
per-netns constraints for this value in a future patch.
netfilter: nf_tables: use C99 struct initializer for nft_set_iter
Use C99 struct initializer for nft_set_iter, simplifying the code and
preventing future errors due to uninitialized fields if new fields are
added to the struct.
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Florian Westphal <fw@strlen.de>
Ranganath V N [Sun, 26 Oct 2025 16:33:12 +0000 (22:03 +0530)]
net: sctp: fix KMSAN uninit-value in sctp_inq_pop
Fix an issue detected by syzbot:
KMSAN reported an uninitialized-value access in sctp_inq_pop
BUG: KMSAN: uninit-value in sctp_inq_pop
The issue is actually caused by skb trimming via sk_filter() in sctp_rcv().
In the reproducer, skb->len becomes 1 after sk_filter(), which bypassed the
original check:
if (skb->len < sizeof(struct sctphdr) + sizeof(struct sctp_chunkhdr) +
skb_transport_offset(skb))
To handle this safely, a new check should be performed after sk_filter().
Reported-by: syzbot+d101e12bccd4095460e7@syzkaller.appspotmail.com Tested-by: syzbot+d101e12bccd4095460e7@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=d101e12bccd4095460e7 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Suggested-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Ranganath V N <vnranganath.20@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20251026-kmsan_fix-v3-1-2634a409fa5f@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Paolo Abeni [Thu, 30 Oct 2025 09:44:12 +0000 (10:44 +0100)]
Merge branch 'add-cn20k-nix-and-npa-contexts'
Subbaraya Sundeep says:
====================
Add CN20K NIX and NPA contexts
The hardware contexts of blocks NIX and NPA in CN20K silicon are
different than that of previous silicons CN10K and CN9XK. This
patchset adds the new contexts of CN20K in AF and PF drivers.
A new mailbox for enqueuing contexts to hardware is added.
Patch 1 simplifies context writing and reading by using max context
size supported by hardware instead of using each context size.
Patch 2 and 3 adds NIX block contexts in AF driver and extends
debugfs to display those new contexts
Patch 4 and 5 adds NPA block contexts in AF driver and extends
debugfs to display those new contexts
Patch 6 omits NDC configuration since CN20K NPA does not use NDC
for caching its contexts
Patch 7 and 8 uses the new NIX and NPA contexts in PF/VF driver.
Patch 9, 10 and 11 are to support more bandwidth profiles present in
CN20K for RX ratelimiting and to display new profiles in debugfs
octeontx2-pf: Use new bandwidth profiles in receive queue
Receive queue points to a bandwidth profile for rate limiting.
Since cn20k has additional bandwidth profiles use them
too while mapping receive queue to bandwidth profile.
octeontx2-af: Accommodate more bandwidth profiles for cn20k
CN20K has 16k of leaf profiles, 2k of middle profiles and
256 of top profiles. This patch modifies existing receive
queue and bandwidth profile context structures to accommodate
additional profiles of cn20k.
Linu Cherian [Sat, 25 Oct 2025 10:32:43 +0000 (16:02 +0530)]
octeontx2-pf: Initialize cn20k specific aura and pool contexts
With new CN20K NPA pool and aura contexts supported in AF
driver this patch modifies PF driver to use new NPA contexts.
Implement new hw_ops for intializing aura and pool contexts
for all the silicons.
Linu Cherian [Sat, 25 Oct 2025 10:32:42 +0000 (16:02 +0530)]
octeontx2-af: Skip NDC operations for cn20k
For cn20k, NPA block doesn't use the general purpose
NDC (Near Coprocessor Bus Data cache Unit) for caching,
hence skip the NDC related operations.
Also refactor NDC configuration code to a helper function.
Linu Cherian [Sat, 25 Oct 2025 10:32:40 +0000 (16:02 +0530)]
octeontx2-af: Add cn20k NPA block contexts
New CN20K silicon has NPA hardware context structures different from
previous silicons. Add NPA aura and pool context definitions for cn20k.
Extend NPA context handling support to cn20k.
New CN20K silicon has NIX hardware context structures different from
previous silicons. Add NIX send and completion queue context
definitions for cn20k. Extend NIX context handling support to cn20k.
Thomas Wu [Tue, 28 Oct 2025 04:34:42 +0000 (10:04 +0530)]
wifi: mac80211: Allow HT Action frame processing on 6 GHz when HE is supported
Management frames on 6 GHz do not include HT Capabilities, causing HT
Action frames to be dropped in ieee80211_rx_h_action(). The current logic
checks only ht_cap.ht_supported, which fails for 6 GHz radios that support
only HE and EHT.
Update the condition to also allow HT Action frame processing when
he_cap.has_he is true. This enables support for HE dynamic SM power save
as defined in IEEE Std 802.11ax-2021, section 26.14.4.
Benjamin Berg [Mon, 27 Oct 2025 12:22:14 +0000 (14:22 +0200)]
wifi: mac80211: add RX flag to report radiotap VHT information
mac80211 already reports some basic information in the radiotap header
with the known fields declared by the driver. However, drivers may want
to report more accurate information and in that case the full VHT
radiotap structure needs to be provided.
Add a new RX_FLAG_RADIOTAP_VHT which is set when the VHT information
should be pulled from the skb. Update the code to fill in the VHT fields
to only do so when requested by the driver or if the information has not
yet been set. This way the driver can fully control the information if
it chooses so.
pm_runtime_put_autosuspend(), pm_runtime_put_sync_autosuspend(),
pm_runtime_autosuspend() and pm_request_autosuspend() now include a call
to pm_runtime_mark_last_busy(). Remove the now-reduntant explicit call to
pm_runtime_mark_last_busy().
Shivaji Kant [Wed, 29 Oct 2025 06:54:19 +0000 (06:54 +0000)]
net: devmem: refresh devmem TX dst in case of route invalidation
The zero-copy Device Memory (Devmem) transmit path
relies on the socket's route cache (`dst_entry`) to
validate that the packet is being sent via the network
device to which the DMA buffer was bound.
However, this check incorrectly fails and returns `-ENODEV`
if the socket's route cache entry (`dst`) is merely missing
or expired (`dst == NULL`). This scenario is observed during
network events, such as when flow steering rules are deleted,
leading to a temporary route cache invalidation.
This patch fixes -ENODEV error for `net_devmem_get_binding()`
by doing the following:
1. It attempts to rebuild the route via `rebuild_header()`
if the route is initially missing (`dst == NULL`). This
allows the TCP/IP stack to recover from transient route
cache misses.
2. It uses `rcu_read_lock()` and `dst_dev_rcu()` to safely
access the network device pointer (`dst_dev`) from the
route, preventing use-after-free conditions if the
device is concurrently removed.
3. It maintains the critical safety check by validating
that the retrieved destination device (`dst_dev`) is
exactly the device registered in the Devmem binding
(`binding->dev`).
These changes prevent unnecessary ENODEV failures while
maintaining the critical safety requirement that the
Devmem resources are only used on the bound network device.
Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Reported-by: Eric Dumazet <edumazet@google.com> Reported-by: Vedant Mathur <vedantmathur@google.com> Suggested-by: Eric Dumazet <edumazet@google.com> Fixes: bd61848900bf ("net: devmem: Implement TX path") Signed-off-by: Shivaji Kant <shivajikant@google.com> Link: https://patch.msgid.link/20251029065420.3489943-1-shivajikant@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
max_addr is the max number of addresses, not the highest possible address,
therefore check phydev->mdio.addr > max_addr isn't correct.
To fix this change the semantics of max_addr, so that it represents
the highest possible address. IMO this is also a little bit more intuitive
wrt name max_addr.
Fixes: 4a107a0e8361 ("net: stmmac: mdio: use phy_find_first to simplify stmmac_mdio_register") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Reported-by: Simon Horman <horms@kernel.org> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/e869999b-2d4b-4dc1-9890-c2d3d1e8d0f8@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
pm_runtime_put_autosuspend(), pm_runtime_put_sync_autosuspend(),
pm_runtime_autosuspend() and pm_request_autosuspend() now include a call
to pm_runtime_mark_last_busy(). Remove the now-reduntant explicit call to
pm_runtime_mark_last_busy().
pm_runtime_put_autosuspend(), pm_runtime_put_sync_autosuspend(),
pm_runtime_autosuspend() and pm_request_autosuspend() now include a call
to pm_runtime_mark_last_busy(). Remove the now-reduntant explicit call to
pm_runtime_mark_last_busy().
pm_runtime_put_autosuspend(), pm_runtime_put_sync_autosuspend(),
pm_runtime_autosuspend() and pm_request_autosuspend() now include a call
to pm_runtime_mark_last_busy(). Remove the now-reduntant explicit call to
pm_runtime_mark_last_busy().
====================
net: stmmac: Fixes for stmmac Tx VLAN insert and EST
This patchset includes following fixes for stmmac Tx VLAN insert and
EST implementations:
1. Disable STAG insertion offloading, as DWMAC IPs doesn't support
offload of STAG for double VLAN packets and CTAG for single VLAN
packets when using the same register configuration. The current
configuration in the driver is undocumented and is adding an
additional 802.1Q tag with VLAN ID 0 for double VLAN packets.
2. Consider Tx VLAN offload tag length for maxSDU estimation.
3. Fix GCL bounds check
====================
Rohan G Thomas [Tue, 28 Oct 2025 03:18:45 +0000 (11:18 +0800)]
net: stmmac: est: Fix GCL bounds checks
Fix the bounds checks for the hw supported maximum GCL entry
count and gate interval time.
Fixes: b60189e0392f ("net: stmmac: Integrate EST with TAPRIO scheduler API") Signed-off-by: Rohan G Thomas <rohan.g.thomas@altera.com> Reviewed-by: Matthew Gerlach <matthew.gerlach@altera.com> Link: https://patch.msgid.link/20251028-qbv-fixes-v4-3-26481c7634e3@altera.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Rohan G Thomas [Tue, 28 Oct 2025 03:18:44 +0000 (11:18 +0800)]
net: stmmac: Consider Tx VLAN offload tag length for maxSDU
Queue maxSDU requirement of 802.1 Qbv standard requires mac to drop
packets that exceeds maxSDU length and maxSDU doesn't include
preamble, destination and source address, or FCS but includes
ethernet type and VLAN header.
On hardware with Tx VLAN offload enabled, VLAN header length is not
included in the skb->len, when Tx VLAN offload is requested. This
leads to incorrect length checks and allows transmission of
oversized packets. Add the VLAN_HLEN to the skb->len before checking
the Qbv maxSDU if Tx VLAN offload is requested for the packet.
Fixes: c5c3e1bfc9e0 ("net: stmmac: Offload queueMaxSDU from tc-taprio") Signed-off-by: Rohan G Thomas <rohan.g.thomas@altera.com> Reviewed-by: Matthew Gerlach <matthew.gerlach@altera.com> Link: https://patch.msgid.link/20251028-qbv-fixes-v4-2-26481c7634e3@altera.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Rohan G Thomas [Tue, 28 Oct 2025 03:18:43 +0000 (11:18 +0800)]
net: stmmac: vlan: Disable 802.1AD tag insertion offload
The DWMAC IP's VLAN tag insertion offload does not support inserting
STAG (802.1AD) and CTAG (802.1Q) types in bytes 13 and 14 using the
same MAC_VLAN_Incl and MAC_VLAN_Inner_Incl register configurations.
Currently, MAC_VLAN_Incl is configured to offload only STAG type
insertion. However, the DWMAC IP inserts a CTAG type when the inner
VLAN ID field of the descriptor is not configured, and a STAG type
when it is configured. This behavior is not documented and leads to
inconsistent double VLAN tagging.
Additionally, an unexpected CTAG with VLAN ID 0 is inserted, resulting
in frames like:
Frame 1: 110 bytes on wire (880 bits), 110 bytes captured (880 bits)
Ethernet II, Src: <src> (<src>), Dst: <dst> (<dst>)
IEEE 802.1ad, ID: 100
802.1Q Virtual LAN, PRI: 0, DEI: 0, ID: 0 (unexpected)
802.1Q Virtual LAN, PRI: 0, DEI: 0, ID: 200
Internet Protocol Version 4, Src: 192.168.4.10, Dst: 192.168.4.11
Internet Control Message Protocol
To avoid this undocumented and incorrect behavior, disable 802.1AD tag
insertion offload. Also, don't set CSVL bit. As per the data book,
when this bit is set, S-VLAN type (0x88A8) is inserted in the 13th and
14th bytes of transmitted packets and when this bit is reset, C-VLAN
type (0x8100) is inserted in the 13th and 14th bytes of transmitted
packets.
Fixes: 30d932279dc2 ("net: stmmac: Add support for VLAN Insertion Offload") Fixes: e94e3f3b51ce ("net: stmmac: Add support for VLAN Insertion Offload in GMAC4+") Fixes: 1d2c7a5fee31 ("net: stmmac: Refactor VLAN implementation") Signed-off-by: Rohan G Thomas <rohan.g.thomas@altera.com> Reviewed-by: Boon Khai Ng <boon.khai.ng@altera.com> Link: https://patch.msgid.link/20251028-qbv-fixes-v4-1-26481c7634e3@altera.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 30 Oct 2025 01:44:21 +0000 (18:44 -0700)]
Merge branch 'net-enetc-add-i-mx94-enetc-support'
Wei Fang says:
====================
net: enetc: Add i.MX94 ENETC support
i.MX94 NETC has two kinds of ENETCs, one is the same as i.MX95, which
can be used as a standalone network port. The other one is an internal
ENETC, it connects to the CPU port of NETC switch through the pseudo
MAC. Also, i.MX94 have multiple PTP Timers, which is different from
i.MX95. Any PTP Timer can be bound to a specified standalone ENETC by
the IERB ETBCR registers. Currently, this patch only add ENETC support
and Timer support for i.MX94. The switch will be added by a separate
patch set.
In addition, note that i.MX94 SoC is launched after i.MX95, its NETC
has a higher version, so the driver support is added after i.MX95.
====================
Wei Fang [Wed, 29 Oct 2025 01:38:59 +0000 (09:38 +0800)]
net: enetc: add basic support for the ENETC with pseudo MAC for i.MX94
The ENETC with pseudo MAC is an internal port which connects to the CPU
port of the switch. The switch CPU/host ENETC is fully integrated with
the switch and does not require a back-to-back MAC, instead a light
weight "pseudo MAC" provides the delineation between switch and ENETC.
This translates to lower power (less logic and memory) and lower delay
(as there is no serialization delay across this link).
Different from the standalone ENETC which is used as the external port,
the internal ENETC has a different PCIe device ID, and it does not have
Ethernet MAC port registers, instead, it has a small number of pseudo
MAC port registers, so some features are not supported by pseudo MAC,
such as loopback, half duplex, one-step timestamping and so on.
Therefore, the configuration of this internal ENETC is also somewhat
different from that of the standalone ENETC. So add the basic support
for ENETC with pseudo MAC. More supports will be added in the future.
Clark Wang [Wed, 29 Oct 2025 01:38:58 +0000 (09:38 +0800)]
net: enetc: add ptp timer binding support for i.MX94
The i.MX94 has three PTP timers, and all standalone ENETCs can select
one of them to bind to as their PHC. The 'ptp-timer' property is used
to represent the PTP device of the Ethernet controller. So users can
add 'ptp-timer' to the ENETC node to specify the PTP timer. The driver
parses this property to bind the two hardware devices.
If the "ptp-timer" property is not present, the first timer of the PCIe
bus where the ENETC is located is used as the default bound PTP timer.
Wei Fang [Wed, 29 Oct 2025 01:38:57 +0000 (09:38 +0800)]
net: enetc: add preliminary i.MX94 NETC blocks control support
NETC blocks control is used for warm reset and pre-boot initialization.
Different versions of NETC blocks control are not exactly the same. We
need to add corresponding netc_devinfo data for each version. i.MX94
series are launched after i.MX95, so its NETC version (v4.3) is higher
than i.MX95 NETC (v4.1). Currently, the patch adds the following
configurations for ENETCs.
1. Set the link's MII protocol.
2. ENETC 0 (MAC 3) and the switch port 2 (MAC 2) share the same parallel
interface, but due to SoC constraint, they cannot be used simultaneously.
Since the switch is not supported yet, so the interface is assigned to
ENETC 0 by default.
The switch configuration will be added separately in a subsequent patch.
Wei Fang [Wed, 29 Oct 2025 01:38:56 +0000 (09:38 +0800)]
dt-bindings: net: enetc: add compatible string for ENETC with pseduo MAC
The ENETC with pseudo MAC is used to connect to the CPU port of the NETC
switch. This ENETC has a different PCI device ID, so add a standard PCI
device compatible string to it.
====================
tls: Introduce and use RX async resync request cancel function
This series by Shahar introduces RX async resync request cancel function
in tls module, and uses it in mlx5e driver.
For a device-offloaded TLS RX connection, the TLS module increments
rcd_delta each time a new TLS record is received, tracking the distance
from the original resync request. In the meanwhile, the device is
queried and is expected to respond, asynchronously.
However, if the device response is delayed or fails (e.g due to unstable
connection and device getting out of tracking, hardware errors, resource
exhaustion etc.), the TLS module keeps logging and incrementing
rcd_delta, which can lead to a WARN() when rcd_delta exceeds the
threshold.
This series improves this code area by canceling the resync request when
spotting an issue with the device response.
====================
Shahar Shitrit [Sun, 26 Oct 2025 20:03:03 +0000 (22:03 +0200)]
net/mlx5e: kTLS, Cancel RX async resync request in error flows
When device loses track of TLS records, it attempts to resync by
monitoring records and requests an asynchronous resynchronization
from software for this TLS connection.
The TLS module handles such device RX resync requests by logging record
headers and comparing them with the record tcp_sn when provided by the
device. It also increments rcd_delta to track how far the current
record tcp_sn is from the tcp_sn of the original resync request.
If the device later responds with a matching tcp_sn, the TLS module
approves the tcp_sn for resync.
However, the device response may be delayed or never arrive,
particularly due to traffic-related issues such as packet drops or
reordering. In such cases, the TLS module remains unaware that resync
will not complete, and continues performing unnecessary work by logging
headers and incrementing rcd_delta, which can eventually exceed the
threshold and trigger a WARN(). For example, this was observed when the
device got out of tracking, causing
mlx5e_ktls_handle_get_psv_completion() to fail and ultimately leading
to the rcd_delta warning.
To address this, call tls_offload_rx_resync_async_request_cancel()
to cancel the resync request and stop resync tracking in such error
cases. Also, increment the tls_resync_req_skip counter to track these
cancellations.
Shahar Shitrit [Sun, 26 Oct 2025 20:03:02 +0000 (22:03 +0200)]
net: tls: Cancel RX async resync request on rcd_delta overflow
When a netdev issues a RX async resync request for a TLS connection,
the TLS module handles it by logging record headers and attempting to
match them to the tcp_sn provided by the device. If a match is found,
the TLS module approves the tcp_sn for resynchronization.
While waiting for a device response, the TLS module also increments
rcd_delta each time a new TLS record is received, tracking the distance
from the original resync request.
However, if the device response is delayed or fails (e.g due to
unstable connection and device getting out of tracking, hardware
errors, resource exhaustion etc.), the TLS module keeps logging and
incrementing, which can lead to a WARN() when rcd_delta exceeds the
threshold.
To address this, introduce tls_offload_rx_resync_async_request_cancel()
to explicitly cancel resync requests when a device response failure is
detected. Call this helper also as a final safeguard when rcd_delta
crosses its threshold, as reaching this point implies that earlier
cancellation did not occur.
Shahar Shitrit [Sun, 26 Oct 2025 20:03:01 +0000 (22:03 +0200)]
net: tls: Change async resync helpers argument
Update tls_offload_rx_resync_async_request_start() and
tls_offload_rx_resync_async_request_end() to get a struct
tls_offload_resync_async parameter directly, rather than
extracting it from struct sock.
This change aligns the function signatures with the upcoming
tls_offload_rx_resync_async_request_cancel() helper, which
will be introduced in a subsequent patch.
Jakub Kicinski [Thu, 30 Oct 2025 01:28:32 +0000 (18:28 -0700)]
Merge branch 'icmp-add-rfc-5837-support'
Ido Schimmel says:
====================
icmp: Add RFC 5837 support
tl;dr
=====
This patchset extends certain ICMP error messages (e.g., "Time
Exceeded") with incoming interface information in accordance with RFC
5837 [1]. This is required for more meaningful traceroute results in
unnumbered networks. Like other ICMP settings, the feature is controlled
via a per-{netns, address family} sysctl. The interface and the
implementation are designed to support more ICMP extensions.
Motivation
==========
Over the years, the kernel was extended with the ability to derive the
source IP of ICMP error messages from the interface that received the
datagram which elicited the ICMP error [2][3][4]. This is especially
important for "Time Exceeded" messages as it allows traceroute users to
trace the actual packet path along the network.
The above scheme does not work in unnumbered networks. In these
networks, only the loopback / VRF interface is assigned a global IP
address while router interfaces are assigned IPv6 link-local addresses.
As such, ICMP error messages are generated with a source IP derived from
the loopback / VRF interface, making it impossible to trace the actual
packet path when parallel links exist between routers.
The problem can be solved by implementing the solution proposed by RFC
4884 [5] and RFC 5837. The former defines an ICMP extension structure
that can be appended to selected ICMP messages and carry extension
objects. The latter defines an extension object called the "Interface
Information Object" (IIO) that can carry interface information (e.g.,
name, index, MTU) about interfaces with certain roles such as the
interface that received the datagram which elicited the ICMP error.
The payload of the datagram that elicited the error (potentially padded
/ trimmed) along with the ICMP extension structure will be queued to the
error queue of the originating socket, thereby allowing traceroute
applications to parse and display the information encoded in the ICMP
extension structure. Example:
# traceroute6 -e 2001:db8:1::3
traceroute to 2001:db8:1::3 (2001:db8:1::3), 30 hops max, 80 byte packets
1 2001:db8:1::2 (2001:db8:1::2) <INC:11,"eth1",mtu=1500> 0.214 ms 0.171 ms 0.162 ms
2 2001:db8:1::3 (2001:db8:1::3) <INC:12,"eth2",mtu=1500> 0.154 ms 0.135 ms 0.127 ms
# traceroute -e 192.0.2.3
traceroute to 192.0.2.3 (192.0.2.3), 30 hops max, 60 byte packets
1 192.0.2.2 (192.0.2.2) <INC:11,"eth1",mtu=1500> 0.191 ms 0.148 ms 0.144 ms
2 192.0.2.3 (192.0.2.3) <INC:12,"eth2",mtu=1500> 0.137 ms 0.122 ms 0.114 ms
Implementation
==============
As previously stated, the feature is controlled via a per-{netns,
address} sysctl. Specifically, a bit mask where each bit controls the
addition of a different ICMP extension to ICMP error messages.
Currently, only a single value is supported, to append the incoming
interface information.
Key points:
1. Global knob vs finer control. I am not aware of users who require
finer control, but it is possible that some users will want to avoid
appending ICMP extensions when the packet is sent out of a specific
interface (e.g., the management interface) or to a specific subnet. This
can be accomplished via a tc-bpf program that trims the ICMP extension
structure. An example program can be found here [6].
2. Split implementation between IPv4 / IPv6. While the implementation is
currently similar, there are some differences between both address
families. In addition, some extensions (e.g., RFC 8883 [7]) are
IPv6-specific. Given the above and given that the implementation is not
very complex, it makes sense to keep both implementations separate.
3. Compatibility with legacy applications. RFC 4884 from 2007 extended
certain ICMP messages with a length field that encodes the length of the
"original datagram" field, so that applications will be able to tell
where the "original datagram" ends and where the ICMP extension
structure starts.
Before the introduction of the IP{,6}_RECVERR_RFC4884 socket options
[8][9] in 2020 it was impossible for applications to know where the ICMP
extension structure starts and to this day some applications assume that
it starts at offset 128, which is the minimum length of the "original
datagram" field as specified by RFC 4884.
Therefore, in order to be compatible with both legacy and modern
applications, the datagram that elicited the ICMP error is trimmed /
padded to 128 bytes before appending the ICMP extension structure.
This behavior is specifically called out by RFC 4884: "Those wishing to
be backward compatible with non-compliant TRACEROUTE implementations
will include exactly 128 octets" [10].
Note that in 128 bytes we should be able to include enough headers for
the originating node to match the ICMP error message with the relevant
socket. For example, the following headers will be present in the
"original datagram" field when a VXLAN encapsulated IPv6 packet elicits
an ICMP error in an IPv6 underlay: IPv6 (40) | UDP (8) | VXLAN (8) | Eth
(14) | IPv6 (40) | UDP (8). Overall, 118 bytes.
If the 128 bytes limit proves to be insufficient for some use case, we
can consider dedicating a new bit in the previously mentioned sysctl to
allow for more bytes to be included in the "original datagram" field.
4. Extensibility. This patchset adds partial support for a single ICMP
extension. However, the interface and the implementation should be able
to support more extensions, if needed. Examples:
* More interface information objects as part of RFC 5837. We should be
able to derive the outgoing interface information and nexthop IP from
the dst entry attached to the packet that elicited the error.
* Extended Information object which encodes aggregate header limits as
part of RFC 8883.
A previous proposal from Ishaan Gandhi and Ron Bonica is available here
[12].
Testing
=======
The existing traceroute selftest is extended to test that ICMP
extensions are reported correctly when enabled. Both address families
are tested and with different packet sizes in order to make sure that
trimming / padding works correctly. Tested that packets are parsed
correctly by the IP{,6}_RECVERR_RFC4884 socket options using Willem's
selftest [13].
Changelog
=========
Changes since v1 [14]:
* Patches #1-#2: Added a comment about field ordering and review tags.
* Patch #3: Converted "sysctl" to "echo" when testing the return value.
Added a check to skip the test if traceroute version is older
than 2.1.5.
Ido Schimmel [Mon, 27 Oct 2025 08:22:32 +0000 (10:22 +0200)]
selftests: traceroute: Add ICMP extensions tests
Test that ICMP extensions are reported correctly when enabled and not
reported when disabled. Test both IPv4 and IPv6 and using different
packet sizes, to make sure trimming / padding works correctly.
Disable ICMP rate limiting (defaults to 1 per-second per-target) so that
the kernel will always generate ICMP errors when needed.
Ido Schimmel [Mon, 27 Oct 2025 08:22:31 +0000 (10:22 +0200)]
ipv6: icmp: Add RFC 5837 support
Add the ability to append the incoming IP interface information to
ICMPv6 error messages in accordance with RFC 5837 and RFC 4884. This is
required for more meaningful traceroute results in unnumbered networks.
The feature is disabled by default and controlled via a new sysctl
("net.ipv6.icmp.errors_extension_mask") which accepts a bitmask of ICMP
extensions to append to ICMP error messages. Currently, only a single
value is supported, but the interface and the implementation should be
able to support more extensions, if needed.
Clone the skb and copy the relevant data portions before modifying the
skb as the caller of icmp6_send() still owns the skb after the function
returns. This should be fine since by default ICMP error messages are
rate limited to 1000 per second and no more than 1 per second per
specific host.
Trim or pad the packet to 128 bytes before appending the ICMP extension
structure in order to be compatible with legacy applications that assume
that the ICMP extension structure always starts at this offset (the
minimum length specified by RFC 4884).
Since commit 20e1954fe238 ("ipv6: RFC 4884 partial support for SIT/GRE
tunnels") it is possible for icmp6_send() to be called with an skb that
already contains ICMP extensions. This can happen when we receive an
ICMPv4 message with extensions from a tunnel and translate it to an
ICMPv6 message towards an IPv6 host in the overlay network. I could not
find an RFC that supports this behavior, but it makes sense to not
overwrite the original extensions that were appended to the packet.
Therefore, avoid appending extensions if the length field in the
provided ICMPv6 header is already filled.
Export netdev_copy_name() using EXPORT_IPV6_MOD_GPL() to make it
available to IPv6 when it is built as a module.
Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251027082232.232571-3-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ido Schimmel [Mon, 27 Oct 2025 08:22:30 +0000 (10:22 +0200)]
ipv4: icmp: Add RFC 5837 support
Add the ability to append the incoming IP interface information to
ICMPv4 error messages in accordance with RFC 5837 and RFC 4884. This is
required for more meaningful traceroute results in unnumbered networks.
The feature is disabled by default and controlled via a new sysctl
("net.ipv4.icmp_errors_extension_mask") which accepts a bitmask of ICMP
extensions to append to ICMP error messages. Currently, only a single
value is supported, but the interface and the implementation should be
able to support more extensions, if needed.
Clone the skb and copy the relevant data portions before modifying the
skb as the caller of __icmp_send() still owns the skb after the function
returns. This should be fine since by default ICMP error messages are
rate limited to 1000 per second and no more than 1 per second per
specific host.
Trim or pad the packet to 128 bytes before appending the ICMP extension
structure in order to be compatible with legacy applications that assume
that the ICMP extension structure always starts at this offset (the
minimum length specified by RFC 4884).
Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251027082232.232571-2-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 30 Oct 2025 01:25:12 +0000 (18:25 -0700)]
Merge tag 'nf-25-10-29' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Florian Westphal says:
====================
netfilter: updates for net
1) its not possible to attach conntrack labels via ctnetlink
unless one creates a dummy 'ct labels set' rule in nftables.
This is an oversight, the 'ruleset tests presence, userspace
(netlink) sets' use-case is valid and should 'just work'.
Always broken since this got added in Linux 4.7.
2) nft_connlimit reads count value without holding the relevant
lock, add a READ_ONCE annotation. From Fernando Fernandez Mancera.
3) There is a long-standing bug (since 4.12) in nftables helper infra
when NAT is in use: if the helper gets assigned after the nat binding
was set up, we fail to initialise the 'seqadj' extension, which is
needed in case NAT payload rewrites need to add (or remove) from the
packet payload. Fix from Andrii Melnychenko.
* tag 'nf-25-10-29' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nft_ct: add seqadj extension for natted connections
netfilter: nft_connlimit: fix possible data race on connection count
netfilter: nft_ct: enable labels for get case too
====================
Thanh Quan [Mon, 27 Oct 2025 14:02:43 +0000 (15:02 +0100)]
net: phy: dp83869: fix STRAP_OPMODE bitmask
According to the TI DP83869HM datasheet Revision D (June 2025), section
7.6.1.41 STRAP_STS Register, the STRAP_OPMODE bitmask is bit [11:9].
Fix this.
In case the PHY is auto-detected via PHY ID registers, or not described
in DT, or, in case the PHY is described in DT but the optional DT property
"ti,op-mode" is not present, then the driver reads out the PHY functional
mode (RGMII, SGMII, ...) from hardware straps.
Currently, all upstream users of this PHY specify both DT compatible string
"ethernet-phy-id2000.a0f1" and ti,op-mode = <DP83869_RGMII_COPPER_ETHERNET>
property, therefore it seems no upstream users are affected by this bug.
The driver currently interprets bits [2:0] of STRAP_STS register as PHY
functional mode. Those bits are controlled by ANEG_DIS, ANEGSEL_0 straps
and an always-zero reserved bit. Systems that use RGMII-to-Copper functional
mode are unlikely to disable auto-negotiation via ANEG_DIS strap, or change
auto-negotiation behavior via ANEGSEL_0 strap. Therefore, even with this bug
in place, the STRAP_STS register content is likely going to be interpreted
by the driver as RGMII-to-Copper mode.
However, for a system with PHY functional mode strapping set to other mode
than RGMII-to-Copper, the driver is likely to misinterpret the strapping
as RGMII-to-Copper and misconfigure the PHY.
For example, on a system with SGMII-to-Copper strapping, the STRAP_STS
register reads as 0x0c20, but the PHY ends up being configured for
incompatible RGMII-to-Copper mode.
Fixes: 0eaf8ccf2047 ("net: phy: dp83869: Set opmode from straps") Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Thanh Quan <thanh.quan.xn@renesas.com> Signed-off-by: Hai Pham <hai.pham.ud@renesas.com> Signed-off-by: Marek Vasut <marek.vasut+renesas@mailbox.org> # Port from U-Boot to Linux Link: https://patch.msgid.link/20251027140320.8996-1-marek.vasut+renesas@mailbox.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Po-Hsu Lin [Mon, 27 Oct 2025 09:57:10 +0000 (17:57 +0800)]
selftests: net: use BASH for bareudp testing
In bareudp.sh, this script uses /bin/sh and it will load another lib.sh
BASH script at the very beginning.
But on some operating systems like Ubuntu, /bin/sh is actually pointed to
DASH, thus it will try to run BASH commands with DASH and consequently
leads to syntax issues:
# ./bareudp.sh: 4: ./lib.sh: Bad substitution
# ./bareudp.sh: 5: ./lib.sh: source: not found
# ./bareudp.sh: 24: ./lib.sh: Syntax error: "(" unexpected
Fix this by explicitly using BASH for bareudp.sh. This fixes test
execution failures on systems where /bin/sh is not BASH.
Jinliang Wang [Mon, 27 Oct 2025 06:55:30 +0000 (23:55 -0700)]
net: mctp: Fix tx queue stall
The tx queue can become permanently stuck in a stopped state due to a
race condition between the URB submission path and its completion
callback.
The URB completion callback can run immediately after usb_submit_urb()
returns, before the submitting function calls netif_stop_queue(). If
this occurs, the queue state management becomes desynchronized, leading
to a stall where the queue is never woken.
Fix this by moving the netif_stop_queue() call to before submitting the
URB. This closes the race window by ensuring the network stack is aware
the queue is stopped before the URB completion can possibly run.
Fixes: 0791c0327a6e ("net: mctp: Add MCTP USB transport driver") Signed-off-by: Jinliang Wang <jinliangw@google.com> Acked-by: Jeremy Kerr <jk@codeconstruct.com.au> Link: https://patch.msgid.link/20251027065530.2045724-1-jinliangw@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cosmin Ratiu [Sun, 26 Oct 2025 20:20:19 +0000 (22:20 +0200)]
net/mlx5: Don't zero user_count when destroying FDB tables
esw->user_count tracks how many TC rules are added on an esw via
mlx5e_configure_flower -> mlx5_esw_get -> atomic64_inc(&esw->user_count)
esw.user_count was unconditionally set to 0 in
esw_destroy_legacy_fdb_table and esw_destroy_offloads_fdb_tables.
These two together can lead to the following sequence of events:
1. echo 1 > /sys/class/net/eth2/device/sriov_numvfs
- mlx5_core_sriov_configure -...-> esw_create_legacy_table ->
atomic64_set(&esw->user_count, 0)
2. tc qdisc add dev eth2 ingress && \
tc filter replace dev eth2 pref 1 protocol ip chain 0 ingress \
handle 1 flower action ct nat zone 64000 pipe
- mlx5e_configure_flower -> mlx5_esw_get ->
atomic64_inc(&esw->user_count)
3. echo 0 > /sys/class/net/eth2/device/sriov_numvfs
- mlx5_core_sriov_configure -..-> esw_destroy_legacy_fdb_table
-> atomic64_set(&esw->user_count, 0)
4. devlink dev eswitch set pci/0000:08:00.0 mode switchdev
- mlx5_devlink_eswitch_mode_set -> mlx5_esw_try_lock ->
atomic64_read(&esw->user_count) == 0
- then proceed to a WARN_ON in:
esw_offloads_start -> mlx5_eswitch_enable_locke -> esw_offloads_enable
-> mlx5_esw_offloads_rep_load -> mlx5e_vport_rep_load ->
mlx5e_netdev_change_profile -> mlx5e_detach_netdev ->
mlx5e_cleanup_nic_rx -> mlx5e_tc_nic_cleanup ->
mlx5e_mod_hdr_tbl_destroy
Fix this by not clearing out the user_count when destroying FDB tables,
so that the check in mlx5_esw_try_lock can prevent the mode change when
there are TC rules configured, as originally intended.
Miaoqian Lin [Sun, 26 Oct 2025 16:43:16 +0000 (00:43 +0800)]
net: usb: asix_devices: Check return value of usbnet_get_endpoints
The code did not check the return value of usbnet_get_endpoints.
Add checks and return the error if it fails to transfer the error.
Found via static anlaysis and this is similar to
commit 07161b2416f7 ("sr9800: Add check for usbnet_get_endpoints").
Fixes: 933a27d39e0e ("USB: asix - Add AX88178 support and many other changes") Fixes: 2e55cc7210fe ("[PATCH] USB: usbnet (3/9) module for ASIX Ethernet adapters") Cc: stable@vger.kernel.org Signed-off-by: Miaoqian Lin <linmq006@gmail.com> Link: https://patch.msgid.link/20251026164318.57624-1-linmq006@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 30 Oct 2025 00:44:30 +0000 (17:44 -0700)]
Merge branch 'mptcp-various-rare-sending-issues'
Matthieu Baerts says:
====================
mptcp: various rare sending issues
Here are various fixes from Paolo, addressing very occasional issues on
the sending side:
- Patch 1: drop an optimisation that could lead to timeout in case of
race conditions. A fix for up to v5.11.
- Patch 2: fix stream corruption under very specific conditions.
A fix for up to v5.13.
- Patch 3: restore MPTCP-level zero window probe after a recent fix.
A fix for up to v5.16.
- Patch 4: new MIB counter to track MPTCP-level zero windows probe to
help catching issues similar to the one fixed by the previous patch.
====================
Paolo Abeni [Tue, 28 Oct 2025 08:16:54 +0000 (09:16 +0100)]
mptcp: restore window probe
Since commit 72377ab2d671 ("mptcp: more conservative check for zero
probes") the MPTCP-level zero window probe check is always disabled, as
the TCP-level write queue always contains at least the newly allocated
skb.
Refine the relevant check tacking in account that the above condition
and that such skb can have zero length.
Fixes: 72377ab2d671 ("mptcp: more conservative check for zero probes") Cc: stable@vger.kernel.org Reported-by: Geliang Tang <geliang@kernel.org> Closes: https://lore.kernel.org/d0a814c364e744ca6b836ccd5b6e9146882e8d42.camel@kernel.org Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Tested-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251028-net-mptcp-send-timeout-v1-3-38ffff5a9ec8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Paolo Abeni [Tue, 28 Oct 2025 08:16:53 +0000 (09:16 +0100)]
mptcp: fix MSG_PEEK stream corruption
If a MSG_PEEK | MSG_WAITALL read operation consumes all the bytes in the
receive queue and recvmsg() need to waits for more data - i.e. it's a
blocking one - upon arrival of the next packet the MPTCP protocol will
start again copying the oldest data present in the receive queue,
corrupting the data stream.
Address the issue explicitly tracking the peeked sequence number,
restarting from the last peeked byte.
Paolo Abeni [Tue, 28 Oct 2025 08:16:52 +0000 (09:16 +0100)]
mptcp: drop bogus optimization in __mptcp_check_push()
Accessing the transmit queue without owning the msk socket lock is
inherently racy, hence __mptcp_check_push() could actually quit early
even when there is pending data.
That in turn could cause unexpected tx lock and timeout.
Dropping the early check avoids the race, implicitly relaying on later
tests under the relevant lock. With such change, all the other
mptcp_send_head() call sites are now under the msk socket lock and we
can additionally drop the now unneeded annotation on the transmit head
pointer accesses.
Fixes: 6e628cd3a8f7 ("mptcp: use mptcp release_cb for delayed tasks") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Geliang Tang <geliang@kernel.org> Tested-by: Geliang Tang <geliang@kernel.org> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251028-net-mptcp-send-timeout-v1-1-38ffff5a9ec8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
netconsole: Fix race condition in between reader and writer of userdata
The update_userdata() function constructs the complete userdata string
in nt->extradata_complete and updates nt->userdata_length. This data
is then read by write_msg() and write_ext_msg() when sending netconsole
messages. However, update_userdata() was not holding target_list_lock
during this process, allowing concurrent message transmission to read
partially updated userdata.
This race condition could result in netconsole messages containing
incomplete or inconsistent userdata - for example, reading the old
userdata_length with new extradata_complete content, or vice versa,
leading to truncated or corrupted output.
Fix this by acquiring target_list_lock with spin_lock_irqsave() before
updating extradata_complete and userdata_length, and releasing it after
both fields are fully updated. This ensures that readers see a
consistent view of the userdata, preventing corruption during concurrent
access.
The fix aligns with the existing locking pattern used throughout the
netconsole code, where target_list_lock protects access to target
fields including buf[] and msgcounter that are accessed during message
transmission.
Also get rid of the unnecessary variable complete_idx, which makes it
easier to bail out of update_userdata().
Bagas Sanjaya [Tue, 28 Oct 2025 13:20:27 +0000 (20:20 +0700)]
Documentation: netconsole: Remove obsolete contact people
Breno Leitao has been listed in MAINTAINERS as netconsole maintainer
since 7c938e438c56db ("MAINTAINERS: make Breno the netconsole
maintainer"), but the documentation says otherwise that bug reports
should be sent to original netconsole authors.
Remove obsolate contact info.
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Link: https://patch.msgid.link/20251028132027.48102-1-bagasdotme@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ankit Khushwaha [Tue, 28 Oct 2025 17:29:47 +0000 (22:59 +0530)]
selftest: net: fix socklen_t type mismatch in sctp_collision test
Socket APIs like recvfrom(), accept(), and getsockname() expect socklen_t*
arg, but tests were using int variables. This causes -Wpointer-sign
warnings on platforms where socklen_t is unsigned.
Change the variable type from int to socklen_t to resolve the warning and
ensure type safety across platforms.
warning fixed:
sctp_collision.c:62:70: warning: passing 'int *' to parameter of
type 'socklen_t *' (aka 'unsigned int *') converts between pointers to
integer types with different sign [-Wpointer-sign]
62 | ret = recvfrom(sd, buf, sizeof(buf),
0, (struct sockaddr *)&daddr, &len);
| ^~~~
/usr/include/sys/socket.h:165:27: note: passing argument to
parameter '__addr_len' here
165 | socklen_t *__restrict __addr_len);
| ^
Abdun Nihaal [Tue, 28 Oct 2025 16:08:41 +0000 (21:38 +0530)]
nfp: xsk: fix memory leak in nfp_net_alloc()
In nfp_net_alloc(), the memory allocated for xsk_pools is not freed in
the subsequent error paths, leading to a memory leak. Fix that by
freeing it in the error path.
Jakub Kicinski [Thu, 30 Oct 2025 00:30:45 +0000 (17:30 -0700)]
Merge branch 'tcp-fix-receive-autotune-again'
Matthieu Baerts says:
====================
tcp: fix receive autotune again
Neal Cardwell found that recent kernels were having RWIN limited
issues, even when net.ipv4.tcp_rmem[2] was set to a very big value like
512MB.
He suspected that tcp_stream default buffer size (64KB) was triggering
heuristic added in ea33537d8292 ("tcp: add receive queue awareness
in tcp_rcv_space_adjust()").
After more testing, it turns out the bug was added earlier
with commit 65c5287892e9 ("tcp: fix sk_rcvbuf overshoot").
I forgot once again that DRS has one RTT latency.
MPTCP also got the same issue.
This series :
- Prevents calling tcp_rcvbuf_grow() on some MPTCP subflows.
- adds rcv_ssthresh, window_clamp and rcv_wnd to trace_tcp_rcvbuf_grow().
- Refactors code in a patch with no functional changes.
- Fixes the issue in the final patch.
====================
Eric Dumazet [Tue, 28 Oct 2025 11:58:01 +0000 (12:58 +0100)]
tcp: add newval parameter to tcp_rcvbuf_grow()
This patch has no functional change, and prepares the following one.
tcp_rcvbuf_grow() will need to have access to tp->rcvq_space.space
old and new values.
Change mptcp_rcvbuf_grow() in a similar way.
Signed-off-by: Eric Dumazet <edumazet@google.com>
[ Moved 'oldval' declaration to the next patch to avoid warnings at
build time. ] Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20251028-net-tcp-recv-autotune-v3-3-74b43ba4c84c@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Paolo Abeni [Tue, 28 Oct 2025 11:57:59 +0000 (12:57 +0100)]
mptcp: fix subflow rcvbuf adjust
The mptcp PM can add subflow to the conn_list before tcp_init_transfer().
Calling tcp_rcvbuf_grow() on such subflow is not correct as later
init will overwrite the update.
Fix the issue calling tcp_rcvbuf_grow() only after init buffer
initialization.
For ice, Grzegorz fixes setting of PHY lane number and logical PF ID for
E82x devices. He also corrects access of CGU (Clock Generation Unit) on
dual complex devices.
Kohei Enju resolves issues with error path cleanup for probe when in
recovery mode on ixgbe and ensures PHY is powered on for link testing
on igc. Lastly, he converts incorrect use of -ENOTSUPP to -EOPNOTSUPP
on igb, igc, and ixgbe.
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
ixgbe: use EOPNOTSUPP instead of ENOTSUPP in ixgbe_ptp_feature_enable()
igc: use EOPNOTSUPP instead of ENOTSUPP in igc_ethtool_get_sset_count()
igb: use EOPNOTSUPP instead of ENOTSUPP in igb_get_sset_count()
igc: power up the PHY before the link test
ixgbe: fix memory leak and use-after-free in ixgbe_recovery_probe()
ice: fix usage of logical PF id
ice: fix destination CGU for dual complex E825
ice: fix lane number calculation
====================
Jakub Kicinski [Thu, 30 Oct 2025 00:18:24 +0000 (17:18 -0700)]
Merge branch 'net-stmmac-hwif-c-cleanups'
Russell King says:
====================
net: stmmac: hwif.c cleanups
This series cleans up hwif.c:
- move the reading of the version information out of stmmac_hwif_init()
into its own function, stmmac_get_version(), storing the result in a
new struct.
- simplify stmmac_get_version().
- read the version register once, passing it to stmmac_get_id() and
stmmac_get_dev_id().
- move stmmac_get_id() and stmmac_get_dev_id() into
stmmac_get_version()
- define version register fields and use FIELD_GET() to decode
- start tackling the big loop in stmmac_hwif_init() - provide a
function, stmmac_hwif_find(), which looks up the hwif entry, thus
making a much smaller loop, which improves readability of this code.
- change the use of '^' to '!=' when comparing the dev_id, which is
what is really meant here.
- reorganise the test after calling stmmac_hwif_init() so that we
handle the error case in the indented code, and the success case
with no indent, which is the classical arrangement.
====================
Reorganise stmmac_hwif_init() to handle the error case of
stmmac_hwif_find() in the indented block, which follows normal
programming pattern.
Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Tested-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/E1vDtfG-0000000CCCX-2YwQ@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: stmmac: use != rather than ^ for comparing dev_id
Use the more usual not-equals rather than exclusive-or operator when
comparing the dev_id in stmmac_hwif_find().
Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Tested-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/E1vDtfB-0000000CCCR-25rr@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Provide a function to lookup the hwif entry given the core type,
Synopsys version, and device ID (used for XGMAC cores).
Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Tested-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/E1vDtf6-0000000CCCL-1cQA@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Provide field definitions in common.h, and use these with FIELD_GET()
to extract the fields from the version register.
Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Tested-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/E1vDtf1-0000000CCCF-0uUV@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: stmmac: move stmmac_get_*id() into stmmac_get_version()
Move the contents of both stmmac_get_id() and stmmac_get_dev_id() into
stmmac_get_version() as it no longer makes sense for these to be
separate functions.
Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Tested-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/E1vDtew-0000000CCC9-0KeM@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>