Fixes: 4a6ce2b6f2ec ("net: introduce a new function dst_dev_put()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-7-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet [Mon, 30 Jun 2025 12:19:29 +0000 (12:19 +0000)]
net: dst: annotate data-races around dst->output
dst_dev_put() can overwrite dst->output while other
cpus might read this field (for instance from dst_output())
Add READ_ONCE()/WRITE_ONCE() annotations to suppress
potential issues.
We will likely need RCU protection in the future.
Fixes: 4a6ce2b6f2ec ("net: introduce a new function dst_dev_put()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet [Mon, 30 Jun 2025 12:19:28 +0000 (12:19 +0000)]
net: dst: annotate data-races around dst->input
dst_dev_put() can overwrite dst->input while other
cpus might read this field (for instance from dst_input())
Add READ_ONCE()/WRITE_ONCE() annotations to suppress
potential issues.
We will likely need full RCU protection later.
Fixes: 4a6ce2b6f2ec ("net: introduce a new function dst_dev_put()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
____cacheline_aligned_in_smp on small fields like
tcp_memory_allocated and udp_memory_allocated is not good enough.
It makes sure to put these fields at the start of a cache line,
but does not prevent the linker from using the cache line for other
fields, with potential performance impact.
Eric Dumazet [Mon, 30 Jun 2025 09:35:40 +0000 (09:35 +0000)]
udp: move udp_memory_allocated into net_aligned_data
____cacheline_aligned_in_smp attribute only makes sure to align
a field to a cache line. It does not prevent the linker to use
the remaining of the cache line for other variables, causing
potential false sharing.
Eric Dumazet [Mon, 30 Jun 2025 09:35:39 +0000 (09:35 +0000)]
tcp: move tcp_memory_allocated into net_aligned_data
____cacheline_aligned_in_smp attribute only makes sure to align
a field to a cache line. It does not prevent the linker to use
the remaining of the cache line for other variables, causing
potential false sharing.
Move tcp_memory_allocated into a dedicated cache line.
zhangjianrong [Sat, 28 Jun 2025 09:38:13 +0000 (17:38 +0800)]
net: thunderbolt: Enable end-to-end flow control also in transmit
According to USB4 specification, if E2E flow control is disabled for
the Transmit Descriptor Ring, the Host Interface Adapter Layer shall
not require any credits to be available before transmitting a Tunneled
Packet from this Transmit Descriptor Ring, so e2e flow control should
be enabled in both directions.
zhangjianrong [Sat, 28 Jun 2025 09:49:20 +0000 (17:49 +0800)]
net: thunderbolt: Fix the parameter passing of tb_xdomain_enable_paths()/tb_xdomain_disable_paths()
According to the description of tb_xdomain_enable_paths(), the third
parameter represents the transmit ring and the fifth parameter represents
the receive ring. tb_xdomain_disable_paths() is the same case.
[Jakub] Mika says: it works now because both rings ->hop is the same
Uwe Kleine-König [Mon, 30 Jun 2025 16:44:07 +0000 (18:44 +0200)]
net: atlantic: Rename PCI driver struct to end in _driver
This is not only a cosmetic change because the section mismatch checks
(implemented in scripts/mod/modpost.c) also depend on the object's name
and for drivers the checks are stricter than for ops.
However aq_pci_driver also passes the stricter checks just fine, so no
further changes needed.
The cheating^Wmisleading name was introduced in commit 97bde5c4f909
("net: ethernet: aquantia: Support for NIC-specific code")
Lucien.Jheng [Mon, 30 Jun 2025 15:41:47 +0000 (23:41 +0800)]
net: phy: air_en8811h: Introduce resume/suspend and clk_restore_context to ensure correct CKO settings after network interface reinitialization.
If the user reinitializes the network interface, the PHY will reinitialize,
and the CKO settings will revert to their initial configuration(be enabled).
To prevent CKO from being re-enabled,
en8811h_clk_restore_context and en8811h_resume were added
to ensure the CKO settings remain correct.
net: dsa: hellcreek: Constify struct devlink_region_ops and struct hellcreek_fdb_entry
'struct devlink_region_ops' and 'struct hellcreek_fdb_entry' are not
modified in this driver.
Constifying these structures moves some data to a read-only section, so
increases overall security, especially when the structure holds some
function pointers.
On a x86_64, with allmodconfig:
Before:
======
text data bss dec hex filename
55320 19216 320 74856 12468 drivers/net/dsa/hirschmann/hellcreek.o
After:
=====
text data bss dec hex filename
55960 18576 320 74856 12468 drivers/net/dsa/hirschmann/hellcreek.o
====================
seg6: fix typos in comments within the SRv6 subsystem
In this patchset, we correct some typos found both in the SRv6 Endpoints
implementation (i.e., seg6local) and in some SRv6 selftests, using
codespell.
====================
Use kcalloc() instead of hand writing it. This is less verbose.
Also move the initialization of 'count' to save some LoC.
On a x86_64, with allmodconfig, as an example:
Before:
======
text data bss dec hex filename
18652 5920 64 24636 603c drivers/net/dsa/mv88e6xxx/devlink.o
After:
=====
text data bss dec hex filename
18498 5920 64 24482 5fa2 drivers/net/dsa/mv88e6xxx/devlink.o
net: dsa: mv88e6xxx: Constify struct devlink_region_ops and struct mv88e6xxx_region
'struct devlink_region_ops' and 'struct mv88e6xxx_region' are not modified
in this driver.
Constifying these structures moves some data to a read-only section, so
increases overall security, especially when the structure holds some
function pointers.
On a x86_64, with allmodconfig, as an example:
Before:
======
text data bss dec hex filename
18076 6496 64 24636 603c drivers/net/dsa/mv88e6xxx/devlink.o
After:
=====
text data bss dec hex filename
18652 5920 64 24636 603c drivers/net/dsa/mv88e6xxx/devlink.o
Eric Work [Sun, 29 Jun 2025 05:15:28 +0000 (22:15 -0700)]
net: atlantic: add set_power to fw_ops for atl2 to fix wol
Aquantia AQC113(C) using ATL2FW doesn't properly prepare the NIC for
enabling wake-on-lan. The FW operation `set_power` was only implemented
for `hw_atl` and not `hw_atl2`. Implement the `set_power` functionality
for `hw_atl2`.
Tested with both AQC113 and AQC113C devices. Confirmed you can shutdown
the system and wake from S5 using magic packets. NIC was previously
powered off when entering S5. If the NIC was configured for WOL by the
Windows driver, loading the atlantic driver would disable WOL.
Partially cherry-picks changes from commit,
https://github.com/Aquantia/AQtion/commit/37bd5cc
Attributing original authors from Marvell for the referenced commit.
Closes: https://github.com/Aquantia/AQtion/issues/70 Co-developed-by: Igor Russkikh <irusskikh@marvell.com> Co-developed-by: Mark Starovoitov <mstarovoitov@marvell.com> Co-developed-by: Dmitry Bogdanov <dbogdanov@marvell.com> Co-developed-by: Pavel Belous <pbelous@marvell.com> Co-developed-by: Nikita Danilov <ndanilov@marvell.com> Signed-off-by: Eric Work <work.eric@gmail.com> Reviewed-by: Igor Russkikh <irusskikh@marvell.com> Link: https://patch.msgid.link/20250629051535.5172-1-work.eric@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Haiyang Zhang [Fri, 27 Jun 2025 20:26:23 +0000 (13:26 -0700)]
net: mana: Handle Reset Request from MANA NIC
Upon receiving the Reset Request, pause the connection and clean up
queues, wait for the specified period, then resume the NIC.
In the cleanup phase, the HWC is no longer responding, so set hwc_timeout
to zero to skip waiting on the response.
Oleksij Rempel [Fri, 27 Jun 2025 11:25:39 +0000 (13:25 +0200)]
phy: micrel: add Signal Quality Indicator (SQI) support for KSZ9477 switch PHYs
Add support for the Signal Quality Indicator (SQI) feature on KSZ9477
family switches, providing a relative measure of receive signal quality.
The hardware exposes separate SQI readings per channel. For 1000BASE-T,
all four channels are read. For 100BASE-TX, only one channel is reported,
but which receive pair is active depends on Auto MDI-X negotiation, which
is not exposed by the hardware. Therefore, it is not possible to reliably
map the measured channel to a specific wire pair.
This resolves an earlier discussion about how to handle multi-channel
SQI. Originally, the plan was to expose all channels individually.
However, since pair mapping is sometimes unavailable, this
implementation treats SQI as a per-link metric instead. This fallback
avoids ambiguity and ensures consistent behavior. The existing get_sqi()
UAPI was designed for single-pair Ethernet (SPE), where per-pair and
per-link are effectively equivalent. Restricting its use to per-link
metrics does not introduce regressions for existing users.
The raw 7-bit SQI value (0–127, lower is better) is converted to the
standard 0–7 (high is better) scale. Empirical testing showed that the
link becomes unstable around a raw value of 8.
The SQI raw value remains zero if no data is received, even if noise is
present. This confirms that the measurement reflects the "quality" during
active data reception rather than the passive line state. User space
must ensure that traffic is present on the link to obtain valid SQI
readings.
Jakub Kicinski [Mon, 30 Jun 2025 15:40:53 +0000 (08:40 -0700)]
net: ethtool: fix leaking netdev ref if ethnl_default_parse() failed
Ido spotted that I made a mistake in commit under Fixes,
ethnl_default_parse() may acquire a dev reference even when it returns
an error. This may have been driven by the code structure in dumps
(which unconditionally release dev before handling errors), but it's
too much of a trap. Functions should undo what they did before returning
an error, rather than expecting caller to clean up.
Rather than fixing ethnl_default_set_doit() directly make
ethnl_default_parse() clean up errors.
Fushuai Wang [Sat, 28 Jun 2025 05:10:33 +0000 (13:10 +0800)]
sfc: siena: eliminate xdp_rxq_info_valid using XDP base API
Commit d48523cb88e0 ("sfc: Copy shared files needed for Siena (part 2)")
use xdp_rxq_info_valid to track failures of xdp_rxq_info_reg().
However, this driver-maintained state becomes redundant since the XDP
framework already provides xdp_rxq_info_is_reg() for checking registration
status.
Signed-off-by: Fushuai Wang <wangfushuai@baidu.com> Acked-by: Edward Cree <ecree.xilinx@gmail.com> Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com> Link: https://patch.msgid.link/20250628051033.51133-1-wangfushuai@baidu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fushuai Wang [Sat, 28 Jun 2025 05:10:16 +0000 (13:10 +0800)]
sfc: eliminate xdp_rxq_info_valid using XDP base API
Commit eb9a36be7f3e ("sfc: perform XDP processing on received packets")
use xdp_rxq_info_valid to track failures of xdp_rxq_info_reg().
However, this driver-maintained state becomes redundant since the XDP
framework already provides xdp_rxq_info_is_reg() for checking registration
status.
Signed-off-by: Fushuai Wang <wangfushuai@baidu.com> Acked-by: Edward Cree <ecree.xilinx@gmail.com> Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com> Link: https://patch.msgid.link/20250628051016.51022-1-wangfushuai@baidu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
_Static_assert(ARRAY_SIZE(map) != IEEE8021Q_TT_MAX - 1) rejects only a
length of 7 and allows any other mismatch. Replace it with a strict
equality test via a helper macro so that every mapping table must have
exactly IEEE8021Q_TT_MAX (8) entries.
Oleksij Rempel [Thu, 26 Jun 2025 10:37:31 +0000 (12:37 +0200)]
net: usb: lan78xx: fix possible NULL pointer dereference in lan78xx_phy_init()
If no PHY device is found (e.g., for LAN7801 in fixed-link mode),
lan78xx_phy_init() may proceed to dereference a NULL phydev pointer,
leading to a crash.
Update the logic to perform MAC configuration first, then check for the presence
of a PHY. For the fixed-link case, set up the fixed link and return early,
bypassing any code that assumes a valid phydev pointer.
It is safe to move lan78xx_mac_prepare_for_phy() earlier because this function
only uses information from dev->interface, which is configured by
lan78xx_get_phy() beforehand. The function does not access phydev or any data
set up by later steps.
Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Fixes: e110bc825897 ("net: usb: lan78xx: Convert to PHYLINK for improved PHY and MAC management") Link: https://patch.msgid.link/20250626103731.3986545-1-o.rempel@pengutronix.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Paolo Abeni [Tue, 1 Jul 2025 07:49:21 +0000 (09:49 +0200)]
Merge branch 'clean-up-usage-of-ffi-types'
Tamir Duberstein says:
====================
Clean up usage of ffi types
Remove qualification of ffi types which are included in the prelude and
change `as` casts to target the proper ffi type alias rather than the
underlying primitive.
Eric Dumazet [Fri, 27 Jun 2025 16:32:42 +0000 (16:32 +0000)]
net: net->nsid_lock does not need BH safety
At the time of commit bc51dddf98c9 ("netns: avoid disabling irq
for netns id") peernet2id() was not yet using RCU.
Commit 2dce224f469f ("netns: protect netns
ID lookups with RCU") changed peernet2id() to no longer
acquire net->nsid_lock (potentially from BH context).
We do not need to block soft interrupts when acquiring
net->nsid_lock anymore.
Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Guillaume Nault <gnault@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250627163242.230866-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
net: enetc: change some statistics to 64-bit
The port MAC counters of ENETC are 64-bit registers and the statistics
of ethtool are also u64 type, so add enetc_port_rd64() helper function
to read 64-bit statistics from these registers, and also change the
statistics of ring to unsigned long type to be consistent with the
statistics type in struct net_device_stats.
Wei Fang [Fri, 27 Jun 2025 02:11:08 +0000 (10:11 +0800)]
net: enetc: read 64-bit statistics from port MAC counters
The counters of port MAC are all 64-bit registers, and the statistics of
ethtool are u64 type, so replace enetc_port_rd() with enetc_port_rd64()
to read 64-bit statistics.
Signed-off-by: Wei Fang <wei.fang@nxp.com> Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com> Reviewed-by: Frank Li <Frank.Li@nxp.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250627021108.3359642-4-wei.fang@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Wei Fang [Fri, 27 Jun 2025 02:11:07 +0000 (10:11 +0800)]
net: enetc: separate 64-bit counters from enetc_port_counters
Some counters in enetc_port_counters are 32-bit registers, and some are
64-bit registers. But in the current driver, they are all read through
enetc_port_rd(), which can only read a 32-bit value. Therefore, separate
64-bit counters (enetc_pm_counters) from enetc_port_counters and use
enetc_port_rd64() to read the 64-bit statistics.
Signed-off-by: Wei Fang <wei.fang@nxp.com> Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com> Reviewed-by: Frank Li <Frank.Li@nxp.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250627021108.3359642-3-wei.fang@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Wei Fang [Fri, 27 Jun 2025 02:11:06 +0000 (10:11 +0800)]
net: enetc: change the statistics of ring to unsigned long type
The statistics of the ring are all unsigned int type, so the statistics
will overflow quickly under heavy traffic. In addition, the statistics
of struct net_device_stats are obtained from struct enetc_ring_stats,
but the statistics of net_device_stats are unsigned long type. So it is
better to keep the statistics types consistent in these two structures.
Considering these two factors, and the fact that both LS1028A and i.MX95
are arm64 architecture, the statistics of enetc_ring_stats are changed
to unsigned long type. Note that unsigned int and unsigned long are the
same thing on some systems, and on such systems there is no overflow
advantage of one over the other.
Signed-off-by: Wei Fang <wei.fang@nxp.com> Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com> Reviewed-by: Frank Li <Frank.Li@nxp.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250627021108.3359642-2-wei.fang@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jonas Rebmann [Thu, 26 Jun 2025 13:44:02 +0000 (15:44 +0200)]
net: fec: allow disable coalescing
In the current implementation, IP coalescing is always enabled and
cannot be disabled.
As setting maximum frames to 0 or 1, or setting delay to zero implies
immediate delivery of single packets/IRQs, disable coalescing in
hardware in these cases.
This also guarantees that coalescing is never enabled with ICFT or ICTT
set to zero, a configuration that could lead to unpredictable behaviour
according to i.MX8MP reference manual.
====================
Add support for externally validated neighbor entries
Patch #1 adds a new neighbor flag ("extern_valid") that prevents the
kernel from invalidating or removing a neighbor entry, while allowing
the kernel to notify user space when the entry becomes reachable. See
motivation and implementation details in the commit message.
Ido Schimmel [Thu, 26 Jun 2025 07:31:11 +0000 (10:31 +0300)]
selftests: net: Add a selftest for externally validated neighbor entries
Add test cases for externally validated neighbor entries, testing both
IPv4 and IPv6. Name the file "test_neigh.sh" so that it could be
possibly extended in the future with more neighbor test cases.
Example output:
# ./test_neigh.sh
TEST: IPv4 "extern_valid" flag: Add entry [ OK ]
TEST: IPv4 "extern_valid" flag: Add with an invalid state [ OK ]
TEST: IPv4 "extern_valid" flag: Add with "use" flag [ OK ]
TEST: IPv4 "extern_valid" flag: Replace entry [ OK ]
TEST: IPv4 "extern_valid" flag: Replace entry with "managed" flag [ OK ]
TEST: IPv4 "extern_valid" flag: Replace with an invalid state [ OK ]
TEST: IPv4 "extern_valid" flag: Interface down [ OK ]
TEST: IPv4 "extern_valid" flag: Carrier down [ OK ]
TEST: IPv4 "extern_valid" flag: Transition to "reachable" state [ OK ]
TEST: IPv4 "extern_valid" flag: Transition back to "stale" state [ OK ]
TEST: IPv4 "extern_valid" flag: Forced garbage collection [ OK ]
TEST: IPv4 "extern_valid" flag: Periodic garbage collection [ OK ]
TEST: IPv6 "extern_valid" flag: Add entry [ OK ]
TEST: IPv6 "extern_valid" flag: Add with an invalid state [ OK ]
TEST: IPv6 "extern_valid" flag: Add with "use" flag [ OK ]
TEST: IPv6 "extern_valid" flag: Replace entry [ OK ]
TEST: IPv6 "extern_valid" flag: Replace entry with "managed" flag [ OK ]
TEST: IPv6 "extern_valid" flag: Replace with an invalid state [ OK ]
TEST: IPv6 "extern_valid" flag: Interface down [ OK ]
TEST: IPv6 "extern_valid" flag: Carrier down [ OK ]
TEST: IPv6 "extern_valid" flag: Transition to "reachable" state [ OK ]
TEST: IPv6 "extern_valid" flag: Transition back to "stale" state [ OK ]
TEST: IPv6 "extern_valid" flag: Forced garbage collection [ OK ]
TEST: IPv6 "extern_valid" flag: Periodic garbage collection [ OK ]
Ido Schimmel [Thu, 26 Jun 2025 07:31:10 +0000 (10:31 +0300)]
neighbor: Add NTF_EXT_VALIDATED flag for externally validated entries
tl;dr
=====
Add a new neighbor flag ("extern_valid") that can be used to indicate to
the kernel that a neighbor entry was learned and determined to be valid
externally. The kernel will not try to remove or invalidate such an
entry, leaving these decisions to the user space control plane. This is
needed for EVPN multi-homing where a neighbor entry for a multi-homed
host needs to be synced across all the VTEPs among which the host is
multi-homed.
Background
==========
In a typical EVPN multi-homing setup each host is multi-homed using a
set of links called ES (Ethernet Segment, i.e., LAG) to multiple leaf
switches (VTEPs). VTEPs that are connected to the same ES are called ES
peers.
When a neighbor entry is learned on a VTEP, it is distributed to both ES
peers and remote VTEPs using EVPN MAC/IP advertisement routes. ES peers
use the neighbor entry when routing traffic towards the multi-homed host
and remote VTEPs use it for ARP/NS suppression.
Motivation
==========
If the ES link between a host and the VTEP on which the neighbor entry
was locally learned goes down, the EVPN MAC/IP advertisement route will
be withdrawn and the neighbor entries will be removed from both ES peers
and remote VTEPs. Routing towards the multi-homed host and ARP/NS
suppression can fail until another ES peer locally learns the neighbor
entry and distributes it via an EVPN MAC/IP advertisement route.
"draft-rbickhart-evpn-ip-mac-proxy-adv-03" [1] suggests avoiding these
intermittent failures by having the ES peers install the neighbor
entries as before, but also injecting EVPN MAC/IP advertisement routes
with a proxy indication. When the previously mentioned ES link goes down
and the original EVPN MAC/IP advertisement route is withdrawn, the ES
peers will not withdraw their neighbor entries, but instead start aging
timers for the proxy indication.
If an ES peer locally learns the neighbor entry (i.e., it becomes
"reachable"), it will restart its aging timer for the entry and emit an
EVPN MAC/IP advertisement route without a proxy indication. An ES peer
will stop its aging timer for the proxy indication if it observes the
removal of the proxy indication from at least one of the ES peers
advertising the entry.
In the event that the aging timer for the proxy indication expired, an
ES peer will withdraw its EVPN MAC/IP advertisement route. If the timer
expired on all ES peers and they all withdrew their proxy
advertisements, the neighbor entry will be completely removed from the
EVPN fabric.
Implementation
==============
In the above scheme, when the control plane (e.g., FRR) advertises a
neighbor entry with a proxy indication, it expects the corresponding
entry in the data plane (i.e., the kernel) to remain valid and not be
removed due to garbage collection or loss of carrier. The control plane
also expects the kernel to notify it if the entry was learned locally
(i.e., became "reachable") so that it will remove the proxy indication
from the EVPN MAC/IP advertisement route. That is why these entries
cannot be programmed with dummy states such as "permanent" or "noarp".
Instead, add a new neighbor flag ("extern_valid") which indicates that
the entry was learned and determined to be valid externally and should
not be removed or invalidated by the kernel. The kernel can probe the
entry and notify user space when it becomes "reachable" (it is initially
installed as "stale"). However, if the kernel does not receive a
confirmation, have it return the entry to the "stale" state instead of
the "failed" state.
In other words, an entry marked with the "extern_valid" flag behaves
like any other dynamically learned entry other than the fact that the
kernel cannot remove or invalidate it.
One can argue that the "extern_valid" flag should not prevent garbage
collection and that instead a neighbor entry should be programmed with
both the "extern_valid" and "extern_learn" flags. There are two reasons
for not doing that:
1. Unclear why a control plane would like to program an entry that the
kernel cannot invalidate but can completely remove.
2. The "extern_learn" flag is used by FRR for neighbor entries learned
on remote VTEPs (for ARP/NS suppression) whereas here we are
concerned with local entries. This distinction is currently irrelevant
for the kernel, but might be relevant in the future.
Given that the flag only makes sense when the neighbor has a valid
state, reject attempts to add a neighbor with an invalid state and with
this flag set. For example:
# ip neigh add 192.0.2.1 nud none dev br0.10 extern_valid
Error: Cannot create externally validated neighbor with an invalid state.
# ip neigh add 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid
# ip neigh replace 192.0.2.1 nud failed dev br0.10 extern_valid
Error: Cannot mark neighbor as externally validated with an invalid state.
The above means that a neighbor cannot be created with the
"extern_valid" flag and flags such as "use" or "managed" as they result
in a neighbor being created with an invalid state ("none") and
immediately getting probed:
# ip neigh add 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use
Error: Cannot create externally validated neighbor with an invalid state.
However, these flags can be used together with "extern_valid" after the
neighbor was created with a valid state:
# ip neigh add 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid
# ip neigh replace 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use
One consequence of preventing the kernel from invalidating a neighbor
entry is that by default it will only try to determine reachability
using unicast probes. This can be changed using the "mcast_resolicit"
sysctl:
====================
net: ethtool: consistently take rss_lock for all rxfh ops
I'd like to bring RXFH and RXFHINDIR ioctls under a single set of
Netlink ops. It appears that while core takes the ethtool->rss_lock
around some of the RXFHINDIR ops, drivers (sfc) take it internally
for the RXFH.
Consistently take the lock around all ops and accesses to the XArray
within the core. This should hopefully make the rss_lock a lot less
confusing.
====================
Jakub Kicinski [Thu, 26 Jun 2025 20:28:48 +0000 (13:28 -0700)]
net: ethtool: move get_rxfh callback under the rss_lock
We already call get_rxfh under the rss_lock when we read back
context state after changes. Let's be consistent and always
hold the lock. The existing callers are all under rtnl_lock
so this should make no difference in practice, but it makes
the locking rules far less confusing IMHO. Any RSS callback
and any access to the RSS XArray should hold the lock.
Jakub Kicinski [Thu, 26 Jun 2025 20:28:47 +0000 (13:28 -0700)]
net: ethtool: move rxfh_fields callbacks under the rss_lock
Netlink code will want to perform the RSS_SET operation atomically
under the rss_lock. sfc wants to hold the rss_lock in rxfh_fields_get,
which makes that difficult. Lets move the locking up to the core
so that for all driver-facing callbacks rss_lock is taken consistently
by the core.
Jakub Kicinski [Thu, 26 Jun 2025 20:28:46 +0000 (13:28 -0700)]
net: ethtool: take rss_lock for all rxfh changes
Always take the rss_lock in ethtool_set_rxfh(). We will want to
make a similar change in ethtool_set_rxfh_fields() and some
drivers lock that callback regardless of rss context ID being set.
Having some callbacks locked unconditionally and some only if
context ID is set would be very confusing.
ethtool handling is under rtnl_lock, so rss_lock is very unlikely
to ever be congested.
Jakub Kicinski [Thu, 26 Jun 2025 23:39:26 +0000 (16:39 -0700)]
net: ethtool: avoid OOB accesses in PAUSE_SET
We now reuse .parse_request() from GET on SET, so we need to make sure
that the policies for both cover the attributes used for .parse_request().
genetlink will only allocate space in info->attrs for ARRAY_SIZE(policy).
Reported-by: syzbot+430f9f76633641a62217@syzkaller.appspotmail.com Fixes: 963781bdfe20 ("net: ethtool: call .parse_request for SET handlers") Reviewed-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250626233926.199801-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fushuai Wang [Thu, 26 Jun 2025 05:30:03 +0000 (13:30 +0800)]
net/mlx5e: Fix error handling in RQ memory model registration
Currently when xdp_rxq_info_reg_mem_model() fails in the XSK path, the
error handling incorrectly jumps to err_destroy_page_pool. While this
may not cause errors, we should make it jump to the correct location.
Signed-off-by: Fushuai Wang <wangfushuai@baidu.com> Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Acked-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Dan Carpenter [Wed, 25 Jun 2025 15:23:05 +0000 (10:23 -0500)]
octeontx2-af: Fix error code in rvu_mbox_init()
The error code was intended to be -EINVAL here, but it was accidentally
changed to returning success. Set the error code.
Fixes: e53ee4acb220 ("octeontx2-af: CN20k basic mbox operations and structures") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
Octeontx2-pf: extend link modes support
This series of patches adds multi advertise mode support along with
other improvements in link mode management code flow.
Patch1: Currently all SGMII modes 10/100/1000baseT are mapped with
single firmware mode. This patch updates these link modes
with corresponding firmware modes.
Patch2: Due to limitation in current kernel <-> firmware communication,
link modes are divided into multiple groups, and identified
with their group index.
Patch3: Adds support for multi advertise mode.
====================
Hariprasad Kelam [Wed, 25 Jun 2025 09:21:07 +0000 (14:51 +0530)]
Octeontx2-pf: ethtool: support multi advertise mode
Current implementation considers only first advertise
mode and passes the same to firmware to process.
This patch extends code such that user can advertise
multiple modes on the given interface.
Below are high level changes:
1. Remove unnecessary speed/duplex/autoneg validation as its
already verified as part of "set_link_ksettings"
2. Since scratch csr framework designed to support single mode at a time,
use "shared firmware data" for multi mode support.
The existing MODE_ID bitmap can only support up to 42 modes.
To resolve the issue, the unused port field is modified as below
uint64_t reserved2:6;
uint64_t mode_group_idx:2;
'mode_group_idx' categorize the mode ID range to accommodate more modes.
To specify mode ID range of 0 - 41, this field will be 0.
To specify mode ID range of 42 - 83, this field will be 1.
mode ID will be still mentioned as 1 << (0 - 41). But the mode_group_idx
decides the actual mode range
Hariprasad Kelam [Wed, 25 Jun 2025 09:21:05 +0000 (14:51 +0530)]
Octeontx-pf: Update SGMII mode mapping
Current implementation maps ethtool link modes 10baseT/100baseT/1000baseT
to single firmware mode SGMII. This create a problem for end users who want
to advertise only one speed among them.
This patch addresses the issue by mapping each ethtool link mode
to a corresponding firmware mode also updates new modes supported
by firmware.
The device may support the Reference SYNC feature, which allows the
combination of two inputs into a input pair. In this configuration,
clock signals from both inputs are used to synchronize the DPLL device.
The higher frequency signal is utilized for the loop bandwidth of the DPLL,
while the lower frequency signal is used to syntonize the output signal of
the DPLL device. This feature enables the provision of a high-quality loop
bandwidth signal from an external source.
A capable input provides a list of inputs that can be bound with to create
Reference SYNC. To control this feature, the user must request a
desired state for a target pin: use ``DPLL_PIN_STATE_CONNECTED`` to
enable or ``DPLL_PIN_STATE_DISCONNECTED`` to disable the feature. An input
pin can be bound to only one other pin at any given time.
Add new netlink attribute to allow user space configuration of reference
sync pin pairs, where both pins are used to provide one clock signal
consisting of both: base frequency and sync signal.
These are the remaining patches from the "ice: Separate TSPLL from PTP
and cleanup" series [1] with control flow macros removed. What remains
are cleanups and some minor improvements.
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
ice: default to TIME_REF instead of TXCO on E825-C
ice: move TSPLL init calls to ice_ptp.c
ice: fall back to TCXO on TSPLL lock fail
ice: wait before enabling TSPLL
ice: add multiple TSPLL helpers
ice: use bitfields instead of unions for CGU regs
ice: read TSPLL registers again before reporting status
ice: clear time_sync_en field for E825-C during reprogramming
====================
Jakub Kicinski [Thu, 26 Jun 2025 16:54:41 +0000 (09:54 -0700)]
eth: bnxt: take page size into account for page pool recycling rings
The Rx rings are filled with Rx buffers. Which are supposed to fit
packet headers (or MTU if HW-GRO is disabled). The aggregation buffers
are filled with "device pages". Adjust the sizes of the page pool
recycling ring appropriately, based on ratio of the size of the
buffer on given ring vs system page size. Otherwise on a system
with 64kB pages we end up with >700MB of memory sitting in every
single page pool cache.
Correct the size calculation for the head_pool. Since the buffers
there are always small I'm pretty sure I meant to cap the size
at 1k, rather than make it the lowest possible size. With 64k pages
1k cache with a 1k ring is 64x larger than we need.
xin.guo [Thu, 26 Jun 2025 12:34:19 +0000 (12:34 +0000)]
tcp: fix tcp_ofo_queue() to avoid including too much DUP SACK range
If the new coming segment covers more than one skbs in the ofo queue,
and which seq is equal to rcv_nxt, then the sequence range
that is duplicated will be sent as DUP SACK, the detail as below,
in step6, the {501,2001} range is clearly including too much
DUP SACK range, in violation of RFC 2883 rules.
Jakub Kicinski [Fri, 27 Jun 2025 22:14:58 +0000 (15:14 -0700)]
Merge branch 'net-dsa-ks8995-fix-up-bindings'
Linus Walleij says:
====================
net: dsa: ks8995: Fix up bindings
After looking at the datasheets for KS8995 I realized this is
a DSA switch and need to have DT bindings as such and be implemented
as such.
This series just fixes up the bindings and the offending device tree.
The existing kernel driver which is in drivers/net/phy/spi_ks8995.c
does not implement DSA. It can be forgiven for this because it was
merged in 2011 and the DSA framework was not widely established
back then. It continues to probe fine but needs to be rewritten
to use the special DSA tag and moved to drivers/net/dsa as time
permits. (I hope I can do this.)
It's fine for the networking tree to merge both patches, I maintain
ixp4xx as well. But I can also carry the second patch through the
SoC tree if so desired.
Linus Walleij [Wed, 25 Jun 2025 06:51:25 +0000 (08:51 +0200)]
ARM: dts: Fix up wrv54g device tree
Fix up the KS8995 switch and PHYs the way that is most likely:
- Phy 1-4 is certainly the PHYs of the KS8995 (mask 0x1e in
the outoftree code masks PHYs 1,2,3,4).
- Phy 5 is the MII-P5 separate WAN phy of the KS8995 directly
connected to EthC.
- The EthB MII is probably connected as CPU interface to the
KS8995.
Properly integrate the KS8995 switch using the new bindings.
Linus Walleij [Wed, 25 Jun 2025 06:51:24 +0000 (08:51 +0200)]
dt-bindings: dsa: Rewrite Micrel KS8995 in schema
After studying the datasheets for some of the KS8995 variants
it becomes pretty obvious that this is a straight-forward
and simple MII DSA switch with one port in (CPU) and four outgoing
ports, and it even supports custom tags by setting a bit in
a special register, and elaborate VLAN handling as all DSA
switches do.
What is a bit odd with KS8995 is that it uses an extra MII-P5
port to access one of the PHYs separately, on the side of the
switch fabric, such as when using a WAN port separately from
a LAN switch in a home router.
Rewrite the terse bindings to YAML, and move to the proper
subdirectory. Include a verbose example to make things clear.
The Allwinner A100/A133 has an Ethernet MAC (EMAC) controller that is
compatible with the A64 one. It features the same syscon registers for
control of the top-level integration of the unit.
====================
NFC: trf7970a: Add option to reduce antenna gain
The TRF7970a device is sensitive to RF disturbances, which can make it
hard to pass some EMC immunity tests. By reducing the RX antenna gain,
the device becomes less sensitive to EMC disturbances, as a trade-off
against antenna performance.
====================
Paul Geurts [Thu, 26 Jun 2025 14:12:42 +0000 (16:12 +0200)]
NFC: trf7970a: Create device-tree parameter for RX gain reduction
The TRF7970a device is sensitive to RF disturbances, which can make it
hard to pass some EMC immunity tests. By reducing the RX antenna gain,
the device becomes less sensitive to EMC disturbances, as a trade-off
against antenna performance.
Add a device tree option to select RX gain reduction to improve EMC
performance.
Selecting a communication standard in the ISO control register resets
the RX antenna gain settings. Therefore set the RX gain reduction
everytime the ISO control register changes, when the option is used.
Jeff Layton [Thu, 26 Jun 2025 12:52:14 +0000 (08:52 -0400)]
ref_tracker: do xarray and workqueue job initializations earlier
The kernel test robot reported an oops that occurred when attempting to
deregister a dentry from the xarray during subsys_initcall().
The ref_tracker xarrays and workqueue job are being initialized in
late_initcall() which is too late. Move those to postcore_initcall()
instead.
Fixes: 65b584f53611 ("ref_tracker: automatically register a file in debugfs for a ref_tracker_dir") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202506251406.c28f2adb-lkp@intel.com Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20250626-reftrack-dbgfs-v1-1-812102e2a394@kernel.org
Simon Horman [Wed, 25 Jun 2025 12:52:10 +0000 (13:52 +0100)]
tg3: spelling corrections
Correct spelling as flagged by codespell.
Signed-off-by: Simon Horman <horms@kernel.org> Reviewed-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: mdio: Add MDIO bus controller for Airoha AN7583
Airoha AN7583 SoC have 2 dedicated MDIO bus controller in the SCU
register map. To driver register an MDIO controller based on the DT
reg property and access the register by accessing the parent syscon.
The MDIO bus logic is similar to the MT7530 internal MDIO bus but
deviates of some setting and some HW bug.
On Airoha AN7583 the MDIO clock is set to 25MHz by default and needs to
be correctly setup to 2.5MHz to correctly work (by setting the divisor
to 10x).
There seems to be Hardware bug where AN7583_MII_RWDATA
is not wiped in the context of unconnected PHY and the
previous read value is returned.
Example: (only one PHY on the BUS at 0x1f)
- read at 0x1f report at 0x2 0x7500
- read at 0x0 report 0x7500 on every address
To workaround this, we reset the Mdio BUS at every read
to have consistent values on read operation.
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
dt-bindings: net: Document support for Airoha AN7583 MDIO Controller
Airoha AN7583 SoC have 3 different MDIO Controller. One comes from
the intergated Switch based on MT7530. The other 2 live under the SCU
register and expose 2 dedicated MDIO controller.
Document the schema for the 2 dedicated MDIO controller.
Each MDIO controller can be independently reset with the SoC reset line.
Each MDIO controller have a dedicated clock configured to 2.5MHz by
default to follow MDIO bus IEEE 802.3 standard.
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
ptp: Belated spring cleaning of the chardev driver
When looking into supporting auxiliary clocks in the PTP ioctl, the
inpenetrable ptp_ioctl() letter soup bothered me enough to clean it up.
The code (~400 lines!) is really hard to follow due to a gazillion of
local variables, which are only used in certain case scopes, and a
mixture of gotos, breaks and direct error return paths.
Clean it up by splitting out the IOCTL functionality into seperate
functions, which contain only the required local variables and are trivial
to follow. Complete the cleanup by converting the code to lock guards and
get rid of all gotos.
That reduces the code size by 48 lines and also the binary text size is
80 bytes smaller than the current maze.
The series is split up into one patch per IOCTL command group for easy
review.
Thomas Gleixner [Wed, 25 Jun 2025 11:52:39 +0000 (13:52 +0200)]
ptp: Simplify ptp_read()
The mixture of gotos and direct return codes is inconsistent and just makes
the code harder to read. Let it consistently return error codes directly and
tidy the code flow up accordingly.
Thomas Gleixner [Wed, 25 Jun 2025 11:52:38 +0000 (13:52 +0200)]
ptp: Convert chardev code to lock guards
Convert the various spin_lock_irqsave() protected critical regions to
scoped guards. Use spinlock_irq instead of spinlock_irqsave as all the
functions are invoked in thread context with interrupts enabled.
Thomas Gleixner [Wed, 25 Jun 2025 11:52:34 +0000 (13:52 +0200)]
ptp: Split out PTP_PIN_SETFUNC ioctl code
Continue the ptp_ioctl() cleanup by splitting out the PTP_PIN_SETFUNC ioctl
code into a helper function. Convert to lock guard while at it and remove
the pointless memset of the pd::rsv because nothing uses it.
Thomas Gleixner [Wed, 25 Jun 2025 11:52:33 +0000 (13:52 +0200)]
ptp: Split out PTP_PIN_GETFUNC ioctl code
Continue the ptp_ioctl() cleanup by splitting out the PTP_PIN_GETFUNC ioctl
code into a helper function. Convert to lock guard while at it and remove
the pointless memset of the pd::rsv because nothing uses it.