====================
net: dsa: mxl-gsw1xx: setup polarities and validate chip
Now that common PHY properties make it easy to configure the SerDes RX
and TX polarities, use that for the SGMII/1000Base-X/2500Base-X port of
the MaxLinear GSW1xx switches.
Also, validate hardware in probe() function to make sure the switch is
actually present and MDIO communication works properly.
====================
Daniel Golle [Sun, 1 Feb 2026 03:42:18 +0000 (03:42 +0000)]
net: dsa: mxl-gsw1xx: validate chip ID
No check for actually present hardware is being performed in the probe
function of the mxl-gsw1xx switch driver. So even if the switch isn't
present at the configured MDIO bus address the driver wrongly tells the
user that a "GSWIP version 0 mod 0" was found, outputting errors about
PHY capabilities not matching.
Read and validate the chip MANU_ID and PNUM_ID registers and output
information while probing, but return an error and abort probing in case
the hardware is not actually present.
Daniel Golle [Sun, 1 Feb 2026 03:42:00 +0000 (03:42 +0000)]
net: dsa: mxl-gsw1xx: configure SerDes port polarities
Configure SerDes (port 4) RX and TX polarities using the newly
introduced generic properties. The polarities are described at the port
level which equals the polarities of the external pins of the chip.
Note that the RX lane is inverted internally and the vendor driver
simply always sets bit GSW1XX_SGMII_PHY_RX0_CFG2_INVERT unconditionally
to end up with the correct (ie. as documented in datasheets) polarity at
the external pins.
In this sense, PHY_POLARITY_NORMAL denotes normal polarity for pins as
documented for the MRQFN 105-pin package (GSW120, GSW125, GSW140, GSW141
and GSW145 all use the same package and have identical pin layouts
except for TP port 2 and 3 being N/C on GSW12x):
pin B18 (TX0_P) positive signal of the differential SGMII data output pair
pin B19 (TX0_M) negative signal of the differential SGMII data output pair
pin B20 (RX0_P) positive signal of the differential SGMII data input pair
pin B21 (RX0_M) negative signal of the differential SGMII data input pair
====================
net: stats, tools, driver tests for HW GRO [part]
Add miscellaneous pieces related to production use of HW-GRO:
- report standard stats from drivers (bnxt included here,
Gal recently posted patches for mlx5 which is great)
- CLI tool for calculating HW GRO savings / effectiveness
====================
Jakub Kicinski [Sat, 7 Feb 2026 00:35:03 +0000 (16:35 -0800)]
tools: ynltool: add qstats analysis for HW-GRO efficiency / savings
Extend ynltool to compute HW GRO savings metric - how many
packets has HW GRO been able to save the kernel from seeing.
Note that this definition does not actually take into account
whether the segments were or weren't eligible for HW GRO.
If a machine is receiving all-UDP traffic - new metric will show
HW-GRO savings of 0%. Conversely since the super-packet still
counts as a received packet, savings of 100% is not achievable.
Perfect HW-GRO on a machine with 4k MTU and 64kB super-frames
would show ~93.75% savings. With 1.5k MTU we may see up to
~97.8% savings (if my math is right).
Example after 10 sec of iperf on a freshly booted machine
with 1.5k MTU:
None of the NICs I have access to can report "missed" HW-GRO
opportunities so computing a true "effectiveness" metric
is not possible. One could also argue that effectiveness metric
is inferior in environments where we control both senders and
receivers, the savings metrics will capture both regressions
in receiver's HW GRO effectiveness but also regressions in senders
sending smaller TSO trains. And we care about both. The main
downside is that it's hard to tell at a glance how well the NIC
is doing because the savings will be dependent on traffic patterns.
Jakub Kicinski [Sat, 7 Feb 2026 00:35:01 +0000 (16:35 -0800)]
eth: bnxt: gather and report HW-GRO stats
Count and report HW-GRO stats as seen by the kernel.
The device stats for GRO seem to not reflect the reality,
perhaps they count sessions which did not actually result
in any aggregation. Also they count wire packets, so we
have to count super-frames, anyway.
Jakub Kicinski [Sat, 7 Feb 2026 04:48:35 +0000 (20:48 -0800)]
Merge branch 'big-tcp-without-hbh-in-ipv6'
Alice Mikityanska says:
====================
BIG TCP without HBH in IPv6
Resubmitting after the grace period.
This series is part 1 of "BIG TCP for UDP tunnels". Due to the number of
patches, I'm splitting it into two logical parts:
* Remove hop-by-hop header for BIG TCP IPv6 to align with BIG TCP IPv4.
* Fix up things that prevent BIG TCP from working with UDP tunnels.
The current BIG TCP IPv6 code inserts a hop-by-hop extension header with
32-bit length of the packet. When the packet is encapsulated, and either
the outer or the inner protocol is IPv6, or both are IPv6, there will be
1 or 2 HBH headers that need to be dealt with. The issues that arise:
1. The drivers don't strip it, and they'd all need to know the structure
of each tunnel protocol in order to strip it correctly, also taking into
account all combinations of IPv4/IPv6 inner/outer protocols.
2. Even if (1) is implemented, it would be an additional performance
penalty per aggregated packet.
3. The skb_gso_validate_network_len check is skipped in
ip6_finish_output_gso when IP6SKB_FAKEJUMBO is set, but it seems that it
would make sense to do the actual validation, just taking into account
the length of the HBH header. When the support for tunnels is added, it
becomes trickier, because there may be one or two HBH headers, depending
on whether it's IPv6 in IPv6 or not.
At the same time, having an HBH header to store the 32-bit length is not
strictly necessary, as BIG TCP IPv4 doesn't do anything like this and
just restores the length from skb->len. The same thing can be done for
BIG TCP IPv6. Removing HBH from BIG TCP would allow to simplify the
implementation significantly, and align it with BIG TCP IPv4, which has
been a long-standing goal.
====================
Now that the kernel doesn't insert HBH for BIG TCP IPv6 packets, remove
unnecessary steps from the bnxt_en TX path, that used to check and
remove HBH.
Signed-off-by: Alice Mikityanska <alice@isovalent.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Michael Chan <michael.chan@broadcom.com> Link: https://patch.msgid.link/20260205133925.526371-9-alice.kernel@fastmail.im Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Now that the kernel doesn't insert HBH for BIG TCP IPv6 packets, remove
unnecessary steps from the mlx5e and mlx5i TX path, that used to check
and remove HBH.
BIG TCP IPv6 inserts a hop-by-hop extension header to indicate the real
IPv6 payload length when it doesn't fit into the 16-bit field in the
IPv6 header itself. While it helps tools parse the packet, it also
requires every driver that supports TSO and BIG TCP to remove this
8-byte extension header. It might not sound that bad until we try to
apply it to tunneled traffic. Currently, the drivers don't attempt to
strip HBH if skb->encapsulation = 1. Moreover, trying to do so would
require dissecting different tunnel protocols and making corresponding
adjustments on case-by-case basis, which would slow down the fastpath
(potentially also requiring adjusting checksums in outer headers).
At the same time, BIG TCP IPv4 doesn't insert any extra headers and just
calculates the payload length from skb->len, significantly simplifying
implementing BIG TCP for tunnels.
Stop inserting HBH when building BIG TCP GSO SKBs.
The next commits will transition away from using the hop-by-hop
extension header to encode packet length for BIG TCP. Add wrappers
around ip6->payload_len that return the actual value if it's non-zero,
and calculate it from skb->len if payload_len is set to zero (and a
symmetrical setter).
The new helpers are used wherever the surrounding code supports the
hop-by-hop jumbo header for BIG TCP IPv6, or the corresponding IPv4 code
uses skb_ip_totlen (e.g., in include/net/netfilter/nf_tables_ipv6.h).
====================
dpll: zl3073x: Include current frequency in supported frequencies list
This series ensures that the current operating frequency of a DPLL pin
is always reported in its supported frequencies list.
Problem:
When a ZL3073x DPLL pin is registered, its supported frequencies are
read from the firmware node's "supported-frequencies-hz" property.
However, if the firmware node is missing, or doesn't include the
current operating frequency, the pin reports a frequency that isn't
in its supported list. This inconsistency can confuse userspace tools
that expect the current frequency to be among the supported values.
Solution:
Always include the current pin frequency as the first entry in the
supported frequencies list, followed by any additional frequencies
from the firmware node (with duplicates filtered out).
Patch 1 refactors the output pin frequency calculation into a reusable
helper function zl3073x_dev_output_pin_freq_get(), which mirrors the
existing zl3073x_dev_ref_freq_get() for input pins.
Patch 2 modifies zl3073x_pin_props_get() to obtain the current
frequency early and place it at index 0 of the supported frequencies
array, ensuring it is always present regardless of firmware node
contents.
====================
Ivan Vecera [Thu, 5 Feb 2026 15:43:50 +0000 (16:43 +0100)]
dpll: zl3073x: Include current frequency in supported frequencies list
Ensure the current pin frequency is always present in the list of
supported frequencies reported to userspace. Previously, if the
firmware node was missing or didn't include the current operating
frequency in the supported-frequencies-hz property, the pin would
report a frequency that wasn't in its supported list.
Get the current frequency early in zl3073x_pin_props_get():
- For input pins: use zl3073x_dev_ref_freq_get()
- For output pins: use zl3073x_dev_output_pin_freq_get()
Place the current frequency at index 0 of the supported frequencies
array, then append frequencies from the firmware node (if present),
skipping any duplicate of the current frequency.
Ivan Vecera [Thu, 5 Feb 2026 15:43:49 +0000 (16:43 +0100)]
dpll: zl3073x: Add output pin frequency helper
Introduce zl3073x_dev_output_pin_freq_get() helper function to compute
the output pin frequency based on synthesizer frequency, output divisor,
and signal format. For N-div signal formats, the N-pin frequency is
additionally divided by esync_n_period.
Add zl3073x_out_is_ndiv() helper to check if an output is configured
in N-div mode (2_NDIV or 2_NDIV_INV signal formats).
Refactor zl3073x_dpll_output_pin_frequency_get() callback to use the
new helper, reducing code duplication and enabling reuse of the
frequency calculation logic in other contexts.
This is a preparatory change for adding current frequency to the
supported frequencies list in pin properties.
Kevin Hao [Thu, 5 Feb 2026 06:25:09 +0000 (14:25 +0800)]
net: ti: icssg: Remove dedicated workqueue for ndo_set_rx_mode callback
Currently, both the icssg-prueth and icssg-prueth-sr1 drivers create
a dedicated 'emac->cmd_wq' workqueue.
In the icssg-prueth-sr1 driver, this workqueue is not utilized at all.
In the icssg-prueth driver, the workqueue is only used to execute the
actual processing of ndo_set_rx_mode. However, creating a dedicated
workqueue for such a simple use case is unnecessary. To simplify the
code, switch to using the system default workqueue instead.
Arnd Bergmann [Thu, 5 Feb 2026 16:28:09 +0000 (17:28 +0100)]
myri10ge: avoid uninitialized variable use
While compile testing on less common architectures, I noticed that gcc-10 on
s390 finds a bug that all other configurations seem to miss:
drivers/net/ethernet/myricom/myri10ge/myri10ge.c: In function 'myri10ge_set_multicast_list':
drivers/net/ethernet/myricom/myri10ge/myri10ge.c:391:25: error: 'cmd.data0' is used uninitialized in this function [-Werror=uninitialized]
391 | buf->data0 = htonl(data->data0);
| ^~
drivers/net/ethernet/myricom/myri10ge/myri10ge.c:392:25: error: '*((void *)&cmd+4)' is used uninitialized in this function [-Werror=uninitialized]
392 | buf->data1 = htonl(data->data1);
| ^~
drivers/net/ethernet/myricom/myri10ge/myri10ge.c: In function 'myri10ge_allocate_rings':
drivers/net/ethernet/myricom/myri10ge/myri10ge.c:392:13: error: 'cmd.data1' is used uninitialized in this function [-Werror=uninitialized]
392 | buf->data1 = htonl(data->data1);
drivers/net/ethernet/myricom/myri10ge/myri10ge.c:1939:22: note: 'cmd.data1' was declared here
1939 | struct myri10ge_cmd cmd;
| ^~~
drivers/net/ethernet/myricom/myri10ge/myri10ge.c:393:13: error: 'cmd.data2' is used uninitialized in this function [-Werror=uninitialized]
393 | buf->data2 = htonl(data->data2);
drivers/net/ethernet/myricom/myri10ge/myri10ge.c:1939:22: note: 'cmd.data2' was declared here
1939 | struct myri10ge_cmd cmd;
It would be nice to understand how to make other compilers catch this as
well, but for the moment I'll just shut up the warning by fixing the
undefined behavior in this driver.
Arnd Bergmann [Thu, 5 Feb 2026 16:13:48 +0000 (17:13 +0100)]
hinic3: select CONFIG_DIMLIB
The driver started using dimlib but fails to select the corresponding
symbol, which results in a link failure:
x86_64-linux-ld: drivers/net/ethernet/huawei/hinic3/hinic3_irq.o: in function `hinic3_poll':
hinic3_irq.c:(.text+0x179): undefined reference to `net_dim'
x86_64-linux-ld: drivers/net/ethernet/huawei/hinic3/hinic3_irq.o: in function `hinic3_rx_dim_work':
hinic3_irq.c:(.text+0x1fb): undefined reference to `net_dim_get_rx_moderation'
Oliver Hartkopp [Thu, 5 Feb 2026 14:44:05 +0000 (15:44 +0100)]
net: skb: allow up to 8 skb extension ids
The skb extension ids range from 0 .. 7 to fit their bits as flags into
a single byte. The ids are automatically enumnerated in enum skb_ext_id
in skbuff.h, where SKB_EXT_NUM is defined as the last value.
When having 8 skb extension ids (0 .. 7), SKB_EXT_NUM becomes 8 which is
a valid value for SKB_EXT_NUM.
Alok Tiwari [Thu, 5 Feb 2026 09:19:55 +0000 (01:19 -0800)]
net: marvell: prestera: fix FEC error message for SFP ports
In prestera_ethtool_set_fecparam(), the error message is opposite of
the condition checking PRESTERA_PORT_TCVR_SFP. FEC configuration is
not allowed on SFP ports, but the message says "non-SFP ports", which
does not match the condition. However, FEC may be required depending on
the transceiver, cable, or mode, and firmware already validates invalid
combinations.
Remove the SFP transceiver check and let firmware handle validation.
Qiliang Yuan [Wed, 4 Feb 2026 07:48:42 +0000 (02:48 -0500)]
netns: optimize netns cleaning by batching unhash_nsid calls
Currently, unhash_nsid() scans the entire system for each netns being
killed, leading to O(L_dying_net * M_alive_net * N_id) complexity, as
__peernet2id() also performs a linear search in the IDR.
Optimize this to O(M_alive_net * N_id) by batching unhash operations. Move
unhash_nsid() out of the per-netns loop in cleanup_net() to perform a
single-pass traversal over survivor namespaces.
Identify dying peers by an 'is_dying' flag, which is set under net_rwsem
write lock after the netns is removed from the global list. This batches
the unhashing work and eliminates the O(L_dying_net) multiplier.
To minimize the impact on struct net size, 'is_dying' is placed in an
existing hole after 'hash_mix' in struct net.
Use a restartable idr_get_next() loop for iteration. This avoids the
unsafe modification issue inherent to idr_for_each() callbacks and allows
dropping the nsid_lock to safely call sleepy rtnl_net_notifyid().
Clean up redundant nsid_lock and simplify the destruction loop now that
unhashing is centralized.
Dragos Tatulea [Wed, 4 Feb 2026 20:03:45 +0000 (22:03 +0200)]
net/mlx5e: SHAMPO, Switch to header memcpy
Previously the HW-GRO code was using a separate page_pool for the header
buffer. The pages of the header buffer were replenished via UMR. This
mechanism has some drawbacks:
- Reference counting on the page_pool page frags is not cheap.
- UMRs have HW overhead for updating and also for access. Especially for
the KLM type which was previously used.
- UMR code for headers is complex.
This patch switches to using a static memory area (static MTT MKEY) for
the header buffer and does a header memcpy. This happens only once per
GRO session. The SKB is allocated from the per-cpu NAPI SKB cache.
Notes on test:
- System: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
- oncpu: NAPI and application running on same CPU
- offcpu: NAPI and application running on different CPUs
- MTU: 1500
- iperf3 tests are single stream, 60s with IPv6 (for slightly larger
headers)
- kperf version [1]
[1] git://git.kernel.dk/kperf.git
Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260204200345.1724098-1-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Davide Caratti [Wed, 4 Feb 2026 16:02:31 +0000 (17:02 +0100)]
net/sched: don't use dynamic lockdep keys with clsact/ingress/noqueue
Currently we are registering one dynamic lockdep key for each allocated
qdisc, to avoid false deadlock reports when mirred (or TC eBPF) redirects
packets to another device while the root lock is acquired [1].
Since dynamic keys are a limited resource, we can save them at least for
qdiscs that are not meant to acquire the root lock in the traffic path,
or to carry traffic at all, like:
- clsact
- ingress
- noqueue
Don't register dynamic keys for the above schedulers, so that we hit
MAX_LOCKDEP_KEYS later in our tests.
When looking at the iMX93 documentation, the definitions in the driver
do not correspond with the documentation, which makes the driver
confusing.
The driver, for example, re-uses a definition for bit 0 for two
different registers, where this bit have completely different purposes.
Fix this by renaming the second register, and adding a definition that
reflects the true purpose of bit 0 in the first register (EQOS enable.)
Replace MX93_GPR_ENET_QOS_INTF_MODE_MASK with MX93_GPR_ENET_QOS_ENABLE
and MX93_GPR_ENET_QOS_INTF_SEL_MASK as MX93_GPR_ENET_QOS_INTF_MODE_MASK
is not a register field.
net: stmmac: rk: rk3506, rk3528 and rk3588 have rmii_mode in clock register
rk3506, rk3528 and rk3588 have the rmii_mode bit in the clock GRF
register rather than the gmac GRF register. Provide a mask for this
field in the clock register, and convert these SoCs to use this.
Add the necessary code in rk_gmac_powerup() to write this field.
This allows us to get rid of these SoCs set_to_rmii() function. As
such, we need to mark these SoCs as supporting RMII mode.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Heiko Stuebner <heiko@sntech.de> Tested-by: Heiko Stuebner <heiko@sntech.de> #px30,rk3328,rk3568,rk3588 Link: https://patch.msgid.link/E1vnYyB-00000007hpF-1dwK@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: stmmac: rk: use rk_encode_wm16() for clock selection
Use rk_encode_wm16() for RMII clock gating control, and also for the
io_clksel bit used to select the transmit clock between CRU-derived
and IO-derived clock sources.
Both of these were configured via the "set_clock_selection" method in
the SoC specific operations, but there is no requirement to change the
io_clksel except when enabling clocks.
It is also possible that we don't need to ungate the RMII clock if we
are operating in RGMII mode, but this commit makes no change there.
Split up the configuration of these as separate functions, and remove
the set_clock_selection() method. Since these clocking bits are in the
same register that we call the "speed" register, move the logic for
writing that register into rk_write_speed_grf_reg().
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Heiko Stuebner <heiko@sntech.de> Tested-by: Heiko Stuebner <heiko@sntech.de> #px30,rk3328,rk3568,rk3588 Link: https://patch.msgid.link/E1vnYy6-00000007hp9-1AJM@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This follows the same pattern as rk3328, where this gmac instance
only supports RMII. Disable RGMII in phylink's supported_interfaces
mask for this gmac instance.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Heiko Stuebner <heiko@sntech.de> Link: https://patch.msgid.link/E1vnYy1-00000007hp3-0hKm@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: stmmac: rk: rk3328: gmac2phy only supports RMII
As detailed in a previous commit ("net: stmmac: rk: convert rk3328 to
use bsp_priv->id") rk3328 gmac2phy only supports RMII, whereas gmac2io
supports both RMII and RGMII. Clear supports_rgmii for gmac2phy.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Heiko Stuebner <heiko@sntech.de> Tested-by: Heiko Stuebner <heiko@sntech.de> #px30,rk3328 gmac2io,rk3568,rk3588 Link: https://patch.msgid.link/E1vnYxw-00000007hox-0DqH@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: stmmac: rk: introduce flags indicating support for RGMII/RMII
Introduce two boolean flags into struct rk_priv_data indicating
whether RGMII and/or RMII is supported for this instance. Use these
to configure the supported_interfaces mask for phylink, validate the
interface mode. Initialise these from equivalent flags in the
rk_gmac_ops or depending on the presence of the ops->set_to_rgmii and
ops->set_to_mii methods. Finally, make ops->set_to_* optional.
This will allow us to get rid of empty set_to_rmii() methods.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Heiko Stuebner <heiko@sntech.de> Tested-by: Heiko Stuebner <heiko@sntech.de> #px30,rk3328,rk3568,rk3588 Link: https://patch.msgid.link/E1vnYxl-00000007hol-3XiH@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Shigeru Yoshida [Wed, 4 Feb 2026 09:58:37 +0000 (18:58 +0900)]
ipv6: Fix ECMP sibling count mismatch when clearing RTF_ADDRCONF
syzbot reported a kernel BUG in fib6_add_rt2node() when adding an IPv6
route. [0]
Commit f72514b3c569 ("ipv6: clear RA flags when adding a static
route") introduced logic to clear RTF_ADDRCONF from existing routes
when a static route with the same nexthop is added. However, this
causes a problem when the existing route has a gateway.
When RTF_ADDRCONF is cleared from a route that has a gateway, that
route becomes eligible for ECMP, i.e. rt6_qualify_for_ecmp() returns
true. The issue is that this route was never added to the
fib6_siblings list.
This leads to a mismatch between the following counts:
- The sibling count computed by iterating fib6_next chain, which
includes the newly ECMP-eligible route
- The actual siblings in fib6_siblings list, which does not include
that route
When a subsequent ECMP route is added, fib6_add_rt2node() hits
BUG_ON(sibling->fib6_nsiblings != rt->fib6_nsiblings) because the
counts don't match.
Fix this by only clearing RTF_ADDRCONF when the existing route does
not have a gateway. Routes without a gateway cannot qualify for ECMP
anyway (rt6_qualify_for_ecmp() requires fib_nh_gw_family), so clearing
RTF_ADDRCONF on them is safe and matches the original intent of the
commit.
Jakub Kicinski [Thu, 5 Feb 2026 16:38:02 +0000 (08:38 -0800)]
Merge tag 'nf-26-02-05' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Florian Westphal says:
====================
netfilter: update for net
This is one last-minute crash fix for nf_tables, from Andrew Fasano:
Logical check is inverted, this makes kernel fail to correctly undo
the transaction, leading to a use-after-free.
* tag 'nf-26-02-05' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_tables: fix inverted genmask check in nft_map_catchall_activate()
====================
David Yang [Wed, 4 Feb 2026 05:28:35 +0000 (13:28 +0800)]
flow_offload: add const qualifiers to function arguments
Some functions do not modify the pointed-to data, but lack const
qualifiers. Add const qualifiers to the arguments of
flow_rule_match_has_control_flags() and flow_cls_offload_flow_rule().
====================
dpll: Core improvements and ice E825-C SyncE support
This series introduces Synchronous Ethernet (SyncE) support for the Intel
E825-C Ethernet controller. Unlike previous generations where DPLL
connections were implicitly assumed, the E825-C architecture relies
on the platform firmware (ACPI) to describe the physical connections
between the Ethernet controller and external DPLLs (such as the ZL3073x).
To accommodate this, the series extends the DPLL subsystem to support
firmware node (fwnode) associations, asynchronous discovery via notifiers,
and dynamic pin management. Additionally, a significant refactor of
the DPLL reference counting logic is included to ensure robustness and
debuggability.
DPLL Core Extensions:
* Firmware Node Association: Pins can now be associated with a struct
fwnode_handle after allocation via dpll_pin_fwnode_set(). This allows
drivers to link pin objects with their corresponding DT/ACPI nodes.
* Asynchronous Notifiers: A raw notifier chain is added to the DPLL core.
This allows the Ethernet driver to subscribe to events and react when
the platform DPLL driver registers the parent pins, resolving probe
ordering dependencies.
* Dynamic Indexing: Drivers can now request DPLL_PIN_IDX_UNSPEC to have
the core automatically allocate a unique pin index.
Reference Counting & Debugging:
* Refactor: The reference counting logic in the core is consolidated.
Internal list management helpers now automatically handle hold/put
operations, removing fragile open-coded logic in the registration paths.
* Reference Tracking: A new Kconfig option DPLL_REFCNT_TRACKER is added.
This allows developers to instrument and debug reference leaks by
recording stack traces for every get/put operation.
Driver Updates:
* zl3073x: Updated to associate pins with fwnode handles using the new
setter and support the 'mux' pin type.
* ice: Implements the E825-C specific hardware configuration for SyncE
(CGU registers). It utilizes the new notifier and fwnode APIs to
dynamically discover and attach to the platform DPLLs.
ice: dpll: Support E825-C SyncE and dynamic pin discovery
Implement SyncE support for the E825-C Ethernet controller using the
DPLL subsystem. Unlike E810, the E825-C architecture relies on platform
firmware (ACPI) to describe connections between the NIC's recovered clock
outputs and external DPLL inputs.
Implement the following mechanisms to support this architecture:
1. Discovery Mechanism: The driver parses the 'dpll-pins' and 'dpll-pin names'
firmware properties to identify the external DPLL pins (parents)
corresponding to its RCLK outputs ("rclk0", "rclk1"). It uses
fwnode_dpll_pin_find() to locate these parent pins in the DPLL core.
2. Asynchronous Registration: Since the platform DPLL driver (e.g.
zl3073x) may probe independently of the network driver, utilize
the DPLL notifier chain The driver listens for DPLL_PIN_CREATED
events to detect when the parent MUX pins become available, then
registers its own Recovered Clock (RCLK) pins as children of those
parents.
3. Hardware Configuration: Implement the specific register access logic
for E825-C CGU (Clock Generation Unit) registers (R10, R11). This
includes configuring the bypass MUXes and clock dividers required to
drive SyncE signals.
4. Split Initialization: Refactor `ice_dpll_init()` to separate the
static initialization path of E810 from the dynamic, firmware-driven
path required for E825-C.
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Co-developed-by: Ivan Vecera <ivecera@redhat.com> Signed-off-by: Ivan Vecera <ivecera@redhat.com> Co-developed-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Link: https://patch.msgid.link/20260203174002.705176-10-ivecera@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Ivan Vecera [Tue, 3 Feb 2026 17:40:01 +0000 (18:40 +0100)]
drivers: Add support for DPLL reference count tracking
Update existing DPLL drivers to utilize the DPLL reference count
tracking infrastructure.
Add dpll_tracker fields to the drivers' internal device and pin
structures. Pass pointers to these trackers when calling
dpll_device_get/put() and dpll_pin_get/put().
This allows developers to inspect the specific references held by this
driver via debugfs when CONFIG_DPLL_REFCNT_TRACKER is enabled, aiding
in the debugging of resource leaks.
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Link: https://patch.msgid.link/20260203174002.705176-9-ivecera@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Ivan Vecera [Tue, 3 Feb 2026 17:40:00 +0000 (18:40 +0100)]
dpll: Add reference count tracking support
Add support for the REF_TRACKER infrastructure to the DPLL subsystem.
When enabled, this allows developers to track and debug reference counting
leaks or imbalances for dpll_device and dpll_pin objects. It records stack
traces for every get/put operation and exposes this information via
debugfs at:
/sys/kernel/debug/ref_tracker/dpll_device_*
/sys/kernel/debug/ref_tracker/dpll_pin_*
The following API changes are made to support this:
1. dpll_device_get() / dpll_device_put() now accept a 'dpll_tracker *'
(which is a typedef to 'struct ref_tracker *' when enabled, or an empty
struct otherwise).
2. dpll_pin_get() / dpll_pin_put() and fwnode_dpll_pin_find() similarly
accept the tracker argument.
3. Internal registration structures now hold a tracker to associate the
reference held by the registration with the specific owner.
All existing in-tree drivers (ice, mlx5, ptp_ocp, zl3073x) are updated
to pass NULL for the new tracker argument, maintaining current behavior
while enabling future debugging capabilities.
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Co-developed-by: Petr Oros <poros@redhat.com> Signed-off-by: Petr Oros <poros@redhat.com> Signed-off-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Link: https://patch.msgid.link/20260203174002.705176-8-ivecera@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Ivan Vecera [Tue, 3 Feb 2026 17:39:59 +0000 (18:39 +0100)]
dpll: Enhance and consolidate reference counting logic
Refactor the reference counting mechanism for DPLL devices and pins to
improve consistency and prevent potential lifetime issues.
Introduce internal helpers __dpll_{device,pin}_{hold,put}() to
centralize reference management.
Update the internal XArray reference helpers (dpll_xa_ref_*) to
automatically grab a reference to the target object when it is added to
a list, and release it when removed. This ensures that objects linked
internally (e.g., pins referenced by parent pins) are properly kept
alive without relying on the caller to manually manage the count.
Consequently, remove the now redundant manual `refcount_inc/dec` calls
in dpll_pin_on_pin_{,un}register()`, as ownership is now correctly handled
by the dpll_xa_ref_* functions.
Additionally, ensure that dpll_device_{,un}register()` takes/releases
a reference to the device, ensuring the device object remains valid for
the duration of its registration.
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Link: https://patch.msgid.link/20260203174002.705176-7-ivecera@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Ivan Vecera [Tue, 3 Feb 2026 17:39:58 +0000 (18:39 +0100)]
dpll: zl3073x: Add support for mux pin type
Add parsing for the "mux" string in the 'connection-type' pin property
mapping it to DPLL_PIN_TYPE_MUX.
Recognizing this type in the driver allows these pins to be taken as
parent pins for pin-on-pin pins coming from different modules (e.g.
network drivers).
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Link: https://patch.msgid.link/20260203174002.705176-6-ivecera@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Ivan Vecera [Tue, 3 Feb 2026 17:39:57 +0000 (18:39 +0100)]
dpll: Support dynamic pin index allocation
Allow drivers to register DPLL pins without manually specifying a pin
index.
Currently, drivers must provide a unique pin index when calling
dpll_pin_get(). This works well for hardware-mapped pins but creates
friction for drivers handling virtual pins or those without a strict
hardware indexing scheme.
Introduce DPLL_PIN_IDX_UNSPEC (U32_MAX). When a driver passes this
value as the pin index:
1. The core allocates a unique index using an IDA
2. The allocated index is mapped to a range starting above `INT_MAX`
This separation ensures that dynamically allocated indices never collide
with standard driver-provided hardware indices, which are assumed to be
within the `0` to `INT_MAX` range. The index is automatically freed when
the pin is released in dpll_pin_put().
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Link: https://patch.msgid.link/20260203174002.705176-5-ivecera@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Petr Oros [Tue, 3 Feb 2026 17:39:56 +0000 (18:39 +0100)]
dpll: Add notifier chain for dpll events
Currently, the DPLL subsystem reports events (creation, deletion, changes)
to userspace via Netlink. However, there is no mechanism for other kernel
components to be notified of these events directly.
Add a raw notifier chain to the DPLL core protected by dpll_lock. This
allows other kernel subsystems or drivers to register callbacks and
receive notifications when DPLL devices or pins are created, deleted,
or modified.
Define the following:
- Registration helpers: {,un}register_dpll_notifier()
- Event types: DPLL_DEVICE_CREATED, DPLL_PIN_CREATED, etc.
- Context structures: dpll_{device,pin}_notifier_info to pass relevant
data to the listeners.
The notification chain is invoked alongside the existing Netlink event
generation to ensure in-kernel listeners are kept in sync with the
subsystem state.
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Co-developed-by: Ivan Vecera <ivecera@redhat.com> Signed-off-by: Ivan Vecera <ivecera@redhat.com> Signed-off-by: Petr Oros <poros@redhat.com> Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Link: https://patch.msgid.link/20260203174002.705176-4-ivecera@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Ivan Vecera [Tue, 3 Feb 2026 17:39:55 +0000 (18:39 +0100)]
dpll: zl3073x: Associate pin with fwnode handle
Associate the registered DPLL pin with its firmware node by calling
dpll_pin_fwnode_set().
This links the created pin object to its corresponding DT/ACPI node
in the DPLL core. Consequently, this enables consumer drivers (such as
network drivers) to locate and request this specific pin using the
fwnode_dpll_pin_find() helper.
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Link: https://patch.msgid.link/20260203174002.705176-3-ivecera@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Ivan Vecera [Tue, 3 Feb 2026 17:39:54 +0000 (18:39 +0100)]
dpll: Allow associating dpll pin with a firmware node
Extend the DPLL core to support associating a DPLL pin with a firmware
node. This association is required to allow other subsystems (such as
network drivers) to locate and request specific DPLL pins defined in
the Device Tree or ACPI.
* Add a .fwnode field to the struct dpll_pin
* Introduce dpll_pin_fwnode_set() helper to allow the provider driver
to associate a pin with a fwnode after the pin has been allocated
* Introduce fwnode_dpll_pin_find() helper to allow consumers to search
for a registered DPLL pin using its associated fwnode handle
* Ensure the fwnode reference is properly released in dpll_pin_put()
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Signed-off-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Link: https://patch.msgid.link/20260203174002.705176-2-ivecera@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Moshe Shemesh [Tue, 3 Feb 2026 10:24:02 +0000 (12:24 +0200)]
net/mlx5: Support devlink port state for host PF
Add support for devlink port function state get/set operations for the
host physical function (PF). Until now, mlx5 only allowed state get/set
for subfunctions (SFs) ports. This change enables an administrator with
eSwitch manager privileges to query or modify the host PF’s function
state, allowing it to be explicitly inactivated or activated. While
inactivated, the administrator can modify the functions attributes, such
as enable/disable roce.
$ devlink port show pci/0000:03:00.0/196608
pci/0000:03:00.0/196608: type eth netdev eth1 flavour pcipf controller 1 pfnum 0 external true splittable false
function:
hw_addr a0:88:c2:45:17:7c state active opstate attached roce enable max_io_eqs 120
$ devlink port function set pci/0000:03:00.0/196608 state inactive
$ devlink port show pci/0000:03:00.0/196608
pci/0000:03:00.0/196608: type eth netdev eth1 flavour pcipf controller 1 pfnum 0 external true splittable false
function:
hw_addr a0:88:c2:45:17:7c state inactive opstate detached roce enable max_io_eqs 120
====================
move CAN skb headroom content to skb extensions
CAN bus related skbuffs (ETH_P_CAN/ETH_P_CANFD/ETH_P_CANXL) simply contain
CAN frame structs for CAN CC/FD/XL of skb->len length at skb->data. Those
CAN skbs do not have network/mac/transport headers nor other such
references for encapsulated protocols like ethernet/IP protocols.
To store data for CAN specific use-cases all CAN bus related skbuffs are
created with a 16 byte private skb headroom (struct can_skb_priv). Using
the skb headroom and accessing skb->head for this private data led to
several problems in the past likely due to "The struct can_skb_priv
business is highly unconventional for the networking stack." [1]
This patch set aims to remove the unconventional skb headroom usage for CAN
bus related skbuffs and use the common skb extensions instead.
Oliver Hartkopp [Sun, 1 Feb 2026 14:33:21 +0000 (15:33 +0100)]
can: gw: use can_gw_hops instead of sk_buff::csum_start
As CAN skbs don't use IP checksums the skb->csum_start variable was used to
store the can-gw CAN frame time-to-live counter together with
skb->ip_summed set to CHECKSUM_UNNECESSARY.
Remove the 'hack' using the skb->csum_start variable and move the content
to can_skb_ext::can_gw_hops of the CAN skb extensions.
The module parameter 'max_hops' has been reduced to a single byte to fit
can_skb_ext::can_gw_hops as the maximum value to be stored is 6.
Oliver Hartkopp [Sun, 1 Feb 2026 14:33:18 +0000 (15:33 +0100)]
can: move ifindex to CAN skb extensions
When routing CAN frames over different CAN interfaces the interface index
skb->iif is overwritten with every single hop. To prevent sending a CAN
frame back to its originating (first) incoming CAN interface another
ifindex variable is needed, which was stored in can_skb_priv::ifindex.
Move the can_skb_priv::ifindex content to can_skb_ext::can_iif.
Oliver Hartkopp [Sun, 1 Feb 2026 14:33:17 +0000 (15:33 +0100)]
can: add CAN skb extension infrastructure
To remove the private CAN bus skb headroom infrastructure 8 bytes need to
be stored in the skb. The skb extensions are a common pattern and an easy
and efficient way to hold private data travelling along with the skb. We
only need the skb_ext_add() and skb_ext_find() functions to allocate and
access CAN specific content as the skb helpers to copy/clone/free skbs
automatically take care of skb extensions and their final removal.
This patch introduces the complete CAN skb extensions infrastructure:
- add struct can_skb_ext in new file include/net/can.h
- add include/net/can.h in MAINTAINERS
- add SKB_EXT_CAN to skbuff.c and skbuff.h
- select SKB_EXTENSIONS in Kconfig when CONFIG_CAN is enabled
- check for existing CAN skb extensions in can_rcv() in af_can.c
- add CAN skb extensions allocation at every skb_alloc() location
- duplicate the skb extensions if cloning outgoing skbs (framelen/gw_hops)
- introduce can_skb_ext_add() and can_skb_ext_find() helpers
The patch also corrects an indention issue in the original code from 2018: Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202602010426.PnGrYAk3-lkp@intel.com/ Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://patch.msgid.link/20260201-can_skb_ext-v8-2-3635d790fe8b@hartkopp.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Oliver Hartkopp [Sun, 1 Feb 2026 14:33:16 +0000 (15:33 +0100)]
can: use skb hash instead of private variable in headroom
The can_skb_priv::skbcnt variable is used to identify CAN skbs in the RX
path analogue to the skb->hash.
As the skb hash is not filled in CAN skbs move the private skbcnt value to
skb->hash and set skb->sw_hash accordingly. The skb->hash is a value used
for RPS to identify skbs. Use it as intended.
Cong Wang [Fri, 30 Jan 2026 21:20:21 +0000 (13:20 -0800)]
MAINTAINERS: Remove myself from TC maintainers
Recently TC maintainer Jamal intentionally a broke reasonable
use case:
https://lore.kernel.org/netdev/aG10rqwjX6elG1Gx@pop-os.localdomain/
Although I tried my best to help by:
1) Strongly objecting this breakage from the very beginning
2) Reverting it and offering a much better solution
3) Offering Jamal for video chat on 8 Jul 2025 and 26 Nov 2025
None of them worked.
So it makes no sense for me to continue caring about this subsystem.
Most importantly, intentionally breaking reasonable use cases is
against my moral, I don't want to get ashamed.
Qingfang Deng [Thu, 29 Jan 2026 01:29:02 +0000 (09:29 +0800)]
ppp: enable TX scatter-gather
PPP channels using chan->direct_xmit prepend the PPP header to a skb and
call dev_queue_xmit() directly. In this mode the skb does not need to be
linear, but the PPP netdevice currently does not advertise
scatter-gather features, causing unnecessary linearization and
preventing GSO.
Enable NETIF_F_SG and NETIF_F_FRAGLIST on PPP devices. In case a linear
buffer is required (PPP compression, multilink, and channels without
direct_xmit), call skb_linearize() explicitly.
Andrew Fasano [Wed, 4 Feb 2026 16:46:58 +0000 (17:46 +0100)]
netfilter: nf_tables: fix inverted genmask check in nft_map_catchall_activate()
nft_map_catchall_activate() has an inverted element activity check
compared to its non-catchall counterpart nft_mapelem_activate() and
compared to what is logically required.
nft_map_catchall_activate() is called from the abort path to re-activate
catchall map elements that were deactivated during a failed transaction.
It should skip elements that are already active (they don't need
re-activation) and process elements that are inactive (they need to be
restored). Instead, the current code does the opposite: it skips inactive
elements and processes active ones.
Compare the non-catchall activate callback, which is correct:
nft_mapelem_activate():
if (nft_set_elem_active(ext, iter->genmask))
return 0; /* skip active, process inactive */
With the buggy catchall version:
nft_map_catchall_activate():
if (!nft_set_elem_active(ext, genmask))
continue; /* skip inactive, process active */
The consequence is that when a DELSET operation is aborted,
nft_setelem_data_activate() is never called for the catchall element.
For NFT_GOTO verdict elements, this means nft_data_hold() is never
called to restore the chain->use reference count. Each abort cycle
permanently decrements chain->use. Once chain->use reaches zero,
DELCHAIN succeeds and frees the chain while catchall verdict elements
still reference it, resulting in a use-after-free.
This is exploitable for local privilege escalation from an unprivileged
user via user namespaces + nftables on distributions that enable
CONFIG_USER_NS and CONFIG_NF_TABLES.
Fix by removing the negation so the check matches nft_mapelem_activate():
skip active elements, process inactive ones.
Fixes: 628bd3e49cba ("netfilter: nf_tables: drop map element references from preparation phase") Signed-off-by: Andrew Fasano <andrew.fasano@nist.gov> Signed-off-by: Florian Westphal <fw@strlen.de>
Alexei Lazar [Tue, 3 Feb 2026 07:30:21 +0000 (09:30 +0200)]
net/mlx5e: Extend TC max ratelimit using max_bw_value_msb
The per-TC rate limit was restricted to 255 Gbps due to the 8-bit
max_bw_value field in the QETC register.
This limit is insufficient for newer, higher-bandwidth NICs.
Extend the rate limit by using the full 16-bit max_bw_value field.
This allows the finer 100Mbps granularity to be used for rates up to
~6.5 Tbps, instead of switching to 1Gbps granularity at higher rates.
The extended range is only used when the device advertises support
via the qetcr_qshr_max_bw_val_msb capability bit in the QCAM register.
Dragos Tatulea [Tue, 3 Feb 2026 07:21:30 +0000 (09:21 +0200)]
net/mlx5e: SHAMPO, Improve allocation recovery
When memory providers are used, there is a disconnect between the
page_pool size and the available memory in the provider. This means
that the page_pool can run out of memory if the user didn't provision
a large enough buffer.
Under these conditions, mlx5 gets stuck trying to allocate new
buffers without being able to release existing buffers. This happens due
to the optimization introduced in commit 4c2a13236807
("net/mlx5e: RX, Defer page release in striding rq for better recycling")
which delays WQE releases to increase the chance of page_pool direct
recycling. The optimization was developed before memory providers
existed and this circumstance was not considered.
This patch unblocks the queue by reclaiming pages from WQEs that can be
freed and doing a one-shot retry. A WQE can be freed when:
1) All its strides have been consumed (WQE is no longer in linked list).
2) The WQE pages/netmems have not been previously released.
This reclaim mechanism is useful for regular pages as well.
Note that provisioning memory that can't fill even one MPWQE (64
4K pages) will still render the queue unusable. Same when
the application doesn't release its buffers for various reasons.
Or a combination of the two: a very small buffer is provisioned,
application releases buffers in bulk, bulk size never reached
=> queue is stuck.
Dragos Tatulea [Tue, 3 Feb 2026 07:21:29 +0000 (09:21 +0200)]
net/mlx5e: RX, Drop oversized packets in non-linear mode
Currently the driver has an inconsistent behaviour between modes when it
comes to oversized packets that are not dropped through the physical MTU
check in HW. This can happen for Multi Host configurations where each
port has a different MTU.
Current behavior:
1) Striding RQ in linear mode drops the packet in SW and counts it
with oversize_pkts_sw_drop.
2) Striding RQ in non-linear mode allows it like a normal packet.
3) Legacy RQ can't receive oversized packets by design:
the RX WQE uses MTU sized packet buffers.
This inconsistency is not a violation of the netdev policy [1]
but it is better to be consistent across modes.
This patch aligns (2) with (1) and (3). One exception is added for
LRO: don't drop the oversized packet if it is an LRO packet.
As now rq->hw_mtu always needs to be updated during the MTU change flow,
drop the reset avoidance optimization from mlx5e_change_mtu().
Extract the CQE LRO segments reading into a helper function as it
is used twice now.
[1] Documentation/networking/netdevices.rst#L205
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260203072130.1710255-2-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The dwmac databook for v3.74a states that lpi_intr_o is a sideband
signal which should be used to ungate the application clock, and this
signal is synchronous to the receive clock. The receive clock can run
at 2.5, 25 or 125MHz depending on the media speed, and can stop under
the control of the link partner. This means that the time it takes to
clear is dependent on the negotiated media speed, and thus can be 8,
40, or 400ns after reading the LPI control and status register.
It has been observed with some aggressive link partners, this clock
can stop while lpi_intr_o is still asserted, meaning that the signal
remains asserted for an indefinite period that the local system has
no direct control over.
The LPI interrupts will still be signalled through the main interrupt
path in any case, and this path is not dependent on the receive clock.
This, since we do not gate the application clock, and the chances of
adding clock gating in the future are slim due to the clocks being
ill-defined, lpi_intr_o serves no useful purpose. Remove the code which
requests the interrupt, and all associated code.
Reported-by: Ovidiu Panait <ovidiu.panait.rb@renesas.com> Tested-by: Ovidiu Panait <ovidiu.panait.rb@renesas.com> # Renesas RZ/V2H board Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/E1vnJbt-00000007YYN-28nm@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
net: stmmac: fix serdes power methods
The stmmac serdes powerup/powerdown methods are not guaranteed to be
called in a balancing fashion, but these are used to call the generic
PHY subsystem's phy_power_up() and phy_power_down() methods which do
require balanced calls.
This series addresses this by making the stmmac serdes methods balanced.
====================
net: stmmac: move serdes power methods to stmmac_[open|release]()
Move the SerDes power up and down calls for the non-"after linkup"
case out of __stmmac_open() and __stmmac_release() into the
stmmac_open() and stmmac_release() methods, which means the SerDes
will only change power state on administrative changes or suspend/
resume, not while changing the interface MTU.
net: stmmac: add state tracking for legacy serdes power state
Avoid calling the serdes_powerdown() method if we have not had a
preceeding successful call to the serdes_powerup() method. This
avoids unbalancing refcounted resources that may be used in the
these platform glue serdes methods.
====================
net/rds: RDS-TCP protocol and extension improvements
This is subset 3 of the larger RDS-TCP patch series I posted last
Oct. The greater series aims to correct multiple rds-tcp issues that
can cause dropped or out of sequence messages. I've broken it down into
smaller sets to make reviews more manageable.
In this set, we introduce extension headers for byte accounting
and fix several RDS/TCP protocol issues including message preservation
during connection transitions and multipath lane handling.
The entire set can be viewed in the rfc here:
https://lore.kernel.org/netdev/20251022191715.157755-1-achender@kernel.org/
====================
Gerd Rausch [Tue, 3 Feb 2026 05:57:23 +0000 (22:57 -0700)]
net/rds: Trigger rds_send_ping() more than once
Even though a peer may have already received a
non-zero value for "RDS_EXTHDR_NPATHS" from a node in the past,
the current peer may not.
Therefore it is important to initiate another rds_send_ping()
after a re-connect to any peer:
It is unknown at that time if we're still talking to the same
instance of RDS kernel modules on the other side.
Otherwise, the peer may just operate on a single lane
("c_npaths == 0"), not knowing that more lanes are available.
However, if "c_with_sport_idx" is supported,
we also need to check that the connection we accepted on lane#0
meets the proper source port modulo requirement, as we fan out:
Since the exchange of "RDS_EXTHDR_NPATHS" and "RDS_EXTHDR_SPORT_IDX"
is asynchronous, initially we have no choice but to accept an incoming
connection (via "accept") in the first slot ("cp_index == 0")
for backwards compatibility.
But that very connection may have come from a different lane
with "cp_index != 0", since the peer thought that we already understood
and handled "c_with_sport_idx" properly, as indicated by a previous
exchange before a module was reloaded.
In short:
If a module gets reloaded, we recover from that, but do *not*
allow a downgrade to support fewer lanes.
Downgrades would require us to merge messages from separate lanes,
which is rather tricky with the current RDS design.
Each lane has its own sequence number space and all messages
would need to be re-sequenced as we merge, all while
handling "RDS_FLAG_RETRANSMITTED" and "cp_retrans" properly.
Gerd Rausch [Tue, 3 Feb 2026 05:57:22 +0000 (22:57 -0700)]
net/rds: Use the first lane until RDS_EXTHDR_NPATHS arrives
Instead of just blocking the sender until "c_npaths" is known
(it gets updated upon the receipt of a MPRDS PONG message),
simply use the first lane (cp_index#0).
But just using the first lane isn't enough.
As soon as we enqueue messages on a different lane, we'd run the risk
of out-of-order delivery of RDS messages.
Earlier messages enqueued on "cp_index == 0" could be delivered later
than more recent messages enqueued on "cp_index > 0", mostly because of
possible head of line blocking issues causing the first lane to be
slower.
To avoid that, we simply take a snapshot of "cp_next_tx_seq" at the
time we're about to fan-out to more lanes.
Then we delay the transmission of messages enqueued on other lanes
with "cp_index > 0" until cp_index#0 caught up with the delivery of
new messages (from "cp_send_queue") as well as in-flight
messages (from "cp_retrans") that haven't been acknowledged yet
by the receiver.
We also add a new counter "mprds_catchup_tx0_retries" to keep track
of how many times "rds_send_xmit" had to suspend activities,
because it was waiting for the first lane to catch up.
Håkon Bugge [Tue, 3 Feb 2026 05:57:20 +0000 (22:57 -0700)]
net/rds: Clear reconnect pending bit
When canceling the reconnect worker, care must be taken to reset the
reconnect-pending bit. If the reconnect worker has not yet been
scheduled before it is canceled, the reconnect-pending bit will stay
on forever.
Gerd Rausch [Tue, 3 Feb 2026 05:57:19 +0000 (22:57 -0700)]
net/rds: Kick-start TCP receiver after accept
In cases where the server (the node with the higher IP-address)
in an RDS/TCP connection is overwhelmed it is possible that the
socket that was just accepted is chock-full of messages, up to
the limit of what the socket receive buffer permits.
Subsequently, "rds_tcp_data_ready" won't be called anymore,
because there is no more space to receive additional messages.
Nor was it called prior to the point of calling "rds_tcp_set_callbacks",
because the "sk_data_ready" pointer didn't even point to
"rds_tcp_data_ready" yet.
We fix this by simply kick-starting the receive-worker
for all cases where the socket state is neither
"TCP_CLOSE_WAIT" nor "TCP_CLOSE".
Gerd Rausch [Tue, 3 Feb 2026 05:57:18 +0000 (22:57 -0700)]
net/rds: rds_tcp_conn_path_shutdown must not discard messages
RDS/TCP differs from RDS/RDMA in that message acknowledgment
is done based on TCP sequence numbers:
As soon as the last byte of a message has been acknowledged by the
TCP stack of a peer, rds_tcp_write_space() goes on to discard
prior messages from the send queue.
Which is fine, for as long as the receiver never throws any messages
away.
The dequeuing of messages in RDS/TCP is done either from the
"sk_data_ready" callback pointing to rds_tcp_data_ready()
(the most common case), or from the receive worker pointing
to rds_tcp_recv_path() which is called for as long as the
connection is "RDS_CONN_UP".
However, as soon as rds_conn_path_drop() is called for whatever reason,
including "DR_USER_RESET", "cp_state" transitions to "RDS_CONN_ERROR",
and rds_tcp_restore_callbacks() ends up restoring the callbacks
and thereby disabling message receipt.
So messages already acknowledged to the sender were dropped.
Furthermore, the "->shutdown" callback was always called
with an invalid parameter ("RCV_SHUTDOWN | SEND_SHUTDOWN == 3"),
instead of the correct pre-increment value ("SHUT_RDWR == 2").
inet_shutdown() returns "-EINVAL" in such cases, rendering
this call a NOOP.
So we change rds_tcp_conn_path_shutdown() to do the proper
"->shutdown(SHUT_WR)" call in order to signal EOF to the peer
and make it transition to "TCP_CLOSE_WAIT" (RFC 793).
This should make the peer also enter rds_tcp_conn_path_shutdown()
and do the same.
This allows us to dequeue all messages already received
and acknowledged to the peer.
We do so, until we know that the receive queue no longer has data
(skb_queue_empty()) and that we couldn't have any data
in flight anymore, because the socket transitioned to
any of the states "CLOSING", "TIME_WAIT", "CLOSE_WAIT",
"LAST_ACK", or "CLOSE" (RFC 793).
However, if we do just that, we suddenly see duplicate RDS
messages being delivered to the application.
So what gives?
Turns out that with MPRDS and its multitude of backend connections,
retransmitted messages ("RDS_FLAG_RETRANSMITTED") can outrace
the dequeuing of their original counterparts.
And the duplicate check implemented in rds_recv_local() only
discards duplicates if flag "RDS_FLAG_RETRANSMITTED" is set.
Rather curious, because a duplicate is a duplicate; it shouldn't
matter which copy is looked at and delivered first.
To avoid this entire situation, we simply make the sender discard
messages from the send-queue right from within
rds_tcp_conn_path_shutdown(). Just like rds_tcp_write_space() would
have done, were it called in time or still called.
This makes sure that we no longer have messages that we know
the receiver already dequeued sitting in our send-queue,
and therefore avoid the entire "RDS_FLAG_RETRANSMITTED" fiasco.
Now we got rid of the duplicate RDS message delivery, but we
still run into cases where RDS messages are dropped.
This time it is due to the delayed setting of the socket-callbacks
in rds_tcp_accept_one() via either rds_tcp_reset_callbacks()
or rds_tcp_set_callbacks().
By the time rds_tcp_accept_one() gets there, the socket
may already have transitioned into state "TCP_CLOSE_WAIT",
but rds_tcp_state_change() was never called.
Subsequently, "->shutdown(SHUT_WR)" did not happen either.
So the peer ends up getting stuck in state "TCP_FIN_WAIT2".
We fix that by checking for states "TCP_CLOSE_WAIT", "TCP_LAST_ACK",
or "TCP_CLOSE" and drop the freshly accepted socket in that case.
This problem is observable by running "rds-stress --reset"
frequently on either of the two sides of a RDS connection,
or both while other "rds-stress" processes are exchanging data.
Those "rds-stress" processes reported out-of-sequence
errors, with the expected sequence number being smaller
than the one actually received (due to the dropped messages).
Gerd Rausch [Tue, 3 Feb 2026 05:57:17 +0000 (22:57 -0700)]
net/rds: Encode cp_index in TCP source port
Upon "sendmsg", RDS/TCP selects a backend connection based
on a hash calculated from the source-port ("RDS_MPATH_HASH").
However, "rds_tcp_accept_one" accepts connections
in the order they arrive, which is non-deterministic.
Therefore the mapping of the sender's "cp->cp_index"
to that of the receiver changes if the backend
connections are dropped and reconnected.
However, connection state that's preserved across reconnects
(e.g. "cp_next_rx_seq") relies on that sender<->receiver
mapping to never change.
So we make sure that client and server of the TCP connection
have the exact same "cp->cp_index" across reconnects by
encoding "cp->cp_index" in the lower three bits of the
client's TCP source port.
A new extension "RDS_EXTHDR_SPORT_IDX" is introduced,
that allows the server to tell the difference between
clients that do the "cp->cp_index" encoding, and
legacy clients that pick source ports randomly.
Introduce a new extension header type RDSV3_EXTHDR_RDMA_BYTES for
an RDMA initiator to exchange rdma byte counts to its target.
Currently, RDMA operations cannot precisely account how many bytes a
peer just transferred via RDMA, which limits per-connection statistics
and future policy (e.g., monitoring or rate/cgroup accounting of RDMA
traffic).
In this patch we expand rds_message_add_extension to accept multiple
extensions, and add new flag to RDS header: RDS_FLAG_EXTHDR_EXTENSION,
along with a new extension to RDS header: rds_ext_header_rdma_bytes.
Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com> Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com> Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Link: https://patch.msgid.link/20260203055723.1085751-2-achender@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet [Tue, 3 Feb 2026 21:47:16 +0000 (21:47 +0000)]
net_sched: sch_fq: tweak unlikely() hints in fq_dequeue()
After 076433bd78d7 ("net_sched: sch_fq: add fast path
for mostly idle qdisc") we need to remove one unlikely()
because q->internal holds all the fast path packets.
skb = fq_peek(&q->internal);
if (unlikely(skb)) {
q->internal.qlen--;
Calling INET_ECN_set_ce() is very unlikely.
These changes allow fq_dequeue_skb() to be (auto)inlined,
thus making fq_dequeue() faster.
Marco Crivellari [Wed, 24 Dec 2025 15:50:06 +0000 (16:50 +0100)]
ovpn: Replace use of system_wq with system_percpu_wq
This patch continues the effort to refactor workqueue APIs, which has begun
with the changes introducing new workqueues and a new alloc_workqueue flag:
commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag")
The point of the refactoring is to eventually alter the default behavior of
workqueues to become unbound by default so that their workload placement is
optimized by the scheduler.
Before that to happen after a careful review and conversion of each individual
case, workqueue users must be converted to the better named new workqueues with
no intended behaviour changes:
Randy Dunlap [Tue, 3 Feb 2026 07:52:48 +0000 (23:52 -0800)]
net/iucv: clean up iucv kernel-doc warnings
Fix numerous (many) kernel-doc warnings in iucv.[ch]:
- convert function documentation comments to a common (kernel-doc) look,
even for static functions (without "/**")
- use matching parameter and parameter description names
- use better wording in function descriptions (Jakub & AI)
- remove duplicate kernel-doc comments from the header file (Jakub)
Examples:
Warning: include/net/iucv/iucv.h:210 missing initial short description
on line: * iucv_unregister
Warning: include/net/iucv/iucv.h:216 function parameter 'handle' not
described in 'iucv_unregister'
Warning: include/net/iucv/iucv.h:467 function parameter 'answer' not
described in 'iucv_message_send2way'
Warning: net/iucv/iucv.c:727 missing initial short description on line:
* iucv_cleanup_queue
Build-tested with both "make htmldocs" and "make ARCH=s390 defconfig all".