====================
net/mlx5: HWS single flow counter support
This small series refactors the flow counter bulk initialization code
and extends it so that single flow counters are also usable by hardware
steering (HWS) rules.
Patches 1-2 refactor the bulk init path: first by factoring out common
flow counter bulk initialization into mlx5_fc_bulk_init(), then by
splitting the bitmap allocation into mlx5_fs_bulk_bitmap_alloc(), with
no functional changes.
Patch 3 initializes bulk data for counters allocated via
mlx5_fc_single_alloc(), so they can be safely used by HWS rules.
====================
Mark Bloch [Mon, 12 Jan 2026 09:40:24 +0000 (11:40 +0200)]
net/mlx5: fs, split bulk init
Refactor mlx5_fs_bulk_init() by moving bitmap allocation logic into a
new helper function mlx5_fs_bulk_bitmap_alloc(). This change does not
alter any logic.
Mark Bloch [Mon, 12 Jan 2026 09:40:23 +0000 (11:40 +0200)]
net/mlx5: fs, factor out flow counter bulk init
Add mlx5_fc_bulk_init() to handle bulk initialization of flow counters.
This change does not alter any logic, but refactors the code to remove
duplicate initialization logic by centralizing it in a single function.
====================
Introduce and use netif_xmit_timeout_ms() helper
This is V2, find V1 here:
https://lore.kernel.org/all/1764054776-1308696-1-git-send-email-tariqt@nvidia.com/
This series by Shahar introduces a new helper function
netif_xmit_timeout_ms() to check if a TX queue has timed out and report
the timeout duration.
It also encapsulates the check for whether the TX queue is stopped.
Replace duplicated open-coded timeout check in hns3 driver with the new
helper.
For mlx5e, refine the TX timeout recovery flow to act only on SQs whose
transmit timestamp indicates an actual timeout, as determined by the
helper. This prevents unnecessary channel reopen events caused by
attempting recovery on queues that are merely stopped but not truly
timed out.
====================
Shahar Shitrit [Mon, 12 Jan 2026 09:16:23 +0000 (11:16 +0200)]
net/mlx5e: Refine TX timeout handling to skip non-timed-out SQ
mlx5e_tx_timeout_work() is invoked when the dev_watchdog reports a
timed-out TX queue. Currently, the recovery flow is triggered for all
stopped SQs, which is not always correct — some SQs may be temporarily
stopped without actually timing out. Attempting to recover such SQs
results in no EQE being polled (since no real timeout occurred), which
the driver misinterprets as a recovery failure, unnecessarily causing
channel reopening.
Improve the logic to initiate recovery only for SQs that are both
stopped and timed out. Utilize the helper introduced in the previous
patch to determine whether the netdevice watchdog timeout period has
elapsed since the SQ’s last transmit timestamp.
Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com> Reviewed-by: Yael Chemla <ychemla@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1768209383-1546791-4-git-send-email-tariqt@nvidia.com Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Shahar Shitrit [Mon, 12 Jan 2026 09:16:21 +0000 (11:16 +0200)]
net: Introduce netif_xmit_timeout_ms() helper
Introduce a new helper function netif_xmit_timeout_ms() to check
if a TX queue is stopped and has timed out and report the timeout
duration. This makes the timeout logic reusable, and will be used
in several places in subsequent patches.
Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com> Reviewed-by: Yael Chemla <ychemla@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1768209383-1546791-2-git-send-email-tariqt@nvidia.com Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jason Xing [Sun, 4 Jan 2026 01:21:25 +0000 (09:21 +0800)]
xsk: move cq_cached_prod_lock to avoid touching a cacheline in sending path
We (Paolo and I) noticed that in the sending path touching an extra
cacheline due to cq_cached_prod_lock will impact the performance. After
moving the lock from struct xsk_buff_pool to struct xsk_queue, the
performance is increased by ~5% which can be observed by xdpsock.
An alternative approach [1] can be using atomic_try_cmpxchg() to have the
same effect. But unfortunately I don't have evident performance numbers to
prove the atomic approach is better than the current patch. The advantage
is to save the contention time among multiple xsks sharing the same pool
while the disadvantage is losing good maintenance. The full discussion can
be found at the following link.
Jason Xing [Sun, 4 Jan 2026 01:21:24 +0000 (09:21 +0800)]
xsk: advance cq/fq check when shared umem is used
In the shared umem mode with different queues or devices, either
uninitialized cq or fq is not allowed which was previously done in
xp_assign_dev_shared(). The patch advances the check at the beginning
so that 1) we can avoid a few memory allocation and stuff if cq or fq
is NULL, 2) it can be regarded as preparation for the next patch in
the series.
Dipayaan Roy [Mon, 12 Jan 2026 13:05:52 +0000 (05:05 -0800)]
net: mana: Implement ndo_tx_timeout and serialize queue resets per port.
Implement .ndo_tx_timeout for MANA so any stalled TX queue can be detected
and a device-controlled port reset for all queues can be scheduled to a
ordered workqueue. The reset for all queues on stall detection is
recomended by hardware team.
====================
net: phy: fixed_phy: replace list of fixed PHYs with static array
Due to max 32 PHY addresses being available per mii bus, using a list
can't support more fixed PHY's. And there's no known use case for as
much as 32 fixed PHY's on a system. 8 should be plenty of fixed PHY's,
so use an array of that size instead of a list. This allows to
significantly reduce the code size and complexity.
In addition replace heavy-weight IDA with a simple bitmap.
====================
Heiner Kallweit [Sun, 11 Jan 2026 12:43:22 +0000 (13:43 +0100)]
net: phy: fixed_phy: replace IDA with a bitmap
Size of array fmb_fixed_phys is small, so we can use a simple bitmap
instead of an IDA to manage dynamic allocation of fixed PHY's.
find_first_zero_bit() isn't atomic, so we need the loop to rule out
double allocation of a PHY address.
Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/d4614463-d532-41fc-92e9-ef97107aceb5@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Heiner Kallweit [Sun, 11 Jan 2026 12:41:39 +0000 (13:41 +0100)]
net: phy: fixed_phy: replace list of fixed PHYs with static array
Due to max 32 PHY addresses being available per mii bus, using a list
can't support more fixed PHY's. And there's no known use case for as
much as 32 fixed PHY's on a system. 8 should be plenty of fixed PHY's,
so use an array of that size instead of a list. This allows to
significantly reduce the code size and complexity.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/8610d30c-eac7-4100-9008-d3b6cee6a5cd@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
ipv6: Allow for nexthop device mismatch with "onlink"
This patchset aligns IPv6 with IPv4 with respect to the "onlink" keyword
and allows IPv6 routes to be configured with a gateway address that is
resolved out of a different interface than the one specified.
Patches #1-#3 are small preparations in the existing "onlink" selftest.
Patch #4 is the actual change. See the commit message for detailed
description and motivation.
Patch #5 adds test cases for both address families, to make sure that
this use case does not regress.
====================
Ido Schimmel [Sun, 11 Jan 2026 12:08:13 +0000 (14:08 +0200)]
selftests: fib-onlink: Add test cases for nexthop device mismatch
Add test cases that verify that when the "onlink" keyword is specified,
both address families (with and without VRF) accept routes with a
gateway address that is reachable via a different interface than the one
specified.
Output without "ipv6: Allow for nexthop device mismatch with "onlink"":
Ido Schimmel [Sun, 11 Jan 2026 12:08:12 +0000 (14:08 +0200)]
ipv6: Allow for nexthop device mismatch with "onlink"
IPv4 allows for a nexthop device mismatch when the "onlink" keyword is
specified:
# ip link add name dummy1 up type dummy
# ip address add 192.0.2.1/24 dev dummy1
# ip link add name dummy2 up type dummy
# ip route add 198.51.100.0/24 nexthop via 192.0.2.2 dev dummy2
Error: Nexthop has invalid gateway.
# ip route add 198.51.100.0/24 nexthop via 192.0.2.2 dev dummy2 onlink
# echo $?
0
This seems to be consistent with the description of "onlink" in the
ip-route man page: "Pretend that the nexthop is directly attached to
this link, even if it does not match any interface prefix".
On the other hand, IPv6 rejects a nexthop device mismatch, even when
"onlink" is specified:
# ip link add name dummy1 up type dummy
# ip address add 2001:db8:1::1/64 dev dummy1
# ip link add name dummy2 up type dummy
# ip route add 2001:db8:10::/64 nexthop via 2001:db8:1::2 dev dummy2
RTNETLINK answers: No route to host
# ip route add 2001:db8:10::/64 nexthop via 2001:db8:1::2 dev dummy2 onlink
Error: Nexthop has invalid gateway or device mismatch.
This is intentional according to commit fc1e64e1092f ("net/ipv6: Add
support for onlink flag") which added IPv6 "onlink" support and states
that "any unicast gateway is allowed as long as the gateway is not a
local address and if it resolves it must match the given device".
The condition was later relaxed in commit 4ed591c8ab44 ("net/ipv6: Allow
onlink routes to have a device mismatch if it is the default route") to
allow for a nexthop device mismatch if the gateway address is resolved
via the default route:
# ip link add name dummy1 up type dummy
# ip route add ::/0 dev dummy1
# ip link add name dummy2 up type dummy
# ip route add 2001:db8:10::/64 nexthop via 2001:db8:1::2 dev dummy2
RTNETLINK answers: No route to host
# ip route add 2001:db8:10::/64 nexthop via 2001:db8:1::2 dev dummy2 onlink
# echo $?
0
While the decision to forbid a nexthop device mismatch in IPv6 seems to
be intentional, it is unclear why it was made. Especially when it
differs from IPv4 and seems to go against the intended behavior of
"onlink".
Therefore, relax the condition further and allow for a nexthop device
mismatch when "onlink" is specified:
# ip link add name dummy1 up type dummy
# ip address add 2001:db8:1::1/64 dev dummy1
# ip link add name dummy2 up type dummy
# ip route add 2001:db8:10::/64 nexthop via 2001:db8:1::2 dev dummy2 onlink
# echo $?
0
The motivating use case is the fact that FRR would like to be able to
configure overlay routes of the following form:
# ip route add <host-Z> vrf <VRF> encap ip id <ID> src <VTEP-A> dst <VTEP-Z> via <VTEP-Z> dev vxlan0 onlink
Where vxlan0 is in the default VRF in which "VTEP-Z" is reachable via
one of the underlay routes (e.g., via swpX). Without this patch, the
above only works with IPv4, but not with IPv6.
Ido Schimmel [Sun, 11 Jan 2026 12:08:11 +0000 (14:08 +0200)]
selftests: fib-onlink: Add a test case for IPv4 multicast gateway
A multicast gateway address should be rejected when "onlink" is
specified, but it is only tested as part of the IPv6 tests. Add an
equivalent IPv4 test.
# ./fib-onlink-tests.sh -v
[...]
COMMAND: ip ro add table 254 169.254.101.12/32 via 233.252.0.1 dev veth1 onlink
Error: Nexthop has invalid gateway.
TEST: Invalid gw - multicast address [ OK ]
[...]
COMMAND: ip ro add table 1101 169.254.102.12/32 via 233.252.0.1 dev veth5 onlink
Error: Nexthop has invalid gateway.
The command in the test fails as expected because IPv6 forbids a nexthop
device mismatch:
# ./fib-onlink-tests.sh -v
[...]
COMMAND: ip -6 ro add table 1101 2001:db8:102::103/128 via 2001:db8:701::64 dev veth5 onlink
Error: Nexthop has invalid gateway or device mismatch.
TEST: Gateway resolves to wrong nexthop device - VRF [ OK ]
[...]
Where:
# ip route get 2001:db8:701::64 vrf lisa
2001:db8:701::64 dev veth7 table 1101 proto kernel src 2001:db8:701::1 metric 256 pref medium
This is in contrast to IPv4 where a nexthop device mismatch is allowed
when "onlink" is specified:
# ip route get 169.254.7.2 vrf lisa
169.254.7.2 dev veth7 table 1101 src 169.254.7.1 uid 0
# ip ro add table 1101 169.254.102.103/32 via 169.254.7.2 dev veth5 onlink
# echo $?
0
Remove these tests in preparation for aligning IPv6 with IPv4 and
allowing nexthop device mismatch when "onlink" is specified.
A subsequent patch will add tests that verify that both address families
allow a nexthop device mismatch with "onlink".
According to the test description, these tests fail because of a wrong
nexthop device:
# ./fib-onlink-tests.sh -v
[...]
COMMAND: ip ro add table 254 169.254.101.102/32 via 169.254.3.1 dev veth1 onlink
Error: Nexthop has invalid gateway.
TEST: Gateway resolves to wrong nexthop device [ OK ]
COMMAND: ip ro add table 1101 169.254.102.103/32 via 169.254.7.1 dev veth5 onlink
Error: Nexthop has invalid gateway.
TEST: Gateway resolves to wrong nexthop device - VRF [ OK ]
[...]
But this is incorrect. They fail because the gateway addresses are local
addresses:
# ip -4 address show
[...]
28: veth3@if27: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link-netns peer_ns-Urqh3o
inet 169.254.3.1/24 scope global veth3
[...]
32: veth7@if31: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master lisa state UP group default qlen 1000 link-netns peer_ns-Urqh3o
inet 169.254.7.1/24 scope global veth7
Therefore, using a local address that matches the nexthop device fails
as well:
# ip ro add table 254 169.254.101.102/32 via 169.254.3.1 dev veth3 onlink
Error: Nexthop has invalid gateway.
Using a gateway address with a "wrong" nexthop device is actually valid
and allowed:
# ip route get 169.254.1.2
169.254.1.2 dev veth1 src 169.254.1.1 uid 0
# ip ro add table 254 169.254.101.102/32 via 169.254.1.2 dev veth3 onlink
# echo $?
0
Remove these tests given that their output is confusing and that the
scenario that they are testing is already covered by other tests.
A subsequent patch will add tests for the nexthop device mismatch
scenario.
- This is only a first phase. It instantiates the port, and leverage
that to make the MAC <-> PHY <-> SFP usecase simpler.
- Next phase will deal with controlling the port state, as well as the
netlink uAPI for that.
- The end-goal is to enable support for complex port MUX. This
preliminary work focuses on PHY-driven ports, but this will be
extended to support muxing at the MII level (Multi-phy, or compo PHY
+ SFP as found on Turris Omnia for example).
- The naming is definitely not set in stone. I named that "phy_port",
but this may convey the false sense that this is phylib-specific.
Even the word "port" is not that great, as it already has several
different meanings in the net world (switch port, devlink port,
etc.). I used the term "connector" in the binding.
A bit of history on that work :
The end goal that I personnaly want to achieve is :
+ PHY - RJ45
|
MAC - MUX -+ PHY - RJ45
After many discussions here on netdev@, but also at netdevconf[1] and
LPC[2], there appears to be several analoguous designs that exist out
there.
[1] : https://netdevconf.info/0x17/sessions/talk/improving-multi-phy-and-multi-port-interfaces.html
[2] : https://lpc.events/event/18/contributions/1964/ (video isn't the
right one)
Take the MAchiatobin, it has 2 interfaces that looks like this :
MAC - PHY -+ RJ45
|
+ SFP - Whatever the module does
Now, looking at the Turris Omnia, we have :
MAC - MUX -+ PHY - RJ45
|
+ SFP - Whatever the module does
We can find more example of this kind of designs, the common part is
that we expose multiple front-facing media ports. This is what this
current work aims at supporting. As of right now, it does'nt add any
support for muxing, but this will come later on.
This first phase focuses on phy-driven ports only, but there are already
quite some challenges already. For one, we can't really autodetect how
many ports are sitting behind a PHY. That's why this series introduces a
new binding. Describing ports in DT should however be a last-resort
thing when we need to clear some ambiguity about the PHY media-side.
The only use-cases that we have today for multi-port PHYs are combo PHYs
that drive both a Copper port and an SFP (the Macchiatobin case). This
in itself is challenging and this series only addresses part of this
support, by registering a phy_port for the PHY <-> SFP connection. The
SFP module should in the end be considered as a port as well, but that's
not yet the case.
However, because now PHYs can register phy_ports for every media-side
interface they have, they can register the capabilities of their ports,
which allows making the PHY-driver SFP case much more generic.
====================
net: phy: Only rely on phy_port for PHY-driven SFP
Now that all PHY drivers that support downstream SFP have been converted
to phy_port serdes handling, we can make the generic PHY SFP handling
mandatory, thus making all phylib sfp helpers static.
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260108080041.553250-14-maxime.chevallier@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: phy: qca807x: Support SFP through phy_port interface
QCA8072/8075 may be used as combo-port PHYs, with Serdes (100/1000BaseX)
and Copper interfaces. The PHY has the ability to read the configuration
it's in. If the configuration indicates the PHY is in combo mode, allow
registering up to 2 ports.
Register a dedicated set of port ops to handle the serdes port, and rely
on generic phylib SFP support for the SFP handling.
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260108080041.553250-13-maxime.chevallier@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: phy: at803x: Support SFP through phy_port interface
Convert the at803x driver to use the generic phylib SFP handling, via a
dedicated .attach_port() callback, populating the supported interfaces.
As these devices are limited to 1000BaseX, a workaround is used to also
support, in a very limited way, copper modules. This is done by
supporting SGMII but limiting it to 1G full duplex (in which case it's
somewhat compatible with 1000BaseX).
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260108080041.553250-12-maxime.chevallier@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: phy: marvell10g: Support SFP through phy_port
Convert the Marvell10G driver to use the generic SFP handling, through a
dedicated .attach_port() handler to populate the port's supported
interfaces.
As the 88x3310 supports multiple MDI, the .attach_port() logic handles
both SFP attach with 10GBaseR support, and support for the "regular"
port that usually is a BaseT port.
Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260108080041.553250-11-maxime.chevallier@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: phy: marvell: Support SFP through phy_port interface
Convert the Marvell driver (especially the 88e1512 driver) to use the
phy_port interface to handle SFPs. This means registering a
.attach_port() handler to detect when a serdes line interface is used
(most likely, and SFP module).
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260108080041.553250-10-maxime.chevallier@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: phy: Introduce generic SFP handling for PHY drivers
There are currently 4 PHY drivers that can drive downstream SFPs:
marvell.c, marvell10g.c, at803x.c and marvell-88x2222.c. Most of the
logic is boilerplate, either calling into generic phylib helpers (for
SFP PHY attach, bus attach, etc.) or performing the same tasks with a
bit of validation :
- Getting the module's expected interface mode
- Making sure the PHY supports it
- Optionaly perform some configuration to make sure the PHY outputs
the right mode
This can be made more generic by leveraging the phy_port, and its
configure_mii() callback which allows setting a port's interfaces when
the port is a serdes.
Introduce a generic PHY SFP support. If a driver doesn't probe the SFP
bus itself, but an SFP phandle is found in devicetree/firmware, then the
generic PHY SFP support will be used, relying on port ops.
PHY driver need to :
- Register a .attach_port() callback
- When a serdes port is registered to the PHY, drivers must set
port->interfaces to the set of PHY_INTERFACE_MODE the port can output
- If the port has limitations regarding speed, duplex and aneg, the
port can also fine-tune the final linkmodes that can be supported
- The port may register a set of ops, including .configure_mii(), that
will be called at module_insert time to adjust the interface based on
the module detected.
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260108080041.553250-8-maxime.chevallier@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Some PHY devices may be used as media-converters to drive SFP ports (for
example, to allow using SFP when the SoC can only output RGMII). This is
already supported to some extend by allowing PHY drivers to registers
themselves as being SFP upstream.
However, the logic to drive the SFP can actually be split to a per-port
control logic, allowing support for multi-port PHYs, or PHYs that can
either drive SFPs or Copper.
To that extent, create a phy_port when registering an SFP bus onto a
PHY. This port is considered a "serdes" port, in that it can feed data
to another entity on the link. The PHY driver needs to specify the
various PHY_INTERFACE_MODE_XXX that this port supports.
Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260108080041.553250-7-maxime.chevallier@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The newly added ethernet-connector binding allows describing an Ethernet
connector with greater precision, and in a more generic manner, than
ti,fiber-mode. Deprecate this property.
Reviewed-by: Rob Herring (Arm) <robh@kernel.org> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260108080041.553250-6-maxime.chevallier@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: phy: dp83822: Add support for phy_port representation
With the phy_port representation introduced, we can use .attach_port to
populate the port information based on either the straps or the
ti,fiber-mode property. This allows simplifying the probe function and
allow users to override the strapping configuration.
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260108080041.553250-5-maxime.chevallier@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ethernet provides a wide variety of layer 1 protocols and standards for
data transmission. The front-facing ports of an interface have their own
complexity and configurability.
Introduce a representation of these front-facing ports. The current code
is minimalistic and only support ports controlled by PHY devices, but
the plan is to extend that to SFP as well as raw Ethernet MACs that
don't use PHY devices.
This minimal port representation allows describing the media and number
of pairs of a BaseT port. From that information, we can derive the
linkmodes usable on the port, which can be used to limit the
capabilities of an interface.
For now, the port pairs and medium is derived from devicetree, defined
by the PHY driver, or populated with default values (as we assume that
all PHYs expose at least one port).
The typical example is 100M ethernet. 100BaseTX works using only 2
pairs on a Cat 5 cables. However, in the situation where a 10/100/1000
capable PHY is wired to its RJ45 port through 2 pairs only, we have no
way of detecting that. The "max-speed" DT property can be used, but a
more accurate representation can be used :
In an effort to have a better representation of Ethernet ports,
introduce enumeration values representing the various ethernet Mediums.
This is part of the 802.3 naming convention, for example :
1000 Base T 4
| | | |
| | | \_ pairs (4)
| | \___ Medium (T == Twisted Copper Pairs)
| \_______ Baseband transmission
\____________ Speed
Other example :
10000 Base K X 4
| | \_ lanes (4)
| \___ encoding (BaseX is 8b/10b while BaseR is 66b/64b)
\_____ Medium (K is backplane ethernet)
In the case of representing a physical port, only the medium and number
of pairs should be relevant. One exception would be 1000BaseX, which is
currently also used as a medium in what appears to be any of 1000BaseSX,
1000BaseCX, 1000BaseLX, 1000BaseEX, 1000BaseBX10 and some other.
This was reflected in the mediums associated with the 1000BaseX linkmode.
These mediums are set in the net/ethtool/common.c lookup table that
maintains a list of all linkmodes with their number of pairs, medium,
encoding, speed and duplex.
One notable exception to this is 100BaseT Ethernet. It emcompasses 100BaseTX,
which is a 2-pairs protocol but also 100BaseT4, that will also work on 4-pairs
cables. As we don't make a disctinction between these, the lookup table
contains 2 sets of pair numbers, indicating the min number of pairs for a
protocol to work and the "nominal" number of pairs as well.
Another set of exceptions are linkmodes such 100000baseLR4_ER4, where
the same link mode seems to represent 100GBaseLR4 and 100GBaseER4. The
macro __DEFINE_LINK_MODE_PARAMS_MEDIUMS is here used to populate the
.mediums bitfield with all appropriate mediums.
dt-bindings: net: Introduce the ethernet-connector description
The ability to describe the physical ports of Ethernet devices is useful
to describe multi-port devices, as well as to remove any ambiguity with
regard to the nature of the port.
Moreover, describing ports allows for a better description of features
that are tied to connectors, such as PoE through the PSE-PD devices.
Introduce a binding to allow describing the ports, for now with 2
attributes :
- The number of pairs, which is a quite generic property that allows
differentating between multiple similar technologies such as BaseT1
and "regular" BaseT (which usually means BaseT4).
- The media that can be used on that port, such as BaseT for Twisted
Copper, BaseC for coax copper, BaseS/L for Fiber, BaseK for backplane
ethernet, etc. This allows defining the nature of the port, and
therefore avoids the need for vendor-specific properties such as
"micrel,fiber-mode" or "ti,fiber-mode".
The port description lives in its own file, as it is intended in the
future to allow describing the ports for phy-less devices.
====================
selftests: drv-net: gro: enable HW GRO and LRO testing
Add support for running our existing GRO test against HW GRO
and LRO implementation. The first 3 patches are just ksft lib
nice-to-haves, and patch 4 cleans up the existing gro Python.
Patches 5 and 6 are of most practical interest. The support
reconfiguring the NIC to disable SW GRO and enable HW GRO and LRO.
Additionally last patch breaks up the existing GRO cases to
track HW compliance at finer granularity.
====================
Jakub Kicinski [Tue, 13 Jan 2026 00:07:40 +0000 (16:07 -0800)]
selftests: drv-net: gro: break out all individual test cases
GRO test groups the cases into categories, e.g. "tcp" case
checks coalescing in presence of:
- packets with bad csum,
- sequence number mismatch,
- timestamp option value mismatch,
- different TCP options.
Since we now have TAP support grouping the cases like that
lowers our reporting granularity. This matters even more for
NICs performing HW GRO and LRO since it appears that most
implementation have _some_ bugs. Flagging the whole group
of tests as failed prevents us from catching regressions
in the things that work today.
Jakub Kicinski [Tue, 13 Jan 2026 00:07:38 +0000 (16:07 -0800)]
selftests: drv-net: gro: improve feature config
We'll need to do a lot more feature handling to test HW-GRO and LRO.
Clean up the feature handling for SW GRO a bit to let the next commit
focus on the new test cases, only.
Make sure HW GRO-like features are not enabled for the SW tests.
Be more careful about changing features as "nothing changed"
situations may result in non-zero error code from ethtool.
Don't disable TSO on the local interface (receiver) when running over
netdevsim, we just want GSO to break up the segments on the sender.
Bobby Eshleman [Tue, 13 Jan 2026 03:56:31 +0000 (19:56 -0800)]
tools/net/ynl: suppress jobserver warning in ynltool version detection
When building ynltool with parallel make (-jN), a warning is emitted:
make[1]: warning: jobserver unavailable: using -j1.
Add '+' to parent make rule.
The warning trips up local runs of NIPA's ingest_mdir.py, which
correctly fails on make warnings.
This occurs because SRC_VERSION uses $(shell make ...) to make
kernelversion. The $(shell) function inherits make's MAKEFLAGS env var
which specifies "--jobserver-auth=R,W" pointing to file descriptors that
the invoked make sub-shell does not have access to.
Observed with:
$ make --version | head -1
GNU Make 4.3
Instead of suppressing MAKEFLAGS and foregoing all future MAKEFLAGS
(some of which may be desirable, such as variable overrides) or
introducing a new make target, we instead just ignore the warning by
piping stderr to /dev/null. If 'make kernelversion' fails, the ' || echo
"unknown"' phrase will catch the failure.
Before:
NIPA ingest_mdir.py:
ynl
Full series FAIL (1)
Generated files up to date; build has 1 warnings/errors; no diff in
generated;
Jakub Kicinski [Wed, 14 Jan 2026 01:46:19 +0000 (17:46 -0800)]
Merge branch 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
Tariq Toukan says:
====================
mlx5-next updates 2026-01-13
* 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
net/mlx5: Add IFC bits for extended ETS rate limit bandwidth value
net/mlx5: Add support for querying bond speed
net/mlx5: Handle port and vport speed change events in MPESW
net/mlx5: Propagate LAG effective max_tx_speed to vports
net/mlx5: Add max_tx_speed and its CAP bit to IFC
====================
This is subset 1 of the RDS-TCP bug fix collection series I posted last
Oct. The greater series aims to correct multiple rds-tcp bugs that
can cause dropped or out of sequence messages. The set was starting to
get a bit large, so I've broken it down into smaller sets to make
reviews more manageable.
In this subset, we focus on work queue scalability. Messages queues
are refactored to operate in parallel across multiple connections,
which improves response times and avoids timeouts.
The entire set can be viewed in the rfc here:
https://lore.kernel.org/netdev/20251022191715.157755-1-achender@kernel.org/
net/rds: Give each connection path its own workqueue
RDS was written to require ordered workqueues for "cp->cp_wq":
Work is executed in the order scheduled, one item at a time.
If these workqueues are shared across connections,
then work executed on behalf of one connection blocks work
scheduled for a different and unrelated connection.
Luckily we don't need to share these workqueues.
While it obviously makes sense to limit the number of
workers (processes) that ought to be allocated on a system,
a workqueue that doesn't have a rescue worker attached,
has a tiny footprint compared to the connection as a whole:
A workqueue costs ~900 bytes, including the workqueue_struct,
pool_workqueue, workqueue_attrs, wq_node_nr_active and the
node_nr_active flex array. Each connection can have up to 8
(RDS_MPATH_WORKERS) paths for a worst case of ~7 KBytes per
connection. While an RDS/IB connection totals only ~5 MBytes.
So we're getting a signficant performance gain
(90% of connections fail over under 3 seconds vs. 40%)
for a less than 0.02% overhead.
RDS doesn't even benefit from the additional rescue workers:
of all the reasons that RDS blocks workers, allocation under
memory pressue is the least of our concerns. And even if RDS
was stalling due to the memory-reclaim process, the work
executed by the rescue workers are highly unlikely to free up
any memory. If anything, they might try to allocate even more.
By giving each connection path its own workqueues, we allow
RDS to better utilize the unbound workers that the system
has available.
This patch adds a per connection workqueue which can be initialized
and used independently of the globally shared rds_wq.
This patch is the first in a series that aims to address tcp ack
timeouts during the tcp socket shutdown sequence.
This initial refactoring lays the ground work needed to alleviate
queue congestion during heavy reads and writes. The independently
managed queues will allow shutdowns and reconnects respond more quickly
before the peer(s) timeout waiting for the proper acks.
Paolo Abeni [Tue, 13 Jan 2026 10:54:31 +0000 (11:54 +0100)]
Merge branch 'multi-queue-aware-sch_cake'
says:
====================
Multi-queue aware sch_cake
This series adds a multi-queue aware variant of the sch_cake scheduler,
called 'cake_mq'. Using this makes it possible to scale the rate shaper
of sch_cake across multiple CPUs, while still enforcing a single global
rate on the interface.
The approach taken in this patch series is to implement a separate qdisc
called 'cake_mq', which is based on the existing 'mq' qdisc, but differs
in a couple of aspects:
- It will always install a cake instance on each hardware queue (instead
of using the default qdisc for each queue like 'mq' does).
- The cake instances on the queues will share their configuration, which
can only be modified through the parent cake_mq instance.
Doing things this way simplifies user configuration by centralising
all configuration through the cake_mq qdisc (which also serves as an
obvious way of opting into the multi-queue aware behaviour). The cake_mq
qdisc takes all the same configuration parameters as the cake qdisc.
An earlier version of this work was presented at this year's Netdevconf:
https://netdevconf.info/0x19/sessions/talk/mq-cake-scaling-software-rate-limiting-across-cpu-cores.html
The patch series is structured as follows:
- Patch 1 exports the mq qdisc functions for reuse.
- Patch 2 factors out the sch_cake configuration variables into a
separate struct that can be shared between instances.
- Patch 3 adds the basic cake_mq qdisc, reusing the exported mq code
- Patch 4 adds configuration sharing across the cake instances installed
under cake_mq
- Patch 5 adds the shared shaper state that enables the multi-core rate
shaping
- Patch 6 adds selftests for cake_mq
A patch to iproute2 to make it aware of the cake_mq qdisc were submitted
separately with a previous patch version:
Jonas Köppeler [Fri, 9 Jan 2026 13:15:35 +0000 (14:15 +0100)]
selftests/tc-testing: add selftests for cake_mq qdisc
Test 684b: Create CAKE_MQ with default setting (4 queues)
Test 7ee8: Create CAKE_MQ with bandwidth limit (4 queues)
Test 1f87: Create CAKE_MQ with rtt time (4 queues)
Test e9cf: Create CAKE_MQ with besteffort flag (4 queues)
Test 7c05: Create CAKE_MQ with diffserv8 flag (4 queues)
Test 5a77: Create CAKE_MQ with diffserv4 flag (4 queues)
Test 8f7a: Create CAKE_MQ with flowblind flag (4 queues)
Test 7ef7: Create CAKE_MQ with dsthost and nat flag (4 queues)
Test 2e4d: Create CAKE_MQ with wash flag (4 queues)
Test b3e6: Create CAKE_MQ with flowblind and no-split-gso flag (4 queues)
Test 62cd: Create CAKE_MQ with dual-srchost and ack-filter flag (4 queues)
Test 0df3: Create CAKE_MQ with dual-dsthost and ack-filter-aggressive flag (4 queues)
Test 9a75: Create CAKE_MQ with memlimit and ptm flag (4 queues)
Test cdef: Create CAKE_MQ with fwmark and atm flag (4 queues)
Test 93dd: Create CAKE_MQ with overhead 0 and mpu (4 queues)
Test 1475: Create CAKE_MQ with conservative and ingress flag (4 queues)
Test 7bf1: Delete CAKE_MQ with conservative and ingress flag (4 queues)
Test ee55: Replace CAKE_MQ with mpu (4 queues)
Test 6df9: Change CAKE_MQ with mpu (4 queues)
Test 67e2: Show CAKE_MQ class (4 queues)
Test 2de4: Change bandwidth of CAKE_MQ (4 queues)
Test 5f62: Fail to create CAKE_MQ with autorate-ingress flag (4 queues)
Test 038e: Fail to change setting of sub-qdisc under CAKE_MQ
Test 7bdc: Fail to replace sub-qdisc under CAKE_MQ
Test 18e0: Fail to install CAKE_MQ on single queue device
Jonas Köppeler [Fri, 9 Jan 2026 13:15:34 +0000 (14:15 +0100)]
net/sched: sch_cake: share shaper state across sub-instances of cake_mq
This commit adds shared shaper state across the cake instances beneath a
cake_mq qdisc. It works by periodically tracking the number of active
instances, and scaling the configured rate by the number of active
queues.
The scan is lockless and simply reads the qlen and the last_active state
variable of each of the instances configured beneath the parent cake_mq
instance. Locking is not required since the values are only updated by
the owning instance, and eventual consistency is sufficient for the
purpose of estimating the number of active queues.
The interval for scanning the number of active queues is set to 200 us.
We found this to be a good tradeoff between overhead and response time.
For a detailed analysis of this aspect see the Netdevconf talk:
net/sched: sch_cake: Add cake_mq qdisc for using cake on mq devices
Add a cake_mq qdisc which installs cake instances on each hardware
queue on a multi-queue device.
This is just a copy of sch_mq that installs cake instead of the default
qdisc on each queue. Subsequent commits will add sharing of the config
between cake instances, as well as a multi-queue aware shaper algorithm.
net/sched: sch_cake: Factor out config variables into separate struct
Factor out all the user-configurable variables into a separate struct
and embed it into struct cake_sched_data. This is done in preparation
for sharing the configuration across multiple instances of cake in an mq
setup.
To enable the cake_mq qdisc to reuse code from the mq qdisc, export a
bunch of functions from sch_mq. Split common functionality out from some
functions so it can be composed with other code, and export other
functions wholesale. To discourage wanton reuse, put the symbols into a
new NET_SCHED_INTERNAL namespace, and a sch_priv.h header file.
Javen Xu [Fri, 9 Jan 2026 07:04:14 +0000 (15:04 +0800)]
r8169: add DASH support for RTL8127AP
This adds DASH support for chip RTL8127AP. Its mac version is
RTL_GIGA_MAC_VER_80 and revision id is 0x04. DASH is a standard for
remote management of network device, allowing out-of-band control.
Alexei Lazar [Mon, 12 Jan 2026 06:50:08 +0000 (08:50 +0200)]
net/mlx5: Add IFC bits for extended ETS rate limit bandwidth value
Add hardware interface definitions to support extended bandwidth rate
limiting in the QoS Enhanced Transmission Selection (ETS) configuration.
The new fields include:
- max_bw_value: extended from 8-bit to 16-bit in ets_tcn_config_reg,
simplifying the implementation by using a single field instead of
separate MSB/LSB fields.
- qetcr_qshr_max_bw_val_msb: capability bit in qcam_qos_feature_cap_mask
indicating device support for the extended 16-bit max_bw_value field.
These interface additions are prerequisites for increasing the per-TC
rate limit beyond 255 Gbps to support higher-bandwidth NICs.
====================
net: stmmac: pcs: clean up pcs interrupt handling
Clean up the stmmac PCS interrupt handling:
- Avoid promotion to unsigned long from unsigned int by defining PCS
register bits/fields using u32 macros.
- Pass struct stmmac_priv into the host_irq_status MAC core method.
- Move the existing PCS interrupt handler (dwmac_pcs_isr) into
stmmac_pcs.c, change it's arguments, use dev_info() rather than
pr_info()
- arrange to call phylink_pcs_change() on link state changes.
====================
Report PCS link changes to phylink, which will allow phylink's inband
support to respoind to link events once the PCS is appropriately
configured.
An expected behavioural change is that should the PCS report that its
link has failed, but phylink is operating in outband mode and the PHY
reports that link is up, this event will cause the netdev's link to
momentarily drop, making the event more noticable, rather than just
producing a "stmmac_pcs: Link Down" message.
net: stmmac: change arguments to PCS handler and use dev_info()
Change the arguments to the PCS handler so that it can access the
struct device pointer and integrated PCS pointers.
This allows us to use the PCS register offset stored in struct
stmmac_pcs rather than passing it into the function, and also allows
the messages to be printed using dev_info() rather than pr_info(),
thereby allowing the stmmac instance to be identified.
Finally, as dev_info() identifies the driver/device, prefixing with
"stmmac_pcs: " is now redundant, so replace this with just "PCS ".
net: stmmac: pass struct stmmac_priv to host_irq_status() method
Rather than passing struct mac_device_info to the host_irq_status()
method, pass struct stmmac_priv so that we can pass the integrated
PCS to the PCS interrupt handler.
dwmac_pcs_isr() doesn't need to be inlined into the MAC's
host_irq_status method, as handling PCS interrupts isn't performance
critical. However, there is little point calling this function unless
an interrupt is pending for the PCS.
Rename it to stmmac_integrated_pcs_irq() while moving it.
net: stmmac: use BIT_U32() and GENMASK_U32() for PCS registers
stmmac registers a 32-bit. u32 is unsigned int. The use of BIT() and
GENMASK() leads to integer promotion to unsigned long in expressions
such as:
u32 old = foo;
dev_info(dev, "%08x %08x\n", old, old & BIT(1));
resulting in arg2 being accepted as compatible with the format string
and arg3 warning that the argument does not match (because the former
is unsigned int, and the latter is unsigned long.)
Fix this by defining 32-bit register bits using BIT_U32() and
GENMASK_U32() macros.
====================
r8169: add support for RTL8127ATF (10G Fiber SFP)
RTL8127ATF supports a SFP+ port for fiber modules (10GBASE-SR/LR/ER/ZR and
DAC). The list of supported modes was provided by Realtek. According to the
r8127 vendor driver also 1G modules are supported, but this needs some more
complexity in the driver, and only 10G mode has been tested so far.
Therefore mainline support will be limited to 10G for now.
The SFP port signals are hidden in the chip IP and driven by firmware.
Therefore mainline SFP support can't be used here.
The PHY driver is used by the RTL8127ATF support in r8169.
RTL8127ATF reports the same PHY ID as the TP version. Therefore use a dummy
PHY ID.
====================
Heiner Kallweit [Sat, 10 Jan 2026 15:15:32 +0000 (16:15 +0100)]
r8169: add support for RTL8127ATF (Fiber SFP)
RTL8127ATF supports a SFP+ port for fiber modules (10GBASE-SR/LR/ER/ZR and
DAC). The list of supported modes was provided by Realtek. According to the
r8127 vendor driver also 1G modules are supported, but this needs some more
complexity in the driver, and only 10G mode has been tested so far.
Therefore mainline support will be limited to 10G for now.
The SFP port signals are hidden in the chip IP and driven by firmware.
Therefore mainline SFP support can't be used here.
Heiner Kallweit [Sat, 10 Jan 2026 15:14:05 +0000 (16:14 +0100)]
net: phy: realtek: add dummy PHY driver for RTL8127ATF
RTL8127ATF supports a SFP+ port for fiber modules (10GBASE-SR/LR/ER/ZR and
DAC). The list of supported modes was provided by Realtek. According to the
r8127 vendor driver also 1G modules are supported, but this needs some more
complexity in the driver, and only 10G mode has been tested so far.
Therefore mainline support will be limited to 10G for now.
The SFP port signals are hidden in the chip IP and driven by firmware.
Therefore mainline SFP support can't be used here.
This PHY driver is used by the RTL8127ATF support in r8169.
RTL8127ATF reports the same PHY ID as the TP version. Therefore use a dummy
PHY ID. This PHY driver is used by the RTL8127ATF support in r8169.
Jian Zhang [Thu, 8 Jan 2026 10:18:29 +0000 (18:18 +0800)]
net: mctp-i2c: fix duplicate reception of old data
The MCTP I2C slave callback did not handle I2C_SLAVE_READ_REQUESTED
events. As a result, i2c read event will trigger repeated reception of
old data, reset rx_pos when a read request is received.
====================
Add DWMAC glue driver for Motorcomm YT6801
This series adds glue driver for Motorcomm YT6801 PCIe ethernet
controller, which is considered mostly compatible with DWMAC-4 IP by
inspecting the register layout[1]. It integrates a Motorcomm YT8531S PHY
(confirmed by reading PHY ID) and GMII is used to connect the PHY to
MAC[2].
The initialization logic of the MAC is mostly based on previous upstream
effort for the controller[3] and the Deepin-maintained downstream Linux
driver[4] licensed under GPL-2.0 according to its SPDX headers. However,
this series is a completely re-write of the previous patch series,
utilizing the existing DWMAC4 driver and introducing a glue driver only.
This series only aims to add basic networking functions for the
controller, features like WoL, RSS and LED control are omitted for now.
Testing is done on i3-4170, it reaches 939Mbps (TX)/933Mbps (RX) on
average,
Yao Zi [Fri, 9 Jan 2026 09:34:45 +0000 (09:34 +0000)]
net: stmmac: Add glue driver for Motorcomm YT6801 ethernet controller
Motorcomm YT6801 is a PCIe ethernet controller based on DWMAC4 IP. It
integrates an GbE phy, supporting WOL, VLAN tagging and various types
of offloading. It ships an on-chip eFuse for storing various vendor
configuration, including MAC address.
This patch adds basic glue code for the controller, allowing it to be
set up and transmit data at a reasonable speed. Features like WOL could
be implemented in the future.
Signed-off-by: Yao Zi <me@ziyao.cc> Tested-by: Mingcong Bai <jeffbai@aosc.io> Tested-by: Runhua He <hua@aosc.io> Tested-by: Xi Ruoyao <xry111@xry111.site> Reviewed-by: Sai Krishna <saikrishnag@marvell.com> Link: https://patch.msgid.link/20260109093445.46791-4-me@ziyao.cc Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Yao Zi [Fri, 9 Jan 2026 09:34:44 +0000 (09:34 +0000)]
net: phy: motorcomm: Support YT8531S PHY in YT6801 Ethernet controller
YT6801's internal PHY is confirmed as a GMII-capable variant of YT8531S
by a previous series[1] and reading PHY ID. Add support for
PHY_INTERFACE_MODE_GMII for YT8531S to allow the Ethernet driver to
reuse the PHY code for its internal PHY.
Ankit Khushwaha [Fri, 9 Jan 2026 15:22:01 +0000 (20:52 +0530)]
selftests/net/ipsec: Fix variable size type not at the end of struct
The "struct alg" object contains a union of 3 xfrm structures:
union {
struct xfrm_algo;
struct xfrm_algo_aead;
struct xfrm_algo_auth;
}
All of them end with a flexible array member used to store key material,
but the flexible array appears at *different offsets* in each struct.
bcz of this, union itself is of variable-sized & Placing it above
char buf[...] triggers:
ipsec.c:835:5: warning: field 'u' with variable sized type 'union
(unnamed union at ipsec.c:831:3)' not at the end of a struct or class
is a GNU extension [-Wgnu-variable-sized-type-not-at-end]
835 | } u;
| ^
one fix is to use "TRAILING_OVERLAP()" which works with one flexible
array member only.
But In "struct alg" flexible array member exists in all union members,
but not at the same offset, so TRAILING_OVERLAP cannot be applied.
so the fix is to explicitly overlay the key buffer at the correct offset
for the largest union member (xfrm_algo_auth). This ensures that the
flexible-array region and the fixed buffer line up.
====================
net: stmmac: cleanups and low priority fixes
Further cleanups and a few low priority fixes:
- Remove duplicated register definitions from header files
- Fix harmless wrong definition used for PTP message type in
descriptors
- Fix norm_set_tx_desc_len_on_ring() off-by-one error (and make
enh_set_tx_desc_len_on_ring() follow a similar pattern.)
Document the buffer size limits. I believe we never call
norm_set_tx_desc_len_on_ring() with 2KiB lengths.
- use u32 rather than unsigned int for 32-bit quantities in
descriptors
- modernise: convert to use FIELD_PREP() rather than separate mask
and shift definitions.
- Reorganise register and register field definitions: registers
defined in address offset order followed by their register field
definitions.
- Remove lots of unused register definitions.
====================
net: stmmac: cores: remove many xxx_SHIFT definitions
We have many xxx_SHIFT definitions along side their corresponding
xxx_MASK definitions for the various cores. Manually using the
shift and mask can be error prone, as shown with the dwmac4 RXFSTS
fix patch.
Convert sites that use xxx_SHIFT and xxx_MASK directly to use
FIELD_GET(), FIELD_PREP(), and u32_replace_bits() as appropriate.
net: stmmac: descs: remove many xxx_SHIFT definitions
Remove many xxx_SHIFT definitions for descriptors, isntead using
FIELD_PREP(), FIELD_GET(), and u32_replace_bits() as appropriate to
manipulate the bitfields. This avoids potential errors where an
incorrect shift is used with a mask.
Use u32 rather than unsigned int for 32-bit descriptor variables.
This will allow the u32 bitfield helpers to be used. Note, we use
__le32 for the in-memory descriptor structures.
norm_set_tx_desc_len_on_ring() incorrectly tests the buffer length,
leading to a length of 2048 being squeezed into a bitfield covering
bits 10:0 - which results in the buffer 1 size being zero.
If this field is zero, buffer 1 is ignored, and thus is equivalent to
transmitting a zero length buffer.
The path to norm_set_tx_desc_len_on_ring() is only possible when the
hardware does not support enhanced descriptors (plat->enh_desc clear)
which is dependent on the hardware.
The correct definition is RDES1_PTP_MSG_TYPE_MASK, which is also
defined as:
#define RDES1_PTP_MSG_TYPE_MASK GENMASK(11, 8)
Use the correct definition, converting to use FIELD_GET() to extract
it without needing an open-coded shift right that is dependent on the
mask definition.
As this change has no effect on the generated code, there is no need
to treat this as a bug fix.
where rxfsts is tested against small integers 1 .. 3. This results in
the tests always failing, causing the "mtl_rx_fifo__fill_level_empty"
statistic counter to always be incremented no matter what the fill
level actually is.
Fix this by using FIELD_GET() and remove the unnecessary
MTL_DEBUG_RXFSTS_SHIFT definition as FIELD_GET() will shift according
to the least siginificant set bit in the supplied field mask.
Bobby Eshleman [Thu, 8 Jan 2026 01:29:38 +0000 (17:29 -0800)]
net: devmem: convert binding refcount to percpu_ref
Convert net_devmem_dmabuf_binding refcount from refcount_t to percpu_ref
to optimize common-case reference counting on the hot path.
The typical devmem workflow involves binding a dmabuf to a queue
(acquiring the initial reference on binding->ref), followed by
high-volume traffic where every skb fragment acquires a reference.
Eventually traffic stops and the unbind operation releases the initial
reference. Additionally, the high traffic hot path is often multi-core.
This access pattern is ideal for percpu_ref as the first and last
reference during bind/unbind normally book-ends activity in the hot
path.
__net_devmem_dmabuf_binding_free becomes the percpu_ref callback invoked
when the last reference is dropped.
kperf test:
- 4MB message sizes
- 60s of workload each run
- 5 runs
- 4 flows
Jakub Kicinski [Tue, 13 Jan 2026 01:26:45 +0000 (17:26 -0800)]
Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2026-01-09 (ice, ixgbe, idpf)
For ice:
Grzegorz commonizes firmware loading process across all ice devices.
Michal adjusts default queue allocation to be based on
netif_get_num_default_rss_queues() rather than num_online_cpus().
For ixgbe:
Birger Koblitz adds support for 10G-BX modules.
For idpf:
Sreedevi converts always successful function to return void.
Andy Shevchenko fixes kdocs for missing 'Return:' in idpf_txrx.c file.
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
idpf: Fix kernel-doc descriptions to avoid warnings
idpf: update idpf_up_complete() return type to void
ice: use netif_get_num_default_rss_queues()
ixgbe: Add 10G-BX support
ice: unify PHY FW loading status handler for E800 devices
====================
Jakub Kicinski [Tue, 13 Jan 2026 01:02:02 +0000 (17:02 -0800)]
Merge tag 'wireless-next-2026-01-12' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Johannes Berg says:
====================
First set of changes for the current -next cycle, of note:
- ath12k gets an overhaul to support multi-wiphy device
wiphy and pave the way for future device support in
the same driver (rather than splitting to ath13k)
- mac80211 gets some better iteration macros
* tag 'wireless-next-2026-01-12' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (120 commits)
wifi: mac80211: remove width argument from ieee80211_parse_bitrates
wifi: mac80211_hwsim: remove NAN by default
wifi: mac80211: improve station iteration ergonomics
wifi: mac80211: improve interface iteration ergonomics
wifi: cfg80211: include S1G_NO_PRIMARY flag when sending channel
wifi: mac80211: unexport ieee80211_get_bssid()
wl1251: Replace strncpy with strscpy in wl1251_acx_fw_version
wifi: iwlegacy: 3945-rs: remove redundant pointer check in il3945_rs_tx_status() and il3945_rs_get_rate()
wifi: mac80211: don't send an unused argument to ieee80211_check_combinations
wifi: libertas: fix WARNING in usb_tx_block
wifi: mwifiex: Allocate dev name earlier for interface workqueue name
wifi: wlcore: sdio: Use pm_ptr instead of #ifdef CONFIG_PM
wifi: cfg80211: Fix use_for flag update on BSS refresh
wifi: brcmfmac: rename function that frees vif
wifi: brcmfmac: fix/add kernel-doc comments
wifi: mac80211: Update csa_finalize to use link_id
wifi: cfg80211: add cfg80211_stop_link() for per-link teardown
wifi: ath12k: Skip DP peer creation for scan vdev
wifi: ath12k: move firmware stats request outside of atomic context
wifi: ath12k: add the missing RCU lock in ath12k_dp_tx_free_txbuf()
...
====================
====================
tools: ynl: cli: improve the help and doc
I had some time on the plane to LPC, so here are improvements
to the --help and --list-attrs handling of YNL CLI which seem
in order given growing use of YNL as a real CLI tool.
====================
Jakub Kicinski [Sat, 10 Jan 2026 23:31:42 +0000 (15:31 -0800)]
tools: ynl: cli: print reply in combined format if possible
As pointed out during review of the --list-attrs support the GET
ops very often return the same attrs from do and dump. Make the
output more readable by combining the reply information, from:
Do request attributes:
- ifindex: u32
netdev ifindex
Do reply attributes:
- ifindex: u32
netdev ifindex
[ .. other attrs .. ]
Do request attributes:
- ifindex: u32
netdev ifindex
Do and Dump reply attributes:
- ifindex: u32
netdev ifindex
[ .. other attrs .. ]
Tested-by: Gal Pressman <gal@nvidia.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20260110233142.3921386-8-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Sat, 10 Jan 2026 23:31:41 +0000 (15:31 -0800)]
tools: ynl: cli: extract the event/notify handling in --list-attrs
Event and notify handling is quite different from do / dump
handling. Forcing it into print_mode_attrs() doesn't really
buy us anything as events and notifications do not have requests.
Call print_attr_list() directly. Apart form subjective code
clarity this also removes the word "reply" from the output:
Before:
Event reply attributes:
Now:
Event attributes:
Tested-by: Gal Pressman <gal@nvidia.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20260110233142.3921386-7-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Sat, 10 Jan 2026 23:31:40 +0000 (15:31 -0800)]
tools: ynl: cli: factor out --list-attrs / --doc handling
We'll soon add more code to the --doc handling. Factor it out
to avoid making main() too long.
Tested-by: Gal Pressman <gal@nvidia.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20260110233142.3921386-6-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Sat, 10 Jan 2026 23:31:39 +0000 (15:31 -0800)]
tools: ynl: cli: add --doc as alias to --list-attrs
--list-attrs also provides information about the operation itself.
So --doc seems more appropriate. Add an alias.
Tested-by: Gal Pressman <gal@nvidia.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20260110233142.3921386-5-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Sat, 10 Jan 2026 23:31:38 +0000 (15:31 -0800)]
tools: ynl: cli: improve --help
Improve the clarity of --help. Reorder, provide some grouping and
add help messages to most of the options.
No functional changes intended.
Tested-by: Gal Pressman <gal@nvidia.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20260110233142.3921386-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Sat, 10 Jan 2026 23:31:37 +0000 (15:31 -0800)]
tools: ynl: cli: wrap the doc text if it's long
We already use textwrap when printing "doc" section about an attribute,
but only to indent the text. Switch to using fill() to split and indent
all the lines. While at it indent the text by 2 more spaces, so that it
doesn't align with the name of the attribute.
Before (I'm drawing a "box" at ~60 cols here, in an attempt for clarity):
| - irq-suspend-timeout: uint |
| The timeout, in nanoseconds, of how long to suspend irq|
|processing, if event polling finds events |
After:
| - irq-suspend-timeout: uint |
| The timeout, in nanoseconds, of how long to suspend |
| irq processing, if event polling finds events |
Tested-by: Gal Pressman <gal@nvidia.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20260110233142.3921386-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Sat, 10 Jan 2026 23:31:36 +0000 (15:31 -0800)]
tools: ynl: cli: introduce formatting for attr names in --list-attrs
It's a little hard to make sense of the output of --list-attrs,
it looks like a wall of text. Sprinkle a little bit of formatting -
make op and attr names bold, and Enum: / Flags: keywords italics.
Tested-by: Gal Pressman <gal@nvidia.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20260110233142.3921386-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>