Eric Dumazet [Fri, 21 Nov 2025 08:32:55 +0000 (08:32 +0000)]
net_sched: add qdisc_dequeue_drop() helper
Some qdisc like cake, codel, fq_codel might drop packets
in their dequeue() method.
This is currently problematic because dequeue() runs with
the qdisc spinlock held. Freeing skbs can be extremely expensive.
Add qdisc_dequeue_drop() method and a new TCQ_F_DEQUEUE_DROPS
so that these qdiscs can opt-in to defer the skb frees
after the socket spinlock is released.
TCQ_F_DEQUEUE_DROPS is an attempt to not penalize other qdiscs
with an extra cache line miss.
Eric Dumazet [Fri, 21 Nov 2025 08:32:49 +0000 (08:32 +0000)]
net_sched: add Qdisc_read_mostly and Qdisc_write groups
It is possible to reorg Qdisc to avoid always dirtying 2 cache lines in
fast path by reducing this to a single dirtied cache line.
In current layout, we change only four/six fields in the first cache line:
- q.spinlock
- q.qlen
- bstats.bytes
- bstats.packets
- some Qdisc also change q.next/q.prev
In the second cache line we change in the fast path:
- running
- state
- qstats.backlog
Add "aspeed,ast2700-mdio" compatible to the binding schema with a fallback
to "aspeed,ast2600-mdio".
Although the MDIO controller on AST2700 is functionally the same as the
one on AST2600, it's good practice to add a SoC-specific compatible for
new silicon. This allows future driver updates to handle any 2700-specific
integration issues without requiring devicetree changes or complex
runtime detection logic.
For now, the driver continues to bind via the existing
"aspeed,ast2600-mdio" compatible, so no driver changes are needed.
====================
mptcp: memcg accounting for passive sockets & backlog processing
This series is split in two: the 4 first patches are linked to memcg
accounting for passive sockets, and the rest introduce the backlog
processing. They are sent together, because the first one appeared to be
needed to get the second one fully working.
The second part includes RX path improvement built around backlog
processing. The main goals are improving the RX performances _and_
increase the long term maintainability.
- Patches 1-3: preparation work to ease the introduction of the next
patch.
- Patch 4: fix memcg accounting for passive sockets. Note that this is a
(non-urgent) fix, but it depends on material that is currently only in
net-next, e.g. commit 4a997d49d92a ("tcp: Save lock_sock() for memcg
in inet_csk_accept().").
- Patches 5-6: preparation of the stack for backlog processing, removing
assumptions that will not hold true any more after the backlog
introduction.
- Patches 7,8,10,11,12 are more cleanups that will make the backlog
patch a little less huge.
- Patch 9: somewhat an unrelated cleanup, included here not to forget
about it.
- Patches 13-14: The real work is done by them. Patch 13 introduces the
helpers needed to manipulate the msk-level backlog, and the data
struct itself, without any actual functional change. Patch 14 finally
uses the backlog for RX skb processing. Note that MPTCP can't use the
sk_backlog, as the MPTCP release callback can also release and
re-acquire the msk-level spinlock and core backlog processing works
under the assumption that such event is not possible.
A relevant point is memory accounts for skbs in the backlog. It's
somewhat "original" due to MPTCP constraints. Such skbs use space from
the incoming subflow receive buffer, do not use explicitly any forward
allocated memory, as we can't update the msk fwd mem while enqueuing,
nor we want to acquire again the ssk socket lock while processing the
skbs. Instead the msk borrows memory from the subflow and reserve it
for the backlog, see patch 5 and 14 for the gory details.
====================
Paolo Abeni [Fri, 21 Nov 2025 17:02:13 +0000 (18:02 +0100)]
mptcp: leverage the backlog for RX packet processing
When the msk socket is owned or the msk receive buffer is full,
move the incoming skbs in a msk level backlog list. This avoid
traversing the joined subflows and acquiring the subflow level
socket lock at reception time, improving the RX performances.
When processing the backlog, use the fwd alloc memory borrowed from
the incoming subflow. skbs exceeding the msk receive space are
not dropped; instead they are kept into the backlog until the receive
buffer is freed. Dropping packets already acked at the TCP level is
explicitly discouraged by the RFC and would corrupt the data stream
for fallback sockets.
Special care is needed to avoid adding skbs to the backlog of a closed
msk and to avoid leaving dangling references into the backlog
at subflow closing time.
Paolo Abeni [Fri, 21 Nov 2025 17:02:12 +0000 (18:02 +0100)]
mptcp: introduce mptcp-level backlog
We are soon using it for incoming data processing.
MPTCP can't leverage the sk_backlog, as the latter is processed
before the release callback, and such callback for MPTCP releases
and re-acquire the socket spinlock, breaking the sk_backlog processing
assumption.
Add a skb backlog list inside the mptcp sock struct, and implement
basic helper to transfer packet to and purge such list.
Packets in the backlog are memory accounted and still use the incoming
subflow receive memory, to allow back-pressure. The backlog size is
implicitly bounded to the sum of subflows rcvbuf.
When a subflow is closed, references from the backlog to such sock
are removed.
No packet is currently added to the backlog, so no functional changes
intended here.
Paolo Abeni [Fri, 21 Nov 2025 17:02:11 +0000 (18:02 +0100)]
mptcp: borrow forward memory from subflow
In the MPTCP receive path, we release the subflow allocated fwd
memory just to allocate it again shortly after for the msk.
That could increases the failures chances, especially when we will
add backlog processing, with other actions could consume the just
released memory before the msk socket has a chance to do the
rcv allocation.
Replace the skb_orphan() call with an open-coded variant that
explicitly borrows, the fwd memory from the subflow socket instead
of releasing it.
The borrowed memory does not have PAGE_SIZE granularity; rounding to
the page size will make the fwd allocated memory higher than what is
strictly required and could make the incoming subflow fwd mem
consistently negative. Instead, keep track of the accumulated frag and
borrow the full page at subflow close time.
This allow removing the last drop in the TCP to MPTCP transition and
the associated, now unused, MIB.
Paolo Abeni [Fri, 21 Nov 2025 17:02:10 +0000 (18:02 +0100)]
mptcp: handle first subflow closing consistently
Currently, as soon as the PM closes a subflow, the msk stops accepting
data from it, even if the TCP socket could be still formally open in the
incoming direction, with the notable exception of the first subflow.
The root cause of such behavior is that code currently piggy back two
separate semantic on the subflow->disposable bit: the subflow context
must be released and that the subflow must stop accepting incoming
data.
The first subflow is never disposed, so it also never stop accepting
incoming data. Use a separate bit to mark the latter status and set such
bit in __mptcp_close_ssk() for all subflows.
Beyond making per subflow behaviour more consistent this will also
simplify the next patch.
Paolo Abeni [Fri, 21 Nov 2025 17:02:07 +0000 (18:02 +0100)]
mptcp: do not miss early first subflow close event notification
The MPTCP protocol is not currently emitting the NL event when the first
subflow is closed before msk accept() time.
By replacing the in use close helper is such scenario, implicitly introduce
the missing notification. Note that in such scenario we want to be sure
that mptcp_close_ssk() will not trigger any PM work, move the msk state
change update earlier, so that the previous patch will offer such
guarantee.
Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Geliang Tang <geliang@kernel.org> Tested-by: Geliang Tang <geliang@kernel.org> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-8-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Paolo Abeni [Fri, 21 Nov 2025 17:02:06 +0000 (18:02 +0100)]
mptcp: ensure the kernel PM does not take action too late
The PM hooks can currently take place when the msk is already shutting
down. Subflow creation will fail, thanks to the existing check at join
time, but we can entirely avoid starting the to be failed operations.
Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Geliang Tang <geliang@kernel.org> Tested-by: Geliang Tang <geliang@kernel.org> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-7-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Paolo Abeni [Fri, 21 Nov 2025 17:02:05 +0000 (18:02 +0100)]
mptcp: cleanup fallback dummy mapping generation
MPTCP currently access ack_seq outside the msk socket log scope to
generate the dummy mapping for fallback socket. Soon we are going
to introduce backlog usage and even for fallback socket the ack_seq
value will be significantly off outside of the msk socket lock scope.
Avoid relying on ack_seq for dummy mapping generation, using instead
the subflow sequence number. Note that in case of disconnect() and
(re)connect() we must ensure that any previous state is re-set.
Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Geliang Tang <geliang@kernel.org> Tested-by: Geliang Tang <geliang@kernel.org> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-6-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Paolo Abeni [Fri, 21 Nov 2025 17:02:04 +0000 (18:02 +0100)]
mptcp: cleanup fallback data fin reception
MPTCP currently generate a dummy data_fin for fallback socket
when the fallback subflow has completed data reception using
the current ack_seq.
We are going to introduce backlog usage for the msk soon, even
for fallback sockets: the ack_seq value will not match the most recent
sequence number seen by the fallback subflow socket, as it will ignore
data_seq sitting in the backlog.
Instead use the last map sequence number to set the data_fin,
as fallback (dummy) map sequences are always in sequence.
Reviewed-by: Geliang Tang <geliang@kernel.org> Tested-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-5-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Paolo Abeni [Fri, 21 Nov 2025 17:02:03 +0000 (18:02 +0100)]
mptcp: fix memcg accounting for passive sockets
The passive sockets never got proper memcg accounting: the msk
socket is associated with the memcg at accept time, but the
passive subflows never got it right.
At accept time, traverse the subflows list and associate each of them
with the msk memcg, and try to do the same at join completion time, if
the msk has been already accepted.
Paolo Abeni [Fri, 21 Nov 2025 17:02:00 +0000 (18:02 +0100)]
net: factor-out _sk_charge() helper
Move out of __inet_accept() the code dealing charging newly
accepted socket to memcg. MPTCP will soon use it to on a per
subflow basis, in different contexts.
ipvlan_core.c:56: warning: incorrect type in argument 1
(different base types) expected unsigned int [usertype] a
got restricted __be32 const [usertype] s_addr
Breno Leitao [Fri, 21 Nov 2025 17:02:36 +0000 (09:02 -0800)]
net: mvpp2: extract GRXRINGS from .get_rxnfc
Commit 84eaf4359c36 ("net: ethtool: add get_rx_ring_count callback to
optimize RX ring queries") added specific support for GRXRINGS callback,
simplifying .get_rxnfc.
Remove the handling of GRXRINGS in .get_rxnfc() by moving it to the new
.get_rx_ring_count() for the mvpp2 driver.
This simplifies the RX ring count retrieval and aligns mvpp2 with the new
ethtool API for querying RX ring parameters, while keeping the other
rxnfc handlers (GRXCLSRLCNT, GRXCLSRULE, GRXCLSRLALL) intact.
Breno Leitao [Fri, 21 Nov 2025 17:02:35 +0000 (09:02 -0800)]
net: mvneta: convert to use .get_rx_ring_count
Convert the mvneta driver to use the new .get_rx_ring_count ethtool
operation instead of implementing .get_rxnfc solely for handling
ETHTOOL_GRXRINGS command. This simplifies the code by removing the
switch statement and replacing it with a direct return of the queue
count.
The new callback provides the same functionality in a more direct way,
following the ongoing ethtool API modernization.
Breno Leitao [Fri, 21 Nov 2025 09:59:23 +0000 (01:59 -0800)]
net: hyperv: convert to use .get_rx_ring_count
Convert the hyperv netvsc driver to use the new .get_rx_ring_count
ethtool operation instead of implementing .get_rxnfc solely for handling
ETHTOOL_GRXRINGS command. This simplifies the code by replacing the
switch statement with a direct return of the queue count.
The new callback provides the same functionality in a more direct way,
following the ongoing ethtool API modernization.
Eric Dumazet [Fri, 21 Nov 2025 06:17:25 +0000 (06:17 +0000)]
net: optimize eth_type_trans() vs CONFIG_STACKPROTECTOR_STRONG=y
Some platforms exhibit very high costs with CONFIG_STACKPROTECTOR_STRONG=y
when a function needs to pass the address of a local variable to external
functions.
eth_type_trans() (and its callers) is showing this anomaly on AMD EPYC 7B12
platforms (and maybe others).
We could :
1) inline eth_type_trans()
This would help if its callers also has the same issue, and the canary cost
would be paid by the callers already.
This is a bit cumbersome because netdev_uses_dsa() is pulling
whole <net/dsa.h> definitions.
2) Compile net/ethernet/eth.c with -fno-stack-protector
This would weaken security.
3) Hack eth_type_trans() to temporarily use skb->dev as a place holder
if skb_header_pointer() needs to pull 2 bytes not present in skb->head.
This patch implements 3), and brings a 5% improvement on TX/RX intensive
workload (tcp_rr 10,000 flows) on AMD EPYC 7B12.
Removing CONFIG_STACKPROTECTOR_STRONG on this platform can improve
performance by 25 %.
This means eth_type_trans() issue is not an isolated artifact.
Jakub Kicinski [Sun, 23 Nov 2025 02:16:01 +0000 (18:16 -0800)]
selftests: af_unix: don't use SKIP for expected failures
netdev CI reserves SKIP in selftests for cases which can't be executed
due to setup issues, like missing or old commands. Tests which are
expected to fail must use XFAIL.
Andre Carvalho [Fri, 21 Nov 2025 15:00:22 +0000 (15:00 +0000)]
selftests: netconsole: ensure required log level is set on netcons_basic
This commit ensures that the required log level is set at the start of
the test iteration.
Part of the cleanup performed at the end of each test iteration resets
the log level (do_cleanup in lib_netcons.sh) to the values defined at the
time test script started. This may cause further test iterations to fail
if the default values are not sufficient.
====================
selftests: hw-net: toeplitz: read config from the NIC directly
First patch here tries to auto-disable building the iouring sample.
Our CI will still run the iouring test(s), of course, but it looks
like the liburing updates aren't very quick in distroes and having
to hack around it when developing unrelated tests is a bit annoying.
Remaining 4 patches iron out running the Toeplitz hash test against
real NICs. I tested mlx5, bnxt and fbnic, they all pass now.
I switched to using YNL directly in the C code, can't see a reason
to get the info in Python and pass it to C via argv. The old code
likely did this because it predates YNL.
====================
Jakub Kicinski [Fri, 21 Nov 2025 04:02:59 +0000 (20:02 -0800)]
selftests: hw-net: toeplitz: give the test up to 4 seconds
Increase the receiver timeout. When running between machines
in different geographic regions the test needs more than
a second to SSH across and send the frames.
The bkg() command that runs the receiver defaults to 5 sec timeout,
so using 4 sec sounds like a reasonable value for the receiver itself.
Jakub Kicinski [Fri, 21 Nov 2025 04:02:57 +0000 (20:02 -0800)]
selftests: hw-net: toeplitz: read the RSS key directly from C
Now that we have YNL support for RSS accessing the RSS info from
C is very easy. Instead of passing the RSS key from Python do it
directly in the C code.
Jakub Kicinski [Fri, 21 Nov 2025 04:02:55 +0000 (20:02 -0800)]
selftests: hw-net: auto-disable building the iouring C code
Looks like the liburing is not updated by distros very aggressively.
Presumably because a lot of packages depend on it. I just updated
to Fedora 43 and it's still on liburing 2.9. The test is 9mo old,
at this stage I think this warrants handling the build failure
more gracefully.
Detect if iouring is recent enough and if not print a warning
and exclude the C prog from build. The Python test will just
fail since the binary won't exist. But it removes the major
annoyance of having to update liburing from sources when
developing other tests.
Dan Carpenter [Fri, 21 Nov 2025 13:35:10 +0000 (16:35 +0300)]
i40e: delete a stray tab
This return statement is indented one tab too far. Delete a tab.
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Link: https://patch.msgid.link/aSBqjtA8oF25G1OG@stanley.mountain Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This series cleans up the "rgmii" accessors in qcom-ethqos.
readl() and writel() return and take a u32 for the value. Rather than
implicitly casting this to an int, keep it as a u32.
Add set/clear functions to reduce the code and make it easier to read.
Finally, convert the open-coded poll loops to use the iopoll helpers.
Note that patch 1 has a checkpatch warning concerning "volatile" -
I'm changing the type here, and the "volatile" is removed in patch 3.
I do not feel it is appropriate to remove it in patch 1.
====================
The driver has a lot of bit manipulation of the RGMII registers. Add
a pair of helpers to set bits and clear bits, converting the various
calls to rgmii_updatel() as appropriate.
net: stmmac: qcom-ethqos: use u32 for rgmii read/write/update
readl() returns a u32, and writel() takes a "u32" for the value. These
are used in rgmii_readl()() and rgmii_writel(), but the value and
return are "int". As these are 32-bit register values which are not
signed, use "u32".
These changes do not cause generated code changes.
Update rgmii_updatel() to use u32 for mask and val. Changing "mask"
to "u32" also does not cause generated code changes. However, changing
"val" causes the generated assembly to be re-ordered for aarch64.
Update the temporary variables used with the rgmii functions to use
u32.
====================
devlink: net/mlx5: implement swp_l4_csum_mode via devlink params
This series introduces a new devlink feature for querying param
default values, and resetting params to their default values. This
feature is then used to implement a new mlx5 driver param.
The series starts with two pure refactor patches: one that passes
through the extack to devlink_param::get() implementations. And a
second small refactor that prepares the netlink tlv handling code in
the devlink_param::get() path to better handle default parameter
values.
The third patch introduces the uapi and driver api for default
parameter values. The driver api is opt-in, and both the uapi and
driver api preserve existing behavior when not used by drivers or
userspace.
The fourth patch introduces a new mlx5 driver param, swp_l4_csum_mode,
for controlling tx csum behavior. The "l4_only" value of this param is
a dependency for PSP initialization on CX7 NICs.
Lastly, the series introduces a new driver param with cmode runtime to
netdevsim, and then uses this param in a new testcase for netdevsim
devlink params.
Here are some examples of using the default param uapi with the devlink
cli. Note the devlink cli binary I am using has changes which I am
posting in accompanying series targeting iproute2-next:
# netdevsim
./devlink dev param show netdevsim/netdevsim0
netdevsim/netdevsim0:
name max_macs type generic
values:
cmode driverinit value 32 default 32
name test1 type driver-specific
values:
cmode driverinit value true default true
# set to false
./devlink dev param set netdevsim/netdevsim0 name test1 value false cmode driverinit
./devlink dev param show netdevsim/netdevsim0
netdevsim/netdevsim0:
name max_macs type generic
values:
cmode driverinit value 32 default 32
name test1 type driver-specific
values:
cmode driverinit value false default true
# set back to default
./devlink dev param set netdevsim/netdevsim0 name test1 default cmode driverinit
./devlink dev param show netdevsim/netdevsim0
netdevsim/netdevsim0:
name max_macs type generic
values:
cmode driverinit value 32 default 32
name test1 type driver-specific
values:
cmode driverinit value true default true
# mlx5 params on cx7
./devlink dev param show pci/0000:01:00.0
pci/0000:01:00.0:
name max_macs type generic
values:
cmode driverinit value 128 default 128
...
name swp_l4_csum_mode type driver-specific
values:
cmode permanent value default default default
# set to l4_only
./devlink dev param set pci/0000:01:00.0 name swp_l4_csum_mode value l4_only cmode permanent
./devlink dev param show pci/0000:01:00.0 name swp_l4_csum_mode
pci/0000:01:00.0:
name swp_l4_csum_mode type driver-specific
values:
cmode permanent value l4_only default default
# reset to default
./devlink dev param set pci/0000:01:00.0 name swp_l4_csum_mode default cmode permanent
./devlink dev param show pci/0000:01:00.0 name swp_l4_csum_mode
pci/0000:01:00.0:
name swp_l4_csum_mode type driver-specific
values:
cmode permanent value default default default
====================
Daniel Zahka [Wed, 19 Nov 2025 02:50:36 +0000 (18:50 -0800)]
selftest: netdevsim: test devlink default params
Test querying default values and resetting to default values for
netdevsim devlink params.
This should cover the basic paths of interest: driverinit and
non-driverinit cmodes, as well as bool and non-bool value
type. Default param values of type bool are encoded with u8 netlink
type as opposed to flag type, so that userspace can distinguish
"not-present" from false.
Daniel Zahka [Wed, 19 Nov 2025 02:50:35 +0000 (18:50 -0800)]
netdevsim: register a new devlink param with default value interface
Create a new devlink param, test2, that supports default param actions
via the devlink_param::get_default() and
devlink_param::reset_default() functions.
Daniel Zahka [Wed, 19 Nov 2025 02:50:34 +0000 (18:50 -0800)]
net/mlx5: implement swp_l4_csum_mode via devlink params
swp_l4_csum_mode controls how L4 transmit checksums are computed when
using Software Parser (SWP) hints for header locations.
Supported values:
1. default: device will choose between full_csum or l4_only. Driver
will discover the device's choice during initialization.
2. full_csum: calculate L4 checksum with the pseudo-header.
3. l4_only: calculate L4 checksum without the pseudo-header. Only
available when swp_l4_csum_mode_l4_only is set in
mlx5_ifc_nv_sw_offload_cap_bits.
Note that 'default' might be returned from the device and passed to
userspace, and it might also be set during a
devlink_param::reset_default() call, but attempts to set a value of
default directly with param-set will be rejected.
The l4_only setting is a dependency for PSP initialization in
mlx5e_psp_init().
Daniel Zahka [Wed, 19 Nov 2025 02:50:33 +0000 (18:50 -0800)]
devlink: support default values for param-get and param-set
Support querying and resetting to default param values.
Introduce two new devlink netlink attrs:
DEVLINK_ATTR_PARAM_VALUE_DEFAULT and
DEVLINK_ATTR_PARAM_RESET_DEFAULT. The former is used to contain an
optional parameter value inside of the param_value nested
attribute. The latter is used in param-set requests from userspace to
indicate that the driver should reset the param to its default value.
To implement this, two new functions are added to the devlink driver
api: devlink_param::get_default() and
devlink_param::reset_default(). These callbacks allow drivers to
implement default param actions for runtime and permanent cmodes. For
driverinit params, the core latches the last value set by a driver via
devl_param_driverinit_value_set(), and uses that as the default value
for a param.
Because default parameter values are optional, it would be impossible
to discern whether or not a param of type bool has default value of
false or not provided if the default value is encoded using a netlink
flag type. For this reason, when a DEVLINK_PARAM_TYPE_BOOL has an
associated default value, the default value is encoded using a u8
type.
Lift the param type demux and value attr placement into a separate
function. This new function, devlink_nl_param_put(), can be used to
place additional types values in the value array, e.g., default,
current, next values. This commit has no functional change.
Daniel Zahka [Wed, 19 Nov 2025 02:50:31 +0000 (18:50 -0800)]
devlink: pass extack through to devlink_param::get()
Allow devlink_param::get() handlers to report error messages via
extack. This function is called in a few different contexts, but not
all of them will have an valid extack to use.
When devlink_param::get() is called from param_get_doit or
param_get_dumpit contexts, pass the extack through so that drivers can
report errors when retrieving param values. devlink_param::get() is
called from the context of devlink_param_notify(), pass NULL in for
the extack.
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com> Link: https://patch.msgid.link/20251119025038.651131-2-daniel.zahka@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
netconsole: Allow userdata buffer to grow dynamically
The current netconsole implementation allocates a static buffer for
extradata (userdata + sysdata) with a fixed size of
MAX_EXTRADATA_ENTRY_LEN * MAX_EXTRADATA_ITEMS bytes for every target,
regardless of whether userspace actually uses this feature. This forces
us to keep MAX_EXTRADATA_ITEMS small (16), which is restrictive for
users who need to attach more metadata to their log messages.
This patch series enables dynamic allocation of the userdata buffer,
allowing it to grow on-demand based on actual usage. The series:
1. Refactors send_fragmented_body() to simplify handling of separated
userdata and sysdata (patch 1/4)
2. Splits userdata and sysdata into separate buffers (patch 2/4)
3. Implements dynamic allocation for the userdata buffer (patch 3/4)
4. Increases MAX_USERDATA_ITEMS from 16 to 256 now that we can do so
without memory waste (patch 4/4)
Benefits:
- No memory waste when userdata is not used
- Targets that use userdata only consume what they need
- Users can attach significantly more metadata without impacting systems
that don't use this feature
====================
Increase MAX_USERDATA_ITEMS from 16 to 256 entries now that the userdata
buffer is allocated dynamically.
The previous limit of 16 was necessary because the buffer was statically
allocated for all targets. With dynamic allocation, we can support more
entries without wasting memory on targets that don't use userdata.
This allows users to attach more metadata to their netconsole messages,
which is useful for complex debugging and logging scenarios.
The userdata buffer in struct netconsole_target is currently statically
allocated with a size of MAX_USERDATA_ITEMS * MAX_EXTRADATA_ENTRY_LEN
(16 * 256 = 4096 bytes). This wastes memory when userdata entries are
not used or when only a few entries are configured, which is common in
typical usage scenarios. It also forces us to keep MAX_USERDATA_ITEMS
small to limit the memory wasted.
Change the userdata buffer from a static array to a dynamically
allocated pointer. The buffer is now allocated on-demand in
update_userdata() whenever userdata entries are added, modified, or
removed via configfs. The implementation calculates the exact size
needed for all current userdata entries, allocates a new buffer of that
size, formats the entries into it, and atomically swaps it with the old
buffer.
This approach provides several benefits:
- Memory efficiency: Targets with no userdata use zero bytes instead of
4KB, and targets with userdata only allocate what they need;
- Scalability: Makes it practical to increase MAX_USERDATA_ITEMS to a
much larger value without imposing a fixed memory cost on every
target;
- No hot-path overhead: Allocation occurs during configuration (write to
configfs), not during message transmission
If memory allocation fails during userdata update, -ENOMEM is returned
to userspace through the configfs attribute write operation.
The sysdata buffer remains statically allocated since it has a smaller
fixed size (MAX_SYSDATA_ITEMS * MAX_EXTRADATA_ENTRY_LEN = 4 * 256 = 1024
bytes) and its content length is less predictable.
Separate userdata and sysdata into distinct buffers to enable independent
management. Previously, both were stored in a single extradata_complete
buffer with a fixed size that accommodated both types of data.
This separation allows:
- userdata to grow dynamically (in subsequent patch)
- sysdata to remain in a small static buffer
- removal of complex entry counting logic that tracked both types together
The split also simplifies the code by eliminating the need to check total
entry count across both userdata and sysdata when enabling features,
which allows to drop holding su_mutex on sysdata_*_enabled_store().
No functional change in this patch, just structural preparation for
dynamic userdata allocation.
Refactor send_fragmented_body() to use separate offset tracking for
msgbody, and extradata instead of complex conditional logic.
The previous implementation used boolean flags and calculated offsets
which made the code harder to follow.
The new implementation maintains independent offset counters
(msgbody_offset, extradata_offset) and processes each section
sequentially, making the data flow more straightforward and the code
easier to maintain.
This is a preparatory refactoring with no functional changes, which will
allow easily splitting extradata_complete into separate userdata and
sysdata buffers in the next patch.
Wei Fang [Wed, 19 Nov 2025 02:51:48 +0000 (10:51 +0800)]
net: fec: remove duplicate macros of the BD status
There are two sets of macros used to define the status bits of TX and RX
BDs, one is the BD_SC_xx macros, the other one is the BD_ENET_xx macros.
For the BD_SC_xx macros, only BD_SC_WRAP is used in the driver. But the
BD_ENET_xx macros are more widely used in the driver, and they define
more bits of the BD status. Therefore, remove the BD_SC_xx macros from
now on.
Wei Fang [Wed, 19 Nov 2025 02:51:47 +0000 (10:51 +0800)]
net: fec: remove rx_align from fec_enet_private
The rx_align was introduced by the commit 41ef84ce4c72 ("net: fec: change
FEC alignment according to i.mx6 sx requirement"). Because the i.MX6 SX
requires RX buffer must be 64 bytes alignment.
Since the commit 95698ff6177b ("net: fec: using page pool to manage RX
buffers"), the address of the RX buffer is always the page address plus
FEC_ENET_XDP_HEADROOM which is 256 bytes, so the RX buffer is always
64-byte aligned. Therefore, rx_align has no effect since that commit,
and we can safely remove it.
In addition, to prevent future modifications to FEC_ENET_XDP_HEADROOM,
a BUILD_BUG_ON() test has been added to the driver, which ensures that
FEC_ENET_XDP_HEADROOM provides the required alignment.
Wei Fang [Wed, 19 Nov 2025 02:51:46 +0000 (10:51 +0800)]
net: fec: remove struct fec_enet_priv_txrx_info
The struct fec_enet_priv_txrx_info has three members: offset, page and
skb. The offset is only initialized in the driver and is not used, the
skb is never initialized and used in the driver. The both will not be
used in the future. Therefore, replace struct fec_enet_priv_txrx_info
directly with struct page.
Wei Fang [Wed, 19 Nov 2025 02:51:45 +0000 (10:51 +0800)]
net: fec: simplify the conditional preprocessor directives
From the Kconfig file, we can see CONFIG_FEC depends on the following
platform-related options.
ColdFire: M523x, M527x, M5272, M528x, M520x and M532x
S32: ARCH_S32 (ARM64)
i.MX: SOC_IMX28 and ARCH_MXC (ARM and ARM64)
Based on the code of fec driver, only some macro definitions on the
M5272 platform are different from those on other platforms. Therefore,
we can simplify the following complex preprocessor directives to
"if !defined(CONFIG_M5272)".
The conditional preprocessor directive was added to fix build errors on
the MCF5272 platform, see commit d13919301d9a ("net: fec: Fix build for
MCF5272"). The compilation errors were originally caused by some register
macros not being defined on that platform.
The driver now uses quirks to dynamically handle platform differences,
and for MCF5272, its quirks is 0, so it does not support RACC and GBIT
Ethernet. So these preprocessor directives are no longer required and
can be safely removed without causing build or functional issue.
Yael Chemla [Wed, 19 Nov 2025 20:48:15 +0000 (22:48 +0200)]
net: ethtool: Add support for 1600Gbps speed
Add support for 1600Gbps link modes based on 200Gbps per lane [1].
This includes the adopted IEEE 802.3dj copper and optical PMDs that use
200G/lane signaling [2].
Add the following PMD types:
- KR8 (backplane)
- CR8 (copper cable)
- DR8 (SMF 500m)
- DR8-2 (SMF 2km)
These modes are defined in the 802.3dj specifications.
References:
[1] https://www.ieee802.org/3/dj/public/23_03/opsasnick_3dj_01a_2303.pdf
[2] https://www.ieee802.org/3/dj/projdoc/objectives_P802d3dj_240314.pdf
Signed-off-by: Yael Chemla <ychemla@nvidia.com> Reviewed-by: Shahar Shitrit <shshitrit@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/1763585297-1243980-2-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This verifies correct handling of tc action attributes for multiple
VLAN push actions. The tc action indexed arrays start from index 1,
and the index defines the action order. This behavior differs from
the YNL specification, which expects arrays to be zero-based. To
accommodate this, the example adds a dummy action at index 0, which
is ignored by the kernel.
====================
selftests: drv-net: convert GRO and Toeplitz tests to work for drivers in NIPA
Main objective of this series is to convert the gro.sh and toeplitz.sh
tests to be "NIPA-compatible" - meaning make use of the Python env,
which lets us run the tests against either netdevsim or a real device.
The tests seem to have been written with a different flow in mind.
Namely they source different bash "setup" scripts depending on arguments
passed to the test. While I have nothing against the use of bash and
the overall architecture - the existing code needs quite a bit of work
(don't assume MAC/IP addresses, support remote endpoint over SSH).
If I'm the one fixing it, I'd rather convert them to our "simplistic"
Python.
This series rewrites the tests in Python while addressing their
shortcomings. The functionality of running the test over loopback
on a real device is retained but with a different method of invocation
(see the last patch).
Once again we are dealing with a script which run over a variety of
protocols (combination of [ipv4, ipv6, ipip] x [tcp, udp]). The first
4 patches add support for test variants to our scripts. We use the
term "variant" in the same sense as the C kselftest_harness.h -
variant is just a set of static input arguments.
Note that neither GRO nor the Toeplitz test fully passes for me on
any HW I have access to. But this is unrelated to the conversion.
This series is not making any real functional changes to the tests,
it is limited to improving the "test harness" scripts.
====================
Jakub Kicinski [Thu, 20 Nov 2025 02:10:24 +0000 (18:10 -0800)]
selftests: net: remove old setup_* scripts
gro.sh and toeplitz.sh used to source in one of two setup scripts
depending on whether the test was expected to be run against
veth or a real device. veth testing is replaced by netdevsim
and existing "remote endpoint" support in our Python tests.
Add a script which sets up loopback mode.
The usage is a little bit more complicated than running
the scripts used to be. Testing used to work like this:
Jakub Kicinski [Thu, 20 Nov 2025 02:10:23 +0000 (18:10 -0800)]
netdevsim: add loopback support
Support device loopback. Apparently this mode has been historically
supported by the toeplitz test and I don't have any HW which lets
me test the conversion..
Jakub Kicinski [Thu, 20 Nov 2025 02:10:22 +0000 (18:10 -0800)]
selftests: drv-net: hw: convert the Toeplitz test to Python
Rewrite the existing toeplitz.sh test in Python. The conversion
is a lot less exact than the GRO one. We use Netlink APIs to
get the device RSS and IRQ information. We expect that the device
has neither RPS nor RFS configured, and set RPS up as part of
the test.
Jakub Kicinski [Thu, 20 Nov 2025 02:10:21 +0000 (18:10 -0800)]
selftests: drv-net: add a Python version of the GRO test
Rewrite the existing gro.sh test in Python. The conversion
not exact, the changes are related to integrating the test
with our "remote endpoint" paradigm. The test now reads
the IP addresses from the user config. It resolves the MAC
address (including running over Layer 3 networks).
Jakub Kicinski [Thu, 20 Nov 2025 02:10:20 +0000 (18:10 -0800)]
netdevsim: pass packets thru GRO on Rx
To replace veth in software GRO testing with netdevsim we need
GRO support in netdevsim. Luckily we already have NAPI support
so this change is trivial (compared to veth).
Jakub Kicinski [Thu, 20 Nov 2025 02:10:19 +0000 (18:10 -0800)]
selftests: net: py: read ip link info about remote dev
We're already saving the info about the local dev in env.dev
for the tests, save remote dev as well. This is more symmetric,
env generally provides the same info for local and remote end.
While at it make sure that we reliably get the detailed info
about the local dev. nsim used to read the dev info without -d.
Jakub Kicinski [Thu, 20 Nov 2025 02:10:18 +0000 (18:10 -0800)]
selftests: net: py: support ksft ready without wait
There's a common synchronization problem when a script (Python test)
uses a C program to set up some state (usually start a receiving
process for traffic). The script needs to know when the process
has fully initialized. The inverse of the problem exists for shutting
the process down - we need a reliable way to tell the process to exit.
We added helpers to do this safely in
commit 71477137994f ("selftests: drv-net: add a way to wait for a local process")
unfortunately the two operations (wait for init, and shutdown) are
controlled by a single parameter (ksft_wait). Add support for using
ksft_ready without using the second fd for exit.
This is useful for programs which wait for a specific number of packets
to rx so exit_wait is a good match, but we still need to wait for init.
Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: breno Leitao <leitao@debian.org> Link: https://patch.msgid.link/20251120021024.2944527-7-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 20 Nov 2025 02:10:17 +0000 (18:10 -0800)]
selftests: net: relocate gro and toeplitz tests to drivers/net
The GRO test can run on a real device or a veth.
The Toeplitz hash test can only run on a real device.
Move them from net/ to drivers/net/ and drivers/net/hw/ respectively.
There are two scripts which set up the environment for these tests
setup_loopback.sh and setup_veth.sh. Move those scripts to net/lib.
The paths to the setup files are a little ugly but they will be
deleted shortly.
toeplitz_client.sh is not a test in itself, but rather a helper
to send traffic, so add it to TEST_FILES rather than TEST_PROGS.
Jakub Kicinski [Thu, 20 Nov 2025 02:10:15 +0000 (18:10 -0800)]
selftests: net: py: add test variants
There's a lot of cases where we try to re-run the same code with
different parameters. We currently need to either use a generator
method or create a "main" case implementation which then gets called
by trivial case functions:
def _test(x, y, z):
...
def case_int():
_test(1, 2, 3)
def case_str():
_test('a', 'b', 'c')
Add support for variants, similar to kselftests_harness.h and
a lot of other frameworks. Variants can be added as decorator
to test functions:
ksft_run() will auto-generate case names:
case.1_2_3
case.a_b_c
Because the names may not always be pretty (and to avoid forcing
classes to implement case-friendly __str__()) add a wrapper class
KsftNamedVariant which lets the user specify the name for the variant.
Note that ksft_run's args are still supported. ksft_run splices args
and variant params together.
Jakub Kicinski [Thu, 20 Nov 2025 02:10:13 +0000 (18:10 -0800)]
selftests: net: py: coding style improvements
We're about to add more features here and finding new issues with old
ones in place is hard. Address ruff checks:
- bare exceptions
- f-string with no params
- unused import
We need to use BaseException when handling defer(), as Petr points out.
This retains the old behavior of ignoring SIGTERM while running cleanups.
Heiner Kallweit [Wed, 19 Nov 2025 07:05:45 +0000 (08:05 +0100)]
net: phy: fixed_phy: fix missing initialization of fixed phy link
Original change remove the link initialization from the passed struct
fixed_phy_status, but @status is also passed to __fixed_phy_add(),
where it is saved. Make sure that copy also has link set to 1.
while building a new device around the ADIN1100 I noticed some errors in
kernel log when calling `ifdown` on the ethernet device. Series has a
straight forward fix and an obvious follow-up code simplification.
====================
Alexander Dahl [Wed, 19 Nov 2025 12:47:37 +0000 (13:47 +0100)]
net: phy: adin1100: Simplify register value passing
The additional use case for that variable is gone,
the expression is simple enough to pass it inline now.
Signed-off-by: Alexander Dahl <ada@thorsis.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Acked-by: Nuno Sá <nuno.sa@analog.com> Link: https://patch.msgid.link/20251119124737.280939-3-ada@thorsis.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Value CRSM_SFT_PD written to Software Power-Down Control Register
(CRSM_SFT_PD_CNTRL) is 0x01 and therefor different to value
CRSM_SFT_PD_RDY (0x02) read from System Status Register (CRSM_STAT) for
confirmation powerdown has been reached.
The condition could have only worked when disabling powerdown
(both 0x00), but never when enabling it (0x01 != 0x02).
Result is a timeout, like so:
$ ifdown eth0
macb f802c000.ethernet eth0: Link is Down
ADIN1100 f802c000.ethernet-ffffffff:01: adin_set_powerdown_mode failed: -110
ADIN1100 f802c000.ethernet-ffffffff:01: adin_set_powerdown_mode failed: -110
Fixes: 7eaf9132996a ("net: phy: adin1100: Add initial support for ADIN1100 industrial PHY") Signed-off-by: Alexander Dahl <ada@thorsis.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Acked-by: Nuno Sá <nuno.sa@analog.com> Link: https://patch.msgid.link/20251119124737.280939-2-ada@thorsis.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
stmmac's axi_blen (burst length) handling is very verbose and
unnecessary.
Firstly, the burst length register bitfield is the same across all
dwmac cores, so we can use common definitions for these bits which
platform glue can use.
We end up with platform glue:
- filling in the axi_blen[] array with the decimal burst lengths, e.g.
dwmac-intel.c, etc
- decoding a bitmap into burst lengths for this array, e.g.
dwmac-dwc-qos-eth.c
Other cases read the array from DT, placing it into the axi_blen
array, and converting later to the register bitfield.
This series removes all this complexity, ultimately ending up with
platform glue providing the register value containing the burst
length bitfield directly. Where necessary, platform glue calls
stmmac_axi_blen_to_mask() to convert a decimal array (e.g. from
DT) to the register value.
This also means that stmmac_axi_blen_to_mask() can issue a
diagnostic message at probe time if the burst length is incorrect.
====================
Remove the axi_blen array from struct stmmac_axi as we set this array,
and then immediately convert it ot the register value, never looking at
the array again. Thus, the array can be function local rather than part
of a run-time allocated long-lived struct.
net: stmmac: move stmmac_axi_blen_to_mask() to axi_blen init sites
Move stmmac_axi_blen_to_mask() to the axi->axi_blen array init sites
to prepare for the removal of axi_blen. For sites which initialise
axi->axi_blen with constant data, initialise axi->axi_blen_regval
using the DMA_AXI_BLENx constants.
net: stmmac: move stmmac_axi_blen_to_mask() to stmmac_main.c
Move the call to stmmac_axi_blen_to_mask() out of the individual
MAC version drivers into the main code in stmmac_init_dma_engine(),
passing the resulting value through a new member, axi_blen_regval,
in the struct stmmac_axi structure.
There is now no need for stmmac_axi_blen_to_dma_mask() to use
u32p_replace_bits(), so use FIELD_PREP() instead.
net: stmmac: provide common stmmac_axi_blen_to_mask()
Provide a common stmmac_axi_blen_to_mask() function to translate the
burst length array to the value for the AXI bus mode register, and use
it for dwmac, dwmac4 and dwxgmac2. Remove the now unnecessary
XGMAC_BLEN* definitions.
Note that stmmac_axi_blen_to_dma_mask() is coded to be more efficient
than the original three implementations, and verifies the contents of
the burst length array.
net: stmmac: move common DMA AXI register bits to common.h
Move the common DMA AXI register bits to common.h so they can be shared
and we can provide a common function to convert the axi->dma_blen[]
array to the format needed for this register.
net: stmmac: dwc-qos-eth: simplify switch() in dwc_eth_dwmac_config_dt()
Simplify the switch() statement in dwc_eth_dwmac_config_dt().
Although this is not speed-critical, simplifying it can make it more
readable. This also drastically improves the code emitted by the
compiler.
On aarch64, with the original code, the compiler loads registers with
every possible value, and then has a tree of test-and-branch statements
to work out which register to store. With the simplified code, the
compiler can load a register with '4' and shift it appropriately.
This shrinks the text size on aarch64 from 4289 bytes to 4153 bytes,
a reduction of 3%.