Aditya Garg [Sat, 2 May 2026 07:45:33 +0000 (00:45 -0700)]
net: mana: Use per-queue allocation for tx_qp to reduce allocation size
Convert tx_qp from a single contiguous array allocation to per-queue
individual allocations. Each mana_tx_qp struct is approximately 35KB.
With many queues (e.g., 32/64), the flat array requires a single
contiguous allocation that can fail under memory fragmentation.
Change mana_tx_qp *tx_qp to mana_tx_qp **tx_qp (array of pointers),
allocating each queue's mana_tx_qp individually via kvzalloc. This
reduces each allocation to ~35KB and provides vmalloc fallback,
avoiding allocation failure due to fragmentation.
====================
selftests: rds: Log collection, TAP compliance and cleanups
This series is a set of bug fixes and improvements for the rds
selftests.
Patch 1 bumps the kselftest timeout from 400s to 800s. The original
limit was developed against a lean config, but the kselftest harness
counts boot time and gcov log collection against the limit, so a
default config with gcov enabled needs more headroom.
Patch 2 corrects some typos in the run.sh USAGE string and removes an
unused "-g" flag.
Patch 3 silences a handful of pylint warnings in test.py: it adds a
module docstring, suppresses the warnings tied to the sys.path.append
import trick, marks the long lived tcpdump Popen with disable-next
consider-using-with, and drops unused exception variables from two
BlockingIOError except clauses.
Patch 4 adds a -t flag to run.sh so the timeout can be overridden
if needed.
Patch 5 adds a RDS_LOG_DIR environment variable that specifies where
logs should be stored, or skips log collection if left unset
Patch 6 adds a SUDO_USER environment variable that sets the user
for tcpdump --relinquish-privileges. This avoid the permissions
drop that would leave pcaps empty on 9pfs since 9p does not
support chown
Patch 7 removes the initial tmp tcpdumps and instead saves the pcaps
directly to the logdir if it is set.
Patch 8 hoists the tcpdump shutdown into a helper and calls it from the
timeout signal handler so that the processes are properly terminated
and dumps are flushed
Patch 9 fixes gcov collection by ensuring debugfs is mounted, and
specifying the --root folder so that gcov can still find the kernel
source when it is run from the ksft test directory.
Patch 10 makes the test output TAP compliant so the kselftest runner
parses results correctly.
====================
This patch updates the rds selftests output to be TAP compliant.
Use ksft_pr() to mark debug output with a leading '# ' so that TAP
parsers treat it as commentary, and convert all informational print()
calls to use ksft_pr(). sys.exit(0) is changed to os._exit(0) to
avoid duplicate prints from the buffered TAP output. The console
output from the tcpdump subprocess is silenced, and the gcov console
output is redirected to a gcovr.log.
Finally adjust the exit path so that the hash check loop sets a
return code instead exiting directly. Then print the TAP results
and totals lines before exiting.
debugfs is not mounted automatically in a virtme-ng guest, so the
gcov data copy from /sys/kernel/debug/gcov/ silently finds nothing
depending on whether debugfs is mounted by default on the host OS.
Fix this by mounting debugfs in run.sh before copying the gcda
files.
Finally when invoked through the kselftest runner, the working
directory is the test directory rather than the kernel source root.
gcovr defaults --root to the current working directory, which causes
it to filter out all coverage data for files under net/rds/ since
they are not under the test directory. Fix this by passing --root
to gcovr explicitly.
The timeout signal handler for the rds selftests currently just
exits when the time limit is exceeded, and forgets to stop the
network dumps. Fix this by hoisting the tcpdump terminate commands
into a helper function, and call it from the signal handler before
exiting
Bound proc.wait() with a timeout (and fall back to proc.kill())
so an unresponsive tcpdump cannot hang the timeout path itself.
We also pop() tcpdump_procs as we iterate, so stop_pcaps() is safe
to call from both the normal cleanup path and the signal handler,
since the second invocation simply has nothing to do
This patch modifies rds selftests to use the environment variable
SUDO_USER for tcpdumps if it is set. This is needed to avoid chown
operations on the vng 9pfs which is not supported. Passing a user
listed in sudoers avoids the tcpdump privilege drop which may
otherwise create empty pcaps
This patch modifies the rds selftest to look for an env variable
RDS_LOG_DIR, and log all traces, pcaps and gcov collections to
the folder specified in RDS_LOG_DIR. If RDS_LOG_DIR is unset,
logs are not collected.
Add a -t flag to run.sh to optionally override the default
timeout. The --timeout flag is already supported in test.py,
so just add the shorthand -t flag
This patch fixes a few pylint errors in test.py. Remove unused exception
variables from except blocks, and disable warnings for imports that cannot
appear at the start of the module. Also disable warnings for the
tcpdump processes. The suggestion to use a with block does not apply
here since the process needs to outlive the parent to collect the dumps.
Lastly add the module docstring at the top of the module.
The 400s time out was originally developed under a leaner
kernel config that booted much faster than a default config.
Boot up is included as part of the over all test runtime, as
well as any log collection done when the test is complete.
A slower config combined with the gcov enabled test means
we'll need more time to accommodate the boot up and log
collection. So, bump time out to 800s.
====================
Fixes for mv88e6xxx for 6320/6321 family
Five fixes for mv88e6xxx for 6320/6321 family, for net-next,
without Fixes tags, as per Andrew's request last year, see
https://lore.kernel.org/netdev/20250313134146.27087-1-kabel@kernel.org/
====================
Marek Behún [Mon, 4 May 2026 15:32:27 +0000 (17:32 +0200)]
net: dsa: mv88e6xxx: enable devlink ATU hash param for 6320 family
Commit 23e8b470c7788 ("net: dsa: mv88e6xxx: Add devlink param for ATU
hash algorithm.") introduced ATU hash algorithm access via devlink, but
did not enable it for the 6320 family. Do it now.
Marek Behún [Mon, 4 May 2026 15:32:25 +0000 (17:32 +0200)]
net: dsa: mv88e6xxx: define .pot_clear() for 6321
Commit 9e907d739cc3 ("net: dsa: mv88e6xxx: add POT operation") did not
add the .pot_clear() method to the 6321 switch operations structure.
Add them now.
====================
selftests: drv-net: convert so_txtime to drv-net
In preparation for extending to pacing hardware offload, convert the
so_txtime.sh test to a drv-net test that can be run against netdevsim
and real hardware.
Two preparatory patches
1. support negative tests, where tests are expected to fail
2. add a tc helper
See individual patches for details and detailed changelog
====================
In preparation for extending to pacing hardware offload, convert the
so_txtime.sh test to a drv-net test that can be run against netdevsim
and real hardware.
Also update so_txtime.c to not exit on first failure, but run to
completion and report exit code there. This helps with debugging
unexpected results, especially when processing multiple packets,
as happens in the "reverse_order" testcase.
Signed-off-by: Willem de Bruijn <willemb@google.com>
----
v6 -> v7
- update test to use new argument expect_fail
- v6 received Reviewed-by, but dropped due to above (minor) change
v5 -> v6
- fix order in tools/testing/selftests/drivers/net/config
v4 -> v5
- move qdisc setup/restore into each test
- add tc to utils.py (separate patch)
- test expected failure (separate patch)
- fix pylint
- convert fail to pass for timing errors if KSFT_MACHINE_SLOW
(cmd does not special case KSFT_SKIP process returncode yet)
Responses to sashiko review
- The test converts per packet failure to errors, to continue
testing other packets, but other error() cases are not in scope.
- The test starts sender and receiver at an absolute future time,
like the original test. This assumes ~msec scale sync'ed clocks.
- The tc qdisc replace command works fine with noqueue. Tested
manually.
v3 -> v4
- restore original qdisc after test
- drop unnecessary underscore in tap test names
v2 -> v3
- Makefile: so_txtime from YNL_GEN_FILES to TEST_GEN_FILES (Sashiko, NIPA)
v1 -> v2
- move so_txtime.c for net/lib to drivers/net (Jakub)
- fix drivers/net/config order (Jakub)
- detect passing when failure is expected (Jakub, Sashiko)
- pass pylint --disable=R (Jakub)
- only call ksft_run once (Jakub)
- do not sleep if waiting time is negative (Sashiko)
- add \n when converting error() to fprintf() (Sashiko)
- 4 space indentation, instead of 2 space
- increase sync delay from 100 to 200ms, to fix rare vng flakes
The .port_max_speed_mode() method is not used anymore since commit 40da0c32c3fc ("net: dsa: mv88e6xxx: remove handling for DSA and CPU ports").
Drop it.
====================
udp_tunnel: Speed up UDP tunnel device destruction (Part I)
Most of the UDP tunnel devices call synchronize_rcu() twice
during destruction, for example, vxlan has
1) synchronize_rcu() in udp_tunnel_sock_release()
2) synchronize_net() in vxlan_sock_release()
The goal of this series is to remove the former, and another
followup series removes the latter.
synchronize_rcu() was added in udp_tunnel_sock_release() by
commit 3cf7203ca620 ("net/tunnel: wait until all sk_user_data
reader finish before releasing the sock").
This was intended to protect the fast path of a dying vxlan
from dereferencing vxlan_sock->sock->sk after sock_orphan()
has set sock->sk to NULL.
Most of the UDP tunnel devices store struct socket to its
private struct, but it is NOT needed in the fast paths;
struct sock is used there, but struct socket is only used
for tunnel setup / teardown.
This is probably because UDP tunnel functions accept struct
socket, but even such functions do not need it, except for
udp_tunnel_sock_release(), which can safely access sk->sk_socket.
The overview of the series:
Patch 1 - 5 : Convert UDP tunnel helper to take struct sock
Patch 6 : Small fix for 10-years-old bug
Patch 7 - 14 : Store struct sock in tunnel devices
Patch 15 : Remove synchronize_rcu() in udp_tunnel_sock_release()
With this change, a script creating/upping vxlan in 4000 netns
runs 10x faster.
====================
udp_tunnel: Remove synchronize_rcu() in udp_tunnel_sock_release().
Commit 3cf7203ca620 ("net/tunnel: wait until all sk_user_data
reader finish before releasing the sock") added synchronize_rcu()
in udp_tunnel_sock_release().
This was intended to protect the fast path of a dying vxlan device
from dereferencing vxlan_sock->sock->sk after sock_orphan() has set
sock->sk to NULL.
However, vxlan does not need to access struct socket itself
in the fast path; it only reads struct sock, and struct socket
is only used for tunnel setup and teardown.
This applies to all other UDP tunnel users, and they have been
converted to access struct sock directly.
In addition, each device-specific struct used in their fast paths
is freed after one RCU grace period. Since this occurs after
udp_tunnel_sock_release(), the struct is guaranteed to be freed
after struct udp_sock.
Therefore, synchronize_rcu() in udp_tunnel_sock_release() is
now redundant.
Let's remove it.
Tested:
A script creating/upping vxlan devices in 4000 netns runs 10x
faster with this change. We can see the same improvement with
other UDP tunnel devices as well.
$ cat vxlan.sh
for i in `seq 1 40`
do
(for j in `seq 1 100` ; do
unshare -n bash -c "ip link add vxlan0 type vxlan id 100 local 127.0.0.1 dstport 4789 && ip link set vxlan0 up";
done) &
done
wait
With bpftrace, we can see vxlan_stop() is significantly faster.
tipc udp_bearer does not need to access struct socket itself in
the fast path; it only reads struct sock, and struct socket is
only used for tunnel setup and teardown.
Let's store struct sock directly in struct udp_bearer.
Note that cleanup_bearer() calls synchronize_net() after
udp_tunnel_sock_release(), so udp_bearer is not freed until
inflight fast paths finish.
Note also that synchronize_rcu() is added in the error path
of tipc_udp_enable() since udp_bearer will be kfree()d
immediately once we remove synchronize_rcu() in
udp_tunnel_sock_release().
pfcp does not need to access struct socket itself in the fast
path; it only reads struct sock, and struct socket is only used
for tunnel setup and teardown.
Let's store struct sock directly in struct pfcp_dev.
pfcp_del_sock() is called from dev->netdev_ops->ndo_uninit().
The 2nd synchronize_net() in unregister_netdevice_many_notify()
ensures that inflight pfcp RX fast paths finish before pfcp_dev
is freed.
Note that synchronize_rcu() is added in the error path of
pfcp_newlink() since free_netdev() will free pfcp_dev immediately
once we remove synchronize_rcu() in udp_tunnel_sock_release().
amt does not need to access struct socket itself in the fast path;
it only reads struct sock, and struct socket is only used for tunnel
setup and teardown.
Let's store struct sock directly in struct amt.
amt_dev_stop() is called as dev->netdev_ops->ndo_stop().
synchronize_net() in unregister_netdevice_many_notify() ensures
that inflight amt RX fast paths finish before amt_dev is freed.
amt no longer needs synchronize_rcu() in udp_tunnel_sock_release().
Note that amt_dev_stop() looks buggy; cancel_delayed_work_sync()
should be called after udp_tunnel_sock_release().
fou does not need to access struct socket itself in the fast
path; it only reads struct sock, and struct socket is only used
for tunnel setup and teardown.
Let's store struct sock directly in struct fou.
fou_release() frees struct fou with kfree_rcu(), so fou no
longer needs synchronize_rcu() in udp_tunnel_sock_release().
Note that the error path in fou_create() looks buggy; once the
tunnel is set up and fou_add_to_port_list() fails, struct fou
should be freed with kfree_rcu() _after_ udp_tunnel_sock_release().
bareudp does not need to access struct socket itself in the fast
path; it only reads struct sock, and struct socket is only used
for tunnel setup and teardown.
Let's store struct sock directly in struct bareudp_dev.
bareudp_sock_release() is called from dev->netdev_ops->ndo_stop().
synchronize_net() in unregister_netdevice_many_notify() ensures that
inflight bareudp RX fast paths finish before bareudp_dev is freed.
bareudp no longer needs synchronize_rcu() in udp_tunnel_sock_release().
geneve does not need to access struct socket itself in the fast
path; it only reads struct sock, and struct socket is only used for
tunnel setup and teardown.
Let's store struct sock directly in struct geneve_sock.
__geneve_sock_release() frees geneve_sock with kfree_rcu(), so
geneve no longer needs synchronize_rcu() in udp_tunnel_sock_release().
Commit 3cf7203ca620 ("net/tunnel: wait until all sk_user_data
reader finish before releasing the sock") added synchronize_rcu()
in udp_tunnel_sock_release().
This was intended to protect the fast path of a dying vxlan device
from dereferencing vxlan_sock->sock->sk after sock_orphan() has set
sock->sk to NULL.
However, vxlan does not need to access struct socket itself in the
fast path; it only reads struct sock, and struct socket is only
used for tunnel setup and teardown.
Let's store struct sock directly in struct vxlan_sock.
In the next patch, we will free vxlan_sock with kfree_rcu(), then
vxlan no longer needs synchronize_rcu() in udp_tunnel_sock_release().
udp_tunnel: Pass struct sock to udp_tunnel_sock_release().
None of the udp_tunnel users need struct socket in their
fast paths; it is only used for tunnel setup / teardown.
While the UDP tunnel interface accepts struct socket, this
encourages users to store the pointer unnecessarily. This
leads to extra dereferences when accessing struct sock fields
(e.g., sk->sk_user_data instead of sock->sk->sk_user_data).
Furthermore, these dereferences necessitate synchronize_rcu()
in udp_tunnel_sock_release() to protect the fast paths from
sock_orphan() setting sk->sk_socket to NULL.
This overhead can be avoided if users store the struct sock
pointer directly in their private structures.
As a prep, let's change udp_tunnel_sock_release() to take
struct sock instead of struct socket.
====================
first series for xpcs based rsfec configuration
The series:
- Fixes an addr validation error
- Adds MDIO defines associated with RS-FEC
- consolidates the handling of the boilerplat ID registers
into a routine to report id'ish registers and reduces the lines
of code across the entire set of c45 routines.
- adds PMA read/write routines
https://lore.kernel.org/all/20260428172810.175077-2-mike.marciniszyn@gmail.com/
has been removed from the series and submitted to net as
https://lore.kernel.org/all/20260429150049.1643-1-mike.marciniszyn@gmail.com/
pcs reads for DEVS1 and DEVS2 cleaned up 2/3
====================
Document the MDIO interface topology with an ASCII diagram
showing the MAC, PCS (MMD 3), FEC, Separated PMA (MMD 8), and PMD
(MMD 1) blocks and their interconnects. The diagram illustrates how
4 lanes connect the MAC through PCS, FEC, and PMA, then narrow to
2 lanes at the PMD.
The c45 read and write routines are enhanced to support
read and write of the separated PMA for the fbnic.
Co-developed-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Mike Marciniszyn (Meta) <mike.marciniszyn@gmail.com> Link: https://patch.msgid.link/20260430150802.3521-4-mike.marciniszyn@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
net: eth: fbnic: Consolidate register reads for ids and devs
Consolidate the register reads for boiler plate registers
to reduce LOC and cleanup pcs reads for DEVS1 to
fetch overrides for reserved bits that the hardware does not
return.
Add airoha_fe_get() and airoha_qdma_get() as utility routines for reading
a masked field from a specified register.
This is a non-functional refactor, no logical changes are introduced to
the existing codebase.
net: phy: realtek: replace magic number with register bit macros
Replace magic number with register bit macros. The description of
the RTL8211B interrupt register is obtained from publicly available
datasheet (RTL8211B(L) Rev. 1.5 Datasheet)
net: mana: hardening: Reject zero max_num_queues from GDMA_QUERY_MAX_RESOURCES
In a CVM environment, hardware responses cannot be trusted. The
GDMA_QUERY_MAX_RESOURCES command returns resource limits used to
determine the maximum number of queues.
In mana_gd_query_max_resources(), gc->max_num_queues is initialized
from num_online_cpus() and successively clamped by the hardware-reported
max_eq, max_cq, max_sq, max_rq, and num_msix_usable values. If any of
these hardware values is zero, gc->max_num_queues becomes zero and the
function returns success. This leads to a confusing failure later when
alloc_etherdev_mq() is called with zero queues, returning NULL and
producing a misleading -ENOMEM error.
Add an explicit zero check for gc->max_num_queues after all clamping
steps and return -ENOSPC for a clear early failure, consistent with the
existing gc->num_msix_usable <= 1 guard.
net: mana: hardening: Reject zero max_num_queues from MANA_QUERY_VPORT_CONFIG
As a part of MANA hardening for CVM, validate that max_num_sq and
max_num_rq returned by MANA_QUERY_VPORT_CONFIG are not zero. These
values flow into apc->num_queues, which is used as an allocation count
and loop bound. A zero value would result in zero-size allocations and
incorrect driver behavior.
====================
net: bridge: mcast: support exponential field encoding
Description:
This series addresses a mismatch in how multicast query
intervals and response codes are handled across IPv4 (IGMPv3)
and IPv6 (MLDv2). While decoding logic currently exists,
the corresponding encoding logic is missing during query
packet generation. This leads to incorrect intervals being
transmitted when values exceed their linear thresholds.
The patches introduce a unified floating-point encoding
approach based on RFC3376 and RFC3810, ensuring that large
intervals are correctly represented in QQIC and MRC fields
using the exponent-mantissa format.
Key Changes:
* ipv4: igmp: get rid of IGMPV3_{QQIC,MRC} and simplify calculation
Removes legacy macros in favor of a cleaner, unified
calculation for retrieving intervals from encoded fields,
improving code maintainability.
* ipv6: mld: rename mldv2_mrc() and add mldv2_qqi()
Standardizes MLDv2 terminology by renaming mldv2_mrc()
to mldv2_mrd() (Maximum Response Delay) and introducing
a new API mldv2_qqi for QQI calculation, improving code
readability.
* ipv4: igmp: encode multicast exponential fields
Introduces the logic to dynamically calculate the exponent
and mantissa using bit-scan (fls). This ensures QQIC and
MRC fields (8-bit) are properly encoded when transmitting
query packets with intervals that exceed their respective
linear threshold value of 128 (for QQI/MRT).
* ipv6: mld: encode multicast exponential fields
Applies similar encoding logic for MLDv2. This ensures
QQIC (8-bit) and MRC (16-bit) fields are properly encoded
when transmitting query packets with intervals that exceed
their respective linear thresholds (128 for QQI; 32768
for MRD).
* selftests: net: bridge: add MRC and QQIC field encoding tests
Updates bridge selftests to validate both linear and non-linear
(exponential) encoding for MRC and QQIC fields, ensuring
protocol compliance across IGMPv3 and MLDv2.
Impact:
These changes ensure that multicast queriers and listeners
stay synchronized on timing intervals, preventing protocol
timeouts or premature group membership expiration caused
by incorrectly formatted packet headers.
====================
Ujjal Roy [Sat, 2 May 2026 13:19:06 +0000 (13:19 +0000)]
selftests: net: bridge: add MRC and QQIC field encoding tests
Enhance vlmc_query_intvl_test and vlmc_query_response_intvl_test in
bridge_vlan_mcast.sh to validate IGMPv3/MLDv2 protocol compliance for
MRC and QQIC field encoding across both linear and exponential ranges.
TEST: Vlan multicast snooping enable [ OK ]
TEST: Vlan mcast_query_interval global option default value [ OK ]
TEST: Number of tagged IGMPv2 general query [ OK ]
TEST: IGMPv3 QQIC linear value 60(s) [ OK ]
TEST: MLDv2 QQIC linear value 60(s) [ OK ]
TEST: IGMPv3 QQIC non linear value 160(s) [ OK ]
TEST: MLDv2 QQIC non linear value 160(s) [ OK ]
TEST: Vlan mcast_query_response_interval global option default value [ OK ]
TEST: IGMPv3 MRC linear value of 60(x0.1s) [ OK ]
TEST: MLDv2 MRC linear value of 24000(ms) [ OK ]
TEST: IGMPv3 MRC non linear value of 240(x0.1s) [ OK ]
TEST: MLDv2 MRC non linear value of 48000(ms) [ OK ]
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Ujjal Roy <royujjal@gmail.com> Link: https://patch.msgid.link/20260502131907.987-6-royujjal@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ujjal Roy [Sat, 2 May 2026 13:19:05 +0000 (13:19 +0000)]
ipv6: mld: encode multicast exponential fields
In MLD, MRC and QQIC fields are not correctly encoded when
generating query packets. Since the receiver of the query
interprets these fields using the MLDv2 floating-point
decoding logic, any value that exceeds the linear threshold
is incorrectly parsed as an exponential value, leading to
an incorrect interval calculation.
Encode and assign the corresponding protocol fields during
query generation. Introduce the logic to dynamically
calculate the exponent and mantissa using bit-scan (fls).
This ensures MRC (16-bit) and QQIC (8-bit) fields are
properly encoded when transmitting query packets with
intervals that exceed their respective linear thresholds
(32768 for MRD; 128 for QQI).
RFC3810: If Maximum Response Code >= 32768, the Maximum
Response Code field represents a floating-point value as
follows:
0 1 2 3 4 5 6 7 8 9 A B C D E F
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|1| exp | mant |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
RFC3810: If QQIC >= 128, the QQIC field represents a
floating-point value as follows:
0 1 2 3 4 5 6 7
+-+-+-+-+-+-+-+-+
|1| exp | mant |
+-+-+-+-+-+-+-+-+
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Ujjal Roy <royujjal@gmail.com> Link: https://patch.msgid.link/20260502131907.987-5-royujjal@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ujjal Roy [Sat, 2 May 2026 13:19:04 +0000 (13:19 +0000)]
ipv4: igmp: encode multicast exponential fields
In IGMP, MRC and QQIC fields are not correctly encoded
when generating query packets. Since the receiver of the
query interprets these fields using the IGMPv3 floating-
point decoding logic, any value that exceeds the linear
threshold is incorrectly parsed as an exponential value,
leading to an incorrect interval calculation.
Encode and assign the corresponding protocol fields during
query generation. Introduce the logic to dynamically
calculate the exponent and mantissa using bit-scan (fls).
This ensures MRC and QQIC fields (8-bit) are properly
encoded when transmitting query packets with intervals
that exceed their respective linear threshold value of
128 (for MRT/QQI).
RFC3376: for both MRC and QQIC, values >= 128 represent
the same floating-point encoding as follows:
0 1 2 3 4 5 6 7
+-+-+-+-+-+-+-+-+
|1| exp | mant |
+-+-+-+-+-+-+-+-+
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Ujjal Roy <royujjal@gmail.com> Link: https://patch.msgid.link/20260502131907.987-4-royujjal@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
net: Convert AF_NETLINK and AF_VSOCK to getsockopt_iter API
Continue the work to convert protocols to the new getsockopt_iter API.
Convert AF_NETLINK and AF_VSOCK getsockopt implementations to the new
sockopt_t/getsockopt_iter API, and add kselftests that verify the size
and errno semantics are preserved across the conversion.
I chose these two socket families because they are probably one of the
most used protocols,, ensuring that any potential bugs will be
discovered and reported quickly.
====================
Add a single kselftest covering the proto_ops getsockopt_iter
conversions for AF_NETLINK and AF_VSOCK, using one fixture per protocol:
netlink:
NETLINK_PKTINFO covers the flag-style int path (exact size, oversize
clamp, undersize -EINVAL); NETLINK_LIST_MEMBERSHIPS covers the
size-discovery path that always reports the required buffer length back
via optlen, even when the user buffer is too small to receive any group
bits.
Each fixture also exercises an unknown optname and a bogus level so
the returned-length / errno semantics preserved by the sockopt_t
conversion are pinned down.
Breno Leitao [Fri, 1 May 2026 15:52:52 +0000 (08:52 -0700)]
vsock: convert to getsockopt_iter
Convert AF_VSOCK's getsockopt implementation to use the new
getsockopt_iter callback with sockopt_t. The single
vsock_connectible_getsockopt() callback is shared by both
vsock_stream_ops and vsock_seqpacket_ops, so both proto_ops are
updated to use .getsockopt_iter.
Key changes:
- Replace (char __user *optval, int __user *optlen) with sockopt_t *opt
- Use opt->optlen for buffer length (input) and returned size (output)
- Use copy_to_iter() instead of put_user()/copy_to_user()
Breno Leitao [Fri, 1 May 2026 15:52:51 +0000 (08:52 -0700)]
netlink: convert to getsockopt_iter
Convert AF_NETLINK's getsockopt implementation to use the new
getsockopt_iter callback with sockopt_t.
Key changes:
- Replace (char __user *optval, int __user *optlen) with sockopt_t *opt
- Use opt->optlen for buffer length (input) and returned size (output)
- Use copy_to_iter() instead of put_user()/copy_to_user()
- For NETLINK_LIST_MEMBERSHIPS: walk the groups bitmap and emit each
u32 sequentially via copy_to_iter(), then set opt->optlen to the
total size required (ALIGN(BITS_TO_BYTES(ngroups), sizeof(u32))).
The wrapper writes opt->optlen back to userspace even on partial
failure, preserving the existing API that lets userspace discover
the needed allocation size.
====================
Intel Wired LAN Updates 2024-04-30 (ixgbe, i40e, ice)
This series includes updates to support Energy-Efficient Ethernet (EEE) on
E610 devices in the ixgbe driver, support for an unmanaged DPLL output on
E830, as well as some other minor cleanups and improvements across ixgbe,
i40e, and ice.
Jedrzej begins with the first six patches preparing the ixgbe driver to
support EEE, adding a EEE capability flag, updating the supported EEE
speeds, updating the ACI command structures with the fields related to
EEE, moving the EEE config validation out for re-use, and finally
implementing the EEE support for E610 hardware.
Aleksandr fixes the ixgbe_update_flash_X550() logic to prevent unaligned
access in ixgbe_host_interface_command(). Note: this has no functional
change on x86, and is being sent through net-next as it is considered a
minor cleanup.
Jacob (hi!) modifies the i40e driver to only timestamp PTP event packets,
instead of timestamping every V2 event frame. This avoids wasting the
limited number of timestamp slots for frames which the PTP protocol does
not care about.
Jacob also extends the devlink flash notification message reporting that
users can activate the new firmware via devlink reload to explicitly
indicate the required "fw_activate" action.
Byungchul Park fixes the ice_lbtest_receive_frames() function to use
netmem_desc instead of the page structure.
Przemyslaw Korba fixes a truncation warning in ice_dpll_init_fwnode_pins()
by increasing the allowed length of the pin_name string on the stack to 16.
Ivan Vecera adds some bounds checking to ice_dpll_rclk_state_on_pin_get/set()
and moves the CGU register macros to be under the header guard ifdef in
ice_dpll.h
====================
Introduced by commit ad1df4f2d591 ("ice: dpll: Support E825-C SyncE and
dynamic pin discovery"):
ice_dpll.c: In function ‘ice_dpll_init’:
ice_dpll.c:3588:59: error: ‘%u’ directive output may be truncated
writing between 1 and 10 bytes into a region of size 4
[-Werror=format-truncation=] snprintf(pin_name, sizeof(pin_name),
"rclk%u", i);
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Przemyslaw Korba <przemyslaw.korba@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260430-jk-iwl-net-next-2026-04-30-v1-13-6f27ae1cd073@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Byungchul Park [Fri, 1 May 2026 06:37:23 +0000 (23:37 -0700)]
ice: access @pp through netmem_desc instead of page
To eliminate the use of struct page in page pool, the page pool users
should use netmem descriptor and APIs instead.
Make ice driver access @pp through netmem_desc instead of page.
Signed-off-by: Byungchul Park <byungchul@sk.com> Tested-by: Alexander Nowlin <alexander.nowlin@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> Reviewed-by: David Hildenbrand (Arm) <david@kernel.org> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260430-jk-iwl-net-next-2026-04-30-v1-12-6f27ae1cd073@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jacob Keller [Fri, 1 May 2026 06:37:22 +0000 (23:37 -0700)]
ice: mention fw_activate action along with devlink reload
The ice driver reports a helpful status message when updating firmware
indicating what action is necessary to enable the new firmware. This is
done because some updates require power cycling or rebooting the machine
but some can be activated via devlink.
The ice driver only supports activating firmware with the specific action
of "fw_activate" a bare "devlink dev reload" will *not* update the
firmware, and will only perform driver reinitialization.
Update the status message to explicitly reflect that the reload must use
the fw_activate action.
I considered modifying the text to spell out the full command, but felt
that was both overkill and something that would belong better as part of
the user space program and not hard coded into the kernel driver output.
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260430-jk-iwl-net-next-2026-04-30-v1-11-6f27ae1cd073@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jacob Keller [Fri, 1 May 2026 06:37:21 +0000 (23:37 -0700)]
i40e: only timestamp PTP event packets
The i40e_ptp_set_timestamp_mode() function is responsible for configuring
hardware timestamping. When programming receive timestamping, the logic
must determine how to configure the PRTTSYN_CTL1 register for receive
timestamping.
The i40e hardware does not support timestamping all frames. Instead,
timestamps are captured into one of the four PRTTSYN_RXTIME registers.
Currently, the driver configures hardware to timestamp all V2 packets on
ports 319 and 320, including all message types. This timestamps
significantly more packets than is actually requested by the
HWTSTAMP_FILTER_PTP_V2_EVENT filter type.
The documentation for HWTSTAMP_FILTER_PTP_V2_EVENT indicates that it should
timestamp PTP v2 messages on any layer, including any kind of event
packets.
Timestamping other packets is acceptable, but not required by the filter.
Doing so wastes valuable slots in the Rx timestamp registers. For most
applications this doesn't cause a problem. However, for extremely high
rates of messages, it becomes possible that one of the critical event
packets is not timestamped.
The PTP protocol only requires timestamps for event messages on port 319,
but hardware is timestamping on both 319 and 320, and timestamping message
types which do not need a timestamp value.
The i40e hardware actually has a more strict filtering option. First, only
timestamp layer 4 messages on port 319 instead of both 319 and 320. Second,
note that hardware has a specific mode to timestamp only event packets
(those with message type < 8).
Update the configuration to use the strict mode that only timestamps event
messages, switching the TSYNTYPE field from 10b to 11b which limits the
timestamping only to eventpackets with a Message Type of < 8. Note that the
X700 series datasheet seems to indicate that the V2MSESTYPE field is no
longer relevant. However, we only tested and validated with leaving the
V2MESSTYPE field set to 0xF for the "wildcard" behavior it documents. This
might not be required but it in that case setting it appears harmless, so
leave it as is.
This avoids wasting the valuable Rx timestamp register slots on non-event
frames, and may reduce faults when operating under high event rates.
ixgbe: fix unaligned u32 access in ixgbe_update_flash_X550()
ixgbe_host_interface_command() treats its buffer as a u32 array. The
local buffer we pass in was a union of byte-sized fields, which gives
it 1-byte alignment on the stack. On strict-align architectures this
can cause unaligned 32-bit accesses.
Add a u32 member to union ixgbe_hic_hdr2 so the object is 4-byte
aligned, and pass the u32 member when calling
ixgbe_host_interface_command().
No functional change on x86; prevents unaligned accesses on
architectures that enforce natural alignment.
Fixes: 49425dfc7451 ("ixgbe: Add support for x550em_a 10G MAC type") Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Fixes: 6a14ee0cfb19 ("ixgbe: Add X550 support function pointers") Tested-by: Rinitha S <sx.rinitha@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260430-jk-iwl-net-next-2026-04-30-v1-9-6f27ae1cd073@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add E610 specific implementation of .get_eee() and .set_eee() ethtool
callbacks.
Introduce ixgbe_setup_eee_e610() which is used to set EEE config
on E610 device via ixgbe_aci_set_phy_cfg() (0x0601 ACI command).
Assign it to dedicated mac operation.
E610 devices support EEE feature specifically for 2.5, 5 and 10G link
speeds. When user try to set EEE for unsupported speeds log it.
Setting timer and setting EEE advertised speeds are not yet supported.
EEE shall be enabled by default for E610 devices.
Add EEE statuis logging during link watchdog run.
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260430-jk-iwl-net-next-2026-04-30-v1-6-6f27ae1cd073@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ixgbe: E610: update ACI command structs with EEE fields
There were recent changes in some of the ACI commands,
which have been extended with EEE related fields.
Set PHY Config, Get PHY Caps and Get Link Info have been
affected.
Align SW structs to the recent FW changes.
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260430-jk-iwl-net-next-2026-04-30-v1-4-6f27ae1cd073@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ixgbe: E610: use new version of 0x601 ACI command buffer
Since FW version 1.40, buffer size of the 0x601 cmd has been increased
by 2B - from 24 to 26B. Buffer has been extended with new field
which can be used to configure EEE entry delay.
Pre-1.40 FW versions still expect 24B buffer and throws error when
receipts 26B buffer. To keep compatibility, check whether EEE
device capability flag is set and basing on it use appropriate
size of the command buffer.
Additionally place Set PHY Config capabilities defines out of
structs definitions.
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260430-jk-iwl-net-next-2026-04-30-v1-3-6f27ae1cd073@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Despite there was no EEE (Energy Efficient Ethernet) feature
support for E610 adapters, eee_speeds_supported variable was
defined and even initialized with some EEE speeds.
As E610 adapter supports EEE only for 10G, 5G and 2.5G speeds,
update hw.phy.eee_speeds_supported. Remove unsupported speeds -
10M, 100M and 1G.
Add also entry for 5G speed in EEE speeds mapping array used
by ethtool callbacks.
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260430-jk-iwl-net-next-2026-04-30-v1-2-6f27ae1cd073@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Recently EEE functionality support has been introduced to E610 FW.
Currently ixgbe driver has no possibility to detect whether NVM
loaded on given adapter supports EEE.
There's dedicated device capability element reflecting FW support
for given EEE link speed.
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260430-jk-iwl-net-next-2026-04-30-v1-1-6f27ae1cd073@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Yang [Thu, 30 Apr 2026 11:45:25 +0000 (19:45 +0800)]
net: dsa: yt921x: Refactor long register helpers
Dealing long registers with u64 is good, until you realize there are
longer 96-bit registers.
Refactor reg64 helpers to use u32 arrays instead of u64 values, in
preparation for 96-bit registers. We do not keep the separate u64
version for reg64 to avoid duplicated wrappers, although it looks better
when dealing with reg64 *only*.
Helpers for reg96 should be added when they are actually used to avoid
function unused warnings.
David Yang [Thu, 30 Apr 2026 11:45:24 +0000 (19:45 +0800)]
net: dsa: pass extack to dsa_switch_ops :: port_policer_add()
Drivers might have error messages to propagate to user space. Propagate
the netlink extack so that they can inform user space in a verbal way of
their limitations.
Make the according transformations to the two users (sja1105 and felix).
net/mlx5: Add vhca_id_type support to IPsec alias creation
When creating an alias FT for MPV IPsec, if alias creation with
sw_vhca_id is supported use it instead of using the hw_vhca_id.
This in turn allows IPsec to work properly after live migration,
in case a VF was live migrated and his hw_vhca_id changed due to
migration which can happen if you migrate to a VF with a different index
than yours, IPsec would fail to start post migration, this patch
resolves the issue by using sw_vhca_id instead which doesn't change post
migration.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260430061958.225245-1-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Biggers [Wed, 29 Apr 2026 21:08:56 +0000 (21:08 +0000)]
Documentation/tcp_ao: Document the supported MAC algorithms and lengths
Update the TCP-AO documentation to fix some incorrect terminology and
claims regarding the MAC algorithms, and document which MAC algorithms
and lengths the Linux implementation supports.
====================
net/mlx5: enable sub-page allocations for mlx5_frag_buf
This series aims to improve memory utilization for DMA-coherent
fragmented-buffer allocations on systems with large PAGE_SIZE.
Before this change, such allocations were page-granular, as they were
backed by full pages. On large-page systems this caused significant
internal waste for small objects. For example, a single 4K request
consumed an entire 64K page.
The common kernel solution for sub-page coherent DMA allocations is the
DMA pool API. However, those pools do not return pages to the system
until teardown. That behavior is not a good fit for mlx5_frag_buf
allocations, since they back interface resources (WQs and CQs).
Interfaces may be removed dynamically, so their memory footprint should
reflect live usage to avoid situations where large amounts of memory
remain tied up in pools.
This series introduces a lightweight mlx5-local pool implementation for
sub-page coherent DMA allocations, which immediately returns free
backing pages. It wires mlx5_frag_buf allocations to use these internal
pools, while keeping the mechanism reusable for other mlx5-internal
coherent DMA allocation users in follow-up work.
====================
net/mlx5: use internal dma pools for frag buf alloc
Add mlx5_dma_pool alloc/free paths, and wire mlx5_frag_buf allocation
and free paths to use them.
mlx5_frag_buf_alloc_node() now selects an mlx5_dma_pool to allocate
fragments from, instead of directly allocating full coherent pages.
mlx5_frag_buf_free() frees from the respective pool.
mlx5_dma_pool_alloc() keeps allocation fast by maintaining pages with
available indexes at the head of the list, so the common allocation path
can take a free index immediately. New backing pages are allocated only
when no free index is available.
mlx5_dma_pool_free() returns released indexes to the pool and frees a
backing page once all of its indexes become free. This avoids keeping
fully free pages for the lifetime of the pool and reduces coherent DMA
memory footprint.
Introduce mlx5 DMA pool and pool-page data structures, and add the
creation and teardown paths.
Each NUMA node owns a set of mlx5_dma_pool instances, each one with a
different block size. The sizes are defined as all powers of two
starting from MLX5_ADAPTER_PAGE_SHIFT and up to PAGE_SHIFT. Since
mlx5_frag_bufs are used to back objects whose sizes are encoded relative
to MLX5_ADAPTER_PAGE_SHIFT, a smaller block_shift value cannot be used.
Requests larger than PAGE_SIZE continue to be handled as page-sized
fragments, as in the existing frag-buf allocation model.
Wire mlx5_frag_buf pools init/cleanup hooks into
mlx5_mdev_init()/uninit() and the init unwind path.
Keep temporary no-op stubs in alloc.c so lifecycle ordering is in place
before the coherent DMA sub-page allocator implementation is added in
follow-up patches.
Qingfang Deng [Wed, 29 Apr 2026 02:38:46 +0000 (10:38 +0800)]
pppoe: optimize hash with word access
Currently, hash_item() processes the 6-byte Ethernet address and the
2-byte session ID byte-wise to compute a hash.
Optimize this by using 16-bit word operations: XOR three 16-bit words
from the Ethernet address and the 16-bit session ID, then fold the
result. This reduces the total number of loads and XORs. The Ethernet
addresses in a skb and struct pppoe_addr are both 2-byte aligned, so the
u16 pointer cast is safe.
Lorenzo Bianconi [Thu, 30 Apr 2026 08:47:38 +0000 (10:47 +0200)]
net: airoha: configure QoS channel for HW accelerated flowtable traffic
As done for the SW path, configure the QoS channel for HW accelerated
traffic according to the user port index when forwarding to a DSA port,
or rely on the GDM port identifier otherwise. This allows HTB shaping
to be applied to HW accelerated traffic.
selftests: drv-net: Enable ntuple-filters if supported
Certain devices which support ntuple-filters do not enable the feature
by default. The existing tests will skip (if they check for the feature),
or fail if they blindly attempt to install rules. Therefore, attempt to turn
on ntuple-filters if the device supports them.
Eric Dumazet [Thu, 30 Apr 2026 07:40:04 +0000 (07:40 +0000)]
ip6mr: plug drop_reason to ip6mr_cache_report()
- Check mrt->mroute_sk earlier in the function.
- Use sock_queue_rcv_skb_reason() instead of sock_queue_rcv_skb().
- Use sk_skb_reason_drop() instead of kfree_skb().
Note that we return -ENOMEM if sock_queue_rcv_skb_reason() failed,
as the precise error is not really needed for callers.
drivers/net/Space.c is the last remnant of the linux-2.4.x driver model
that required each subsystem and device driver init function to be called
from init/main.c explicitly, before the introduction of initcall levels.
In linux-7.0, this was only used for a handful of ISA network drivers,
with the ne2000 driver being the last one.
Fold the code into ne.c directly, with minimal changes to preserve
the existing command line parsing.
Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # m68k Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260429145624.2948432-2-arnd@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The cs89x0 driver is really two in one, and they are mutually exclusive:
- the ISA driver was used on 486-era PCs. It likely has no remaining
users, like the other ethernet drivers that got removed in
linux-7.1. The DMA support in here is the last device driver use of
the deprecated isa_bus_to_virt() interface, all other users are either
x86 specific or or got converted to the normal dma-mapping interface.
The driver was maintained by Andrew Morton at the time, based on
the linux-2.2 vendor driver from Cirrus Logic.
- the platform_driver instance was used on some embedded Arm boards
around the same time, such as the EP7211 Development Kit. This
is the same chip, but uses modern devicetree based probing and no DMA.
This was added by Alexander Shiyan.
Remove the ISA driver as a cleanup, including all of the outdated
documentation referring to its configuration.
Jakub Kicinski [Wed, 29 Apr 2026 21:30:01 +0000 (14:30 -0700)]
net: tls: reshuffle the device ops check
We try to validate during registration that the netdev
has ops if it has features. This is currently somewhat sillily
written because we have a dereference before a NULL check
on the ops struct. Straighten this out.
No functional change intended other than saving ourselves
the very theoretical crash with a bad driver.
Note that we check earlier in the function that either ops
or TLS features are set for the device in question.