git.ipfire.org Git - thirdparty/linux.git/log

net: mana: Use per-queue allocation for tx_qp to reduce allocation size

Convert tx_qp from a single contiguous array allocation to per-queue
individual allocations. Each mana_tx_qp struct is approximately 35KB.
With many queues (e.g., 32/64), the flat array requires a single
contiguous allocation that can fail under memory fragmentation.

Change mana_tx_qp *tx_qp to mana_tx_qp **tx_qp (array of pointers),
allocating each queue's mana_tx_qp individually via kvzalloc. This
reduces each allocation to ~35KB and provides vmalloc fallback,
avoiding allocation failure due to fragmentation.

Signed-off-by: Aditya Garg <gargaditya@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Link: https://patch.msgid.link/20260502074552.23857-2-gargaditya@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'selftests-rds-log-collection-tap-compliance-and-cleanups'

Allison Henderson says:

====================
selftests: rds: Log collection, TAP compliance and cleanups

This series is a set of bug fixes and improvements for the rds
selftests.

Patch 1 bumps the kselftest timeout from 400s to 800s. The original
limit was developed against a lean config, but the kselftest harness
counts boot time and gcov log collection against the limit, so a
default config with gcov enabled needs more headroom.

Patch 2 corrects some typos in the run.sh USAGE string and removes an
unused "-g" flag.

Patch 3 silences a handful of pylint warnings in test.py: it adds a
module docstring, suppresses the warnings tied to the sys.path.append
import trick, marks the long lived tcpdump Popen with disable-next
consider-using-with, and drops unused exception variables from two
BlockingIOError except clauses.

Patch 4 adds a -t flag to run.sh so the timeout can be overridden
if needed.

Patch 5 adds a RDS_LOG_DIR environment variable that specifies where
logs should be stored, or skips log collection if left unset

Patch 6 adds a SUDO_USER environment variable that sets the user
for tcpdump --relinquish-privileges. This avoid the permissions
drop that would leave pcaps empty on 9pfs since 9p does not
support chown

Patch 7 removes the initial tmp tcpdumps and instead saves the pcaps
directly to the logdir if it is set.

Patch 8 hoists the tcpdump shutdown into a helper and calls it from the
timeout signal handler so that the processes are properly terminated
and dumps are flushed

Patch 9 fixes gcov collection by ensuring debugfs is mounted, and
specifying the --root folder so that gcov can still find the kernel
source when it is run from the ksft test directory.

Patch 10 makes the test output TAP compliant so the kselftest runner
parses results correctly.
====================

Link: https://patch.msgid.link/20260504054143.4027538-1-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: rds: Make rds selftests TAP compliant

This patch updates the rds selftests output to be TAP compliant.

Use ksft_pr() to mark debug output with a leading '# ' so that TAP
parsers treat it as commentary, and convert all informational print()
calls to use ksft_pr(). sys.exit(0) is changed to os._exit(0) to
avoid duplicate prints from the buffered TAP output. The console
output from the tcpdump subprocess is silenced, and the gcov console
output is redirected to a gcovr.log.

Finally adjust the exit path so that the hash check loop sets a
return code instead exiting directly. Then print the TAP results
and totals lines before exiting.

Signed-off-by: Allison Henderson <achender@kernel.org>
Link: https://patch.msgid.link/20260504054143.4027538-11-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: rds: Fix gcov collection

debugfs is not mounted automatically in a virtme-ng guest, so the
gcov data copy from /sys/kernel/debug/gcov/ silently finds nothing
depending on whether debugfs is mounted by default on the host OS.
Fix this by mounting debugfs in run.sh before copying the gcda
files.

Finally when invoked through the kselftest runner, the working
directory is the test directory rather than the kernel source root.
gcovr defaults --root to the current working directory, which causes
it to filter out all coverage data for files under net/rds/ since
they are not under the test directory. Fix this by passing --root
to gcovr explicitly.

Signed-off-by: Allison Henderson <achender@kernel.org>
Link: https://patch.msgid.link/20260504054143.4027538-10-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: rds: Stop tcpdump on timeout

The timeout signal handler for the rds selftests currently just
exits when the time limit is exceeded, and forgets to stop the
network dumps. Fix this by hoisting the tcpdump terminate commands
into a helper function, and call it from the signal handler before
exiting

Bound proc.wait() with a timeout (and fall back to proc.kill())
so an unresponsive tcpdump cannot hang the timeout path itself.

We also pop() tcpdump_procs as we iterate, so stop_pcaps() is safe
to call from both the normal cleanup path and the signal handler,
since the second invocation simply has nothing to do

Signed-off-by: Allison Henderson <achender@kernel.org>
Link: https://patch.msgid.link/20260504054143.4027538-9-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: rds: Remove tmp pcaps

This patch removes the initial tmp tcpdumps and instead saves
the pcaps directly to the logdir if it is set.

Signed-off-by: Allison Henderson <achender@kernel.org>
Link: https://patch.msgid.link/20260504054143.4027538-8-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: rds: Add SUDO_USER env variable

This patch modifies rds selftests to use the environment variable
SUDO_USER for tcpdumps if it is set. This is needed to avoid chown
operations on the vng 9pfs which is not supported. Passing a user
listed in sudoers avoids the tcpdump privilege drop which may
otherwise create empty pcaps

Signed-off-by: Allison Henderson <achender@kernel.org>
Link: https://patch.msgid.link/20260504054143.4027538-7-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: rds: Add RDS_LOG_DIR env variable

This patch modifies the rds selftest to look for an env variable
RDS_LOG_DIR, and log all traces, pcaps and gcov collections to
the folder specified in RDS_LOG_DIR. If RDS_LOG_DIR is unset,
logs are not collected.

Signed-off-by: Allison Henderson <achender@kernel.org>
Link: https://patch.msgid.link/20260504054143.4027538-6-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: rds: Add timeout flag to run.sh

Add a -t flag to run.sh to optionally override the default
timeout. The --timeout flag is already supported in test.py,
so just add the shorthand -t flag

Signed-off-by: Allison Henderson <achender@kernel.org>
Link: https://patch.msgid.link/20260504054143.4027538-5-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: rds: Fix more pylint errors

This patch fixes a few pylint errors in test.py. Remove unused exception
variables from except blocks, and disable warnings for imports that cannot
appear at the start of the module. Also disable warnings for the
tcpdump processes. The suggestion to use a with block does not apply
here since the process needs to outlive the parent to collect the dumps.
Lastly add the module docstring at the top of the module.

Signed-off-by: Allison Henderson <achender@kernel.org>
Link: https://patch.msgid.link/20260504054143.4027538-4-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: rds: Update USAGE string for run.sh

The run.sh script does not have a -g flag. Update USAGE string with
correct flags. Also fix typo packet_duplcate -> packet_duplicate

Signed-off-by: Allison Henderson <achender@kernel.org>
Link: https://patch.msgid.link/20260504054143.4027538-3-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: rds: Increase selftest timeout

The 400s time out was originally developed under a leaner
kernel config that booted much faster than a default config.
Boot up is included as part of the over all test runtime, as
well as any log collection done when the test is complete.
A slower config combined with the gcov enabled test means
we'll need more time to accommodate the boot up and log
collection. So, bump time out to 800s.

Signed-off-by: Allison Henderson <achender@kernel.org>
Link: https://patch.msgid.link/20260504054143.4027538-2-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'fixes-for-mv88e6xxx-for-6320-6321-family'

Marek Behún says:

====================
Fixes for mv88e6xxx for 6320/6321 family

Five fixes for mv88e6xxx for 6320/6321 family, for net-next,
without Fixes tags, as per Andrew's request last year, see
https://lore.kernel.org/netdev/20250313134146.27087-1-kabel@kernel.org/
====================

Link: https://patch.msgid.link/20260504153227.1390546-1-kabel@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: mv88e6xxx: enable devlink ATU hash param for 6320 family

Commit 23e8b470c7788 ("net: dsa: mv88e6xxx: Add devlink param for ATU
hash algorithm.") introduced ATU hash algorithm access via devlink, but
did not enable it for the 6320 family. Do it now.

Signed-off-by: Marek Behún <kabel@kernel.org>
Link: https://patch.msgid.link/20260504153227.1390546-6-kabel@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: mv88e6xxx: enable .rmu_disable() for 6320 family

Commit 9e5baf9b3636 ("net: dsa: mv88e6xxx: add RMU disable op") did not
add the .rmu_disable() method for the 6320 family. Add it now.

Signed-off-by: Marek Behún <kabel@kernel.org>
Link: https://patch.msgid.link/20260504153227.1390546-5-kabel@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: mv88e6xxx: define .pot_clear() for 6321

Commit 9e907d739cc3 ("net: dsa: mv88e6xxx: add POT operation") did not
add the .pot_clear() method to the 6321 switch operations structure.
Add them now.

Signed-off-by: Marek Behún <kabel@kernel.org>
Link: https://patch.msgid.link/20260504153227.1390546-4-kabel@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: mv88e6xxx: allow SPEED_200 for 6320 family on supported ports

The 6320 family supports the ALT_SPEED bit on ports 2, 5 and 6. Allow
this speed by implementing 6320 family specific .port_set_speed_duplex()
method.

Signed-off-by: Marek Behún <kabel@kernel.org>
Link: https://patch.msgid.link/20260504153227.1390546-3-kabel@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: mv88e6xxx: fix number of g1 interrupts for 6320 family

The 6320 family has 9 global1 interrupt, not 8. Fix it.

Signed-off-by: Marek Behún <kabel@kernel.org>
Link: https://patch.msgid.link/20260504153227.1390546-2-kabel@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'selftests-drv-net-convert-so_txtime-to-drv-net'

Willem de Bruijn says:

====================
selftests: drv-net: convert so_txtime to drv-net

In preparation for extending to pacing hardware offload, convert the
so_txtime.sh test to a drv-net test that can be run against netdevsim
and real hardware.

Two preparatory patches
1. support negative tests, where tests are expected to fail
2. add a tc helper

See individual patches for details and detailed changelog
====================

Link: https://patch.msgid.link/20260504174056.565319-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: drv-net: convert so_txtime to drv-net

In preparation for extending to pacing hardware offload, convert the
so_txtime.sh test to a drv-net test that can be run against netdevsim
and real hardware.

Also update so_txtime.c to not exit on first failure, but run to
completion and report exit code there. This helps with debugging
unexpected results, especially when processing multiple packets,
as happens in the "reverse_order" testcase.

Signed-off-by: Willem de Bruijn <willemb@google.com>
----

v6 -> v7

- update test to use new argument expect_fail
- v6 received Reviewed-by, but dropped due to above (minor) change

v5 -> v6

- fix order in tools/testing/selftests/drivers/net/config

v4 -> v5

- move qdisc setup/restore into each test
- add tc to utils.py (separate patch)
- test expected failure (separate patch)
- fix pylint
- convert fail to pass for timing errors if KSFT_MACHINE_SLOW
  (cmd does not special case KSFT_SKIP process returncode yet)

Responses to sashiko review

- The test converts per packet failure to errors, to continue
  testing other packets, but other error() cases are not in scope.
- The test starts sender and receiver at an absolute future time,
  like the original test. This assumes ~msec scale sync'ed clocks.
- The tc qdisc replace command works fine with noqueue. Tested
  manually.

v3 -> v4

- restore original qdisc after test
- drop unnecessary underscore in tap test names

v2 -> v3

- Makefile: so_txtime from YNL_GEN_FILES to TEST_GEN_FILES (Sashiko, NIPA)

v1 -> v2
- move so_txtime.c for net/lib to drivers/net (Jakub)
- fix drivers/net/config order (Jakub)
- detect passing when failure is expected (Jakub, Sashiko)
- pass pylint --disable=R (Jakub)
- only call ksft_run once (Jakub)
- do not sleep if waiting time is negative (Sashiko)
- add \n when converting error() to fprintf() (Sashiko)
- 4 space indentation, instead of 2 space
- increase sync delay from 100 to 200ms, to fix rare vng flakes

Link: https://patch.msgid.link/20260504174056.565319-4-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: net: py: add tc utility

Add a wrapper similar to existing ip, ethtool, ... commands.

Tc takes a slightly different syntax. Account for that.

The first user is the next patch in this series, converting so_txtime
to drv-net. Pacing offload is supported by selected qdiscs only.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260504174056.565319-3-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: net: py: support cmd verifying expected failure

Support negative tests, where cmd raises an exception if the command
succeeded.

Add optional argument expect_fail to cmd and bkg. Where fail fails the
test on unexpected error, expect_fail fails it on unexpected success.

Both fail on negative return code. Python subprocess may set a
negative return code on process crash or timeout. Those are never
anticipated failures.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260504174056.565319-2-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

atm: solos-pci: Simplify initialisation of pci_device_id array

Use the convenience macro PCI_DEVICE to initialize .vendor, .device,
.subvendor and .subdevice. Drop explicit zeros that the compiler also
fills in.

Signed-off-by: Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com>
Link: https://patch.msgid.link/20260504151202.2139919-2-u.kleine-koenig@baylibre.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: mv88e6xxx: remove unused .port_max_speed_mode()

The .port_max_speed_mode() method is not used anymore since commit
40da0c32c3fc ("net: dsa: mv88e6xxx: remove handling for DSA and CPU ports").
Drop it.

Signed-off-by: Marek Behún <kabel@kernel.org>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20260504152653.1389394-1-kabel@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'udp_tunnel-speed-up-udp-tunnel-device-destruction-part-i'

Kuniyuki Iwashima says:

====================
udp_tunnel: Speed up UDP tunnel device destruction (Part I)

Most of the UDP tunnel devices call synchronize_rcu() twice
during destruction, for example, vxlan has

  1) synchronize_rcu() in udp_tunnel_sock_release()

  2) synchronize_net() in vxlan_sock_release()

The goal of this series is to remove the former, and another
followup series removes the latter.

synchronize_rcu() was added in udp_tunnel_sock_release() by
commit 3cf7203ca620 ("net/tunnel: wait until all sk_user_data
reader finish before releasing the sock").

This was intended to protect the fast path of a dying vxlan
from dereferencing vxlan_sock->sock->sk after sock_orphan()
has set sock->sk to NULL.

Most of the UDP tunnel devices store struct socket to its
private struct, but it is NOT needed in the fast paths;
struct sock is used there, but struct socket is only used
for tunnel setup / teardown.

This is probably because UDP tunnel functions accept struct
socket, but even such functions do not need it, except for
udp_tunnel_sock_release(), which can safely access sk->sk_socket.

The overview of the series:

  Patch 1 -  5 : Convert UDP tunnel helper to take struct sock
  Patch 6      : Small fix for 10-years-old bug
  Patch 7 - 14 : Store struct sock in tunnel devices
  Patch 15     : Remove synchronize_rcu() in udp_tunnel_sock_release()

With this change, a script creating/upping vxlan in 4000 netns
runs 10x faster.
====================

[See Link for benchmark results.]

Link: https://patch.msgid.link/20260502031401.3557229-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

udp_tunnel: Remove synchronize_rcu() in udp_tunnel_sock_release().

Commit 3cf7203ca620 ("net/tunnel: wait until all sk_user_data
reader finish before releasing the sock") added synchronize_rcu()
in udp_tunnel_sock_release().

This was intended to protect the fast path of a dying vxlan device
from dereferencing vxlan_sock->sock->sk after sock_orphan() has set
sock->sk to NULL.

However, vxlan does not need to access struct socket itself
in the fast path; it only reads struct sock, and struct socket
is only used for tunnel setup and teardown.

This applies to all other UDP tunnel users, and they have been
converted to access struct sock directly.

In addition, each device-specific struct used in their fast paths
is freed after one RCU grace period.  Since this occurs after
udp_tunnel_sock_release(), the struct is guaranteed to be freed
after struct udp_sock.

Therefore, synchronize_rcu() in udp_tunnel_sock_release() is
now redundant.

Let's remove it.

Tested:

A script creating/upping vxlan devices in 4000 netns runs 10x
faster with this change.  We can see the same improvement with
other UDP tunnel devices as well.

  $ cat vxlan.sh
  for i in `seq 1 40`
  do
      (for j in `seq 1 100` ; do
            unshare -n bash -c "ip link add vxlan0 type vxlan id 100 local 127.0.0.1 dstport 4789 && ip link set vxlan0 up";
       done) &
  done
  wait

With bpftrace, we can see vxlan_stop() is significantly faster.

  bpftrace -e '
  kprobe:vxlan_stop {
          @start[tid] = nsecs;
  }

  kretprobe:vxlan_stop /@start[tid]/ {
          @duration_us = hist((nsecs - @start[tid]) / 1000);
          delete(@start[tid]);
  }

  END {
          printf("\nExecution time of vxlan_stop (us):\n");
  }'

Before:

  # time ./vxlan.sh // without bpftrace
  real 0m50.615s
  user 0m8.171s
  sys 1m45.101s

  @duration_us:
  [4K, 8K)            1266 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                   |
  [8K, 16K)           1957 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
  [16K, 32K)           764 |@@@@@@@@@@@@@@@@@@@@                                |
  [32K, 64K)             6 |                                                    |
  [64K, 128K)            4 |                                                    |
  [128K, 256K)           3 |                                                    |

After:

  # time ./vxlan.sh // without bpftrace
  real 0m5.247s
  user 0m7.956s
  sys 1m47.404s

  @duration_us:
  [16, 32)            3411 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
  [32, 64)             383 |@@@@@                                               |
  [64, 128)            107 |@                                                   |
  [128, 256)            79 |@                                                   |
  [256, 512)            16 |                                                    |
  [512, 1K)              2 |                                                    |
  [1K, 2K)               2 |                                                    |

Next step is to remove another synchronize_net() in vxlan_stop()
and variants in other devices.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260502031401.3557229-16-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tipc: Store struct sock in struct udp_bearer.

tipc udp_bearer does not need to access struct socket itself in
the fast path; it only reads struct sock, and struct socket is
only used for tunnel setup and teardown.

Let's store struct sock directly in struct udp_bearer.

Note that cleanup_bearer() calls synchronize_net() after
udp_tunnel_sock_release(), so udp_bearer is not freed until
inflight fast paths finish.

Note also that synchronize_rcu() is added in the error path
of tipc_udp_enable() since udp_bearer will be kfree()d
immediately once we remove synchronize_rcu() in
udp_tunnel_sock_release().

This can be later converted to kfree_rcu().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260502031401.3557229-15-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

pfcp: Store struct sock in struct pfcp_dev.

pfcp does not need to access struct socket itself in the fast
path; it only reads struct sock, and struct socket is only used
for tunnel setup and teardown.

Let's store struct sock directly in struct pfcp_dev.

pfcp_del_sock() is called from dev->netdev_ops->ndo_uninit().
The 2nd synchronize_net() in unregister_netdevice_many_notify()
ensures that inflight pfcp RX fast paths finish before pfcp_dev
is freed.

Note that synchronize_rcu() is added in the error path of
pfcp_newlink() since free_netdev() will free pfcp_dev immediately
once we remove synchronize_rcu() in udp_tunnel_sock_release().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260502031401.3557229-14-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

amt: Store struct sock in struct amt_dev.

amt does not need to access struct socket itself in the fast path;
it only reads struct sock, and struct socket is only used for tunnel
setup and teardown.

Let's store struct sock directly in struct amt.

amt_dev_stop() is called as dev->netdev_ops->ndo_stop().
synchronize_net() in unregister_netdevice_many_notify() ensures
that inflight amt RX fast paths finish before amt_dev is freed.

amt no longer needs synchronize_rcu() in udp_tunnel_sock_release().

Note that amt_dev_stop() looks buggy; cancel_delayed_work_sync()
should be called after udp_tunnel_sock_release().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260502031401.3557229-13-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

fou: Store struct sock in struct fou.

fou does not need to access struct socket itself in the fast
path; it only reads struct sock, and struct socket is only used
for tunnel setup and teardown.

Let's store struct sock directly in struct fou.

fou_release() frees struct fou with kfree_rcu(), so fou no
longer needs synchronize_rcu() in udp_tunnel_sock_release().

Note that the error path in fou_create() looks buggy; once the
tunnel is set up and fou_add_to_port_list() fails, struct fou
should be freed with kfree_rcu() _after_ udp_tunnel_sock_release().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260502031401.3557229-12-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bareudp: Store struct sock in struct bareudp_dev.

bareudp does not need to access struct socket itself in the fast
path; it only reads struct sock, and struct socket is only used
for tunnel setup and teardown.

Let's store struct sock directly in struct bareudp_dev.

bareudp_sock_release() is called from dev->netdev_ops->ndo_stop().
synchronize_net() in unregister_netdevice_many_notify() ensures that
inflight bareudp RX fast paths finish before bareudp_dev is freed.

bareudp no longer needs synchronize_rcu() in udp_tunnel_sock_release().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260502031401.3557229-11-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

geneve: Store struct sock in struct geneve_sock.

geneve does not need to access struct socket itself in the fast
path; it only reads struct sock, and struct socket is only used for
tunnel setup and teardown.

Let's store struct sock directly in struct geneve_sock.

__geneve_sock_release() frees geneve_sock with kfree_rcu(), so
geneve no longer needs synchronize_rcu() in udp_tunnel_sock_release().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260502031401.3557229-10-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

vxlan: Free vxlan_sock with kfree_rcu().

We will remove synchronize_rcu() in udp_tunnel_sock_release().

We must ensure that vxlan_sock is freed after inflight RX fast path.

Let's free vxlan_sock with kfree_rcu().

Note that vxlan_sock.vni_list[] is 8K and struct rcu_head must
be placed before it.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260502031401.3557229-9-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

vxlan: Store struct sock in struct vxlan_sock.

Commit 3cf7203ca620 ("net/tunnel: wait until all sk_user_data
reader finish before releasing the sock") added synchronize_rcu()
in udp_tunnel_sock_release().

This was intended to protect the fast path of a dying vxlan device
from dereferencing vxlan_sock->sock->sk after sock_orphan() has set
sock->sk to NULL.

However, vxlan does not need to access struct socket itself in the
fast path; it only reads struct sock, and struct socket is only
used for tunnel setup and teardown.

Let's store struct sock directly in struct vxlan_sock.

In the next patch, we will free vxlan_sock with kfree_rcu(), then
vxlan no longer needs synchronize_rcu() in udp_tunnel_sock_release().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260502031401.3557229-8-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

vxlan: Fix potential null-ptr-deref in vxlan_gro_prepare_receive().

udp_tunnel_sock_release() could set sk->sk_user_data to NULL
while vxlan_gro_prepare_receive() is running.

Let's check if rcu_dereference_sk_user_data() is NULL after
skb_gro_remcsum_init().

Fixes: 5602c48cf875 ("vxlan: change vxlan to use UDP socket GRO")
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260502031401.3557229-7-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

udp_tunnel: Pass struct sock to udp_tunnel_notify_{add,del}_rx_port().

None of the udp_tunnel users need struct socket in their
fast paths; it is only used for tunnel setup / teardown.

Even udp_tunnel_notify_{add,del}_rx_port() do not need
struct socket.

Let's change udp_tunnel_notify_{add,del}_rx_port() to take
struct sock instead of struct socket.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260502031401.3557229-6-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

udp_tunnel: Pass struct sock to udp_tunnel_{push,drop}_rx_port().

None of the udp_tunnel users need struct socket in their
fast paths; it is only used for tunnel setup / teardown.

Even udp_tunnel_{push,drop}_rx_port() do not need struct socket.

Let's change udp_tunnel_{push,drop}_rx_port() to take struct
sock instead of struct socket.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260502031401.3557229-5-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

udp_tunnel: Pass struct sock to udp_tunnel6_dst_lookup().

None of the udp_tunnel users need struct socket in their
fast paths; it is only used for tunnel setup / teardown.

Even udp_tunnel6_dst_lookup() does not need struct socket.

Let's change udp_tunnel6_dst_lookup() to take struct sock
instead of struct socket.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260502031401.3557229-4-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

udp_tunnel: Pass struct sock to setup_udp_tunnel_sock().

None of the udp_tunnel users need struct socket in their
fast paths; it is only used for tunnel setup / teardown.

Even setup_udp_tunnel_sock() does not need struct socket.

Let's change setup_udp_tunnel_sock() to take struct sock
instead of struct socket.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260502031401.3557229-3-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

udp_tunnel: Pass struct sock to udp_tunnel_sock_release().

None of the udp_tunnel users need struct socket in their
fast paths; it is only used for tunnel setup / teardown.

While the UDP tunnel interface accepts struct socket, this
encourages users to store the pointer unnecessarily. This
leads to extra dereferences when accessing struct sock fields
(e.g., sk->sk_user_data instead of sock->sk->sk_user_data).

Furthermore, these dereferences necessitate synchronize_rcu()
in udp_tunnel_sock_release() to protect the fast paths from
sock_orphan() setting sk->sk_socket to NULL.

This overhead can be avoided if users store the struct sock
pointer directly in their private structures.

As a prep, let's change udp_tunnel_sock_release() to take
struct sock instead of struct socket.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260502031401.3557229-2-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'first-series-for-xpcs-based-rsfec-configuration'

Mike Marciniszyn says:

====================
first series for xpcs based rsfec configuration

The series:
- Fixes an addr validation error
- Adds MDIO defines associated with RS-FEC
- consolidates the handling of the boilerplat ID registers
into a routine to report id'ish registers and reduces the lines
of code across the entire set of c45 routines.
- adds PMA read/write routines

https://lore.kernel.org/all/20260428172810.175077-2-mike.marciniszyn@gmail.com/
has been removed from the series and submitted to net as
https://lore.kernel.org/all/20260429150049.1643-1-mike.marciniszyn@gmail.com/

pcs reads for DEVS1 and DEVS2 cleaned up 2/3
====================

Link: https://patch.msgid.link/20260430150802.3521-1-mike.marciniszyn@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: eth: fbnic: Add pma read and write access

Document the MDIO interface topology with an ASCII diagram
showing the MAC, PCS (MMD 3), FEC, Separated PMA (MMD 8), and PMD
(MMD 1) blocks and their interconnects. The diagram illustrates how
4 lanes connect the MAC through PCS, FEC, and PMA, then narrow to
2 lanes at the PMD.

The c45 read and write routines are enhanced to support
read and write of the separated PMA for the fbnic.

Co-developed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Mike Marciniszyn (Meta) <mike.marciniszyn@gmail.com>
Link: https://patch.msgid.link/20260430150802.3521-4-mike.marciniszyn@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: eth: fbnic: Consolidate register reads for ids and devs

Consolidate the register reads for boiler plate registers
to reduce LOC and cleanup pcs reads for DEVS1 to
fetch overrides for reserved bits that the hardware does not
return.

Signed-off-by: Mike Marciniszyn (Meta) <mike.marciniszyn@gmail.com>
Link: https://patch.msgid.link/20260430150802.3521-3-mike.marciniszyn@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: mdio: Add support for RSFEC Control register for PMA

Add the constants associated with RS-FEC configuration
and status as well as the indicated separated bits for
DEVS1 to convey a separated PMA.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Mike Marciniszyn (Meta) <mike.marciniszyn@gmail.com>
Link: https://patch.msgid.link/20260430150802.3521-2-mike.marciniszyn@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/sched: speedup tc_dump_qdisc() when tcm_handle is provided

"tc qdisc show ... handle xxx" filtering can be done by the kernel.

A followup patch can do the same for tcm_parent.

iproute2/tc needs a small companion patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20260503114515.2460477-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: Introduce airoha_fe_get()/airoha_qdma_get() register read helpers

Add airoha_fe_get() and airoha_qdma_get() as utility routines for reading
a masked field from a specified register.
This is a non-functional refactor, no logical changes are introduced to
the existing codebase.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20260501-airoha_fe_get-airoha_qdma_get-v3-1-126c6f647ccb@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: realtek: replace magic number with register bit macros

Replace magic number with register bit macros. The description of
the RTL8211B interrupt register is obtained from publicly available
datasheet (RTL8211B(L) Rev. 1.5 Datasheet)

Signed-off-by: Aleksander Jan Bajkowski <olek2@wp.pl>
Reviewed-by: Daniel Golle <daniel@makrotopia.org>
Link: https://patch.msgid.link/20260502092857.156831-1-olek2@wp.pl
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mana: hardening: Reject zero max_num_queues from GDMA_QUERY_MAX_RESOURCES

In a CVM environment, hardware responses cannot be trusted. The
GDMA_QUERY_MAX_RESOURCES command returns resource limits used to
determine the maximum number of queues.

In mana_gd_query_max_resources(), gc->max_num_queues is initialized
from num_online_cpus() and successively clamped by the hardware-reported
max_eq, max_cq, max_sq, max_rq, and num_msix_usable values. If any of
these hardware values is zero, gc->max_num_queues becomes zero and the
function returns success. This leads to a confusing failure later when
alloc_etherdev_mq() is called with zero queues, returning NULL and
producing a misleading -ENOMEM error.

Add an explicit zero check for gc->max_num_queues after all clamping
steps and return -ENOSPC for a clear early failure, consistent with the
existing gc->num_msix_usable <= 1 guard.

Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Reviewed-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Link: https://patch.msgid.link/20260430083627.1873757-1-ernis@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mana: hardening: Reject zero max_num_queues from MANA_QUERY_VPORT_CONFIG

As a part of MANA hardening for CVM, validate that max_num_sq and
max_num_rq returned by MANA_QUERY_VPORT_CONFIG are not zero. These
values flow into apc->num_queues, which is used as an allocation count
and loop bound. A zero value would result in zero-size allocations and
incorrect driver behavior.

Return -EPROTO if either value is zero.

Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Link: https://patch.msgid.link/20260430085638.1875400-1-ernis@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-bridge-mcast-support-exponential-field-encoding'

Ujjal Roy says:

====================
net: bridge: mcast: support exponential field encoding

Description:
This series addresses a mismatch in how multicast query
intervals and response codes are handled across IPv4 (IGMPv3)
and IPv6 (MLDv2). While decoding logic currently exists,
the corresponding encoding logic is missing during query
packet generation. This leads to incorrect intervals being
transmitted when values exceed their linear thresholds.

The patches introduce a unified floating-point encoding
approach based on RFC3376 and RFC3810, ensuring that large
intervals are correctly represented in QQIC and MRC fields
using the exponent-mantissa format.

Key Changes:
* ipv4: igmp: get rid of IGMPV3_{QQIC,MRC} and simplify calculation
  Removes legacy macros in favor of a cleaner, unified
  calculation for retrieving intervals from encoded fields,
  improving code maintainability.

* ipv6: mld: rename mldv2_mrc() and add mldv2_qqi()
  Standardizes MLDv2 terminology by renaming mldv2_mrc()
  to mldv2_mrd() (Maximum Response Delay) and introducing
  a new API mldv2_qqi for QQI calculation, improving code
  readability.

* ipv4: igmp: encode multicast exponential fields
  Introduces the logic to dynamically calculate the exponent
  and mantissa using bit-scan (fls). This ensures QQIC and
  MRC fields (8-bit) are properly encoded when transmitting
  query packets with intervals that exceed their respective
  linear threshold value of 128 (for QQI/MRT).

* ipv6: mld: encode multicast exponential fields
  Applies similar encoding logic for MLDv2. This ensures
  QQIC (8-bit) and MRC (16-bit) fields are properly encoded
  when transmitting query packets with intervals that exceed
  their respective linear thresholds (128 for QQI; 32768
  for MRD).

* selftests: net: bridge: add MRC and QQIC field encoding tests
  Updates bridge selftests to validate both linear and non-linear
  (exponential) encoding for MRC and QQIC fields, ensuring
  protocol compliance across IGMPv3 and MLDv2.

Impact:
These changes ensure that multicast queriers and listeners
stay synchronized on timing intervals, preventing protocol
timeouts or premature group membership expiration caused
by incorrectly formatted packet headers.
====================

Link: https://patch.msgid.link/20260502131907.987-1-royujjal@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: net: bridge: add MRC and QQIC field encoding tests

Enhance vlmc_query_intvl_test and vlmc_query_response_intvl_test in
bridge_vlan_mcast.sh to validate IGMPv3/MLDv2 protocol compliance for
MRC and QQIC field encoding across both linear and exponential ranges.

TEST: Vlan multicast snooping enable                                [ OK ]
TEST: Vlan mcast_query_interval global option default value         [ OK ]
TEST: Number of tagged IGMPv2 general query                         [ OK ]
TEST: IGMPv3 QQIC linear value 60(s)                                [ OK ]
TEST: MLDv2 QQIC linear value 60(s)                                 [ OK ]
TEST: IGMPv3 QQIC non linear value 160(s)                           [ OK ]
TEST: MLDv2 QQIC non linear value 160(s)                            [ OK ]
TEST: Vlan mcast_query_response_interval global option default value   [ OK ]
TEST: IGMPv3 MRC linear value of 60(x0.1s)                          [ OK ]
TEST: MLDv2 MRC linear value of 24000(ms)                           [ OK ]
TEST: IGMPv3 MRC non linear value of 240(x0.1s)                     [ OK ]
TEST: MLDv2 MRC non linear value of 48000(ms)                       [ OK ]

Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Ujjal Roy <royujjal@gmail.com>
Link: https://patch.msgid.link/20260502131907.987-6-royujjal@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: mld: encode multicast exponential fields

In MLD, MRC and QQIC fields are not correctly encoded when
generating query packets. Since the receiver of the query
interprets these fields using the MLDv2 floating-point
decoding logic, any value that exceeds the linear threshold
is incorrectly parsed as an exponential value, leading to
an incorrect interval calculation.

Encode and assign the corresponding protocol fields during
query generation. Introduce the logic to dynamically
calculate the exponent and mantissa using bit-scan (fls).
This ensures MRC (16-bit) and QQIC (8-bit) fields are
properly encoded when transmitting query packets with
intervals that exceed their respective linear thresholds
(32768 for MRD; 128 for QQI).

RFC3810: If Maximum Response Code >= 32768, the Maximum
Response Code field represents a floating-point value as
follows:
     0 1 2 3 4 5 6 7 8 9 A B C D E F
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |1| exp |          mant         |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

RFC3810: If QQIC >= 128, the QQIC field represents a
floating-point value as follows:
     0 1 2 3 4 5 6 7
    +-+-+-+-+-+-+-+-+
    |1| exp | mant  |
    +-+-+-+-+-+-+-+-+

Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Ujjal Roy <royujjal@gmail.com>
Link: https://patch.msgid.link/20260502131907.987-5-royujjal@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv4: igmp: encode multicast exponential fields

In IGMP, MRC and QQIC fields are not correctly encoded
when generating query packets. Since the receiver of the
query interprets these fields using the IGMPv3 floating-
point decoding logic, any value that exceeds the linear
threshold is incorrectly parsed as an exponential value,
leading to an incorrect interval calculation.

Encode and assign the corresponding protocol fields during
query generation. Introduce the logic to dynamically
calculate the exponent and mantissa using bit-scan (fls).
This ensures MRC and QQIC fields (8-bit) are properly
encoded when transmitting query packets with intervals
that exceed their respective linear threshold value of
128 (for MRT/QQI).

RFC3376: for both MRC and QQIC, values >= 128 represent
the same floating-point encoding as follows:
     0 1 2 3 4 5 6 7
    +-+-+-+-+-+-+-+-+
    |1| exp | mant  |
    +-+-+-+-+-+-+-+-+

Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Ujjal Roy <royujjal@gmail.com>
Link: https://patch.msgid.link/20260502131907.987-4-royujjal@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: mld: rename mldv2_mrc() and add mldv2_qqi()

Rename mldv2_mrc() to mldv2_mrd() as it is used to calculate
the Maximum Response Delay from the Maximum Response Code.

Introduce a new API mldv2_qqi() to define the existing
calculation logic of QQI from QQIC. This also organizes
the existing mld_update_qi() API.

Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Ujjal Roy <royujjal@gmail.com>
Link: https://patch.msgid.link/20260502131907.987-3-royujjal@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv4: igmp: get rid of IGMPV3_{QQIC,MRC} and simplify calculation

Get rid of the IGMPV3_MRC macro and use the igmpv3_mrt() API to
calculate the Max Resp Time from the Maximum Response Code.

Similarly, for IGMPV3_QQIC, use the igmpv3_qqi() API to calculate
the Querier's Query Interval from the QQIC field.

Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Ujjal Roy <royujjal@gmail.com>
Link: https://patch.msgid.link/20260502131907.987-2-royujjal@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-convert-af_netlink-and-af_vsock-to-getsockopt_iter-api'

Breno Leitao says:

====================
net: Convert AF_NETLINK and AF_VSOCK to getsockopt_iter API

Continue the work to convert protocols to the new getsockopt_iter API.

Convert AF_NETLINK and AF_VSOCK getsockopt implementations to the new
sockopt_t/getsockopt_iter API, and add kselftests that verify the size
and errno semantics are preserved across the conversion.

I chose these two socket families because they are probably one of the
most used protocols,, ensuring that any potential bugs will be
discovered and reported quickly.
====================

Link: https://patch.msgid.link/20260501-getsock_one-v1-0-810ce23ea70e@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: selftests: add getsockopt_iter regression tests

Add a single kselftest covering the proto_ops getsockopt_iter
conversions for AF_NETLINK and AF_VSOCK, using one fixture per protocol:

netlink:

NETLINK_PKTINFO covers the flag-style int path (exact size, oversize
clamp, undersize -EINVAL); NETLINK_LIST_MEMBERSHIPS covers the
size-discovery path that always reports the required buffer length back
via optlen, even when the user buffer is too small to receive any group
bits.

vsock:
SO_VM_SOCKETS_BUFFER_SIZE covers the u64 path (exact size, oversize
clamp, undersize -EINVAL).

Each fixture also exercises an unknown optname and a bogus level so
the returned-length / errno semantics preserved by the sockopt_t
conversion are pinned down.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20260501-getsock_one-v1-3-810ce23ea70e@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

vsock: convert to getsockopt_iter

Convert AF_VSOCK's getsockopt implementation to use the new
getsockopt_iter callback with sockopt_t. The single
vsock_connectible_getsockopt() callback is shared by both
vsock_stream_ops and vsock_seqpacket_ops, so both proto_ops are
updated to use .getsockopt_iter.

Key changes:
- Replace (char __user *optval, int __user *optlen) with sockopt_t *opt
- Use opt->optlen for buffer length (input) and returned size (output)
- Use copy_to_iter() instead of put_user()/copy_to_user()

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20260501-getsock_one-v1-2-810ce23ea70e@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netlink: convert to getsockopt_iter

Convert AF_NETLINK's getsockopt implementation to use the new
getsockopt_iter callback with sockopt_t.

Key changes:
- Replace (char __user *optval, int __user *optlen) with sockopt_t *opt
- Use opt->optlen for buffer length (input) and returned size (output)
- Use copy_to_iter() instead of put_user()/copy_to_user()
- For NETLINK_LIST_MEMBERSHIPS: walk the groups bitmap and emit each
  u32 sequentially via copy_to_iter(), then set opt->optlen to the
  total size required (ALIGN(BITS_TO_BYTES(ngroups), sizeof(u32))).
  The wrapper writes opt->optlen back to userspace even on partial
  failure, preserving the existing API that lets userspace discover
  the needed allocation size.

Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20260501-getsock_one-v1-1-810ce23ea70e@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/sched: add qstats_cpu_drop_inc() helper

1) Using this_cpu_inc() is better than going through this_cpu_ptr():

- Single instruction on x86.
- Store tearing prevention.

2) Change tcf_action_update_stats() to use this_cpu_add().

3) Add WRITE_ONCE() to __qdisc_qstats_drop() and qstats_drop_inc()
   in preparation for lockless "tc qdisc show".

$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/0 grow/shrink: 3/17 up/down: 72/-216 (-144)
Function                                     old     new   delta
dualpi2_enqueue_skb                          462     511     +49
tcf_ife_act                                 1061    1077     +16
taprio_enqueue                               613     620      +7
codel_qdisc_enqueue                          149     143      -6
tcf_vlan_act                                 684     676      -8
tcf_skbedit_act                              626     618      -8
tcf_police_act                               725     717      -8
tcf_mpls_act                                1297    1289      -8
tcf_gate_act                                 310     302      -8
tcf_gact_act                                 222     214      -8
tcf_csum_act                                2438    2430      -8
tcf_bpf_act                                  709     701      -8
tcf_action_update_stats                      124     115      -9
pie_qdisc_enqueue                            865     856      -9
pfifo_enqueue                                116     107      -9
choke_enqueue                               2069    2059     -10
plug_enqueue                                 139     128     -11
bfifo_enqueue                                121     110     -11
tcf_nat_act                                 1501    1489     -12
gred_enqueue                                1743    1668     -75
Total: Before=24388609, After=24388465, chg -0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20260501135916.2566766-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: realtek: Add support for PHY LEDs on RTL8221B

Realtek RTL8221B Ethernet PHY supports three LED pins which are used to
indicate link status and activity. Add netdev trigger support for them.

Signed-off-by: Chukun Pan <amadeus@jmu.edu.cn>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20260501100002.755672-1-amadeus@jmu.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/sched: taprio: prepare taprio_dump() for RTNL removal

We soon will no longer hold RTNL in qdisc dumps.

Add READ_ONCE()/WRITE_ONCE() annotations.

Note: taprio already uses RCU to protect most of its fields.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260501064247.2027688-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'intel-wired-lan-updates-2024-04-30-ixgbe-i40e-ice'

Jacob Keller says:

====================
Intel Wired LAN Updates 2024-04-30 (ixgbe, i40e, ice)

This series includes updates to support Energy-Efficient Ethernet (EEE) on
E610 devices in the ixgbe driver, support for an unmanaged DPLL output on
E830, as well as some other minor cleanups and improvements across ixgbe,
i40e, and ice.

Jedrzej begins with the first six patches preparing the ixgbe driver to
support EEE, adding a EEE capability flag, updating the supported EEE
speeds, updating the ACI command structures with the fields related to
EEE, moving the EEE config validation out for re-use, and finally
implementing the EEE support for E610 hardware.

Aleksandr fixes the ixgbe_update_flash_X550() logic to prevent unaligned
access in ixgbe_host_interface_command(). Note: this has no functional
change on x86, and is being sent through net-next as it is considered a
minor cleanup.

Jacob (hi!) modifies the i40e driver to only timestamp PTP event packets,
instead of timestamping every V2 event frame. This avoids wasting the
limited number of timestamp slots for frames which the PTP protocol does
not care about.

Jacob also extends the devlink flash notification message reporting that
users can activate the new firmware via devlink reload to explicitly
indicate the required "fw_activate" action.

Byungchul Park fixes the ice_lbtest_receive_frames() function to use
netmem_desc instead of the page structure.

Przemyslaw Korba fixes a truncation warning in ice_dpll_init_fwnode_pins()
by increasing the allowed length of the pin_name string on the stack to 16.

Ivan Vecera adds some bounds checking to ice_dpll_rclk_state_on_pin_get/set()
and moves the CGU register macros to be under the header guard ifdef in
ice_dpll.h
====================

Link: https://patch.msgid.link/20260430-jk-iwl-net-next-2026-04-30-v1-0-6f27ae1cd073@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ice: dpll: Fix compilation warning

Introduced by commit ad1df4f2d591 ("ice: dpll: Support E825-C SyncE and
dynamic pin discovery"):

ice_dpll.c: In function ‘ice_dpll_init’:
ice_dpll.c:3588:59: error: ‘%u’ directive output may be truncated
writing between 1 and 10 bytes into a region of size 4
[-Werror=format-truncation=] snprintf(pin_name, sizeof(pin_name),
"rclk%u", i);

Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Przemyslaw Korba <przemyslaw.korba@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260430-jk-iwl-net-next-2026-04-30-v1-13-6f27ae1cd073@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ice: access @pp through netmem_desc instead of page

To eliminate the use of struct page in page pool, the page pool users
should use netmem descriptor and APIs instead.

Make ice driver access @pp through netmem_desc instead of page.

Signed-off-by: Byungchul Park <byungchul@sk.com>
Tested-by: Alexander Nowlin <alexander.nowlin@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com>
Reviewed-by: David Hildenbrand (Arm) <david@kernel.org>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260430-jk-iwl-net-next-2026-04-30-v1-12-6f27ae1cd073@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ice: mention fw_activate action along with devlink reload

The ice driver reports a helpful status message when updating firmware
indicating what action is necessary to enable the new firmware. This is
done because some updates require power cycling or rebooting the machine
but some can be activated via devlink.

The ice driver only supports activating firmware with the specific action
of "fw_activate" a bare "devlink dev reload" will *not* update the
firmware, and will only perform driver reinitialization.

Update the status message to explicitly reflect that the reload must use
the fw_activate action.

I considered modifying the text to spell out the full command, but felt
that was both overkill and something that would belong better as part of
the user space program and not hard coded into the kernel driver output.

Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260430-jk-iwl-net-next-2026-04-30-v1-11-6f27ae1cd073@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

i40e: only timestamp PTP event packets

The i40e_ptp_set_timestamp_mode() function is responsible for configuring
hardware timestamping. When programming receive timestamping, the logic
must determine how to configure the PRTTSYN_CTL1 register for receive
timestamping.

The i40e hardware does not support timestamping all frames. Instead,
timestamps are captured into one of the four PRTTSYN_RXTIME registers.

Currently, the driver configures hardware to timestamp all V2 packets on
ports 319 and 320, including all message types. This timestamps
significantly more packets than is actually requested by the
HWTSTAMP_FILTER_PTP_V2_EVENT filter type.

The documentation for HWTSTAMP_FILTER_PTP_V2_EVENT indicates that it should
timestamp PTP v2 messages on any layer, including any kind of event
packets.

Timestamping other packets is acceptable, but not required by the filter.
Doing so wastes valuable slots in the Rx timestamp registers. For most
applications this doesn't cause a problem. However, for extremely high
rates of messages, it becomes possible that one of the critical event
packets is not timestamped.

The PTP protocol only requires timestamps for event messages on port 319,
but hardware is timestamping on both 319 and 320, and timestamping message
types which do not need a timestamp value.

The i40e hardware actually has a more strict filtering option. First, only
timestamp layer 4 messages on port 319 instead of both 319 and 320. Second,
note that hardware has a specific mode to timestamp only event packets
(those with message type < 8).

Update the configuration to use the strict mode that only timestamps event
messages, switching the TSYNTYPE field from 10b to 11b which limits the
timestamping only to eventpackets with a Message Type of < 8. Note that the
X700 series datasheet seems to indicate that the V2MSESTYPE field is no
longer relevant. However, we only tested and validated with leaving the
V2MESSTYPE field set to 0xF for the "wildcard" behavior it documents. This
might not be required but it in that case setting it appears harmless, so
leave it as is.

This avoids wasting the valuable Rx timestamp register slots on non-event
frames, and may reduce faults when operating under high event rates.

Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260430-jk-iwl-net-next-2026-04-30-v1-10-6f27ae1cd073@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ixgbe: fix unaligned u32 access in ixgbe_update_flash_X550()

ixgbe_host_interface_command() treats its buffer as a u32 array. The
local buffer we pass in was a union of byte-sized fields, which gives
it 1-byte alignment on the stack. On strict-align architectures this
can cause unaligned 32-bit accesses.

Add a u32 member to union ixgbe_hic_hdr2 so the object is 4-byte
aligned, and pass the u32 member when calling
ixgbe_host_interface_command().

No functional change on x86; prevents unaligned accesses on
architectures that enforce natural alignment.

Fixes: 49425dfc7451 ("ixgbe: Add support for x550em_a 10G MAC type")
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Fixes: 6a14ee0cfb19 ("ixgbe: Add X550 support function pointers")
Tested-by: Rinitha S <sx.rinitha@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260430-jk-iwl-net-next-2026-04-30-v1-9-6f27ae1cd073@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ixgbe: E610: add EEE support

Add E610 specific implementation of .get_eee() and .set_eee() ethtool
callbacks.

Introduce ixgbe_setup_eee_e610() which is used to set EEE config
on E610 device via ixgbe_aci_set_phy_cfg() (0x0601 ACI command).
Assign it to dedicated mac operation.

E610 devices support EEE feature specifically for 2.5, 5 and 10G link
speeds. When user try to set EEE for unsupported speeds log it.

Setting timer and setting EEE advertised speeds are not yet supported.

EEE shall be enabled by default for E610 devices.

Add EEE statuis logging during link watchdog run.

Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260430-jk-iwl-net-next-2026-04-30-v1-6-6f27ae1cd073@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ixgbe: move EEE config validation out of ixgbe_set_eee()

To make this part of the code mode reusable move all
EEE input checks out of ixgbe_set_eee().

Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260430-jk-iwl-net-next-2026-04-30-v1-5-6f27ae1cd073@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ixgbe: E610: update ACI command structs with EEE fields

There were recent changes in some of the ACI commands,
which have been extended with EEE related fields.
Set PHY Config, Get PHY Caps and Get Link Info have been
affected.

Align SW structs to the recent FW changes.

Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260430-jk-iwl-net-next-2026-04-30-v1-4-6f27ae1cd073@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ixgbe: E610: use new version of 0x601 ACI command buffer

Since FW version 1.40, buffer size of the 0x601 cmd has been increased
by 2B - from 24 to 26B. Buffer has been extended with new field
which can be used to configure EEE entry delay.

Pre-1.40 FW versions still expect 24B buffer and throws error when
receipts 26B buffer. To keep compatibility, check whether EEE
device capability flag is set and basing on it use appropriate
size of the command buffer.

Additionally place Set PHY Config capabilities defines out of
structs definitions.

Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260430-jk-iwl-net-next-2026-04-30-v1-3-6f27ae1cd073@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ixgbe: E610: update EEE supported speeds

Despite there was no EEE (Energy Efficient Ethernet) feature
support for E610 adapters, eee_speeds_supported variable was
defined and even initialized with some EEE speeds.

As E610 adapter supports EEE only for 10G, 5G and 2.5G speeds,
update hw.phy.eee_speeds_supported. Remove unsupported speeds -
10M, 100M and 1G.

Add also entry for 5G speed in EEE speeds mapping array used
by ethtool callbacks.

Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260430-jk-iwl-net-next-2026-04-30-v1-2-6f27ae1cd073@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ixgbe: E610: add discovering EEE capability

Add detecting and parsing EEE device capability.

Recently EEE functionality support has been introduced to E610 FW.
Currently ixgbe driver has no possibility to detect whether NVM
loaded on given adapter supports EEE.

There's dedicated device capability element reflecting FW support
for given EEE link speed.

Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260430-jk-iwl-net-next-2026-04-30-v1-1-6f27ae1cd073@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-dsa-yt921x-add-port-police-support'

David Yang says:

====================
net: dsa: yt921x: Add port police support
====================

Link: https://patch.msgid.link/20260430114529.3536911-1-mmyangfl@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: yt921x: Add port police support

Enable rate meter ability and support limiting the rate of incoming
traffic.

Signed-off-by: David Yang <mmyangfl@gmail.com>
Link: https://patch.msgid.link/20260430114529.3536911-4-mmyangfl@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: yt921x: Refactor long register helpers

Dealing long registers with u64 is good, until you realize there are
longer 96-bit registers.

Refactor reg64 helpers to use u32 arrays instead of u64 values, in
preparation for 96-bit registers. We do not keep the separate u64
version for reg64 to avoid duplicated wrappers, although it looks better
when dealing with reg64 *only*.

Helpers for reg96 should be added when they are actually used to avoid
function unused warnings.

Signed-off-by: David Yang <mmyangfl@gmail.com>
Link: https://patch.msgid.link/20260430114529.3536911-3-mmyangfl@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: pass extack to dsa_switch_ops :: port_policer_add()

Drivers might have error messages to propagate to user space. Propagate
the netlink extack so that they can inform user space in a verbal way of
their limitations.

Make the according transformations to the two users (sja1105 and felix).

Signed-off-by: David Yang <mmyangfl@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20260430114529.3536911-2-mmyangfl@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Add vhca_id_type support to IPsec alias creation

When creating an alias FT for MPV IPsec, if alias creation with
sw_vhca_id is supported use it instead of using the hw_vhca_id.

This in turn allows IPsec to work properly after live migration,
in case a VF was live migrated and his hw_vhca_id changed due to
migration which can happen if you migrate to a VF with a different index
than yours, IPsec would fail to start post migration, this patch
resolves the issue by using sw_vhca_id instead which doesn't change post
migration.

Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260430061958.225245-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Documentation/tcp_ao: Document the supported MAC algorithms and lengths

Update the TCP-AO documentation to fix some incorrect terminology and
claims regarding the MAC algorithms, and document which MAC algorithms
and lengths the Linux implementation supports.

Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Link: https://patch.msgid.link/20260429210856.725667-1-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-mlx5-enable-sub-page-allocations-for-mlx5_frag_buf'

Tariq Toukan says:

====================
net/mlx5: enable sub-page allocations for mlx5_frag_buf

This series aims to improve memory utilization for DMA-coherent
fragmented-buffer allocations on systems with large PAGE_SIZE.

Before this change, such allocations were page-granular, as they were
backed by full pages. On large-page systems this caused significant
internal waste for small objects. For example, a single 4K request
consumed an entire 64K page.

The common kernel solution for sub-page coherent DMA allocations is the
DMA pool API. However, those pools do not return pages to the system
until teardown. That behavior is not a good fit for mlx5_frag_buf
allocations, since they back interface resources (WQs and CQs).
Interfaces may be removed dynamically, so their memory footprint should
reflect live usage to avoid situations where large amounts of memory
remain tied up in pools.

This series introduces a lightweight mlx5-local pool implementation for
sub-page coherent DMA allocations, which immediately returns free
backing pages. It wires mlx5_frag_buf allocations to use these internal
pools, while keeping the mechanism reusable for other mlx5-internal
coherent DMA allocation users in follow-up work.
====================

Link: https://patch.msgid.link/20260429201429.223809-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: use internal dma pools for frag buf alloc

Add mlx5_dma_pool alloc/free paths, and wire mlx5_frag_buf allocation
and free paths to use them.

mlx5_frag_buf_alloc_node() now selects an mlx5_dma_pool to allocate
fragments from, instead of directly allocating full coherent pages.

mlx5_frag_buf_free() frees from the respective pool.

mlx5_dma_pool_alloc() keeps allocation fast by maintaining pages with
available indexes at the head of the list, so the common allocation path
can take a free index immediately. New backing pages are allocated only
when no free index is available.

mlx5_dma_pool_free() returns released indexes to the pool and frees a
backing page once all of its indexes become free. This avoids keeping
fully free pages for the lifetime of the pool and reduces coherent DMA
memory footprint.

Signed-off-by: Nimrod Oren <noren@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260429201429.223809-4-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: add frag buf pools create/destroy paths

Introduce mlx5 DMA pool and pool-page data structures, and add the
creation and teardown paths.

Each NUMA node owns a set of mlx5_dma_pool instances, each one with a
different block size. The sizes are defined as all powers of two
starting from MLX5_ADAPTER_PAGE_SHIFT and up to PAGE_SHIFT. Since
mlx5_frag_bufs are used to back objects whose sizes are encoded relative
to MLX5_ADAPTER_PAGE_SHIFT, a smaller block_shift value cannot be used.
Requests larger than PAGE_SIZE continue to be handled as page-sized
fragments, as in the existing frag-buf allocation model.

Signed-off-by: Nimrod Oren <noren@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260429201429.223809-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: wire frag buf pools lifecycle hooks

Wire mlx5_frag_buf pools init/cleanup hooks into
mlx5_mdev_init()/uninit() and the init unwind path.

Keep temporary no-op stubs in alloc.c so lifecycle ordering is in place
before the coherent DMA sub-page allocator implementation is added in
follow-up patches.

Signed-off-by: Nimrod Oren <noren@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260429201429.223809-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

pppoe: optimize hash with word access

Currently, hash_item() processes the 6-byte Ethernet address and the
2-byte session ID byte-wise to compute a hash.

Optimize this by using 16-bit word operations: XOR three 16-bit words
from the Ethernet address and the 16-bit session ID, then fold the
result. This reduces the total number of loads and XORs. The Ethernet
addresses in a skb and struct pppoe_addr are both 2-byte aligned, so the
u16 pointer cast is safe.

Signed-off-by: Qingfang Deng <qingfang.deng@linux.dev>
Link: https://patch.msgid.link/20260429023848.153425-1-qingfang.deng@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: configure QoS channel for HW accelerated flowtable traffic

As done for the SW path, configure the QoS channel for HW accelerated
traffic according to the user port index when forwarding to a DSA port,
or rely on the GDM port identifier otherwise. This allows HTB shaping
to be applied to HW accelerated traffic.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20260430-airoha-ppe-qos-channel-v1-1-5ef9221e85c1@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'tcp-move-some-fastpath-fields-to-appropriate-groups'

Eric Dumazet says:

====================
tcp: move some fastpath fields to appropriate groups

Move following fields to better groups to increase data locality.

- delivered
- delivered_ce
- segs_in
- segs_out
- first_tx_mstamp
- delivered_mstamp
- max_packets_out
- cwnd_usage_seq
- rate_delivered
- rate_interval_us

No change in overall tcp_sock size.
====================

Link: https://patch.msgid.link/20260430100021.211139-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: move max_packets_out, cwnd_usage_seq, rate_delivered and rate_interval_us to tcp_sock_write_tx group

These fields are used in TX path.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260430100021.211139-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: move tp->bytes_acked to tcp_sock_write_tx group

tp->bytes_acked is touched in TX path only.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260430100021.211139-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: move tp->first_tx_mstamp and tp->delivered_mstamp to tcp_sock_write_tx

These fields are touched in when payload is sent.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260430100021.211139-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: move tp->segs_in and tp->segs_out to tcp_sock_write_txrx group

segs_in is changed for each incoming packet, including ACK packets.
segs_out is changed for each outgoing packet, including ACK packets.

They belong to tcp_sock_write_txrx group.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260430100021.211139-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: move tp->delivered and tp->delivered_ce to tcp_sock_write_tx group

These counters are changed whenever sent data is acknowleged.

They do not belong to tcp_sock_write_txrx group, because TCP receivers
do not touch them.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260430100021.211139-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: drv-net: Enable ntuple-filters if supported

Certain devices which support ntuple-filters do not enable the feature
by default. The existing tests will skip (if they check for the feature),
or fail if they blindly attempt to install rules. Therefore, attempt to turn
on ntuple-filters if the device supports them.

Signed-off-by: Dimitri Daskalakis <daskald@meta.com>
Reviewed-by: Joe Damato <joe@dama.to>
Link: https://patch.msgid.link/20260430165217.3700469-1-dimitri.daskalakis1@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ip6mr: plug drop_reason to ip6mr_cache_report()

- Check mrt->mroute_sk earlier in the function.

- Use sock_queue_rcv_skb_reason() instead of sock_queue_rcv_skb().
- Use sk_skb_reason_drop() instead of kfree_skb().
Note that we return -ENOMEM if sock_queue_rcv_skb_reason() failed,
as the precise error is not really needed for callers.

- Remove one net_warn_ratelimited().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260430074004.4133602-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ne2k: fold drivers/net/Space.c into ne.c

drivers/net/Space.c is the last remnant of the linux-2.4.x driver model
that required each subsystem and device driver init function to be called
from init/main.c explicitly, before the introduction of initcall levels.

In linux-7.0, this was only used for a handful of ISA network drivers,
with the ne2000 driver being the last one.

Fold the code into ne.c directly, with minimal changes to preserve
the existing command line parsing.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # m68k
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260429145624.2948432-2-arnd@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: cs89x0: remove ISA bus probing

The cs89x0 driver is really two in one, and they are mutually exclusive:

- the ISA driver was used on 486-era PCs. It likely has no remaining
   users, like the other ethernet drivers that got removed in
   linux-7.1. The DMA support in here is the last device driver use of
   the deprecated isa_bus_to_virt() interface, all other users are either
   x86 specific or or got converted to the normal dma-mapping interface.
   The driver was maintained by Andrew Morton at the time, based on
   the linux-2.2 vendor driver from Cirrus Logic.

- the platform_driver instance was used on some embedded Arm boards
   around the same time, such as the EP7211 Development Kit. This
   is the same chip, but uses modern devicetree based probing and no DMA.
   This was added by Alexander Shiyan.

Remove the ISA driver as a cleanup, including all of the outdated
documentation referring to its configuration.

Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260429145624.2948432-1-arnd@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: tls: reshuffle the device ops check

We try to validate during registration that the netdev
has ops if it has features. This is currently somewhat sillily
written because we have a dereference before a NULL check
on the ops struct. Straighten this out.

No functional change intended other than saving ourselves
the very theoretical crash with a bad driver.

Note that we check earlier in the function that either ops
or TLS features are set for the device in question.

Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260429213001.1908235-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-sched-tc_dump_qdisc-optimizations'

Eric Dumazet says:

====================
net/sched: tc_dump_qdisc() optimizations

Before converting tc_dump_qdisc() to RCU, we make the following changes:

- Use for_each_netdev_dump() instead of for_each_netdev()

- Only dump qdiscs of a single device at user space request.
====================

Link: https://patch.msgid.link/20260430023628.3216283-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/sched: speedup tc_dump_qdisc() when tcm_ifindex is provided

There is no point dumping qdiscs for all devices when user space
wants them for a single device:

tc -s -d qdisc show dev eth1

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260430023628.3216283-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/sched: switch tc_dump_qdisc() to for_each_netdev_dump()

Use for_each_netdev_dump() instead of for_each_netdev().

This is more scalable, and will ease RCU conversion.

This also offer better behavior when other threads
are adding or deleting netevices concurrently.

This enables dumping qdiscs for a single device
at user space request in the following patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260430023628.3216283-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>