git.ipfire.org Git - thirdparty/kernel/linux.git/log

]> git.ipfire.org Git - thirdparty/kernel/linux.git/log

projects / thirdparty / kernel / linux.git / log

summary | shortlog | log | commit | commitdiff | tree
first ⋅ prev ⋅ next

commit | commitdiff | tree

Vladimir Oltean [Wed, 15 Oct 2025 22:33:25 +0000 (23:33 +0100)]

net: dsa: lantiq_gswip: disallow changes to privately set up VID 0

User space can force the altering of VID 0 as it was privately set up by
this driver.

For example, when the port joins a VLAN-aware bridge,
dsa_user_manage_vlan_filtering() will set NETIF_F_HW_VLAN_CTAG_FILTER.
If the port is subsequently brought up and CONFIG_VLAN_8021Q is enabled,
the vlan_vid0_add() function will want to make sure we are capable of
accepting packets tagged with VID 0.

Generally, DSA/switchdev drivers want to suppress that bit of help from
the 8021q layer, and handle VID 0 filters themselves. The 8021q layer
might actually be even detrimential, because VLANs added through
vlan_vid_add() pass through dsa_user_vlan_rx_add_vid(), which is
documented as this:

/* This API only allows programming tagged, non-PVID VIDs */
.flags = 0,

so it will force VID 0 to be reconfigured as egress-tagged, non-PVID.
Whereas the driver configures it as PVID and egress-untagged, the exact
opposite.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Link: https://patch.msgid.link/9f68340c34b5312c3b8c6c7ecf3cfce574a3f65d.1760566491.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Vladimir Oltean [Wed, 15 Oct 2025 22:32:58 +0000 (23:32 +0100)]

net: dsa: lantiq_gswip: permit dynamic changes to VLAN filtering state

The driver should now tolerate these changes, now that the PVID is
automatically recalculated on a VLAN awareness state change.

The VLAN-unaware PVID must be installed to hardware even if the
joined bridge is currently VLAN-aware. Otherwise, when the bridge VLAN
filtering state dynamically changes to VLAN-unaware later, this PVID
will be missing.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Link: https://patch.msgid.link/c58759074fb699581336dc2c2c6bf106257b134e.1760566491.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Vladimir Oltean [Wed, 15 Oct 2025 22:32:50 +0000 (23:32 +0100)]

net: dsa: lantiq_gswip: remove legacy configure_vlan_while_not_filtering option

This driver doesn't support dynamic VLAN filtering changes, for simplicity.
It expects that on a port, either gswip_vlan_add_unaware() or
gswip_vlan_add_aware() is called, but not both.

When !br_vlan_enabled(), the configure_vlan_while_not_filtering = false
option is exactly what will prevent calls to gswip_port_vlan_add() from
being issued by DSA.

In fact, at the time these features were submitted:
https://patchwork.ozlabs.org/project/netdev/patch/20190501204506.21579-3-hauke@hauke-m.de/
"configure_vlan_while_not_filtering = false" did not even have a name,
it was implicit behaviour. It only became legacy in commit 54a0ed0df496
("net: dsa: provide an option for drivers to always receive bridge
VLANs").

Section "Bridge VLAN filtering" of Documentation/networking/switchdev.rst
describes the exact set of rules. Notably, the PVID of the port must
follow the VLAN awareness state of the bridge port. A VLAN-unaware
bridge port should not respond to the addition of a bridge VLAN with the
PVID flag. In fact, the pvid_change() test in
tools/testing/selftests/net/forwarding/bridge_vlan_unaware.sh tests
exactly this.

The lantiq_gswip driver indeed does not respond to the addition of PVID
VLANs while VLAN-unaware in the way described above, but only because of
configure_vlan_while_not_filtering. Our purpose here is to get rid of
configure_vlan_while_not_filtering, so we must add more complex logic
which follows the VLAN awareness state and walks through the Active VLAN
table entries, to find the index of the PVID register that should be
committed to hardware on each port.

As a side-effect of now having a proper implementation to assign the
PVID all the "VLAN upper: ..." tests of the local_termination.sh self-
tests which would previously all FAIL now all PASS (or XFAIL, but
that's ok).

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Tested-by: Daniel Golle <daniel@makrotopia.org>
Link: https://patch.msgid.link/47dab8a8b69ebb92624b9795b723114475d3fe4e.1760566491.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Vladimir Oltean [Wed, 15 Oct 2025 22:32:41 +0000 (23:32 +0100)]

net: dsa: lantiq_gswip: merge gswip_vlan_add_unaware() and gswip_vlan_add_aware()

The two functions largely duplicate functionality. The differences
consist in:

- the "fid" passed to gswip_vlan_active_create(). The unaware variant
  always passes -1, the aware variant passes fid = priv->vlans[i].fid,
  where i is an index into priv->vlans[] for which priv->vlans[i].bridge
  is equal to the given bridge.

- the "vid" is not passed to gswip_vlan_add_unaware(). It is implicitly
  GSWIP_VLAN_UNAWARE_PVID (zero).

- The "untagged" is not passed to gswip_vlan_add_unaware(). It is
  implicitly true. Also, the CPU port must not be a tag member of the
  PVID used for VLAN-unaware bridging.

- The "pvid" is not passed to gswip_vlan_add_unaware(). It is implicitly
  true.

- The GSWIP_PCE_DEFPVID(port) register is written by the aware variant
  with an "idx", but with a hardcoded 0 by the unaware variant.

Merge the two functions into a single unified function without any
functional changes.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Link: https://patch.msgid.link/2be190701d4c17038ce4b8047f9fb0bdf8abdf6e.1760566491.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Vladimir Oltean [Wed, 15 Oct 2025 22:32:30 +0000 (23:32 +0100)]

net: dsa: lantiq_gswip: remove duplicate assignment to vlan_mapping.val[0]

When idx == -1 in gswip_vlan_add(), we set vlan_mapping.val[0] = vid,
even though we do the exact same thing again outside the if/else block.

Remove the duplicate assignment.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Link: https://patch.msgid.link/039ecb48e038cea856a9a6230ad1543db2bc382d.1760566491.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Vladimir Oltean [Wed, 15 Oct 2025 22:32:15 +0000 (23:32 +0100)]

net: dsa: lantiq_gswip: define VLAN ID 0 constant

This patch adds an explicit definition for VID 0 to the Lantiq GSWIP DSA
driver, clarifying its special meaning.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Link: https://patch.msgid.link/e8862239d0bb727723cf60947d2262473b46c96d.1760566491.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Vladimir Oltean [Wed, 15 Oct 2025 22:32:05 +0000 (23:32 +0100)]

net: dsa: lantiq_gswip: support bridge FDB entries on the CPU port

Currently, the driver takes the bridge from dsa_port_bridge_dev_get(),
which only works for user ports. This is why it has to ignore FDB
entries installed on the CPU port.

Commit c26933639b54 ("net: dsa: request drivers to perform FDB
isolation") introduced the possibility of getting the originating bridge
from the passed dsa_db argument, so let's do that instead. This way, we
can act on the local FDB entries coming from the bridge.

Note that we do not expect FDB events for the DSA_DB_PORT database,
because this driver doesn't fulfill the dsa_switch_supports_uc_filtering()
requirements. So we can just return -EOPNOTSUPP and expect it will never
be triggered.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Link: https://patch.msgid.link/ed9d847c0356f0fec81422bdad9ebdcc6a59da79.1760566491.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Sat, 18 Oct 2025 00:20:41 +0000 (17:20 -0700)]

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Martin KaFai Lau says:

====================
pull-request: bpf-next 2025-10-16

We've added 6 non-merge commits during the last 1 day(s) which contain
a total of 18 files changed, 577 insertions(+), 38 deletions(-).

The main changes are:

1) Bypass the global per-protocol memory accounting either by setting
   a netns sysctl or using bpf_setsockopt in a bpf program,
   from Kuniyuki Iwashima.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
  selftests/bpf: Add test for sk->sk_bypass_prot_mem.
  bpf: Introduce SK_BPF_BYPASS_PROT_MEM.
  bpf: Support bpf_setsockopt() for BPF_CGROUP_INET_SOCK_CREATE.
  net: Introduce net.core.bypass_prot_mem sysctl.
  net: Allow opt-out from global protocol memory accounting.
  tcp: Save lock_sock() for memcg in inet_csk_accept().
====================

Link: https://patch.msgid.link/20251016204539.773707-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Eric Biggers [Tue, 14 Oct 2025 21:58:36 +0000 (14:58 -0700)]

tcp: Convert tcp-md5 to use MD5 library instead of crypto_ahash

Make tcp-md5 use the MD5 library API (added in 6.18) instead of the
crypto_ahash API.  This is much simpler and also more efficient:

- The library API just operates on struct md5_ctx.  Just allocate this
  struct on the stack instead of using a pool of pre-allocated
  crypto_ahash and ahash_request objects.

- The library API accepts standard pointers and doesn't require
  scatterlists.  So, for hashing the headers just use an on-stack buffer
  instead of a pool of pre-allocated kmalloc'ed scratch buffers.

- The library API never fails.  Therefore, checking for MD5 hashing
  errors is no longer necessary.  Update tcp_v4_md5_hash_skb(),
  tcp_v6_md5_hash_skb(), tcp_v4_md5_hash_hdr(), tcp_v6_md5_hash_hdr(),
  tcp_md5_hash_key(), tcp_sock_af_ops::calc_md5_hash, and
  tcp_request_sock_ops::calc_md5_hash to return void instead of int.

- The library API provides direct access to the MD5 code, eliminating
  unnecessary overhead such as indirect function calls and scatterlist
  management.  Microbenchmarks of tcp_v4_md5_hash_skb() on x86_64 show a
  speedup from 7518 to 7041 cycles (6% fewer) with skb->len == 1440, or
  from 1020 to 678 cycles (33% fewer) with skb->len == 140.

Since tcp_sigpool_hash_skb_data() can no longer be used, add a function
tcp_md5_hash_skb_data() which is specialized to MD5.  Of course, to the
extent that this duplicates any code, it's well worth it.

To preserve the existing behavior of TCP-MD5 support being disabled when
the kernel is booted with "fips=1", make tcp_md5_do_add() check
fips_enabled itself.  Previously it relied on the error from
crypto_alloc_ahash("md5") being bubbled up.  I don't know for sure that
this is actually needed, but this preserves the existing behavior.

Tested with bidirectional TCP-MD5, both IPv4 and IPv6, between a kernel
that includes this commit and a kernel that doesn't include this commit.

(Side note: please don't use TCP-MD5!  It's cryptographically weak.  But
as long as Linux supports it, it might as well be implemented properly.)

Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Link: https://patch.msgid.link/20251014215836.115616-1-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Carlos Llamas [Thu, 16 Oct 2025 18:25:37 +0000 (18:25 +0000)]

selftests/net: io_uring: fix unknown errnum values

The io_uring functions return negative error values, but error() expects
these to be positive to properly match them to an errno string. Fix this
to make sure the correct error descriptions are displayed upon failure.

Signed-off-by: Carlos Llamas <cmllamas@google.com>
Link: https://patch.msgid.link/20251016182538.3790567-1-cmllamas@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Heiner Kallweit [Thu, 16 Oct 2025 19:25:28 +0000 (21:25 +0200)]

r8169: reconfigure rx unconditionally before chip reset when resuming

There's a good chance that more chip versions suffer from the same
hw issue. So let's reconfigure rx unconditionally before the chip reset
when resuming. This shouldn't have any side effect on unaffected chip
versions.

Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://patch.msgid.link/a5c2e2d2-226f-4896-b8f6-45e2d91f0e24@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Florian Westphal [Thu, 16 Oct 2025 11:51:47 +0000 (13:51 +0200)]

net: Kconfig: discourage drop_monitor enablement

Quoting Eric Dumazet:
"I do not understand the fascination with net/core/drop_monitor.c [..]
misses all the features, flexibility, scalability that 'perf',
eBPF tracing, bpftrace, .... have today."

Reword DROP_MONITOR kconfig help text to clearly state that its not
related to perf-based drop monitoring and that its safe to disable
this unless support for the older netlink-based tools is needed.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251016115147.18503-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Fri, 17 Oct 2025 23:08:45 +0000 (16:08 -0700)]

Merge branch 'net-avoid-ehash-lookup-races'

Xuanqiang Luo says:

====================
net: Avoid ehash lookup races

After replacing R/W locks with RCU in commit 3ab5aee7fe84 ("net: Convert
TCP & DCCP hash tables to use RCU / hlist_nulls"), a race window emerged
during the switch from reqsk/sk to sk/tw.

Now that both timewait sock (tw) and full sock (sk) reside on the same
ehash chain, it is appropriate to introduce hlist_nulls replace
operations, to eliminate the race conditions caused by this window.

Before this series of patches, I previously sent another version of the
patch, attempting to avoid the issue using a lock mechanism. However, it
seems there are some problems with that approach now, so I've switched to
the "replace" method in the current patches to resolve the issue.
For details, refer to:
https://lore.kernel.org/netdev/20250903024406.2418362-1-xuanqiang.luo@linux.dev/

Before I encountered this type of issue recently, I found there had been
several historical discussions about it. Therefore, I'm adding this
background information for those interested to reference:
1. https://lore.kernel.org/lkml/20230118015941.1313-1-kerneljasonxing@gmail.com/
2. https://lore.kernel.org/netdev/20230606064306.9192-1-duanmuquan@baidu.com/
====================

Link: https://patch.msgid.link/20251015020236.431822-1-xuanqiang.luo@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Xuanqiang Luo [Wed, 15 Oct 2025 02:02:36 +0000 (10:02 +0800)]

inet: Avoid ehash lookup race in inet_twsk_hashdance_schedule()

Since ehash lookups are lockless, if another CPU is converting sk to tw
concurrently, fetching the newly inserted tw with tw->tw_refcnt == 0 cause
lookup failure.

The call trace map is drawn as follows:
   CPU 0                                CPU 1
   -----                                -----
     inet_twsk_hashdance_schedule()
     spin_lock()
     inet_twsk_add_node_rcu(tw, ...)
__inet_lookup_established()
(find tw, failure due to tw_refcnt = 0)
     __sk_nulls_del_node_init_rcu(sk)
     refcount_set(&tw->tw_refcnt, 3)
     spin_unlock()

By replacing sk with tw atomically via hlist_nulls_replace_init_rcu() after
setting tw_refcnt, we ensure that tw is either fully initialized or not
visible to other CPUs, eliminating the race.

It's worth noting that we held lock_sock() before the replacement, so
there's no need to check if sk is hashed. Thanks to Kuniyuki Iwashima!

Fixes: 3ab5aee7fe84 ("net: Convert TCP & DCCP hash tables to use RCU / hlist_nulls")
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251015020236.431822-4-xuanqiang.luo@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Xuanqiang Luo [Wed, 15 Oct 2025 02:02:35 +0000 (10:02 +0800)]

inet: Avoid ehash lookup race in inet_ehash_insert()

Since ehash lookups are lockless, if one CPU performs a lookup while
another concurrently deletes and inserts (removing reqsk and inserting sk),
the lookup may fail to find the socket, an RST may be sent.

The call trace map is drawn as follows:
   CPU 0                           CPU 1
   -----                           -----
inet_ehash_insert()
                                spin_lock()
                                sk_nulls_del_node_init_rcu(osk)
__inet_lookup_established()
(lookup failed)
                                __sk_nulls_add_node_rcu(sk, list)
                                spin_unlock()

As both deletion and insertion operate on the same ehash chain, this patch
introduces a new sk_nulls_replace_node_init_rcu() helper functions to
implement atomic replacement.

Fixes: 5e0724d027f0 ("tcp/dccp: fix hashdance race for passive sessions")
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251015020236.431822-3-xuanqiang.luo@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Xuanqiang Luo [Wed, 15 Oct 2025 02:02:34 +0000 (10:02 +0800)]

rculist: Add hlist_nulls_replace_rcu() and hlist_nulls_replace_init_rcu()

Add two functions to atomically replace RCU-protected hlist_nulls entries.

Keep using WRITE_ONCE() to assign values to ->next and ->pprev, as
mentioned in the patch below:
commit efd04f8a8b45 ("rcu: Use WRITE_ONCE() for assignments to ->next for
rculist_nulls")
commit 860c8802ace1 ("rcu: Use WRITE_ONCE() for assignments to ->pprev for
hlist_nulls")

Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
Link: https://patch.msgid.link/20251015020236.431822-2-xuanqiang.luo@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Kuniyuki Iwashima [Tue, 14 Oct 2025 22:42:07 +0000 (22:42 +0000)]

ipv6: Move ipv6_fl_list from ipv6_pinfo to inet_sock.

In {tcp6,udp6,raw6}_sock, struct ipv6_pinfo is always placed at
the beginning of a new cache line because

  1. __alignof__(struct tcp_sock) is 64 due to ____cacheline_aligned
     of __cacheline_group_begin(tcp_sock_write_tx)

  2. __alignof__(struct udp_sock) is 64 due to ____cacheline_aligned
     of struct numa_drop_counters

  3. in raw6_sock, struct numa_drop_counters is placed before
     struct ipv6_pinfo

.  struct ipv6_pinfo is 136 bytes, but the last cache line is
only used by ipv6_fl_list:

  $ pahole -C ipv6_pinfo vmlinux
  struct ipv6_pinfo {
  ...
   /* --- cacheline 2 boundary (128 bytes) --- */
   struct ipv6_fl_socklist *  ipv6_fl_list;         /*   128     8 */

   /* size: 136, cachelines: 3, members: 23 */

Let's move ipv6_fl_list from struct ipv6_pinfo to struct inet_sock
to save a full cache line for {tcp6,udp6,raw6}_sock.

Now, struct ipv6_pinfo is 128 bytes, and {tcp6,udp6,raw6}_sock have
64 bytes less, while {tcp,udp,raw}_sock retain the same size.

Before:

  # grep -E "^(RAW|UDP[^L\-]|TCP)" /proc/slabinfo | awk '{print $1, "\t", $4}'
  RAWv6 1408
  UDPv6 1472
  TCPv6 2560
  RAW 1152
  UDP 1280
  TCP 2368

After:

  # grep -E "^(RAW|UDP[^L\-]|TCP)" /proc/slabinfo | awk '{print $1, "\t", $4}'
  RAWv6 1344
  UDPv6 1408
  TCPv6 2496
  RAW 1152
  UDP 1280
  TCP 2368

Also, ipv6_fl_list and inet_flags (SNDFLOW bit) are placed in the
same cache line.

  $ pahole -C inet_sock vmlinux
  ...
   /* --- cacheline 11 boundary (704 bytes) was 56 bytes ago --- */
   struct ipv6_pinfo *        pinet6;               /*   760     8 */
   /* --- cacheline 12 boundary (768 bytes) --- */
   struct ipv6_fl_socklist *  ipv6_fl_list;         /*   768     8 */
   unsigned long              inet_flags;           /*   776     8 */

Doc churn is due to the insufficient Type column (only 1 space short).

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251014224210.2964778-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jijie Shao [Tue, 14 Oct 2025 13:40:18 +0000 (21:40 +0800)]

net: hibmcge: support pci_driver.shutdown()

support pci_driver.shutdown() for hibmcge driver.

Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Link: https://patch.msgid.link/20251014134018.1178385-1-shaojijie@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Thu, 16 Oct 2025 23:59:31 +0000 (16:59 -0700)]

Merge branch 'net-macb-various-cleanups'

Théo Lebrun says:

====================
net: macb: various cleanups

Fix many oddities inside the MACB driver. They accumulated in my
work-in-progress branch while working on MACB/GEM EyeQ5 support.

Part of this series has been seen on the lkml in March then June.
See below for a semblance of a changelog.

The initial goal was to post them alongside EyeQ5 support, but that
makes for too big of a series. It'll come afterwards, with new
features (interrupt coalescing, ethtool .set_channels() and XDP mostly).

[0]: https://lore.kernel.org/lkml/20250627-macb-v2-0-ff8207d0bb77@bootlin.com/
====================

Link: https://patch.msgid.link/20251014-macb-cleanup-v1-0-31cd266e22cd@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Théo Lebrun [Tue, 14 Oct 2025 15:25:16 +0000 (17:25 +0200)]

net: macb: sort #includes

Sort #include preprocessor directives.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Sean Anderson <sean.anderson@linux.dev>
Signed-off-by: Théo Lebrun <theo.lebrun@bootlin.com>
Link: https://patch.msgid.link/20251014-macb-cleanup-v1-15-31cd266e22cd@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Théo Lebrun [Tue, 14 Oct 2025 15:25:15 +0000 (17:25 +0200)]

net: macb: apply reverse christmas tree in macb_tx_map()

The arguments grew over time; follow conventions and apply reverse
christmas tree (RCT).

Signed-off-by: Théo Lebrun <theo.lebrun@bootlin.com>
Link: https://patch.msgid.link/20251014-macb-cleanup-v1-14-31cd266e22cd@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Théo Lebrun [Tue, 14 Oct 2025 15:25:14 +0000 (17:25 +0200)]

net: macb: drop `count` local variable in macb_tx_map()

Local variable `count` is useless: it counts number of DMA descriptors
used and returns it. But the return value is only checked for error.
Drop counting the number of DMA descriptors and return a usual
negative-if-error integer.

Signed-off-by: Théo Lebrun <theo.lebrun@bootlin.com>
Link: https://patch.msgid.link/20251014-macb-cleanup-v1-13-31cd266e22cd@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Théo Lebrun [Tue, 14 Oct 2025 15:25:13 +0000 (17:25 +0200)]

net: macb: drop `entry` local variable in macb_tx_map()

The pattern:
   entry = macb_tx_ring_wrap(bp, i);
   tx_skb = &queue->tx_skb[entry];
is the exact definition of:
   macb_tx_skb(queue, i);

The pattern:
   entry = macb_tx_ring_wrap(bp, i);
   desc = macb_tx_desc(queue, entry);
is redundant because macb_tx_desc() calls macb_tx_ring_wrap().

One explicit call to macb_tx_ring_wrap() is still required for checking
if it is the last buffer (TX_WRAP case).

Signed-off-by: Théo Lebrun <theo.lebrun@bootlin.com>
Link: https://patch.msgid.link/20251014-macb-cleanup-v1-12-31cd266e22cd@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Théo Lebrun [Tue, 14 Oct 2025 15:25:12 +0000 (17:25 +0200)]

net: macb: replace min() with umin() calls

Whenever min(a, b) is used with a and b unsigned variables or literals,
`make W=2` complains. Change four min() calls into umin().

stderr extract (GCC 11.2.0, MIPS Codescape):

./include/linux/minmax.h:68:57: warning: comparison is always true due
                          to limited range of data type [-Wtype-limits]
   68 | #define __is_nonneg(ux) statically_true((long long)(ux) >= 0)
      |                                                         ^~
drivers/net/ethernet/cadence/macb_main.c:2299:26: note: in expansion of
                                                            macro ‘min’
2299 |              hdrlen = min(skb_headlen(skb), bp->max_tx_length);
      |                       ^~~

Signed-off-by: Théo Lebrun <theo.lebrun@bootlin.com>
Link: https://patch.msgid.link/20251014-macb-cleanup-v1-11-31cd266e22cd@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Théo Lebrun [Tue, 14 Oct 2025 15:25:11 +0000 (17:25 +0200)]

net: macb: remove bp->queue_mask

The low 16 bits of GEM_DCFG6 tell us which queues are enabled in HW. In
theory, there could be holes in the bitfield. In practice, the macb
driver would fail if there were holes as most loops iterate upon
bp->num_queues. Only macb_init() iterated correctly.

- Drop bp->queue_mask field.
- Error out at probe if a hole is in the queue mask.
- Rely upon bp->num_queues for iteration.
- As we drop the queue_mask probe local variable, fix RCT.
- Compute queue_mask on the fly for TAPRIO using bp->num_queues.

Signed-off-by: Théo Lebrun <theo.lebrun@bootlin.com>
Link: https://patch.msgid.link/20251014-macb-cleanup-v1-10-31cd266e22cd@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Théo Lebrun [Tue, 14 Oct 2025 15:25:10 +0000 (17:25 +0200)]

net: macb: introduce DMA descriptor helpers (is 64bit? is PTP?)

Introduce macb_dma64() and macb_dma_ptp() helper functions.
Many codepaths are made simpler by dropping conditional compilation.

This implies two additional changes:
- Always compile related structure definitions inside <macb.h>.
- MACB_EXT_DESC can be dropped as it is useless now.

The common case:

#ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT
struct macb_dma_desc_64 *desc_64;
if (bp->hw_dma_cap & HW_DMA_CAP_64B) {
desc_64 = macb_64b_desc(bp, desc);
// ...
}
#endif

Is replaced by:

if (macb_dma64(bp)) {
struct macb_dma_desc_64 *desc_64 = macb_64b_desc(bp, desc);
// ...
}

Signed-off-by: Théo Lebrun <theo.lebrun@bootlin.com>
Link: https://patch.msgid.link/20251014-macb-cleanup-v1-9-31cd266e22cd@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Théo Lebrun [Tue, 14 Oct 2025 15:25:09 +0000 (17:25 +0200)]

net: macb: move bp->hw_dma_cap flags to bp->caps

Drop bp->hw_dma_cap field and put its two flags into bp->caps.

On my specific config (eyeq5_defconfig), bloat-o-meter indicates:
- macb_main.o: Before=56251, After=56359, chg +0.19%
- macb_ptp.o: Before= 3976, After= 3952, chg -0.60%

Signed-off-by: Théo Lebrun <theo.lebrun@bootlin.com>
Link: https://patch.msgid.link/20251014-macb-cleanup-v1-8-31cd266e22cd@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Théo Lebrun [Tue, 14 Oct 2025 15:25:08 +0000 (17:25 +0200)]

net: macb: simplify macb_adj_dma_desc_idx()

The function body uses a switch statement on bp->hw_dma_cap and handles
its four possible values: 0, is_64b, is_ptp, is_64b && is_ptp.

Instead, refactor by noticing that the return value is:
   desc_size * MULT
with MULT = 3 if is_64b && is_ptp,
            2 if is_64b || is_ptp,
            1 otherwise.

MULT can be expressed as:
   1 + is_64b + is_ptp

Signed-off-by: Théo Lebrun <theo.lebrun@bootlin.com>
Link: https://patch.msgid.link/20251014-macb-cleanup-v1-7-31cd266e22cd@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Théo Lebrun [Tue, 14 Oct 2025 15:25:07 +0000 (17:25 +0200)]

net: macb: simplify macb_dma_desc_get_size()

macb_dma_desc_get_size() does a switch on bp->hw_dma_cap and covers all
four cases: 0, 64B, PTP, 64B+PTP. It also covers the #ifndef
MACB_EXT_DESC separately, making it four codepaths.

Instead, notice the descriptor size grows with enabled features and use
plain if-statements on 64B and PTP flags.

Signed-off-by: Théo Lebrun <theo.lebrun@bootlin.com>
Link: https://patch.msgid.link/20251014-macb-cleanup-v1-6-31cd266e22cd@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Théo Lebrun [Tue, 14 Oct 2025 15:25:06 +0000 (17:25 +0200)]

net: macb: drop macb_config NULL checking

Remove NULL checks on macb_config as it is always valid:
- either it is its default value &default_gem_config,
- or it got overridden using match data.

Reviewed-by: Sean Anderson <sean.anderson@linux.dev>
Signed-off-by: Théo Lebrun <theo.lebrun@bootlin.com>
Link: https://patch.msgid.link/20251014-macb-cleanup-v1-5-31cd266e22cd@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Théo Lebrun [Tue, 14 Oct 2025 15:25:05 +0000 (17:25 +0200)]

net: macb: Remove local variables clk_init and init in macb_probe()

Remove local variables clk_init and init. Those function pointers are
always equivalent to macb_config->clk_init and macb_config->init.

Reviewed-by: Sean Anderson <sean.anderson@linux.dev>
Signed-off-by: Théo Lebrun <theo.lebrun@bootlin.com>
Link: https://patch.msgid.link/20251014-macb-cleanup-v1-4-31cd266e22cd@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Théo Lebrun [Tue, 14 Oct 2025 15:25:04 +0000 (17:25 +0200)]

net: macb: remove gap in MACB_CAPS_* flags

MACB_CAPS_* are bit constants that get used in bp->caps. They occupy
bits 0..12 + 24..31. Remove 11..23 gap by moving bits 24..31 to 13..20.

Occupation bitfields:

   31  29  27  25  23  21  19  17  15  13  11  09  07  05  03  01
     30  28  26  24  22  20  18  16  14  12  10  08  06  04  02  00
   -- Before ------------------------------------------------------
    1 1 1 1 1 1 1 1                       1 1 1 1 1 1 1 1 1 1 1 1 1
                    0 0 0 0 0 0 0 0 0 0 0
   -- After -------------------------------------------------------
                          1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
    0 0 0 0 0 0 0 0 0 0 0

Signed-off-by: Théo Lebrun <theo.lebrun@bootlin.com>
Link: https://patch.msgid.link/20251014-macb-cleanup-v1-3-31cd266e22cd@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Théo Lebrun [Tue, 14 Oct 2025 15:25:03 +0000 (17:25 +0200)]

net: macb: use BIT() macro for capability definitions

Replace all capabilities values by calls to the BIT() macro.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Sean Anderson <sean.anderson@linux.dev>
Signed-off-by: Théo Lebrun <theo.lebrun@bootlin.com>
Link: https://patch.msgid.link/20251014-macb-cleanup-v1-2-31cd266e22cd@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Théo Lebrun [Tue, 14 Oct 2025 15:25:02 +0000 (17:25 +0200)]

dt-bindings: net: cdns,macb: sort compatibles

Compatibles inside this enum are sorted-ish. Make it sorted.

Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Théo Lebrun <theo.lebrun@bootlin.com>
Link: https://patch.msgid.link/20251014-macb-cleanup-v1-1-31cd266e22cd@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Thu, 16 Oct 2025 23:25:16 +0000 (16:25 -0700)]

Merge branch 'net-optimize-tx-throughput-and-efficiency'

Eric Dumazet says:

====================
net: optimize TX throughput and efficiency

In this series, I replace the busylock spinlock we have in
__dev_queue_xmit() and use lockless list (llist) to reduce
spinlock contention to the minimum.

Idea is that only one cpu might spin on the qdisc spinlock,
while others simply add their skb in the llist.

After this series, we get a 300 % (4x) improvement on heavy TX workloads,
sending twice the number of packets per second, for half the cpu cycles.
====================

Link: https://patch.msgid.link/20251014171907.3554413-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Eric Dumazet [Tue, 14 Oct 2025 17:19:07 +0000 (17:19 +0000)]

net: dev_queue_xmit() llist adoption

Remove busylock spinlock and use a lockless list (llist)
to reduce spinlock contention to the minimum.

Idea is that only one cpu might spin on the qdisc spinlock,
while others simply add their skb in the llist.

After this patch, we get a 300 % improvement on heavy TX workloads.
- Sending twice the number of packets per second.
- While consuming 50 % less cycles.

Note that this also allows in the future to submit batches
to various qdisc->enqueue() methods.

Tested:

- Dual Intel(R) Xeon(R) 6985P-C  (480 hyper threads).
- 100Gbit NIC, 30 TX queues with FQ packet scheduler.
- echo 64 >/sys/kernel/slab/skbuff_small_head/cpu_partial (avoid contention in mm)
- 240 concurrent "netperf -t UDP_STREAM -- -m 120 -n"

Before:

16 Mpps (41 Mpps if each thread is pinned to a different cpu)

vmstat 2 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
243  0      0 2368988672  51036 1100852    0    0   146     1  242   60  0  9 91  0  0
244  0      0 2368988672  51036 1100852    0    0   536    10 487745 14718  0 52 48  0  0
244  0      0 2368988672  51036 1100852    0    0   512     0 503067 46033  0 52 48  0  0
244  0      0 2368988672  51036 1100852    0    0   512     0 494807 12107  0 52 48  0  0
244  0      0 2368988672  51036 1100852    0    0   702    26 492845 10110  0 52 48  0  0

Lock contention (1 second sample taken on 8 cores)
perf lock record -C0-7 sleep 1; perf lock contention
contended   total wait     max wait     avg wait         type   caller

    442111      6.79 s     162.47 ms     15.35 us     spinlock   dev_hard_start_xmit+0xcd
      5961      9.57 ms      8.12 us      1.60 us     spinlock   __dev_queue_xmit+0x3a0
       244    560.63 us      7.63 us      2.30 us     spinlock   do_softirq+0x5b
        13     25.09 us      3.21 us      1.93 us     spinlock   net_tx_action+0xf8

If netperf threads are pinned, spinlock stress is very high.
perf lock record -C0-7 sleep 1; perf lock contention
contended   total wait     max wait     avg wait         type   caller

    964508      7.10 s     147.25 ms      7.36 us     spinlock   dev_hard_start_xmit+0xcd
       201    268.05 us      4.65 us      1.33 us     spinlock   __dev_queue_xmit+0x3a0
        12     26.05 us      3.84 us      2.17 us     spinlock   do_softirq+0x5b

@__dev_queue_xmit_ns:
[256, 512)            21 |                                                    |
[512, 1K)            631 |                                                    |
[1K, 2K)           27328 |@                                                   |
[2K, 4K)          265392 |@@@@@@@@@@@@@@@@                                    |
[4K, 8K)          417543 |@@@@@@@@@@@@@@@@@@@@@@@@@@                          |
[8K, 16K)         826292 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16K, 32K)        733822 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
[32K, 64K)         19055 |@                                                   |
[64K, 128K)        17240 |@                                                   |
[128K, 256K)       25633 |@                                                   |
[256K, 512K)           4 |                                                    |

After:

29 Mpps (57 Mpps if each thread is pinned to a different cpu)

vmstat 2 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
78  0      0 2369573632  32896 1350988    0    0    22     0  331  254  0  8 92  0  0
75  0      0 2369573632  32896 1350988    0    0    22    50 425713 280199  0 23 76  0  0
104  0      0 2369573632  32896 1350988    0    0   290     0 430238 298247  0 23 76  0  0
86  0      0 2369573632  32896 1350988    0    0   132     0 428019 291865  0 24 76  0  0
90  0      0 2369573632  32896 1350988    0    0   502     0 422498 278672  0 23 76  0  0

perf lock record -C0-7 sleep 1; perf lock contention
contended   total wait     max wait     avg wait         type   caller

      2524    116.15 ms    486.61 us     46.02 us     spinlock   __dev_queue_xmit+0x55b
      5821    107.18 ms    371.67 us     18.41 us     spinlock   dev_hard_start_xmit+0xcd
      2377      9.73 ms     35.86 us      4.09 us     spinlock   ___slab_alloc+0x4e0
       923      5.74 ms     20.91 us      6.22 us     spinlock   ___slab_alloc+0x5c9
       121      3.42 ms    193.05 us     28.24 us     spinlock   net_tx_action+0xf8
         6    564.33 us    167.60 us     94.05 us     spinlock   do_softirq+0x5b

If netperf threads are pinned (~54 Mpps)
perf lock record -C0-7 sleep 1; perf lock contention
     32907    316.98 ms    195.98 us      9.63 us     spinlock   dev_hard_start_xmit+0xcd
      4507     61.83 ms    212.73 us     13.72 us     spinlock   __dev_queue_xmit+0x554
      2781     23.53 ms     40.03 us      8.46 us     spinlock   ___slab_alloc+0x5c9
      3554     18.94 ms     34.69 us      5.33 us     spinlock   ___slab_alloc+0x4e0
       233      9.09 ms    215.70 us     38.99 us     spinlock   do_softirq+0x5b
       153    930.66 us     48.67 us      6.08 us     spinlock   net_tx_action+0xfd
        84    331.10 us     14.22 us      3.94 us     spinlock   ___slab_alloc+0x5c9
       140    323.71 us      9.94 us      2.31 us     spinlock   ___slab_alloc+0x4e0

@__dev_queue_xmit_ns:
[128, 256)       1539830 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                  |
[256, 512)       2299558 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[512, 1K)         483936 |@@@@@@@@@@                                          |
[1K, 2K)          265345 |@@@@@@                                              |
[2K, 4K)          145463 |@@@                                                 |
[4K, 8K)           54571 |@                                                   |
[8K, 16K)          10270 |                                                    |
[16K, 32K)          9385 |                                                    |
[32K, 64K)          7749 |                                                    |
[64K, 128K)        26799 |                                                    |
[128K, 256K)        2665 |                                                    |
[256K, 512K)         665 |                                                    |

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20251014171907.3554413-7-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Eric Dumazet [Tue, 14 Oct 2025 17:19:06 +0000 (17:19 +0000)]

net: sched: claim one cache line in Qdisc

Replace state2 field with a boolean.

Move it to a hole between qstats and state so that
we shrink Qdisc by a full cache line.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20251014171907.3554413-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Eric Dumazet [Tue, 14 Oct 2025 17:19:05 +0000 (17:19 +0000)]

Revert "net/sched: Fix mirred deadlock on device recursion"

This reverts commits 0f022d32c3eca477fbf79a205243a6123ed0fe11
and 44180feaccf266d9b0b28cc4ceaac019817deb5c.

Prior patch in this series implemented loop detection
in act_mirred, we can remove q->owner to save some cycles
in the fast path.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20251014171907.3554413-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Eric Dumazet [Tue, 14 Oct 2025 17:19:04 +0000 (17:19 +0000)]

net/sched: act_mirred: add loop detection

Commit 0f022d32c3ec ("net/sched: Fix mirred deadlock on device recursion")
added code in the fast path, even when act_mirred is not used.

Prepare its revert by implementing loop detection in act_mirred.

Adds an array of device pointers in struct netdev_xmit.

tcf_mirred_is_act_redirect() can detect if the array
already contains the target device.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20251014171907.3554413-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Eric Dumazet [Tue, 14 Oct 2025 17:19:03 +0000 (17:19 +0000)]

net: add add indirect call wrapper in skb_release_head_state()

While stress testing UDP senders on a host with expensive indirect
calls, I found cpus processing TX completions where showing
a very high cost (20%) in sock_wfree() due to
CONFIG_MITIGATION_RETPOLINE=y.

Take care of TCP and UDP TX destructors and use INDIRECT_CALL_3() macro.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20251014171907.3554413-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Eric Dumazet [Tue, 14 Oct 2025 17:19:02 +0000 (17:19 +0000)]

selftests/net: packetdrill: unflake tcp_user_timeout_user-timeout-probe.pkt

This test fails the first time I am running it after a fresh virtme-ng boot.

tcp_user_timeout_user-timeout-probe.pkt:33: runtime error in write call: Expected result -1 but got 24 with errno 2 (No such file or directory)

Tweaks the timings a bit, to reduce flakiness.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soham Chakradeo <sohamch@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20251014171907.3554413-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Sagar Cheluvegowda [Wed, 15 Oct 2025 16:26:12 +0000 (18:26 +0200)]

dt-bindings: net: qcom: ethernet: Add interconnect properties

Add documentation for the interconnect and interconnect-names
properties required when voting for AHB and AXI buses.

Suggested-by: Andrew Halaney <ahalaney@redhat.com>
Signed-off-by: Sagar Cheluvegowda <quic_scheluve@quicinc.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Link: https://patch.msgid.link/20251015-topic-qc_stmmac_icc_bindings-v5-1-da39126cff28@oss.qualcomm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Thu, 16 Oct 2025 23:07:24 +0000 (16:07 -0700)]

Merge branch 'add-driver-support-for-eswin-eic7700-soc-ethernet-controller'

Shangjuan Wei says:

====================
Add driver support for Eswin eic7700 SoC ethernet controller
====================

Link: https://patch.msgid.link/20251015113751.1114-1-weishangjuan@eswincomputing.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Shangjuan Wei [Wed, 15 Oct 2025 11:41:01 +0000 (19:41 +0800)]

net: stmmac: add Eswin EIC7700 glue driver

Add Ethernet controller support for Eswin's eic7700 SoC. The driver
implements hardware initialization, clock configuration, delay
adjustment functions based on DWC Ethernet controller, and supports
device tree configuration and platform driver integration.

Signed-off-by: Zhi Li <lizhi2@eswincomputing.com>
Signed-off-by: Shangjuan Wei <weishangjuan@eswincomputing.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20251015114101.1218-1-weishangjuan@eswincomputing.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Shangjuan Wei [Wed, 15 Oct 2025 11:40:41 +0000 (19:40 +0800)]

dt-bindings: ethernet: eswin: Document for EIC7700 SoC

Add ESWIN EIC7700 Ethernet controller, supporting clock
configuration, delay adjustment and speed adaptive functions.

Signed-off-by: Zhi Li <lizhi2@eswincomputing.com>
Signed-off-by: Shangjuan Wei <weishangjuan@eswincomputing.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://patch.msgid.link/20251015114041.1166-1-weishangjuan@eswincomputing.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Thu, 16 Oct 2025 22:58:24 +0000 (15:58 -0700)]

Merge branch 'net-stmmac-more-cleanups'

Russell King says:

====================
net: stmmac: more cleanups

The subject for the cover message is wearing thin as I've used it a
number of times, but the scope for cleaning up the driver continues,
and continue it will do, because this is just a small fraction of the
queue.

1. make a better job of one of my previous commits, moving the holding
   of the lock into stmmac_mdio.c

2. move the mac_finish() method to be in-order with the layout of
   struct phylink_mac_ops - this order was chosen because it reflects
   the order that the methods are called, thus making the flow more
   obvious when reading code.

3. continuing on the "removal of stuff that doesn't need to happen",
   patch 3 removes the phylink_speed_(up|down) out of the path that
   is used for MTU changes - we really don't need to fiddle with the
   PHY advertisement when changing the MTU!

4. clean up tc_init()'s initialisation of flow_entries_max - this is
   the sole place that this is written, and we might as well make the
   code more easy to follow.

5. stmmac_phy_setup() really confuses me when I read the code, it's
   not really about PHY setup, but about phylink setup. So, name its
   name reflect its functionality.
====================

Link: https://patch.msgid.link/aO_HIwT_YvxkDS8D@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Russell King (Oracle) [Wed, 15 Oct 2025 16:11:01 +0000 (17:11 +0100)]

net: stmmac: rename stmmac_phy_setup() to include phylink

stmmac_phy_setup() does not set up any PHY, but does setup phylink.
Rename this function to stmmac_phylink_setup() to reflect more what
it is doing.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Gatien Chevallier <gatien.chevallier@foss.st.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/E1v945d-0000000Ameh-3Bs7@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Russell King (Oracle) [Wed, 15 Oct 2025 16:10:56 +0000 (17:10 +0100)]

net: stmmac: rearrange tc_init()

To make future changes easier, rearrange the use of dma_cap->l3l4fnum
vs priv->flow_entries_max.

Always initialise priv->flow_entries_max from dma_cap->l3l4fnum, then
use priv->flow_entries_max to determine whether we allocate
priv->flow_entries and set it up.

This change is safe because tc_init() is only called once from
stmmac_dvr_probe().

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/E1v945Y-0000000Ameb-2gDI@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Russell King (Oracle) [Wed, 15 Oct 2025 16:10:51 +0000 (17:10 +0100)]

net: stmmac: avoid PHY speed change when configuring MTU

There is no need to do the speed-down, speed-up dance when changing
the MTU as there is little power saving that can be gained from such
a brief interval between these, and the autonegotiation they cause
takes much longer.

Move the calls to phylink_speed_up() and phylink_speed_down() into
stmmac_open() and stmmac_release() respectively, reducing the work
done in the __-variants of these functions.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Gatien Chevallier <gatien.chevallier@foss.st.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/E1v945T-0000000AmeV-2BvU@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Russell King (Oracle) [Wed, 15 Oct 2025 16:10:46 +0000 (17:10 +0100)]

net: stmmac: place .mac_finish() method more appropriately

Place the .mac_finish() initialiser and implementation after the
.mac_config() initialiser and method which reflects the order that
they appear in struct phylink_mac_ops, and the order in which they
are called. This keeps logically similar code together.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/E1v945O-0000000AmeP-1k0t@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Russell King (Oracle) [Wed, 15 Oct 2025 16:10:41 +0000 (17:10 +0100)]

net: stmmac: dwc-qos-eth: move MDIO bus locking into stmmac_mdio

Rather than dwc-qos-eth manipulating the MDIO bus lock directly, add
helpers to the stmmac MDIO layer and use them in dwc-qos-eth. This
improves my commit 87f43e6f06a2 ("net: stmmac: dwc-qos: calibrate tegra
with mdio bus idle").

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/E1v945J-0000000AmeJ-1GOb@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jan Vaclav [Wed, 15 Oct 2025 10:10:02 +0000 (12:10 +0200)]

net/hsr: add interlink to fill_info output

Currently, it is possible to configure the interlink
port, but no way to read it back from userspace.

Add it to the output of hsr_fill_info(), so it can be
read from userspace, for example:

$ ip -d link show hsr0
12: hsr0: <BROADCAST,MULTICAST> mtu ...
...
hsr slave1 veth0 slave2 veth1 interlink veth2 ...

Signed-off-by: Jan Vaclav <jvaclav@redhat.com>
Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de>
Link: https://patch.msgid.link/20251015101001.25670-2-jvaclav@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Harshit Mogalapalli [Wed, 15 Oct 2025 09:01:17 +0000 (02:01 -0700)]

Octeontx2-af: Fix pci_alloc_irq_vectors() return value check

In cgx_probe() when pci_alloc_irq_vectors() fails the error value will
be negative and that check is sufficient.

err = pci_alloc_irq_vectors(pdev, nvec, nvec, PCI_IRQ_MSIX);
if (err < 0 || err != nvec) {
...
}

When pci_alloc_irq_vectors() fail to allocate nvec number of vectors,
-ENOSPC is returned, so it would be safe to remove the check that
compares err with nvec.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20251015090117.1557870-1-harshit.m.mogalapalli@oracle.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Hangbin Liu [Wed, 15 Oct 2025 08:36:49 +0000 (08:36 +0000)]

netdevsim: add ipsec hw_features

Currently, netdevsim only sets dev->features, which makes the ESP features
fixed. For example:

  # ethtool -k eni0np1 | grep esp
  tx-esp-segmentation: on [fixed]
  esp-hw-offload: on [fixed]
  esp-tx-csum-hw-offload: on [fixed]

This patch adds the ESP features to hw_features, allowing them to be
changed manually. For example:

  # ethtool -k eni0np1 | grep esp
  tx-esp-segmentation: on
  esp-hw-offload: on
  esp-tx-csum-hw-offload: on

Suggested-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20251015083649.54744-1-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Alok Tiwari [Wed, 15 Oct 2025 02:57:43 +0000 (19:57 -0700)]

net: amd-xgbe: use EOPNOTSUPP instead of ENOTSUPP in xgbe_phy_mii_read_c45

The MDIO read callback xgbe_phy_mii_read_c45() can propagate its return
value up through phylink_mii_ioctl() to user space via netdev ioctls such
as SIOCGMIIREG. Returning ENOTSUPP results in user space seeing
"Unknown error", since ENOTSUPP is not a standard errno value.

Replace ENOTSUPP with EOPNOTSUPP to align with the MDIO core’s
usage and ensure user space receives a proper "Operation not supported"
error instead of an unknown code.

Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Acked-by: Shyam Sundar S K <Shyam-sundar.S-k@amd.com>
Link: https://patch.msgid.link/20251015025751.1532149-1-alok.a.tiwari@oracle.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Martin KaFai Lau [Thu, 16 Oct 2025 19:15:10 +0000 (12:15 -0700)]

Merge branch 'bpf-allow-opt-out-from-sk-sk_prot-memory_allocated'

Kuniyuki Iwashima says:

====================
bpf: Allow opt-out from sk->sk_prot->memory_allocated.

This series allows opting out of the global per-protocol memory
accounting if socket is configured as such by sysctl or BPF prog.

This series is the successor of the series below [0], but the changes
now fall in net and bpf subsystems only.

I discussed with Roman Gushchin offlist, and he suggested not mixing
two independent subsystems and it would be cleaner not to depend on
memcg.

So, sk->sk_memcg and memcg code are no longer touched, and instead we
use another hole near sk->sk_prot to store a flag for the pure net
opt-out feature.

Overview of the series:

  patch 1 is misc cleanup
  patch 2 allows opt-out from sk->sk_prot->memory_allocated
  patch 3 introduces net.core.bypass_prot_mem
  patch 4 & 5 supports flagging sk->sk_bypass_prot_mem via bpf_setsockopt()
  patch 6 is selftest

Thank you very much for all your help, Shakeel, Roman, Martin, and Eric!

[0]: https://lore.kernel.org/bpf/20250920000751.2091731-1-kuniyu@google.com/

Changes:
  v2:
    * Patch 2:
      * Fill kdoc for skc_bypass_prot_mem
    * Patch 6
      * Fix server fd leak in tcp_create_sockets()
      * Avoid close(0) in check_bypass()

  v1: https://lore.kernel.org/bpf/20251007001120.2661442-1-kuniyu@google.com/
====================

Link: https://patch.msgid.link/20251014235604.3057003-1-kuniyu@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

commit | commitdiff | tree

Kuniyuki Iwashima [Tue, 14 Oct 2025 23:54:59 +0000 (23:54 +0000)]

selftests/bpf: Add test for sk->sk_bypass_prot_mem.

The test does the following for IPv4/IPv6 x TCP/UDP sockets
with/without sk->sk_bypass_prot_mem, which can be turned on by
net.core.bypass_prot_mem or bpf_setsockopt(SK_BPF_BYPASS_PROT_MEM).

  1. Create socket pairs
  2. Send NR_PAGES (32) of data (TCP consumes around 35 pages,
     and UDP consuems 66 pages due to skb overhead)
  3. Read memory_allocated from sk->sk_prot->memory_allocated and
     sk->sk_prot->memory_per_cpu_fw_alloc
  4. Check if unread data is charged to memory_allocated

If sk->sk_bypass_prot_mem is set, memory_allocated should not be
changed, but we allow a small error (up to 10 pages) in case
other processes on the host use some amounts of TCP/UDP memory.

The amount of allocated pages are buffered to per-cpu variable
{tcp,udp}_memory_per_cpu_fw_alloc up to +/- net.core.mem_pcpu_rsv
before reported to {tcp,udp}_memory_allocated.

At 3., memory_allocated is calculated from the 2 variables at
fentry of socket create function.

We drain the receive queue only for UDP before close() because UDP
recv queue is destroyed after RCU grace period.  When I printed
memory_allocated, UDP bypass cases sometimes saw the no-bypass
case's leftover, but it's still in the small error range (<10 pages).

  bpf_trace_printk: memory_allocated: 0   <-- TCP no-bypass
  bpf_trace_printk: memory_allocated: 35
  bpf_trace_printk: memory_allocated: 0   <-- TCP w/ sysctl
  bpf_trace_printk: memory_allocated: 0
  bpf_trace_printk: memory_allocated: 0   <-- TCP w/ bpf
  bpf_trace_printk: memory_allocated: 0
  bpf_trace_printk: memory_allocated: 0   <-- UDP no-bypass
  bpf_trace_printk: memory_allocated: 66
  bpf_trace_printk: memory_allocated: 2   <-- UDP w/ sysctl (2 pages leftover)
  bpf_trace_printk: memory_allocated: 2
  bpf_trace_printk: memory_allocated: 2   <-- UDP w/ bpf (2 pages leftover)
  bpf_trace_printk: memory_allocated: 2

We prefer finishing tests faster than oversleeping for call_rcu()
+ sk_destruct().

The test completes within 2s on QEMU (64 CPUs) w/ KVM.

  # time ./test_progs -t sk_bypass
  #371/1   sk_bypass_prot_mem/TCP  :OK
  #371/2   sk_bypass_prot_mem/UDP  :OK
  #371/3   sk_bypass_prot_mem/TCPv6:OK
  #371/4   sk_bypass_prot_mem/UDPv6:OK
  #371     sk_bypass_prot_mem:OK
  Summary: 1/4 PASSED, 0 SKIPPED, 0 FAILED

  real 0m1.481s
  user 0m0.181s
  sys 0m0.441s

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://patch.msgid.link/20251014235604.3057003-7-kuniyu@google.com

commit | commitdiff | tree

Kuniyuki Iwashima [Tue, 14 Oct 2025 23:54:58 +0000 (23:54 +0000)]

bpf: Introduce SK_BPF_BYPASS_PROT_MEM.

If a socket has sk->sk_bypass_prot_mem flagged, the socket opts out
of the global protocol memory accounting.

This is easily controlled by net.core.bypass_prot_mem sysctl, but it
lacks flexibility.

Let's support flagging (and clearing) sk->sk_bypass_prot_mem via
bpf_setsockopt() at the BPF_CGROUP_INET_SOCK_CREATE hook.

  int val = 1;

  bpf_setsockopt(ctx, SOL_SOCKET, SK_BPF_BYPASS_PROT_MEM,
                 &val, sizeof(val));

As with net.core.bypass_prot_mem, this is inherited to child sockets,
and BPF always takes precedence over sysctl at socket(2) and accept(2).

SK_BPF_BYPASS_PROT_MEM is only supported at BPF_CGROUP_INET_SOCK_CREATE
and not supported on other hooks for some reasons:

  1. UDP charges memory under sk->sk_receive_queue.lock instead
     of lock_sock()

  2. Modifying the flag after skb is charged to sk requires such
     adjustment during bpf_setsockopt() and complicates the logic
     unnecessarily

We can support other hooks later if a real use case justifies that.

Most changes are inline and hard to trace, but a microbenchmark on
__sk_mem_raise_allocated() during neper/tcp_stream showed that more
samples completed faster with sk->sk_bypass_prot_mem == 1.  This will
be more visible under tcp_mem pressure (but it's not a fair comparison).

  # bpftrace -e 'kprobe:__sk_mem_raise_allocated { @start[tid] = nsecs; }
    kretprobe:__sk_mem_raise_allocated /@start[tid]/
    { @end[tid] = nsecs - @start[tid]; @times = hist(@end[tid]); delete(@start[tid]); }'
  # tcp_stream -6 -F 1000 -N -T 256

Without bpf prog:

  [128, 256)          3846 |                                                    |
  [256, 512)       1505326 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
  [512, 1K)        1371006 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@     |
  [1K, 2K)          198207 |@@@@@@                                              |
  [2K, 4K)           31199 |@                                                   |

With bpf prog in the next patch:
  (must be attached before tcp_stream)
  # bpftool prog load sk_bypass_prot_mem.bpf.o /sys/fs/bpf/test type cgroup/sock_create
  # bpftool cgroup attach /sys/fs/cgroup/test cgroup_inet_sock_create pinned /sys/fs/bpf/test

  [128, 256)          6413 |                                                    |
  [256, 512)       1868425 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
  [512, 1K)        1101697 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                      |
  [1K, 2K)          117031 |@@@@                                                |
  [2K, 4K)           11773 |                                                    |

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://patch.msgid.link/20251014235604.3057003-6-kuniyu@google.com

commit | commitdiff | tree

Kuniyuki Iwashima [Tue, 14 Oct 2025 23:54:57 +0000 (23:54 +0000)]

bpf: Support bpf_setsockopt() for BPF_CGROUP_INET_SOCK_CREATE.

We will support flagging sk->sk_bypass_prot_mem via bpf_setsockopt()
at the BPF_CGROUP_INET_SOCK_CREATE hook.

BPF_CGROUP_INET_SOCK_CREATE is invoked by __cgroup_bpf_run_filter_sk()
that passes a pointer to struct sock to the bpf prog as void *ctx.

But there are no bpf_func_proto for bpf_setsockopt() that receives
the ctx as a pointer to struct sock.

Also, bpf_getsockopt() will be necessary for a cgroup with multiple
bpf progs running.

Let's add new bpf_setsockopt() and bpf_getsockopt() variants for
BPF_CGROUP_INET_SOCK_CREATE.

Note that inet_create() is not under lock_sock() and has the same
semantics with bpf_lsm_unlocked_sockopt_hooks.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://patch.msgid.link/20251014235604.3057003-5-kuniyu@google.com

commit | commitdiff | tree

Kuniyuki Iwashima [Tue, 14 Oct 2025 23:54:56 +0000 (23:54 +0000)]

net: Introduce net.core.bypass_prot_mem sysctl.

If a socket has sk->sk_bypass_prot_mem flagged, the socket opts out
of the global protocol memory accounting.

Let's control the flag by a new sysctl knob.

The flag is written once during socket(2) and is inherited to child
sockets.

Tested with a script that creates local socket pairs and send()s a
bunch of data without recv()ing.

Setup:

  # mkdir /sys/fs/cgroup/test
  # echo $$ >> /sys/fs/cgroup/test/cgroup.procs
  # sysctl -q net.ipv4.tcp_mem="1000 1000 1000"
  # ulimit -n 524288

Without net.core.bypass_prot_mem, charged to tcp_mem & memcg

  # python3 pressure.py &
  # cat /sys/fs/cgroup/test/memory.stat | grep sock
  sock 22642688 <-------------------------------------- charged to memcg
  # cat /proc/net/sockstat| grep TCP
  TCP: inuse 2006 orphan 0 tw 0 alloc 2008 mem 5376 <-- charged to tcp_mem
  # ss -tn | head -n 5
  State Recv-Q Send-Q Local Address:Port  Peer Address:Port
  ESTAB 2000   0          127.0.0.1:34479    127.0.0.1:53188
  ESTAB 2000   0          127.0.0.1:34479    127.0.0.1:49972
  ESTAB 2000   0          127.0.0.1:34479    127.0.0.1:53868
  ESTAB 2000   0          127.0.0.1:34479    127.0.0.1:53554
  # nstat | grep Pressure || echo no pressure
  TcpExtTCPMemoryPressures        1                  0.0

With net.core.bypass_prot_mem=1, charged to memcg only:

  # sysctl -q net.core.bypass_prot_mem=1
  # python3 pressure.py &
  # cat /sys/fs/cgroup/test/memory.stat | grep sock
  sock 2757468160 <------------------------------------ charged to memcg
  # cat /proc/net/sockstat | grep TCP
  TCP: inuse 2006 orphan 0 tw 0 alloc 2008 mem 0 <- NOT charged to tcp_mem
  # ss -tn | head -n 5
  State Recv-Q Send-Q  Local Address:Port  Peer Address:Port
  ESTAB 111000 0           127.0.0.1:36019    127.0.0.1:49026
  ESTAB 110000 0           127.0.0.1:36019    127.0.0.1:45630
  ESTAB 110000 0           127.0.0.1:36019    127.0.0.1:44870
  ESTAB 111000 0           127.0.0.1:36019    127.0.0.1:45274
  # nstat | grep Pressure || echo no pressure
  no pressure

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://patch.msgid.link/20251014235604.3057003-4-kuniyu@google.com

commit | commitdiff | tree

Kuniyuki Iwashima [Tue, 14 Oct 2025 23:54:55 +0000 (23:54 +0000)]

net: Allow opt-out from global protocol memory accounting.

Some protocols (e.g., TCP, UDP) implement memory accounting for socket
buffers and charge memory to per-protocol global counters pointed to by
sk->sk_proto->memory_allocated.

Sometimes, system processes do not want that limitation. For a similar
purpose, there is SO_RESERVE_MEM for sockets under memcg.

Also, by opting out of the per-protocol accounting, sockets under memcg
can avoid paying costs for two orthogonal memory accounting mechanisms.
A microbenchmark result is in the subsequent bpf patch.

Let's allow opt-out from the per-protocol memory accounting if
sk->sk_bypass_prot_mem is true.

sk->sk_bypass_prot_mem and sk->sk_prot are placed in the same cache
line, and sk_has_account() always fetches sk->sk_prot before accessing
sk->sk_bypass_prot_mem, so there is no extra cache miss for this patch.

The following patches will set sk->sk_bypass_prot_mem to true, and
then, the per-protocol memory accounting will be skipped.

Note that this does NOT disable memcg, but rather the per-protocol one.

Another option not to use the hole in struct sock_common is create
sk_prot variants like tcp_prot_bypass, but this would complicate
SOCKMAP logic, tcp_bpf_prots etc.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://patch.msgid.link/20251014235604.3057003-3-kuniyu@google.com

commit | commitdiff | tree

Kuniyuki Iwashima [Tue, 14 Oct 2025 23:54:54 +0000 (23:54 +0000)]

tcp: Save lock_sock() for memcg in inet_csk_accept().

If memcg is enabled, accept() acquires lock_sock() twice for each new
TCP/MPTCP socket in inet_csk_accept() and __inet_accept().

Let's move memcg operations from inet_csk_accept() to __inet_accept().

Note that SCTP somehow allocates a new socket by sk_alloc() in
sk->sk_prot->accept() and clones fields manually, instead of using
sk_clone_lock().

mem_cgroup_sk_alloc() is called for SCTP before __inet_accept(),
so I added the protocol check in __inet_accept(), but this can be
removed once SCTP uses sk_clone_lock().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://patch.msgid.link/20251014235604.3057003-2-kuniyu@google.com

commit | commitdiff | tree

Jakub Kicinski [Thu, 16 Oct 2025 17:53:13 +0000 (10:53 -0700)]

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR (net-6.18-rc2).

No conflicts or adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Linus Torvalds [Thu, 16 Oct 2025 16:41:21 +0000 (09:41 -0700)]

Merge tag 'net-6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
"Including fixes from CAN

  Current release - regressions:

    - udp: do not use skb_release_head_state() before
      skb_attempt_defer_free()

    - gro_cells: use nested-BH locking for gro_cell

    - dpll: zl3073x: increase maximum size of flash utility

  Previous releases - regressions:

    - core: fix lockdep splat on device unregister

    - tcp: fix tcp_tso_should_defer() vs large RTT

    - tls:
        - don't rely on tx_work during send()
        - wait for pending async decryptions if tls_strp_msg_hold fails

    - can: j1939: add missing calls in NETDEV_UNREGISTER notification
      handler

    - eth: lan78xx: fix lost EEPROM write timeout in
      lan78xx_write_raw_eeprom

  Previous releases - always broken:

    - ip6_tunnel: prevent perpetual tunnel growth

    - dpll: zl3073x: handle missing or corrupted flash configuration

    - can: m_can: fix pm_runtime and CAN state handling

    - eth:
        - ixgbe: fix too early devlink_free() in ixgbe_remove()
        - ixgbevf: fix mailbox API compatibility
        - gve: Check valid ts bit on RX descriptor before hw timestamping
        - idpf: cleanup remaining SKBs in PTP flows
        - r8169: fix packet truncation after S4 resume on RTL8168H/RTL8111H"

* tag 'net-6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (50 commits)
  udp: do not use skb_release_head_state() before skb_attempt_defer_free()
  net: usb: lan78xx: fix use of improperly initialized dev->chipid in lan78xx_reset
  netdevsim: set the carrier when the device goes up
  selftests: tls: add test for short splice due to full skmsg
  selftests: net: tls: add tests for cmsg vs MSG_MORE
  tls: don't rely on tx_work during send()
  tls: wait for pending async decryptions if tls_strp_msg_hold fails
  tls: always set record_type in tls_process_cmsg
  tls: wait for async encrypt in case of error during latter iterations of sendmsg
  tls: trim encrypted message to match the plaintext on short splice
  tg3: prevent use of uninitialized remote_adv and local_adv variables
  MAINTAINERS: new entry for IPv6 IOAM
  gve: Check valid ts bit on RX descriptor before hw timestamping
  net: core: fix lockdep splat on device unregister
  MAINTAINERS: add myself as maintainer for b53
  selftests: net: check jq command is supported
  net: airoha: Take into account out-of-order tx completions in airoha_dev_xmit()
  tcp: fix tcp_tso_should_defer() vs large RTT
  r8152: add error handling in rtl8152_driver_init
  usbnet: Fix using smp_processor_id() in preemptible code warnings
  ...

commit | commitdiff | tree

Linus Torvalds [Thu, 16 Oct 2025 16:39:29 +0000 (09:39 -0700)]

Merge tag 'ata-6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux

Pull ata fix from Niklas Cassel:

- Do not print an error message (and assume that the General Purpose
   Log Directory log page is not supported) for a device that reports a
   bogus General Purpose Logging Version.

   Unsurprisingly, many vendors fail to report the only valid General
   Purpose Logging Version (Damien)

* tag 'ata-6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux:
  ata: libata-core: relax checks in ata_read_log_directory()

commit | commitdiff | tree

Eric Dumazet [Wed, 15 Oct 2025 05:27:15 +0000 (05:27 +0000)]

udp: do not use skb_release_head_state() before skb_attempt_defer_free()

Michal reported and bisected an issue after recent adoption
of skb_attempt_defer_free() in UDP.

The issue here is that skb_release_head_state() is called twice per skb,
one time from skb_consume_udp(), then a second time from skb_defer_free_flush()
and napi_consume_skb().

As Sabrina suggested, remove skb_release_head_state() call from
skb_consume_udp().

Add a DEBUG_NET_WARN_ON_ONCE(skb_nfct(skb)) in skb_attempt_defer_free()

Many thanks to Michal, Sabrina, Paolo and Florian for their help.

Fixes: 6471658dc66c ("udp: use skb_attempt_defer_free()")
Reported-and-bisected-by: Michal Kubecek <mkubecek@suse.cz>
Closes: https://lore.kernel.org/netdev/gpjh4lrotyephiqpuldtxxizrsg6job7cvhiqrw72saz2ubs3h@g6fgbvexgl3r/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Michal Kubecek <mkubecek@suse.cz>
Cc: Sabrina Dubroca <sd@queasysnail.net>
Cc: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20251015052715.4140493-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Paolo Abeni [Thu, 16 Oct 2025 13:41:38 +0000 (15:41 +0200)]

Merge branch 'txgbe-feat-new-aml-firmware'

Jiawen Wu says:

====================
TXGBE feat new AML firmware

The firmware of AML devices are redesigned to adapt to more PHY
interfaces. Optimize the driver to be compatible with the new firmware.

v1: https://lore.kernel.org/all/20250928093923.30456-1-jiawenwu@trustnetic.com/
====================

Link: https://patch.msgid.link/20251014061726.36660-1-jiawenwu@trustnetic.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Jiawen Wu [Tue, 14 Oct 2025 06:17:26 +0000 (14:17 +0800)]

net: txgbe: rename txgbe_get_phy_link()

The function txgbe_get_phy_link() is more appropriately named
txgbe_get_mac_link(), since it reads the link status from the MAC
register.

Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Link: https://patch.msgid.link/20251014061726.36660-4-jiawenwu@trustnetic.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Jiawen Wu [Tue, 14 Oct 2025 06:17:25 +0000 (14:17 +0800)]

net: txgbe: optimize the flow to setup PHY for AML devices

To adapt to new firmware for AML devices, the driver should send the
"SET_LINK_CMD" to the firmware only once when switching PHY interface
mode, and no longer needs to re-trigger PHY configuration based on the
RX signal interrupt (TXGBE_GPIOBIT_3).

In previous firmware versions, the PHY was configured only after receiving
"SET_LINK_CMD", and might remain incomplete if the RX signal was lost.
To handle this case, the driver used TXGBE_GPIOBIT_3 interrupt to resend
the command. This workaround is no longer necessary with the new firmware.

And the unknown link speed is permitted in the mailbox buffer.

Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Link: https://patch.msgid.link/20251014061726.36660-3-jiawenwu@trustnetic.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Jiawen Wu [Tue, 14 Oct 2025 06:17:24 +0000 (14:17 +0800)]

net: txgbe: expend SW-FW mailbox buffer size to identify QSFP module

Recent firmware updates introduce additional fields in the mailbox message
to provide more information for identifying 40G and 100G QSFP modules.
To accommodate these new fields, expand the mailbox buffer size by 4 bytes.

Without this change, drivers built against the updated firmware cannot
properly identify modules due to mismatched mailbox message lengths.

The old firmware version that used the smaller mailbox buffer has never
been publicly released, so there are no backward-compatibility concerns.

Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20251014061726.36660-2-jiawenwu@trustnetic.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Rob Herring (Arm) [Mon, 13 Oct 2025 21:30:49 +0000 (16:30 -0500)]

dt-bindings: net: Convert amd,xgbe-seattle-v1a to DT schema

Convert amd,xgbe-seattle-v1a binding to DT schema format. It's a
straight-forward conversion.

Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20251013213049.686797-2-robh@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Paolo Abeni [Thu, 16 Oct 2025 09:36:31 +0000 (11:36 +0200)]

Merge branch 'add-aarch64-support-for-fbnic'

Dimitri Daskalakis says:

====================
Add aarch64 support for FBNIC

We need to support aarch64 with 64K PAGE_SIZE, and I uncovered an issue
during testing.
====================

Link: https://patch.msgid.link/20251013211449.1377054-1-dimitri.daskalakis1@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Dimitri Daskalakis [Mon, 13 Oct 2025 21:14:49 +0000 (14:14 -0700)]

net: fbnic: Allow builds for all 64 bit architectures

This enables aarch64 testing, but there's no reason we cannot support other
architectures.

Signed-off-by: Dimitri Daskalakis <dimitri.daskalakis1@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20251013211449.1377054-3-dimitri.daskalakis1@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Dimitri Daskalakis [Mon, 13 Oct 2025 21:14:48 +0000 (14:14 -0700)]

net: fbnic: Fix page chunking logic when PAGE_SIZE > 4K

The HW always works on a 4K page size. When the OS supports larger
pages, we fragment them across multiple BDQ descriptors.
We were not properly incrementing the descriptor, which resulted in us
specifying the last chunks id/addr and then 15 zero descriptors. This
would cause packet loss and driver crashes. This is not a fix since the
Kconfig prevents use outside of x86.

Signed-off-by: Dimitri Daskalakis <dimitri.daskalakis1@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20251013211449.1377054-2-dimitri.daskalakis1@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

I Viswanath [Mon, 13 Oct 2025 18:16:48 +0000 (23:46 +0530)]

net: usb: lan78xx: fix use of improperly initialized dev->chipid in lan78xx_reset

dev->chipid is used in lan78xx_init_mac_address before it's initialized:

lan78xx_reset() {
    lan78xx_init_mac_address()
        lan78xx_read_eeprom()
            lan78xx_read_raw_eeprom() <- dev->chipid is used here

    dev->chipid = ... <- dev->chipid is initialized correctly here
}

Reorder initialization so that dev->chipid is set before calling
lan78xx_init_mac_address().

Fixes: a0db7d10b76e ("lan78xx: Add to handle mux control per chip id")
Signed-off-by: I Viswanath <viswanathiyyappan@gmail.com>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Reviewed-by: Khalid Aziz <khalid@kernel.org>
Link: https://patch.msgid.link/20251013181648.35153-1-viswanathiyyappan@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Thu, 16 Oct 2025 00:56:20 +0000 (17:56 -0700)]

Merge tag 'linux-can-fixes-for-6.18-20251014' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can

Marc Kleine-Budde says:

====================
pull-request: can 2025-10-14

The first 2 paches are by Celeste Liu and target the gS_usb driver.
The first patch remove the limitation to 3 CAN interface per USB
device. The second patch adds the missing population of
net_device->dev_port.

The next 4 patches are by me and fix the m_can driver. They add a
missing pm_runtime_disable(), fix the CAN state transition back to
Error Active and fix the state after ifup and suspend/resume.

Another patch by me targets the m_can driver, too and replaces Dong
Aisheng's old email address.

The next 2 patches are by Vincent Mailhol and update the CAN
networking Documentation.

Tetsuo Handa contributes the last patch that add missing cleanup calls
in the NETDEV_UNREGISTER notification handler.

* tag 'linux-can-fixes-for-6.18-20251014' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can:
  can: j1939: add missing calls in NETDEV_UNREGISTER notification handler
  can: add Transmitter Delay Compensation (TDC) documentation
  can: remove false statement about 1:1 mapping between DLC and length
  can: m_can: replace Dong Aisheng's old email address
  can: m_can: fix CAN state in system PM
  can: m_can: m_can_chip_config(): bring up interface in correct state
  can: m_can: m_can_handle_state_errors(): fix CAN state transition to Error Active
  can: m_can: m_can_plat_remove(): add missing pm_runtime_disable()
  can: gs_usb: gs_make_candev(): populate net_device->dev_port
  can: gs_usb: increase max interface to U8_MAX
====================

Link: https://patch.msgid.link/20251014122140.990472-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Thu, 16 Oct 2025 00:52:58 +0000 (17:52 -0700)]

Merge branch 'net-airoha-npu-introduce-support-for-airoha-7583-npu'

Lorenzo Bianconi says:

====================
net: airoha: npu: Introduce support for Airoha 7583 NPU

Introduce support for Airoha 7583 SoC NPU.
====================

Link: https://patch.msgid.link/20251013-airoha-npu-7583-v3-0-00f748b5a0c7@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Lorenzo Bianconi [Mon, 13 Oct 2025 13:58:51 +0000 (15:58 +0200)]

net: airoha: npu: Add 7583 SoC support

Introduce support for Airoha 7583 SoC NPU selecting proper firmware images.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20251013-airoha-npu-7583-v3-3-00f748b5a0c7@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Lorenzo Bianconi [Mon, 13 Oct 2025 13:58:50 +0000 (15:58 +0200)]

net: airoha: npu: Add airoha_npu_soc_data struct

Introduce airoha_npu_soc_data structure in order to generalize per-SoC
NPU firmware info. Introduce airoha_npu_load_firmware utility routine.
This is a preliminary patch in order to introduce AN7583 NPU support.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20251013-airoha-npu-7583-v3-2-00f748b5a0c7@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Lorenzo Bianconi [Mon, 13 Oct 2025 13:58:49 +0000 (15:58 +0200)]

dt-bindings: net: airoha: npu: Add AN7583 support

Introduce AN7583 NPU support to Airoha EN7581 NPU device-tree bindings.

Acked-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20251013-airoha-npu-7583-v3-1-00f748b5a0c7@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Thu, 16 Oct 2025 00:50:31 +0000 (17:50 -0700)]

Merge branch 'preserve-pse-pd692x0-configuration-across-reboots'

Kory Maincent says:

====================
Preserve PSE PD692x0 configuration across reboots

Previously, the driver would always reconfigure the PSE hardware on
probe, causing a port matrix reflash that resulted in temporary power
loss to all connected devices. This change maintains power continuity
by preserving existing configuration when the PSE has been previously
initialized.
====================

Link: https://patch.msgid.link/20251013-feature_pd692x0_reboot_keep_conf-v2-0-68ab082a93dd@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Kory Maincent (Dent Project) [Mon, 13 Oct 2025 14:05:33 +0000 (16:05 +0200)]

net: pse-pd: pd692x0: Preserve PSE configuration across reboots

Detect when PSE hardware is already configured (user byte == 42) and
skip hardware initialization to prevent power interruption to connected
devices during system reboots.

Previously, the driver would always reconfigure the PSE hardware on
probe, causing a port matrix reflash that resulted in temporary power
loss to all connected devices. This change maintains power continuity
by preserving existing configuration when the PSE has been previously
initialized.

Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20251013-feature_pd692x0_reboot_keep_conf-v2-3-68ab082a93dd@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Kory Maincent (Dent Project) [Mon, 13 Oct 2025 14:05:32 +0000 (16:05 +0200)]

net: pse-pd: pd692x0: Separate configuration parsing from hardware setup

Cache the port matrix configuration in driver private data to enable
PSE controller reconfiguration. This refactoring separates device tree
parsing from hardware configuration application, allowing settings to be
reapplied without reparsing the device tree.

This refactoring is a prerequisite for preserving PSE configuration
across reboots to prevent power disruption to connected devices.

Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20251013-feature_pd692x0_reboot_keep_conf-v2-2-68ab082a93dd@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Kory Maincent (Dent Project) [Mon, 13 Oct 2025 14:05:31 +0000 (16:05 +0200)]

net: pse-pd: pd692x0: Replace __free macro with explicit kfree calls

Replace __free(kfree) with explicit kfree() calls to follow the net
subsystem policy of avoiding automatic cleanup macros as described in
the documentation.

Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20251013-feature_pd692x0_reboot_keep_conf-v2-1-68ab082a93dd@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Florian Fainelli [Mon, 13 Oct 2025 17:23:06 +0000 (10:23 -0700)]

net: bcmasp: Add support for PHY-based Wake-on-LAN

If available, interrogate the PHY to find out whether we can use it for
Wake-on-LAN. This can be a more power efficient way of implementing that
feature, especially when the MAC is powered off in low power states.

Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20251013172306.2250223-1-florian.fainelli@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Breno Leitao [Tue, 14 Oct 2025 09:17:25 +0000 (02:17 -0700)]

netdevsim: set the carrier when the device goes up

Bringing a linked netdevsim device down and then up causes communication
failure because both interfaces lack carrier. Basically a ifdown/ifup on
the interface make the link broken.

Commit 3762ec05a9fbda ("netdevsim: add NAPI support") added supported
for NAPI, calling netif_carrier_off() in nsim_stop(). This patch
re-enables the carrier symmetrically on nsim_open(), in case the device
is linked and the peer is up.

Signed-off-by: Breno Leitao <leitao@debian.org>
Fixes: 3762ec05a9fbda ("netdevsim: add NAPI support")
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20251014-netdevsim_fix-v2-1-53b40590dae1@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Thu, 16 Oct 2025 00:41:47 +0000 (17:41 -0700)]

Merge branch 'tls-misc-bugfixes'

Sabrina Dubroca says:

====================
tls: misc bugfixes

Jann Horn reported multiple bugs in kTLS. This series addresses them,
and adds some corresponding selftests for those that are reproducible
(and without failure injection).
====================

Link: https://patch.msgid.link/cover.1760432043.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Sabrina Dubroca [Tue, 14 Oct 2025 09:17:02 +0000 (11:17 +0200)]

selftests: tls: add test for short splice due to full skmsg

We don't have a test triggering a partial splice caused by a full
skmsg. Add one, based on a program by Jann Horn.

Use MAX_FRAGS=48 to make sure the skmsg will be full for any allowed
value of CONFIG_MAX_SKB_FRAGS (17..45).

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/1d129a15f526ea3602f3a2b368aa0b6f7e0d35d5.1760432043.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Sabrina Dubroca [Tue, 14 Oct 2025 09:17:01 +0000 (11:17 +0200)]

selftests: net: tls: add tests for cmsg vs MSG_MORE

We don't have a test to check that MSG_MORE won't let us merge records
of different types across sendmsg calls.

Add new tests that check:
- MSG_MORE is only allowed for DATA records
- a pending DATA record gets closed and pushed before a non-DATA
record is processed

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/b34feeadefe8a997f068d5ed5617afd0072df3c0.1760432043.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Sabrina Dubroca [Tue, 14 Oct 2025 09:17:00 +0000 (11:17 +0200)]

tls: don't rely on tx_work during send()

With async crypto, we rely on tx_work to actually transmit records
once encryption completes. But while send() is running, both the
tx_lock and socket lock are held, so tx_work_handler cannot process
the queue of encrypted records, and simply reschedules itself. During
a large send(), this could last a long time, and use a lot of memory.

Transmit any pending encrypted records before restarting the main
loop of tls_sw_sendmsg_locked.

Fixes: a42055e8d2c3 ("net/tls: Add support for async encryption of records for performance")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/8396631478f70454b44afb98352237d33f48d34d.1760432043.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Sabrina Dubroca [Tue, 14 Oct 2025 09:16:59 +0000 (11:16 +0200)]

tls: wait for pending async decryptions if tls_strp_msg_hold fails

Async decryption calls tls_strp_msg_hold to create a clone of the
input skb to hold references to the memory it uses. If we fail to
allocate that clone, proceeding with async decryption can lead to
various issues (UAF on the skb, writing into userspace memory after
the recv() call has returned).

In this case, wait for all pending decryption requests.

Fixes: 84c61fe1a75b ("tls: rx: do not use the standard strparser")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/b9fe61dcc07dab15da9b35cf4c7d86382a98caf2.1760432043.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Sabrina Dubroca [Tue, 14 Oct 2025 09:16:58 +0000 (11:16 +0200)]

tls: always set record_type in tls_process_cmsg

When userspace wants to send a non-DATA record (via the
TLS_SET_RECORD_TYPE cmsg), we need to send any pending data from a
previous MSG_MORE send() as a separate DATA record. If that DATA record
is encrypted asynchronously, tls_handle_open_record will return
-EINPROGRESS. This is currently treated as an error by
tls_process_cmsg, and it will skip setting record_type to the correct
value, but the caller (tls_sw_sendmsg_locked) handles that return
value correctly and proceeds with sending the new message with an
incorrect record_type (DATA instead of whatever was requested in the
cmsg).

Always set record_type before handling the open record. If
tls_handle_open_record returns an error, record_type will be
ignored. If it succeeds, whether with synchronous crypto (returning 0)
or asynchronous (returning -EINPROGRESS), the caller will proceed
correctly.

Fixes: a42055e8d2c3 ("net/tls: Add support for async encryption of records for performance")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/0457252e578a10a94e40c72ba6288b3a64f31662.1760432043.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Sabrina Dubroca [Tue, 14 Oct 2025 09:16:57 +0000 (11:16 +0200)]

tls: wait for async encrypt in case of error during latter iterations of sendmsg

If we hit an error during the main loop of tls_sw_sendmsg_locked (eg
failed allocation), we jump to send_end and immediately
return. Previous iterations may have queued async encryption requests
that are still pending. We should wait for those before returning, as
we could otherwise be reading from memory that userspace believes
we're not using anymore, which would be a sort of use-after-free.

This is similar to what tls_sw_recvmsg already does: failures during
the main loop jump to the "wait for async" code, not straight to the
unlock/return.

Fixes: a42055e8d2c3 ("net/tls: Add support for async encryption of records for performance")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/c793efe9673b87f808d84fdefc0f732217030c52.1760432043.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Sabrina Dubroca [Tue, 14 Oct 2025 09:16:56 +0000 (11:16 +0200)]

tls: trim encrypted message to match the plaintext on short splice

During tls_sw_sendmsg_locked, we pre-allocate the encrypted message
for the size we're expecting to send during the current iteration, but
we may end up sending less, for example when splicing: if we're
getting the data from small fragments of memory, we may fill up all
the slots in the skmsg with less data than expected.

In this case, we need to trim the encrypted message to only the length
we actually need, to avoid pushing uninitialized bytes down the
underlying TCP socket.

Fixes: fe1e81d4f73b ("tls/sw: Support MSG_SPLICE_PAGES")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/66a0ae99c9efc15f88e9e56c1f58f902f442ce86.1760432043.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Eric Dumazet [Tue, 14 Oct 2025 14:06:05 +0000 (14:06 +0000)]

net: remove obsolete WARN_ON(refcount_read(&sk->sk_refcnt) == 1)

sk->sk_refcnt has been converted to refcount_t in 2017.

__sock_put(sk) being refcount_dec(&sk->sk_refcnt), it will complain
loudly if the current refcnt is 1 (or less) in a non racy way.

We can remove four WARN_ON() in favor of the generic refcount_dec()
check.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Xuanqiang Luo<luoxuanqiang@kylinos.cn>
Link: https://patch.msgid.link/20251014140605.2982703-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Alexey Simakov [Tue, 14 Oct 2025 16:47:38 +0000 (19:47 +0300)]

tg3: prevent use of uninitialized remote_adv and local_adv variables

Some execution paths that jump to the fiber_setup_done label
could leave the remote_adv and local_adv variables uninitialized
and then use it.

Initialize this variables at the point of definition to avoid this.

Fixes: 85730a631f0c ("tg3: Add SGMII phy support for 5719/5718 serdes")
Co-developed-by: Alexandr Sapozhnikov <alsp705@gmail.com>
Signed-off-by: Alexandr Sapozhnikov <alsp705@gmail.com>
Signed-off-by: Alexey Simakov <bigalex934@gmail.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Link: https://patch.msgid.link/20251014164736.5890-1-bigalex934@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Justin Iurman [Tue, 14 Oct 2025 17:06:50 +0000 (19:06 +0200)]

MAINTAINERS: new entry for IPv6 IOAM

Create a maintainer entry for IPv6 IOAM. Add myself as I authored most
if not all of the IPv6 IOAM code in the kernel and actively participate
in the related IETF groups.

Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Link: https://patch.msgid.link/20251014170650.27679-1-justin.iurman@uliege.be
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Linus Torvalds [Wed, 15 Oct 2025 22:12:58 +0000 (15:12 -0700)]

Merge tag 'vfs-6.18-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs fixes from Christian Brauner:

- Handle inode number mismatches in nsfs file handles

- Update the comment to init_file()

- Add documentation link for EBADF in the rust file code

- Skip read lock assertion for read-only filesystems when using dax

- Don't leak disconnected dentries during umount

- Fix new coredump input pattern validation

- Handle ENOIOCTLCMD conversion in vfs_fileattr_{g,s}et() correctly

- Remove redundant IOCB_DIO_CALLER_COMP clearing in overlayfs

* tag 'vfs-6.18-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  ovl: remove redundant IOCB_DIO_CALLER_COMP clearing
  fs: return EOPNOTSUPP from file_setattr/file_getattr syscalls
  Revert "fs: make vfs_fileattr_[get|set] return -EOPNOTSUPP"
  coredump: fix core_pattern input validation
  vfs: Don't leak disconnected dentries on umount
  dax: skip read lock assertion for read-only filesystems
  rust: file: add intra-doc link for 'EBADF'
  fs: update comment in init_file()
  nsfs: handle inode number mismatches gracefully in file handles

commit | commitdiff | tree

Heiner Kallweit [Tue, 14 Oct 2025 06:02:47 +0000 (08:02 +0200)]

net: bcmgenet: remove unused platform code

This effectively reverts b0ba512e25d7 ("net: bcmgenet: enable driver to
work without a device tree"). There has never been an in-tree user of
struct bcmgenet_platform_data, all devices use OF or ACPI.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/108b4e64-55d4-4b4e-9a11-3c810c319d66@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Abhishek Rawal [Tue, 14 Oct 2025 05:52:33 +0000 (11:22 +0530)]

r8152: Advertise software timestamp information.

Driver calls skb_tx_timestamp(skb) in rtl8152_start_xmit(), but does not advertise the capability in ethtool.
Advertise software timestamp capabilities on struct ethtool_ops.

Signed-off-by: Abhishek Rawal <rawal.abhishek92@gmail.com>
Reviewed-by: Jamie Bainbridge <jamie.bainbridge@gmail.com>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20251014055234.46527-1-rawal.abhishek92@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

A mirror of Linus' kernel repository