git.ipfire.org Git - thirdparty/kernel/linux.git/log

netfilter: flowtable: dedicated slab for flow entry

The size of `struct flow_offload` has grown beyond 256 bytes on 64-bit
kernels (currently 280 bytes) because of the `flow_offload_tunnel`
member added recently. So kmalloc() allocates from the kmalloc-512 slab,
causing significant memory waste per entry.

Introduce a dedicated slab cache for flow entries to reduce memory
footprint. Results in a reduction from 512 bytes to 320 bytes per entry
on x86_64 kernels.

Signed-off-by: Qingfang Deng <dqfext@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>

selftests: netfilter: nft_queue.sh: add udp fraglist gro test case

Without the preceding patch, this fails with:

FAIL: test_udp_gro_ct: Expected udp conntrack entry
FAIL: test_udp_gro_ct: Expected software segmentation to occur, had 10 and 0

Signed-off-by: Florian Westphal <fw@strlen.de>

netfilter: nfnetlink_queue: do shared-unconfirmed check before segmentation

Ulrich reports a regression with nfqueue:

If an application did not set the 'F_GSO' capability flag and a gso
packet with an unconfirmed nf_conn entry is received all packets are
now dropped instead of queued, because the check happens after
skb_gso_segment().  In that case, we did have exclusive ownership
of the skb and its associated conntrack entry.  The elevated use
count is due to skb_clone happening via skb_gso_segment().

Move the check so that its peformed vs. the aggregated packet.

Then, annotate the individual segments except the first one so we
can do a 2nd check at reinject time.

For the normal case, where userspace does in-order reinjects, this avoids
packet drops: first reinjected segment continues traversal and confirms
entry, remaining segments observe the confirmed entry.

While at it, simplify nf_ct_drop_unconfirmed(): We only care about
unconfirmed entries with a refcnt > 1, there is no need to special-case
dying entries.

This only happens with UDP.  With TCP, the only unconfirmed packet will
be the TCP SYN, those aren't aggregated by GRO.

Next patch adds a udpgro test case to cover this scenario.

Reported-by: Ulrich Weber <ulrich.weber@gmail.com>
Fixes: 7d8dc1c7be8d ("netfilter: nf_queue: drop packets with cloned unconfirmed conntracks")
Signed-off-by: Florian Westphal <fw@strlen.de>

netfilter: nft_set_rbtree: don't gc elements on insert

During insertion we can queue up expired elements for garbage
collection.

In case of later abort, the commit hook will never be called.
Packet path and 'get' requests will find free'd elements in the
binary search blob:

nft_set_ext_key include/net/netfilter/nf_tables.h:800 [inline]
nft_array_get_cmp+0x1f6/0x2a0 net/netfilter/nft_set_rbtree.c:133
__inline_bsearch include/linux/bsearch.h:15 [inline]
bsearch+0x50/0xc0 lib/bsearch.c:33
nft_rbtree_get+0x16b/0x400 net/netfilter/nft_set_rbtree.c:169
nft_setelem_get net/netfilter/nf_tables_api.c:6495 [inline]
nft_get_set_elem+0x420/0xaa0 net/netfilter/nf_tables_api.c:6543
nf_tables_getsetelem+0x448/0x5e0 net/netfilter/nf_tables_api.c:6632
nfnetlink_rcv_msg+0x8ae/0x12c0 net/netfilter/nfnetlink.c:290

Also, when we insert an element that triggers -EEXIST, and that insertion
happens to also zap a timed-out entry, we end up with same issue:
Neither commit nor abort hook is called.

Fix this by removing gc api usage during insertion.

The blamed commit also removes concurrency of the rbtree with the
packet path, so we can now safely rb_erase() the element and move
it to a new expired list that can be reaped in the commit hook
before building the next blob iteration.

This also avoids the need to rebuild the blob in the abort path:
Expired elements seen during insertion attempts are kept around
until a transaction passes.

Reported-by: syzbot+d417922a3e7935517ef6@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=d417922a3e7935517ef6
Fixes: 7e43e0a1141d ("netfilter: nft_set_rbtree: translate rbtree to array for binary search")
Signed-off-by: Florian Westphal <fw@strlen.de>

net/mlx5e: SHAMPO, Switch to header memcpy

Previously the HW-GRO code was using a separate page_pool for the header
buffer. The pages of the header buffer were replenished via UMR. This
mechanism has some drawbacks:
- Reference counting on the page_pool page frags is not cheap.
- UMRs have HW overhead for updating and also for access. Especially for
  the KLM type which was previously used.
- UMR code for headers is complex.

This patch switches to using a static memory area (static MTT MKEY) for
the header buffer and does a header memcpy. This happens only once per
GRO session. The SKB is allocated from the per-cpu NAPI SKB cache.

Performance numbers for x86:
+---------------------------------------------------------+
| Test                | Baseline   | Header Copy | Change |
|---------------------+------------+-------------+--------|
| iperf3 oncpu        |  59.5 Gbps |  64.00 Gbps |   7 %  |
| iperf3 offcpu       | 102.5 Gbps | 104.20 Gbps |   2 %  |
| kperf oncpu         | 115.0 Gbps | 130.00 Gbps |  12 %  |
| XDP_DROP (skb mode) |   3.9 Mpps |   3.9 Mpps  |   0 %  |
+---------------------------------------------------------+

Notes on test:
- System: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
- oncpu: NAPI and application running on same CPU
- offcpu: NAPI and application running on different CPUs
- MTU: 1500
- iperf3 tests are single stream, 60s with IPv6 (for slightly larger
  headers)
- kperf version [1]

[1] git://git.kernel.dk/kperf.git

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260204200345.1724098-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Fix 1600G link mode enum naming

Rename TAUI/TBASE to GAUI/GBASE in 1600G link mode identifier and its
usage in ethtool and link-info tables.

Reported-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com>
Signed-off-by: Yael Chemla <ychemla@nvidia.com>
Reviewed-by: Shahar Shitrit <shshitrit@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com>
Reported-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com>
Signed-off-by: Yael Chemla <ychemla@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://patch.msgid.link/20260204194324.1723534-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR (net-6.19-rc9).

No adjacent changes, conflicts:

drivers/net/ethernet/spacemit/k1_emac.c
3125fc1701694 ("net: spacemit: k1-emac: fix jumbo frame support")
f66086798f91f ("net: spacemit: Remove broken flow control support")
https://lore.kernel.org/aYIysFIE9ooavWia@sirena.org.uk

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'net-6.19-rc9' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
"Including fixes from wireless and Netfilter.

  Previous releases - regressions:

   - eth: stmmac: fix stm32 (and potentially others) resume regression

   - nf_tables: fix inverted genmask check in nft_map_catchall_activate()

   - usb: r8152: fix resume reset deadlock

   - fix reporting RXH_XFRM_NO_CHANGE as input_xfrm for RSS contexts

  Previous releases - always broken:

   - sched: cls_u32: use skb_header_pointer_careful() to avoid OOB reads
     with malicious u32 rules

   - eth: ice: timestamping related fixes"

* tag 'net-6.19-rc9' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (38 commits)
  ipv6: Fix ECMP sibling count mismatch when clearing RTF_ADDRCONF
  netfilter: nf_tables: fix inverted genmask check in nft_map_catchall_activate()
  net: cpsw: Execute ndo_set_rx_mode callback in a work queue
  net: cpsw_new: Execute ndo_set_rx_mode callback in a work queue
  gve: Correct ethtool rx_dropped calculation
  gve: Fix stats report corruption on queue count change
  selftest: net: add a test-case for encap segmentation after GRO
  net: gro: fix outer network offset
  net: add proper RCU protection to /proc/net/ptype
  net: ethernet: adi: adin1110: Check return value of devm_gpiod_get_optional() in adin1110_check_spi()
  wifi: iwlwifi: mvm: pause TCM on fast resume
  wifi: iwlwifi: mld: cancel mlo_scan_start_wk
  net: spacemit: k1-emac: fix jumbo frame support
  net: enetc: Convert 16-bit register reads to 32-bit for ENETC v4
  net: enetc: Convert 16-bit register writes to 32-bit for ENETC v4
  net: enetc: Remove CBDR cacheability AXI settings for ENETC v4
  net: enetc: Remove SI/BDR cacheability AXI settings for ENETC v4
  tipc: use kfree_sensitive() for session key material
  net: stmmac: fix stm32 (and potentially others) resume regression
  net: rss: fix reporting RXH_XFRM_NO_CHANGE as input_xfrm for contexts
  ...

net/sched: don't use dynamic lockdep keys with clsact/ingress/noqueue

Currently we are registering one dynamic lockdep key for each allocated
qdisc, to avoid false deadlock reports when mirred (or TC eBPF) redirects
packets to another device while the root lock is acquired [1].
Since dynamic keys are a limited resource, we can save them at least for
qdiscs that are not meant to acquire the root lock in the traffic path,
or to carry traffic at all, like:

- clsact
- ingress
- noqueue

Don't register dynamic keys for the above schedulers, so that we hit
MAX_LOCKDEP_KEYS later in our tests.

[1] https://github.com/multipath-tcp/mptcp_net-next/issues/451

Changes in v2:
- change ordering of spin_lock_init() vs. lockdep_register_key()
(Jakub Kicinski)

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Link: https://patch.msgid.link/94448f7fa7c4f52d2ce416a4895ec87d456d7417.1770220576.git.dcaratti@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: imx: fix iMX93 register definitions

When looking at the iMX93 documentation, the definitions in the driver
do not correspond with the documentation, which makes the driver
confusing.

The driver, for example, re-uses a definition for bit 0 for two
different registers, where this bit have completely different purposes.

Fix this by renaming the second register, and adding a definition that
reflects the true purpose of bit 0 in the first register (EQOS enable.)

Replace MX93_GPR_ENET_QOS_INTF_MODE_MASK with MX93_GPR_ENET_QOS_ENABLE
and MX93_GPR_ENET_QOS_INTF_SEL_MASK as MX93_GPR_ENET_QOS_INTF_MODE_MASK
is not a register field.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vnaGl-00000007i9f-0ZMw@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: change inet6_sk_rebuild_header() to use inet->cork.fl.u.ip6

TCP v6 spends a good amount of time rebuilding a fresh fl6 at each
transmit in inet6_csk_xmit()/inet6_csk_route_socket().

TCP v4 caches the information in inet->cork.fl.u.ip4 instead.

This patch is a first step converting IPv6 to the same strategy:

Before this patch inet6_sk_rebuild_header() only validated/rebuilt
a dst. Automatic variable @fl6 content was lost.

After this patch inet6_sk_rebuild_header() also initializes
inet->cork.fl.u.ip6, which can be reused in the future.

This makes inet6_sk_rebuild_header() very similar to
inet_sk_rebuild_header().

Also remove the EXPORT_SYMBOL_GPL(), inet6_sk_rebuild_header()
is not called from any module.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260204163035.4123817-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'tcp-remove-net-core-request_sock-c-and-no-longer-inline-__reqsk_free'

Eric Dumazet says:

====================
tcp: remove net/core/request_sock.c and no longer inline __reqsk_free()

After DCCP removal, net/core/request_sock.c makes no more sense.

Move reqsk_queue_alloc() and reqsk_fastopen_remove() to TCP files.

Then put __reqsk_free() out of line to save ~2 Kbytes of text.
====================

Link: https://patch.msgid.link/20260204055147.1682705-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: move __reqsk_free() out of line

Inlining __reqsk_free() is overkill, let's reclaim 2 Kbytes of text.

$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 2/4 grow/shrink: 2/14 up/down: 225/-2338 (-2113)
Function                                     old     new   delta
__reqsk_free                                   -     114    +114
sock_edemux                                   18      82     +64
inet_csk_listen_start                        233     264     +31
__pfx___reqsk_free                             -      16     +16
__pfx_reqsk_queue_alloc                       16       -     -16
__pfx_reqsk_free                              16       -     -16
reqsk_queue_alloc                             46       -     -46
tcp_req_err                                  272     177     -95
reqsk_fastopen_remove                        348     253     -95
cookie_bpf_check                             157      62     -95
cookie_tcp_reqsk_alloc                       387     290     -97
cookie_v4_check                             1568    1465    -103
reqsk_free                                   105       -    -105
cookie_v6_check                             1519    1412    -107
sock_gen_put                                 187      78    -109
sock_pfree                                   212      82    -130
tcp_try_fastopen                            1818    1683    -135
tcp_v4_rcv                                  3478    3294    -184
reqsk_put                                    306      90    -216
tcp_get_cookie_sock                          551     318    -233
tcp_v6_rcv                                  3404    3141    -263
tcp_conn_request                            2677    2384    -293
Total: Before=24887415, After=24885302, chg -0.01%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260204055147.1682705-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: get rid of net/core/request_sock.c

After DCCP removal, this file was not needed any more.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260204055147.1682705-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: move reqsk_fastopen_remove to net/ipv4/tcp_fastopen.c

This function belongs to TCP stack, not to net/core/request_sock.c

We get rid of the now empty request_sock.c n the following patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260204055147.1682705-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

inet: move reqsk_queue_alloc() to net/ipv4/inet_connection_sock.c

Only called once from inet_csk_listen_start(), it can be static.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260204055147.1682705-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-stmmac-rk-final-cleanups-part'

Russell King says:

====================
net: stmmac: rk: final cleanups part

This is the last part of my current dwmac-rk cleanups.
====================

Link: https://patch.msgid.link/aYMN2gZMfLPKuukG@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: rk: rk3506, rk3528 and rk3588 have rmii_mode in clock register

rk3506, rk3528 and rk3588 have the rmii_mode bit in the clock GRF
register rather than the gmac GRF register. Provide a mask for this
field in the clock register, and convert these SoCs to use this.
Add the necessary code in rk_gmac_powerup() to write this field.

This allows us to get rid of these SoCs set_to_rmii() function. As
such, we need to mark these SoCs as supporting RMII mode.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Heiko Stuebner <heiko@sntech.de>
Tested-by: Heiko Stuebner <heiko@sntech.de> #px30,rk3328,rk3568,rk3588
Link: https://patch.msgid.link/E1vnYyB-00000007hpF-1dwK@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: rk: use rk_encode_wm16() for clock selection

Use rk_encode_wm16() for RMII clock gating control, and also for the
io_clksel bit used to select the transmit clock between CRU-derived
and IO-derived clock sources.

Both of these were configured via the "set_clock_selection" method in
the SoC specific operations, but there is no requirement to change the
io_clksel except when enabling clocks.

It is also possible that we don't need to ungate the RMII clock if we
are operating in RGMII mode, but this commit makes no change there.

Split up the configuration of these as separate functions, and remove
the set_clock_selection() method. Since these clocking bits are in the
same register that we call the "speed" register, move the logic for
writing that register into rk_write_speed_grf_reg().

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Heiko Stuebner <heiko@sntech.de>
Tested-by: Heiko Stuebner <heiko@sntech.de> #px30,rk3328,rk3568,rk3588
Link: https://patch.msgid.link/E1vnYy6-00000007hp9-1AJM@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: rk: rk3528: gmac0 only supports RMII

RK3528 gmac0 dtsi contains:

                gmac0: ethernet@ffbd0000 {
                        phy-handle = <&rmii0_phy>;
                        phy-mode = "rmii";

                        mdio0: mdio {
                                rmii0_phy: ethernet-phy@2 {
                                        phy-is-integrated;
                                };
                        };
                };

This follows the same pattern as rk3328, where this gmac instance
only supports RMII. Disable RGMII in phylink's supported_interfaces
mask for this gmac instance.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Heiko Stuebner <heiko@sntech.de>
Link: https://patch.msgid.link/E1vnYy1-00000007hp3-0hKm@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: rk: rk3328: gmac2phy only supports RMII

As detailed in a previous commit ("net: stmmac: rk: convert rk3328 to
use bsp_priv->id") rk3328 gmac2phy only supports RMII, whereas gmac2io
supports both RMII and RGMII. Clear supports_rgmii for gmac2phy.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Heiko Stuebner <heiko@sntech.de>
Tested-by: Heiko Stuebner <heiko@sntech.de> #px30,rk3328 gmac2io,rk3568,rk3588
Link: https://patch.msgid.link/E1vnYxw-00000007hox-0DqH@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: rk: replace empty set_to_rmii() with supports_rmii

Rather than providing a now-empty set_to_rmii() method to indicate
that RMII is supported, switch to setting ops->supports_rmii instead.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Heiko Stuebner <heiko@sntech.de>
Tested-by: Heiko Stuebner <heiko@sntech.de> #px30,rk3328,rk3568,rk3588
Link: https://patch.msgid.link/E1vnYxq-00000007hor-3yXt@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: rk: introduce flags indicating support for RGMII/RMII

Introduce two boolean flags into struct rk_priv_data indicating
whether RGMII and/or RMII is supported for this instance. Use these
to configure the supported_interfaces mask for phylink, validate the
interface mode. Initialise these from equivalent flags in the
rk_gmac_ops or depending on the presence of the ops->set_to_rgmii and
ops->set_to_mii methods. Finally, make ops->set_to_* optional.

This will allow us to get rid of empty set_to_rmii() methods.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Heiko Stuebner <heiko@sntech.de>
Tested-by: Heiko Stuebner <heiko@sntech.de> #px30,rk3328,rk3568,rk3588
Link: https://patch.msgid.link/E1vnYxl-00000007hol-3XiH@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: Fix ECMP sibling count mismatch when clearing RTF_ADDRCONF

syzbot reported a kernel BUG in fib6_add_rt2node() when adding an IPv6
route. [0]

Commit f72514b3c569 ("ipv6: clear RA flags when adding a static
route") introduced logic to clear RTF_ADDRCONF from existing routes
when a static route with the same nexthop is added. However, this
causes a problem when the existing route has a gateway.

When RTF_ADDRCONF is cleared from a route that has a gateway, that
route becomes eligible for ECMP, i.e. rt6_qualify_for_ecmp() returns
true. The issue is that this route was never added to the
fib6_siblings list.

This leads to a mismatch between the following counts:

- The sibling count computed by iterating fib6_next chain, which
includes the newly ECMP-eligible route

- The actual siblings in fib6_siblings list, which does not include
that route

When a subsequent ECMP route is added, fib6_add_rt2node() hits
BUG_ON(sibling->fib6_nsiblings != rt->fib6_nsiblings) because the
counts don't match.

Fix this by only clearing RTF_ADDRCONF when the existing route does
not have a gateway. Routes without a gateway cannot qualify for ECMP
anyway (rt6_qualify_for_ecmp() requires fib_nh_gw_family), so clearing
RTF_ADDRCONF on them is safe and matches the original intent of the
commit.

[0]:
kernel BUG at net/ipv6/ip6_fib.c:1217!
Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 0 UID: 0 PID: 6010 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/25/2025
RIP: 0010:fib6_add_rt2node+0x3433/0x3470 net/ipv6/ip6_fib.c:1217
[...]
Call Trace:
<TASK>
fib6_add+0x8da/0x18a0 net/ipv6/ip6_fib.c:1532
__ip6_ins_rt net/ipv6/route.c:1351 [inline]
ip6_route_add+0xde/0x1b0 net/ipv6/route.c:3946
ipv6_route_ioctl+0x35c/0x480 net/ipv6/route.c:4571
inet6_ioctl+0x219/0x280 net/ipv6/af_inet6.c:577
sock_do_ioctl+0xdc/0x300 net/socket.c:1245
sock_ioctl+0x576/0x790 net/socket.c:1366
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:597 [inline]
__se_sys_ioctl+0xfc/0x170 fs/ioctl.c:583
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xfa/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f

Fixes: f72514b3c569 ("ipv6: clear RA flags when adding a static route")
Reported-by: syzbot+cb809def1baaac68ab92@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=cb809def1baaac68ab92
Tested-by: syzbot+cb809def1baaac68ab92@syzkaller.appspotmail.com
Signed-off-by: Shigeru Yoshida <syoshida@redhat.com>
Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de>
Link: https://patch.msgid.link/20260204095837.1285552-1-syoshida@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'nf-26-02-05' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf

Florian Westphal says:

====================
netfilter: update for net

This is one last-minute crash fix for nf_tables, from Andrew Fasano:

Logical check is inverted, this makes kernel fail to correctly undo
the transaction, leading to a use-after-free.

* tag 'nf-26-02-05' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_tables: fix inverted genmask check in nft_map_catchall_activate()
====================

Link: https://patch.msgid.link/20260205074450.3187-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: add vlan_get_protocol_offset_inline() helper

skb_protocol() is bloated, and forces slow stack canaries in many
fast paths.

Add vlan_get_protocol_offset_inline() which deals with the non-vlan
common cases.

__vlan_get_protocol_offset() is now out of line.

It returns a vlan_type_depth struct to avoid stack canaries in callers.

struct vlan_type_depth {
       __be16 type;
       u16 depth;
};

$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/2 grow/shrink: 0/22 up/down: 0/-6320 (-6320)
Function                                     old     new   delta
vlan_get_protocol_dgram                       61      59      -2
__pfx_skb_protocol                            16       -     -16
__vlan_get_protocol_offset                   307     273     -34
tap_get_user                                1374    1207    -167
ip_md_tunnel_xmit                           1625    1452    -173
tap_sendmsg                                  940     753    -187
netif_skb_features                          1079     866    -213
netem_enqueue                               3017    2800    -217
vlan_parse_protocol                          271      50    -221
tso_start                                    567     344    -223
fq_dequeue                                  1908    1685    -223
skb_network_protocol                         434     205    -229
ip6_tnl_xmit                                2639    2409    -230
br_dev_queue_push_xmit                       474     236    -238
skb_protocol                                 258       -    -258
packet_parse_headers                         621     357    -264
__ip6_tnl_rcv                               1306    1039    -267
skb_csum_hwoffload_help                      515     224    -291
ip_tunnel_xmit                              2635    2339    -296
sch_frag_xmit_hook                          1582    1233    -349
bpf_skb_ecn_set_ce                           868     457    -411
IP6_ECN_decapsulate                         1297     768    -529
ip_tunnel_rcv                               2121    1489    -632
ipip6_rcv                                   2572    1922    -650
Total: Before=24892803, After=24886483, chg -0.03%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260204053023.1622775-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

flow_offload: add const qualifiers to function arguments

Some functions do not modify the pointed-to data, but lack const
qualifiers. Add const qualifiers to the arguments of
flow_rule_match_has_control_flags() and flow_cls_offload_flow_rule().

Signed-off-by: David Yang <mmyangfl@gmail.com>
Link: https://patch.msgid.link/20260204052839.198602-1-mmyangfl@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'dpll-core-improvements-and-ice-e825-c-synce-support'

Ivan Vecera says:

====================
dpll: Core improvements and ice E825-C SyncE support

This series introduces Synchronous Ethernet (SyncE) support for the Intel
E825-C Ethernet controller. Unlike previous generations where DPLL
connections were implicitly assumed, the E825-C architecture relies
on the platform firmware (ACPI) to describe the physical connections
between the Ethernet controller and external DPLLs (such as the ZL3073x).

To accommodate this, the series extends the DPLL subsystem to support
firmware node (fwnode) associations, asynchronous discovery via notifiers,
and dynamic pin management. Additionally, a significant refactor of
the DPLL reference counting logic is included to ensure robustness and
debuggability.

DPLL Core Extensions:
* Firmware Node Association: Pins can now be associated with a struct
  fwnode_handle after allocation via dpll_pin_fwnode_set(). This allows
  drivers to link pin objects with their corresponding DT/ACPI nodes.
* Asynchronous Notifiers: A raw notifier chain is added to the DPLL core.
  This allows the Ethernet driver to subscribe to events and react when
  the platform DPLL driver registers the parent pins, resolving probe
  ordering dependencies.
* Dynamic Indexing: Drivers can now request DPLL_PIN_IDX_UNSPEC to have
  the core automatically allocate a unique pin index.

Reference Counting & Debugging:
* Refactor: The reference counting logic in the core is consolidated.
  Internal list management helpers now automatically handle hold/put
  operations, removing fragile open-coded logic in the registration paths.
* Reference Tracking: A new Kconfig option DPLL_REFCNT_TRACKER is added.
  This allows developers to instrument and debug reference leaks by
  recording stack traces for every get/put operation.

Driver Updates:
* zl3073x: Updated to associate pins with fwnode handles using the new
  setter and support the 'mux' pin type.
* ice: Implements the E825-C specific hardware configuration for SyncE
  (CGU registers). It utilizes the new notifier and fwnode APIs to
  dynamically discover and attach to the platform DPLLs.

Patch Summary:
Patch 1: DPLL Core (fwnode association).
Patch 2: Driver zl3073x (Set fwnode).
Patch 3-4: DPLL Core (Notifiers and dynamic IDs).
Patch 5: Driver zl3073x (Mux type).
Patch 6: DPLL Core (Refcount refactor).
Patch 7-8: Refcount tracking infrastructure and driver updates.
Patch 9: Driver ice (E825-C SyncE logic).
====================

Link: https://patch.msgid.link/20260203174002.705176-1-ivecera@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ice: dpll: Support E825-C SyncE and dynamic pin discovery

Implement SyncE support for the E825-C Ethernet controller using the
DPLL subsystem. Unlike E810, the E825-C architecture relies on platform
firmware (ACPI) to describe connections between the NIC's recovered clock
outputs and external DPLL inputs.

Implement the following mechanisms to support this architecture:

1. Discovery Mechanism: The driver parses the 'dpll-pins' and 'dpll-pin names'
   firmware properties to identify the external DPLL pins (parents)
   corresponding to its RCLK outputs ("rclk0", "rclk1"). It uses
   fwnode_dpll_pin_find() to locate these parent pins in the DPLL core.

2. Asynchronous Registration: Since the platform DPLL driver (e.g.
   zl3073x) may probe independently of the network driver, utilize
   the DPLL notifier chain The driver listens for DPLL_PIN_CREATED
   events to detect when the parent MUX pins become available, then
   registers its own Recovered Clock (RCLK) pins as children of those
   parents.

3. Hardware Configuration: Implement the specific register access logic
   for E825-C CGU (Clock Generation Unit) registers (R10, R11). This
   includes configuring the bypass MUXes and clock dividers required to
   drive SyncE signals.

4. Split Initialization: Refactor `ice_dpll_init()` to separate the
   static initialization path of E810 from the dynamic, firmware-driven
   path required for E825-C.

Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Co-developed-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Co-developed-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Link: https://patch.msgid.link/20260203174002.705176-10-ivecera@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

drivers: Add support for DPLL reference count tracking

Update existing DPLL drivers to utilize the DPLL reference count
tracking infrastructure.

Add dpll_tracker fields to the drivers' internal device and pin
structures. Pass pointers to these trackers when calling
dpll_device_get/put() and dpll_pin_get/put().

This allows developers to inspect the specific references held by this
driver via debugfs when CONFIG_DPLL_REFCNT_TRACKER is enabled, aiding
in the debugging of resource leaks.

Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Link: https://patch.msgid.link/20260203174002.705176-9-ivecera@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

dpll: Add reference count tracking support

Add support for the REF_TRACKER infrastructure to the DPLL subsystem.

When enabled, this allows developers to track and debug reference counting
leaks or imbalances for dpll_device and dpll_pin objects. It records stack
traces for every get/put operation and exposes this information via
debugfs at:
  /sys/kernel/debug/ref_tracker/dpll_device_*
  /sys/kernel/debug/ref_tracker/dpll_pin_*

The following API changes are made to support this:
1. dpll_device_get() / dpll_device_put() now accept a 'dpll_tracker *'
   (which is a typedef to 'struct ref_tracker *' when enabled, or an empty
   struct otherwise).
2. dpll_pin_get() / dpll_pin_put() and fwnode_dpll_pin_find() similarly
   accept the tracker argument.
3. Internal registration structures now hold a tracker to associate the
   reference held by the registration with the specific owner.

All existing in-tree drivers (ice, mlx5, ptp_ocp, zl3073x) are updated
to pass NULL for the new tracker argument, maintaining current behavior
while enabling future debugging capabilities.

Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Co-developed-by: Petr Oros <poros@redhat.com>
Signed-off-by: Petr Oros <poros@redhat.com>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Link: https://patch.msgid.link/20260203174002.705176-8-ivecera@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

dpll: Enhance and consolidate reference counting logic

Refactor the reference counting mechanism for DPLL devices and pins to
improve consistency and prevent potential lifetime issues.

Introduce internal helpers __dpll_{device,pin}_{hold,put}() to
centralize reference management.

Update the internal XArray reference helpers (dpll_xa_ref_*) to
automatically grab a reference to the target object when it is added to
a list, and release it when removed. This ensures that objects linked
internally (e.g., pins referenced by parent pins) are properly kept
alive without relying on the caller to manually manage the count.

Consequently, remove the now redundant manual `refcount_inc/dec` calls
in dpll_pin_on_pin_{,un}register()`, as ownership is now correctly handled
by the dpll_xa_ref_* functions.

Additionally, ensure that dpll_device_{,un}register()` takes/releases
a reference to the device, ensuring the device object remains valid for
the duration of its registration.

Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Link: https://patch.msgid.link/20260203174002.705176-7-ivecera@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

dpll: zl3073x: Add support for mux pin type

Add parsing for the "mux" string in the 'connection-type' pin property
mapping it to DPLL_PIN_TYPE_MUX.

Recognizing this type in the driver allows these pins to be taken as
parent pins for pin-on-pin pins coming from different modules (e.g.
network drivers).

Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Link: https://patch.msgid.link/20260203174002.705176-6-ivecera@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

dpll: Support dynamic pin index allocation

Allow drivers to register DPLL pins without manually specifying a pin
index.

Currently, drivers must provide a unique pin index when calling
dpll_pin_get(). This works well for hardware-mapped pins but creates
friction for drivers handling virtual pins or those without a strict
hardware indexing scheme.

Introduce DPLL_PIN_IDX_UNSPEC (U32_MAX). When a driver passes this
value as the pin index:
1. The core allocates a unique index using an IDA
2. The allocated index is mapped to a range starting above `INT_MAX`

This separation ensures that dynamically allocated indices never collide
with standard driver-provided hardware indices, which are assumed to be
within the `0` to `INT_MAX` range. The index is automatically freed when
the pin is released in dpll_pin_put().

Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Link: https://patch.msgid.link/20260203174002.705176-5-ivecera@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

dpll: Add notifier chain for dpll events

Currently, the DPLL subsystem reports events (creation, deletion, changes)
to userspace via Netlink. However, there is no mechanism for other kernel
components to be notified of these events directly.

Add a raw notifier chain to the DPLL core protected by dpll_lock. This
allows other kernel subsystems or drivers to register callbacks and
receive notifications when DPLL devices or pins are created, deleted,
or modified.

Define the following:
- Registration helpers: {,un}register_dpll_notifier()
- Event types: DPLL_DEVICE_CREATED, DPLL_PIN_CREATED, etc.
- Context structures: dpll_{device,pin}_notifier_info to pass relevant
data to the listeners.

The notification chain is invoked alongside the existing Netlink event
generation to ensure in-kernel listeners are kept in sync with the
subsystem state.

Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Co-developed-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: Petr Oros <poros@redhat.com>
Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Link: https://patch.msgid.link/20260203174002.705176-4-ivecera@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

dpll: zl3073x: Associate pin with fwnode handle

Associate the registered DPLL pin with its firmware node by calling
dpll_pin_fwnode_set().

This links the created pin object to its corresponding DT/ACPI node
in the DPLL core. Consequently, this enables consumer drivers (such as
network drivers) to locate and request this specific pin using the
fwnode_dpll_pin_find() helper.

Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Link: https://patch.msgid.link/20260203174002.705176-3-ivecera@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

dpll: Allow associating dpll pin with a firmware node

Extend the DPLL core to support associating a DPLL pin with a firmware
node. This association is required to allow other subsystems (such as
network drivers) to locate and request specific DPLL pins defined in
the Device Tree or ACPI.

* Add a .fwnode field to the struct dpll_pin
* Introduce dpll_pin_fwnode_set() helper to allow the provider driver
to associate a pin with a fwnode after the pin has been allocated
* Introduce fwnode_dpll_pin_find() helper to allow consumers to search
for a registered DPLL pin using its associated fwnode handle
* Ensure the fwnode reference is properly released in dpll_pin_put()

Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Link: https://patch.msgid.link/20260203174002.705176-2-ivecera@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/mlx5: Support devlink port state for host PF

Add support for devlink port function state get/set operations for the
host physical function (PF). Until now, mlx5 only allowed state get/set
for subfunctions (SFs) ports. This change enables an administrator with
eSwitch manager privileges to query or modify the host PF’s function
state, allowing it to be explicitly inactivated or activated. While
inactivated, the administrator can modify the functions attributes, such
as enable/disable roce.

$ devlink port show pci/0000:03:00.0/196608
pci/0000:03:00.0/196608: type eth netdev eth1 flavour pcipf controller 1 pfnum 0 external true splittable false
  function:
    hw_addr a0:88:c2:45:17:7c state active opstate attached roce enable max_io_eqs 120
$ devlink port function set pci/0000:03:00.0/196608 state inactive
$ devlink port show pci/0000:03:00.0/196608
pci/0000:03:00.0/196608: type eth netdev eth1 flavour pcipf controller 1 pfnum 0 external true splittable false
  function:
    hw_addr a0:88:c2:45:17:7c state inactive opstate detached roce enable max_io_eqs 120

Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260203102402.1712218-1-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'move-can-skb-headroom-content-to-skb-extensions'

Oliver Hartkopp says:

====================
move CAN skb headroom content to skb extensions

CAN bus related skbuffs (ETH_P_CAN/ETH_P_CANFD/ETH_P_CANXL) simply contain
CAN frame structs for CAN CC/FD/XL of skb->len length at skb->data. Those
CAN skbs do not have network/mac/transport headers nor other such
references for encapsulated protocols like ethernet/IP protocols.

To store data for CAN specific use-cases all CAN bus related skbuffs are
created with a 16 byte private skb headroom (struct can_skb_priv). Using
the skb headroom and accessing skb->head for this private data led to
several problems in the past likely due to "The struct can_skb_priv
business is highly unconventional for the networking stack." [1]

This patch set aims to remove the unconventional skb headroom usage for CAN
bus related skbuffs and use the common skb extensions instead.

[1] https://lore.kernel.org/linux-can/20260104074222.29e660ac@kernel.org/

- v1: https://patch.msgid.link/20260125201601.5018-1-socketcan@hartkopp.net
- v2: https://lore.kernel.org/linux-can/20260128-can-skb-ext-v2-0-fe64aa152c8a@pengutronix.de/
- v4: https://lore.kernel.org/netdev/20260128-can_skb_ext-v1-0-330f60fd5d7e@hartkopp.net/
- v5: https://patch.msgid.link/20260129-can_skb_ext-v5-0-21252fdc8900@hartkopp.net
- v6: https://patch.msgid.link/20260130-can_skb_ext-v6-0-8fceafab7f26@hartkopp.net
- v7: https://patch.msgid.link/20260131-can_skb_ext-v7-0-dd0f8f84a83d@hartkopp.net

Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
====================

Link: https://patch.msgid.link/20260201-can_skb_ext-v8-0-3635d790fe8b@hartkopp.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

can: gw: use can_gw_hops instead of sk_buff::csum_start

As CAN skbs don't use IP checksums the skb->csum_start variable was used to
store the can-gw CAN frame time-to-live counter together with
skb->ip_summed set to CHECKSUM_UNNECESSARY.

Remove the 'hack' using the skb->csum_start variable and move the content
to can_skb_ext::can_gw_hops of the CAN skb extensions.

The module parameter 'max_hops' has been reduced to a single byte to fit
can_skb_ext::can_gw_hops as the maximum value to be stored is 6.

Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://patch.msgid.link/20260201-can_skb_ext-v8-6-3635d790fe8b@hartkopp.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

can: remove private CAN skb headroom infrastructure

This patch removes struct can_skb_priv which was stored at skb->head and
the can_skb_reserve() helper which was used to shift skb->head.

Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://patch.msgid.link/20260201-can_skb_ext-v8-5-3635d790fe8b@hartkopp.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

can: move frame_len to CAN skb extensions

The can_skb_priv::frame_len variable is used to cache a previous
calculated CAN frame length to be passed to BQL queueing disciplines.

Move the can_skb_priv::frame_len content to can_skb_ext::can_framelen.

Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://patch.msgid.link/20260201-can_skb_ext-v8-4-3635d790fe8b@hartkopp.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

can: move ifindex to CAN skb extensions

When routing CAN frames over different CAN interfaces the interface index
skb->iif is overwritten with every single hop. To prevent sending a CAN
frame back to its originating (first) incoming CAN interface another
ifindex variable is needed, which was stored in can_skb_priv::ifindex.

Move the can_skb_priv::ifindex content to can_skb_ext::can_iif.

Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://patch.msgid.link/20260201-can_skb_ext-v8-3-3635d790fe8b@hartkopp.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

can: add CAN skb extension infrastructure

To remove the private CAN bus skb headroom infrastructure 8 bytes need to
be stored in the skb. The skb extensions are a common pattern and an easy
and efficient way to hold private data travelling along with the skb. We
only need the skb_ext_add() and skb_ext_find() functions to allocate and
access CAN specific content as the skb helpers to copy/clone/free skbs
automatically take care of skb extensions and their final removal.

This patch introduces the complete CAN skb extensions infrastructure:
- add struct can_skb_ext in new file include/net/can.h
- add include/net/can.h in MAINTAINERS
- add SKB_EXT_CAN to skbuff.c and skbuff.h
- select SKB_EXTENSIONS in Kconfig when CONFIG_CAN is enabled
- check for existing CAN skb extensions in can_rcv() in af_can.c
- add CAN skb extensions allocation at every skb_alloc() location
- duplicate the skb extensions if cloning outgoing skbs (framelen/gw_hops)
- introduce can_skb_ext_add() and can_skb_ext_find() helpers

The patch also corrects an indention issue in the original code from 2018:
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202602010426.PnGrYAk3-lkp@intel.com/
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://patch.msgid.link/20260201-can_skb_ext-v8-2-3635d790fe8b@hartkopp.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

can: use skb hash instead of private variable in headroom

The can_skb_priv::skbcnt variable is used to identify CAN skbs in the RX
path analogue to the skb->hash.

As the skb hash is not filled in CAN skbs move the private skbcnt value to
skb->hash and set skb->sw_hash accordingly. The skb->hash is a value used
for RPS to identify skbs. Use it as intended.

Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://patch.msgid.link/20260201-can_skb_ext-v8-1-3635d790fe8b@hartkopp.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

MAINTAINERS: Remove myself from TC maintainers

Recently TC maintainer Jamal intentionally a broke reasonable
use case:
https://lore.kernel.org/netdev/aG10rqwjX6elG1Gx@pop-os.localdomain/

Although I tried my best to help by:
1) Strongly objecting this breakage from the very beginning
2) Reverting it and offering a much better solution
3) Offering Jamal for video chat on 8 Jul 2025 and 26 Nov 2025

None of them worked.

So it makes no sense for me to continue caring about this subsystem.

Most importantly, intentionally breaking reasonable use cases is
against my moral, I don't want to get ashamed.

Thanks for the opportunity!

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Link: https://patch.msgid.link/20260130212021.46610-1-xiyou.wangcong@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ppp: enable TX scatter-gather

PPP channels using chan->direct_xmit prepend the PPP header to a skb and
call dev_queue_xmit() directly. In this mode the skb does not need to be
linear, but the PPP netdevice currently does not advertise
scatter-gather features, causing unnecessary linearization and
preventing GSO.

Enable NETIF_F_SG and NETIF_F_FRAGLIST on PPP devices. In case a linear
buffer is required (PPP compression, multilink, and channels without
direct_xmit), call skb_linearize() explicitly.

Signed-off-by: Qingfang Deng <dqfext@gmail.com>
Link: https://patch.msgid.link/20260129012902.941-1-dqfext@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

netfilter: nf_tables: fix inverted genmask check in nft_map_catchall_activate()

nft_map_catchall_activate() has an inverted element activity check
compared to its non-catchall counterpart nft_mapelem_activate() and
compared to what is logically required.

nft_map_catchall_activate() is called from the abort path to re-activate
catchall map elements that were deactivated during a failed transaction.
It should skip elements that are already active (they don't need
re-activation) and process elements that are inactive (they need to be
restored). Instead, the current code does the opposite: it skips inactive
elements and processes active ones.

Compare the non-catchall activate callback, which is correct:

  nft_mapelem_activate():
    if (nft_set_elem_active(ext, iter->genmask))
        return 0;   /* skip active, process inactive */

With the buggy catchall version:

  nft_map_catchall_activate():
    if (!nft_set_elem_active(ext, genmask))
        continue;   /* skip inactive, process active */

The consequence is that when a DELSET operation is aborted,
nft_setelem_data_activate() is never called for the catchall element.
For NFT_GOTO verdict elements, this means nft_data_hold() is never
called to restore the chain->use reference count. Each abort cycle
permanently decrements chain->use. Once chain->use reaches zero,
DELCHAIN succeeds and frees the chain while catchall verdict elements
still reference it, resulting in a use-after-free.

This is exploitable for local privilege escalation from an unprivileged
user via user namespaces + nftables on distributions that enable
CONFIG_USER_NS and CONFIG_NF_TABLES.

Fix by removing the negation so the check matches nft_mapelem_activate():
skip active elements, process inactive ones.

Fixes: 628bd3e49cba ("netfilter: nf_tables: drop map element references from preparation phase")
Signed-off-by: Andrew Fasano <andrew.fasano@nist.gov>
Signed-off-by: Florian Westphal <fw@strlen.de>

net/mlx5e: Extend TC max ratelimit using max_bw_value_msb

The per-TC rate limit was restricted to 255 Gbps due to the 8-bit
max_bw_value field in the QETC register.
This limit is insufficient for newer, higher-bandwidth NICs.

Extend the rate limit by using the full 16-bit max_bw_value field.
This allows the finer 100Mbps granularity to be used for rates up to
~6.5 Tbps, instead of switching to 1Gbps granularity at higher rates.

The extended range is only used when the device advertises support
via the qetcr_qshr_max_bw_val_msb capability bit in the QCAM register.

Signed-off-by: Alexei Lazar <alazar@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260203073021.1710806-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-mlx5e-rx-datapath-enhancements'

Tariq Toukan says:

====================
net/mlx5e: RX datapath enhancements

This series by Dragos introduces multiple RX datapath enhancements to
the mlx5e driver.

First patch adds SW handling for oversized packets in non-linear SKB
mode.

Second patch adds a reclaim mechanism to mitigate memory allocation
failures with memory providers.
====================

Link: https://patch.msgid.link/20260203072130.1710255-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: SHAMPO, Improve allocation recovery

When memory providers are used, there is a disconnect between the
page_pool size and the available memory in the provider. This means
that the page_pool can run out of memory if the user didn't provision
a large enough buffer.

Under these conditions, mlx5 gets stuck trying to allocate new
buffers without being able to release existing buffers. This happens due
to the optimization introduced in commit 4c2a13236807
("net/mlx5e: RX, Defer page release in striding rq for better recycling")
which delays WQE releases to increase the chance of page_pool direct
recycling. The optimization was developed before memory providers
existed and this circumstance was not considered.

This patch unblocks the queue by reclaiming pages from WQEs that can be
freed and doing a one-shot retry. A WQE can be freed when:
1) All its strides have been consumed (WQE is no longer in linked list).
2) The WQE pages/netmems have not been previously released.

This reclaim mechanism is useful for regular pages as well.

Note that provisioning memory that can't fill even one MPWQE (64
4K pages) will still render the queue unusable. Same when
the application doesn't release its buffers for various reasons.
Or a combination of the two: a very small buffer is provisioned,
application releases buffers in bulk, bulk size never reached
=> queue is stuck.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260203072130.1710255-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: RX, Drop oversized packets in non-linear mode

Currently the driver has an inconsistent behaviour between modes when it
comes to oversized packets that are not dropped through the physical MTU
check in HW. This can happen for Multi Host configurations where each
port has a different MTU.

Current behavior:

1) Striding RQ in linear mode drops the packet in SW and counts it
with oversize_pkts_sw_drop.

2) Striding RQ in non-linear mode allows it like a normal packet.

3) Legacy RQ can't receive oversized packets by design:
the RX WQE uses MTU sized packet buffers.

This inconsistency is not a violation of the netdev policy [1]
but it is better to be consistent across modes.

This patch aligns (2) with (1) and (3). One exception is added for
LRO: don't drop the oversized packet if it is an LRO packet.

As now rq->hw_mtu always needs to be updated during the MTU change flow,
drop the reset avoidance optimization from mlx5e_change_mtu().

Extract the CQE LRO segments reading into a helper function as it
is used twice now.

[1] Documentation/networking/netdevices.rst#L205

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260203072130.1710255-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: remove support for lpi_intr_o

The dwmac databook for v3.74a states that lpi_intr_o is a sideband
signal which should be used to ungate the application clock, and this
signal is synchronous to the receive clock. The receive clock can run
at 2.5, 25 or 125MHz depending on the media speed, and can stop under
the control of the link partner. This means that the time it takes to
clear is dependent on the negotiated media speed, and thus can be 8,
40, or 400ns after reading the LPI control and status register.

It has been observed with some aggressive link partners, this clock
can stop while lpi_intr_o is still asserted, meaning that the signal
remains asserted for an indefinite period that the local system has
no direct control over.

The LPI interrupts will still be signalled through the main interrupt
path in any case, and this path is not dependent on the receive clock.

This, since we do not gate the application clock, and the chances of
adding clock gating in the future are slim due to the clocks being
ill-defined, lpi_intr_o serves no useful purpose. Remove the code which
requests the interrupt, and all associated code.

Reported-by: Ovidiu Panait <ovidiu.panait.rb@renesas.com>
Tested-by: Ovidiu Panait <ovidiu.panait.rb@renesas.com> # Renesas RZ/V2H board
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vnJbt-00000007YYN-28nm@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-stmmac-fix-serdes-power-methods'

Russell King says:

====================
net: stmmac: fix serdes power methods

The stmmac serdes powerup/powerdown methods are not guaranteed to be
called in a balancing fashion, but these are used to call the generic
PHY subsystem's phy_power_up() and phy_power_down() methods which do
require balanced calls.

This series addresses this by making the stmmac serdes methods balanced.
====================

Link: https://patch.msgid.link/aYHHWm5UkD1JVa7D@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: move serdes power methods to stmmac_[open|release]()

Move the SerDes power up and down calls for the non-"after linkup"
case out of __stmmac_open() and __stmmac_release() into the
stmmac_open() and stmmac_release() methods, which means the SerDes
will only change power state on administrative changes or suspend/
resume, not while changing the interface MTU.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vnDDt-00000007XxF-3uUK@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: add missing serdes power down in error paths

The open path is missing cleanup of a successful serdes power up if
stmmac_hw_setup() or stmmac_request_irq() fails.

stmmac_resume() is also missing cleanup of the serdes power up if
stmmac_hw_setup() fails.

Add the missing cleanups.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vnDDo-00000007Xx9-3RZ8@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: add state tracking for legacy serdes power state

Avoid calling the serdes_powerdown() method if we have not had a
preceeding successful call to the serdes_powerup() method. This
avoids unbalancing refcounted resources that may be used in the
these platform glue serdes methods.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vnDDj-00000007Xx3-2xZ0@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: add wrappers for serdes_power[up|down]() methods

Add wrappers for the serdes_power[up|down]() methods and update all
call sites. This will allow us to add state tracking.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vnDDe-00000007Xww-2VUU@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-rds-rds-tcp-protocol-and-extension-improvements'

Allison Henderson says:

====================
net/rds: RDS-TCP protocol and extension improvements

This is subset 3 of the larger RDS-TCP patch series I posted last
Oct. The greater series aims to correct multiple rds-tcp issues that
can cause dropped or out of sequence messages. I've broken it down into
smaller sets to make reviews more manageable.

In this set, we introduce extension headers for byte accounting
and fix several RDS/TCP protocol issues including message preservation
during connection transitions and multipath lane handling.

The entire set can be viewed in the rfc here:
https://lore.kernel.org/netdev/20251022191715.157755-1-achender@kernel.org/
====================

Link: https://patch.msgid.link/20260203055723.1085751-1-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/rds: Trigger rds_send_ping() more than once

Even though a peer may have already received a
non-zero value for "RDS_EXTHDR_NPATHS" from a node in the past,
the current peer may not.

Therefore it is important to initiate another rds_send_ping()
after a re-connect to any peer:
It is unknown at that time if we're still talking to the same
instance of RDS kernel modules on the other side.

Otherwise, the peer may just operate on a single lane
("c_npaths == 0"), not knowing that more lanes are available.

However, if "c_with_sport_idx" is supported,
we also need to check that the connection we accepted on lane#0
meets the proper source port modulo requirement, as we fan out:

Since the exchange of "RDS_EXTHDR_NPATHS" and "RDS_EXTHDR_SPORT_IDX"
is asynchronous, initially we have no choice but to accept an incoming
connection (via "accept") in the first slot ("cp_index == 0")
for backwards compatibility.

But that very connection may have come from a different lane
with "cp_index != 0", since the peer thought that we already understood
and handled "c_with_sport_idx" properly, as indicated by a previous
exchange before a module was reloaded.

In short:
If a module gets reloaded, we recover from that, but do *not*
allow a downgrade to support fewer lanes.

Downgrades would require us to merge messages from separate lanes,
which is rather tricky with the current RDS design.
Each lane has its own sequence number space and all messages
would need to be re-sequenced as we merge, all while
handling "RDS_FLAG_RETRANSMITTED" and "cp_retrans" properly.

Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Link: https://patch.msgid.link/20260203055723.1085751-9-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/rds: Use the first lane until RDS_EXTHDR_NPATHS arrives

Instead of just blocking the sender until "c_npaths" is known
(it gets updated upon the receipt of a MPRDS PONG message),
simply use the first lane (cp_index#0).

But just using the first lane isn't enough.

As soon as we enqueue messages on a different lane, we'd run the risk
of out-of-order delivery of RDS messages.

Earlier messages enqueued on "cp_index == 0" could be delivered later
than more recent messages enqueued on "cp_index > 0", mostly because of
possible head of line blocking issues causing the first lane to be
slower.

To avoid that, we simply take a snapshot of "cp_next_tx_seq" at the
time we're about to fan-out to more lanes.

Then we delay the transmission of messages enqueued on other lanes
with "cp_index > 0" until cp_index#0 caught up with the delivery of
new messages (from "cp_send_queue") as well as in-flight
messages (from "cp_retrans") that haven't been acknowledged yet
by the receiver.

We also add a new counter "mprds_catchup_tx0_retries" to keep track
of how many times "rds_send_xmit" had to suspend activities,
because it was waiting for the first lane to catch up.

Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Link: https://patch.msgid.link/20260203055723.1085751-8-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/rds: Update struct rds_statistics to use u64 instead of uint64_t

Quick clean up to avoid checkpatch errors when adding members to
this struct (Prefer kernel type 'u64' over 'uint64_t').
No functional changes added.

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Link: https://patch.msgid.link/20260203055723.1085751-7-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/rds: Clear reconnect pending bit

When canceling the reconnect worker, care must be taken to reset the
reconnect-pending bit. If the reconnect worker has not yet been
scheduled before it is canceled, the reconnect-pending bit will stay
on forever.

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Link: https://patch.msgid.link/20260203055723.1085751-6-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/rds: Kick-start TCP receiver after accept

In cases where the server (the node with the higher IP-address)
in an RDS/TCP connection is overwhelmed it is possible that the
socket that was just accepted is chock-full of messages, up to
the limit of what the socket receive buffer permits.

Subsequently, "rds_tcp_data_ready" won't be called anymore,
because there is no more space to receive additional messages.

Nor was it called prior to the point of calling "rds_tcp_set_callbacks",
because the "sk_data_ready" pointer didn't even point to
"rds_tcp_data_ready" yet.

We fix this by simply kick-starting the receive-worker
for all cases where the socket state is neither
"TCP_CLOSE_WAIT" nor "TCP_CLOSE".

Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Link: https://patch.msgid.link/20260203055723.1085751-5-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/rds: rds_tcp_conn_path_shutdown must not discard messages

RDS/TCP differs from RDS/RDMA in that message acknowledgment
is done based on TCP sequence numbers:
As soon as the last byte of a message has been acknowledged by the
TCP stack of a peer, rds_tcp_write_space() goes on to discard
prior messages from the send queue.

Which is fine, for as long as the receiver never throws any messages
away.

The dequeuing of messages in RDS/TCP is done either from the
"sk_data_ready" callback pointing to rds_tcp_data_ready()
(the most common case), or from the receive worker pointing
to rds_tcp_recv_path() which is called for as long as the
connection is "RDS_CONN_UP".

However, as soon as rds_conn_path_drop() is called for whatever reason,
including "DR_USER_RESET", "cp_state" transitions to "RDS_CONN_ERROR",
and rds_tcp_restore_callbacks() ends up restoring the callbacks
and thereby disabling message receipt.

So messages already acknowledged to the sender were dropped.

Furthermore, the "->shutdown" callback was always called
with an invalid parameter ("RCV_SHUTDOWN | SEND_SHUTDOWN == 3"),
instead of the correct pre-increment value ("SHUT_RDWR == 2").
inet_shutdown() returns "-EINVAL" in such cases, rendering
this call a NOOP.

So we change rds_tcp_conn_path_shutdown() to do the proper
"->shutdown(SHUT_WR)" call in order to signal EOF to the peer
and make it transition to "TCP_CLOSE_WAIT" (RFC 793).

This should make the peer also enter rds_tcp_conn_path_shutdown()
and do the same.

This allows us to dequeue all messages already received
and acknowledged to the peer.
We do so, until we know that the receive queue no longer has data
(skb_queue_empty()) and that we couldn't have any data
in flight anymore, because the socket transitioned to
any of the states "CLOSING", "TIME_WAIT", "CLOSE_WAIT",
"LAST_ACK", or "CLOSE" (RFC 793).

However, if we do just that, we suddenly see duplicate RDS
messages being delivered to the application.
So what gives?

Turns out that with MPRDS and its multitude of backend connections,
retransmitted messages ("RDS_FLAG_RETRANSMITTED") can outrace
the dequeuing of their original counterparts.

And the duplicate check implemented in rds_recv_local() only
discards duplicates if flag "RDS_FLAG_RETRANSMITTED" is set.

Rather curious, because a duplicate is a duplicate; it shouldn't
matter which copy is looked at and delivered first.

To avoid this entire situation, we simply make the sender discard
messages from the send-queue right from within
rds_tcp_conn_path_shutdown(). Just like rds_tcp_write_space() would
have done, were it called in time or still called.

This makes sure that we no longer have messages that we know
the receiver already dequeued sitting in our send-queue,
and therefore avoid the entire "RDS_FLAG_RETRANSMITTED" fiasco.

Now we got rid of the duplicate RDS message delivery, but we
still run into cases where RDS messages are dropped.

This time it is due to the delayed setting of the socket-callbacks
in rds_tcp_accept_one() via either rds_tcp_reset_callbacks()
or rds_tcp_set_callbacks().

By the time rds_tcp_accept_one() gets there, the socket
may already have transitioned into state "TCP_CLOSE_WAIT",
but rds_tcp_state_change() was never called.

Subsequently, "->shutdown(SHUT_WR)" did not happen either.
So the peer ends up getting stuck in state "TCP_FIN_WAIT2".

We fix that by checking for states "TCP_CLOSE_WAIT", "TCP_LAST_ACK",
or "TCP_CLOSE" and drop the freshly accepted socket in that case.

This problem is observable by running "rds-stress --reset"
frequently on either of the two sides of a RDS connection,
or both while other "rds-stress" processes are exchanging data.
Those "rds-stress" processes reported out-of-sequence
errors, with the expected sequence number being smaller
than the one actually received (due to the dropped messages).

Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Link: https://patch.msgid.link/20260203055723.1085751-4-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/rds: Encode cp_index in TCP source port

Upon "sendmsg", RDS/TCP selects a backend connection based
on a hash calculated from the source-port ("RDS_MPATH_HASH").

However, "rds_tcp_accept_one" accepts connections
in the order they arrive, which is non-deterministic.

Therefore the mapping of the sender's "cp->cp_index"
to that of the receiver changes if the backend
connections are dropped and reconnected.

However, connection state that's preserved across reconnects
(e.g. "cp_next_rx_seq") relies on that sender<->receiver
mapping to never change.

So we make sure that client and server of the TCP connection
have the exact same "cp->cp_index" across reconnects by
encoding "cp->cp_index" in the lower three bits of the
client's TCP source port.

A new extension "RDS_EXTHDR_SPORT_IDX" is introduced,
that allows the server to tell the difference between
clients that do the "cp->cp_index" encoding, and
legacy clients that pick source ports randomly.

Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Link: https://patch.msgid.link/20260203055723.1085751-3-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/rds: new extension header: rdma bytes

Introduce a new extension header type RDSV3_EXTHDR_RDMA_BYTES for
an RDMA initiator to exchange rdma byte counts to its target.
Currently, RDMA operations cannot precisely account how many bytes a
peer just transferred via RDMA, which limits per-connection statistics
and future policy (e.g., monitoring or rate/cgroup accounting of RDMA
traffic).

In this patch we expand rds_message_add_extension to accept multiple
extensions, and add new flag to RDS header: RDS_FLAG_EXTHDR_EXTENSION,
along with a new extension to RDS header: rds_ext_header_rdma_bytes.

Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com>
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Link: https://patch.msgid.link/20260203055723.1085751-2-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net_sched: sch_fq: tweak unlikely() hints in fq_dequeue()

After 076433bd78d7 ("net_sched: sch_fq: add fast path
for mostly idle qdisc") we need to remove one unlikely()
because q->internal holds all the fast path packets.

       skb = fq_peek(&q->internal);
       if (unlikely(skb)) {
                q->internal.qlen--;

Calling INET_ECN_set_ce() is very unlikely.

These changes allow fq_dequeue_skb() to be (auto)inlined,
thus making fq_dequeue() faster.

$ scripts/bloat-o-meter -t vmlinux.0 vmlinux
add/remove: 2/2 grow/shrink: 0/1 up/down: 283/-269 (14)
Function                                     old     new   delta
INET_ECN_set_ce                                -     267    +267
__pfx_INET_ECN_set_ce                          -      16     +16
__pfx_fq_dequeue_skb                          16       -     -16
fq_dequeue_skb                               103       -    -103
fq_dequeue                                  1685    1535    -150
Total: Before=24886569, After=24886583, chg +0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20260203214716.880853-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ovpn: Replace use of system_wq with system_percpu_wq

This patch continues the effort to refactor workqueue APIs, which has begun
with the changes introducing new workqueues and a new alloc_workqueue flag:

   commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq")
   commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag")

The point of the refactoring is to eventually alter the default behavior of
workqueues to become unbound by default so that their workload placement is
optimized by the scheduler.

Before that to happen after a careful review and conversion of each individual
case, workqueue users must be converted to the better named new workqueues with
no intended behaviour changes:

   system_wq -> system_percpu_wq
   system_unbound_wq -> system_dfl_wq

This way the old obsolete workqueues (system_wq, system_unbound_wq) can be
removed in the future.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Acked-by: Antonio Quartulli <antonio@openvpn.net>
Link: https://patch.msgid.link/20251224155006.114824-1-marco.crivellari@suse.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/iucv: clean up iucv kernel-doc warnings

Fix numerous (many) kernel-doc warnings in iucv.[ch]:

- convert function documentation comments to a common (kernel-doc) look,
even for static functions (without "/**")
- use matching parameter and parameter description names
- use better wording in function descriptions (Jakub & AI)
- remove duplicate kernel-doc comments from the header file (Jakub)

Examples:

Warning: include/net/iucv/iucv.h:210 missing initial short description
on line: * iucv_unregister
Warning: include/net/iucv/iucv.h:216 function parameter 'handle' not
described in 'iucv_unregister'
Warning: include/net/iucv/iucv.h:467 function parameter 'answer' not
described in 'iucv_message_send2way'
Warning: net/iucv/iucv.c:727 missing initial short description on line:
* iucv_cleanup_queue

Build-tested with both "make htmldocs" and "make ARCH=s390 defconfig all".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Alexandra Winter <wintera@linux.ibm.com>
Link: https://patch.msgid.link/20260203075248.1177869-1-rdunlap@infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: Fix typo from clk_scr_i to clk_csr_i

In include/linux/stmmac.h clk_csr_i is spelled as clk_scr_i by mistake,
so correct the typo.

Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Reviewed-by: Yanteng Si <siyanteng@cqsoftware.com.cn>
Link: https://patch.msgid.link/20260203062658.2156653-1-chenhuacai@loongson.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: split tcp_check_space() in two parts

tcp_check_space() is fat and not inlined.

Move its slow path in (out of line) __tcp_check_space()
and make tcp_check_space() an inline function for better TCP performance.

$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 2/2 grow/shrink: 4/0 up/down: 708/-582 (126)
Function                                     old     new   delta
__tcp_check_space                              -     521    +521
tcp_rcv_established                         1860    1916     +56
tcp_rcv_state_process                       3342    3384     +42
tcp_event_new_data_sent                      248     286     +38
tcp_data_snd_check                            71     106     +35
__pfx___tcp_check_space                        -      16     +16
__pfx_tcp_check_space                         16       -     -16
tcp_check_space                              566       -    -566
Total: Before=24896373, After=24896499, chg +0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260203050932.3522221-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: move tcp_rbtree_insert() to tcp_output.c

tcp_rbtree_insert() is primarily used from tcp_output.c
In tcp_input.c, only (slow path) tcp_collapse() uses it.

Move it to tcp_output.c to allow its (auto)inlining to improve
TCP tx fast path.

$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/0 grow/shrink: 4/1 up/down: 445/-115 (330)
Function                                     old     new   delta
tcp_connect                                 4277    4478    +201
tcp_event_new_data_sent                      162     248     +86
tcp_send_synack                              780     862     +82
tcp_fragment                                1185    1261     +76
tcp_collapse                                1524    1409    -115
Total: Before=24896043, After=24896373, chg +0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260203045110.3499713-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: use __skb_push() in __tcp_transmit_skb()

We trust MAX_TCP_HEADER to be large enough.

Using the inlined version of skb_push() trades 8 bytes
of text for better performance of TCP TX fast path.

$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/0 grow/shrink: 1/0 up/down: 8/0 (8)
Function old new delta
__tcp_transmit_skb 3181 3189 +8
Total: Before=24896035, After=24896043, chg +0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260203044226.3489941-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'wireless-next-2026-02-04' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next

Johannes Berg says:

====================
Some more changes, including pulls from drivers:
- ath drivers: small features/cleanups
- rtw drivers: mostly refactoring for rtw89 RTL8922DE support
- mac80211: use hrtimers for CAC to avoid too long delays
- cfg80211/mac80211: some initial UHR (Wi-Fi 8) support

* tag 'wireless-next-2026-02-04' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (59 commits)
  wifi: brcmsmac: phy: Remove unreachable error handling code
  wifi: mac80211: Add eMLSR/eMLMR action frame parsing support
  wifi: mac80211: add initial UHR support
  wifi: cfg80211: add initial UHR support
  wifi: ieee80211: add some initial UHR definitions
  wifi: mac80211: use wiphy_hrtimer_work for CAC timeout
  wifi: mac80211: correct ieee80211-{s1g/eht}.h include guard comments
  wifi: ath12k: clear stale link mapping of ahvif->links_map
  wifi: ath12k: Add support TX hardware queue stats
  wifi: ath12k: Add support RX PDEV stats
  wifi: ath12k: Fix index decrement when array_len is zero
  wifi: ath12k: support OBSS PD configuration for AP mode
  wifi: ath12k: add WMI support for spatial reuse parameter configuration
  dt-bindings: net: wireless: ath11k-pci: deprecate 'firmware-name' property
  wifi: ath11k: add usecase firmware handling based on device compatible
  wifi: ath10k: sdio: add missing lock protection in ath10k_sdio_fw_crashed_dump()
  wifi: ath10k: fix lock protection in ath10k_wmi_event_peer_sta_ps_state_chg()
  wifi: ath10k: snoc: support powering on the device via pwrseq
  wifi: rtw89: pci: warn if SPS OCP happens for RTL8922DE
  wifi: rtw89: pci: restore LDO setting after device resume
  ...
====================

Link: https://patch.msgid.link/20260204121143.181112-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'wireless-2026-02-04' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless

Johannes Berg says:

====================
Two last-minute iwlwifi fixes:
- cancel mlo_scan_work on disassoc to avoid
   use-after-free/init-after-queue issues
- pause TCM work on suspend to avoid crashing
   the FW (and sometimes the host) on resume
   with traffic

* tag 'wireless-2026-02-04' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
  wifi: iwlwifi: mvm: pause TCM on fast resume
  wifi: iwlwifi: mld: cancel mlo_scan_start_wk
====================

Link: https://patch.msgid.link/20260204113547.159742-4-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'mptcp-misc-features-for-v6-20-7-0'

Matthieu Baerts says:

====================
mptcp: misc. features for v6.20/7.0

This series contains a few independent new features, and small fixes for
net-next:

- Patches 1-2: two small fixes linked to the MPTCP receive buffer that
   are not urgent, requiring code that has been recently changed, and is
   needed for the next patch. Because we are at the end of the cycle, it
   seems easier to send them to net-next, instead of dealing with
   conflicts between net and net-next.

- Patch 3: a refactoring to simplify the code around MPTCP DRS.

- Patch 4: a new trace event for MPTCP to help debugging receive buffer
   auto-tuning issues.

- Patch 5: align internal MPTCP PM structure with NL specs, just to
   manipulate the same thing.

- Patch 6: convert some min_t(int, ...) to min(): cleaner, and to avoid
   future warnings.

- Patch 7: [removed]

- Patch 8: sort all #include in MPTCP Diag tool in the selftests to
   prevent future potential conflicts and ease the reading.

- Patches 9-11: improve the MPTCP Join selftest by waiting for an event
   instead of a "random" sleep.

- Patches 12-14: some small cleanups in the selftests, seen while
   working on the previous patches.

- Patch 15: avoid marking subtests as skipped while still validating
   most checks when executing the last MPTCP selftests on older kernels.
====================

Link: https://patch.msgid.link/20260203-net-next-mptcp-misc-feat-6-20-v1-0-31ec8bfc56d1@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mptcp: join: no SKIP mark for group checks

When executing the last MPTCP selftests on older kernels, this output is
printed:

  # 001 no JOIN
  #       join Rx                             [SKIP]
  #       join Tx                             [SKIP]
  #       fallback                            [SKIP]

In fact, behind each line, a few counters are checked, and likely not
all of them have been skipped because the they are not available on
these kernels. Instead, "new" and unsupported counters for these groups
are now ignored, and [ OK ] will be printed instead of [SKIP].

Note that on the MPTCP CI, when validating the dev versions, any
unsupported counter will cause the tests to fail. So this is safe not to
print 'SKIP' for these group checks.

Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260203-net-next-mptcp-misc-feat-6-20-v1-15-31ec8bfc56d1@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mptcp: connect cleanup TFO setup

To the TFO, only the file descriptor is needed, the family is not.

Also, the error can be handled the same way when 'sendto()' or
'connect()' are used. Only the printed error message is different.

This avoids a bit of confusions.

Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260203-net-next-mptcp-misc-feat-6-20-v1-14-31ec8bfc56d1@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mptcp: join: avoid declaring i if not used

A few loops were declaring 'i', but this variable was not used.

To avoid confusions, use '_' instead: it is more explicit to mark that
this variable is not needed.

Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260203-net-next-mptcp-misc-feat-6-20-v1-13-31ec8bfc56d1@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mptcp: join chk_stale_nr: avoid dup stats

nstat outputs are already printed when calling 'fail_test', no need to
do it again.

While at it, no need to use the dump_stats variable, print the extra
stats directly. And use 'ip -n $ns' instead of 'ip netns exec $ns',
shorter and clearer.

Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260203-net-next-mptcp-misc-feat-6-20-v1-12-31ec8bfc56d1@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mptcp: join: userspace: wait for new events

Instead of waiting for a random amount of time (1 second), wait for an
event to be received on the other side.

To do that, when an address is announced (userspace_pm_add_addr), the
ANNOUNCED is expected. When a new subflow is created
(userspace_pm_add_sf), the SUB_ESTABLISHED event is expected.

With this, the tests can finish quicker.

Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260203-net-next-mptcp-misc-feat-6-20-v1-11-31ec8bfc56d1@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mptcp: join: fix wait_mpj helper

It looks like most of the time, this helper was simply waiting a bit
more than one second: the previous MPJoin counter was often already at
the expected value. So at the end, it was just checking 10 times for
the MPJoin counter to change, but it was not happening. For the tests,
that was time, it was just waiting longer for nothing.

Instead, use 'wait_mpj' with the expected counter: in the tests, the MPJ
counter can easily be predicted. While at it, stop passing the netns as
argument: here the received MPJoin ACK is checked, which happens on the
server side. If later on, this needs to be checked on the client side,
the helper can be adapted for this case, but better avoid confusions now
if it is not needed.

While at it, stop using 'i' for the variable if it is not used.

With this, the tests can finish quicker.

Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260203-net-next-mptcp-misc-feat-6-20-v1-10-31ec8bfc56d1@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mptcp: join: wait for estab event instead of MPJ

'wait_mpj' was used just after having created a background connection,
but before creating new subflows. So no MPJ were sent. The intention was
to wait for the connection to be established, which was the same as
doing a simple sleep with a "random" value.

Instead, wait for an "established" event. With this, the tests can
finish quicker.

Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260203-net-next-mptcp-misc-feat-6-20-v1-9-31ec8bfc56d1@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mptcp: diag: sort all #include

This file is the only one from this directory not to have all these
header inclusions sorted by type and alphabetical order.

Adapt them, to ease the reading, prevent conflicts during potential
future backport modifying these lines, and also to avoid having UAPI
header inclusions before libc ones, see [1].

Link: https://lore.kernel.org/20260120-uapi-sockaddr-v2-1-63c319111cf6@linutronix.de
Reviewed-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260203-net-next-mptcp-misc-feat-6-20-v1-8-31ec8bfc56d1@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: Change some dubious min_t(int, ...) to min()

There are two:

min_t(int, xxx, mptcp_wnd_end(msk) - msk->snd_nxt);

Both mptcp_wnd_end(msk) and msk->snd_nxt are u64, their difference
(aka the window size) might be limited to 32 bits - but that isn't
knowable from this code.
So checks being added to min_t() detect the potential discard of
significant bits.

Provided the 'avail_size' and return of mptcp_check_allowed_size()
are changed to an unsigned type (size_t matches the type the caller
uses) both min_t() can be changed to min().

Signed-off-by: David Laight <david.laight.linux@gmail.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
[ wrapped too long lines when declaring mptcp_check_allowed_size() ]
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260203-net-next-mptcp-misc-feat-6-20-v1-6-31ec8bfc56d1@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: pm: align endpoint flags size with the NL specs

The MPTCP Netlink specs describe the 'flags' as a u32 type. Internally,
a u8 type was used.

Using a u8 is currently fine, because only the 5 first bits are used.
But there is also no reason not to be aligns with the specs, and
to stick to a u8. Especially because there is a whole of 3 bytes after
in both mptcp_pm_local and mptcp_pm_addr_entry structures.

Also, setting it to a u32 will allow future flags, just in case.

Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260203-net-next-mptcp-misc-feat-6-20-v1-5-31ec8bfc56d1@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

trace: mptcp: add mptcp_rcvbuf_grow tracepoint

Similar to tcp, provide a new tracepoint to better understand
mptcp_rcv_space_adjust() behavior, which presents many artifacts.

Note that the used format string is so long that I preferred
wrap it, contrary to guidance for quoted strings.

Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260203-net-next-mptcp-misc-feat-6-20-v1-4-31ec8bfc56d1@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: consolidate rcv space init

MPTCP uses several calls of the mptcp_rcv_space_init() helper to
initialize the receive space, with a catch-up call in
mptcp_rcv_space_adjust().

Drop all the other strictly not needed invocations and move constant
fields initialization at socket init/reset time.

This removes a bit of complexity from mptcp DRS code. No functional
changes intended.

Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260203-net-next-mptcp-misc-feat-6-20-v1-3-31ec8bfc56d1@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: fix receive space timestamp initialization

MPTCP initialize the receive buffer stamp in mptcp_rcv_space_init(),
using the provided subflow stamp. Such helper is invoked in several
places; for passive sockets, space init happened at clone time.

In such scenario, MPTCP ends-up accesses the subflow stamp before
its initialization, leading to quite randomic timing for the first
receive buffer auto-tune event, as the timestamp for newly created
subflow is not refreshed there.

Fix the issue moving the stamp initialization out of the mentioned helper,
at the data transfer start, and always using a fresh timestamp.

Fixes: 013e3179dbd2 ("mptcp: fix rcv space initialization")
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260203-net-next-mptcp-misc-feat-6-20-v1-2-31ec8bfc56d1@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: do not account for OoO in mptcp_rcvbuf_grow()

MPTCP-level OoOs are physiological when multiple subflows are active
concurrently and will not cause retransmissions nor are caused by
drops.

Accounting for them in mptcp_rcvbuf_grow() causes the rcvbuf slowly
drifting towards tcp_rmem[2].

Remove such accounting. Note that subflows will still account for TCP-level
OoO when the MPTCP-level rcvbuf is propagated.

This also closes a subtle and very unlikely race condition with rcvspace
init; active sockets with user-space holding the msk-level socket lock,
could complete such initialization in the receive callback, after that the
first OoO data reaches the rcvbuf and potentially triggering a divide by
zero Oops.

Fixes: e118cdc34dd1 ("mptcp: rcvbuf auto-tuning improvement")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260203-net-next-mptcp-misc-feat-6-20-v1-1-31ec8bfc56d1@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

vmw_vsock: bypass false-positive Wnonnull warning with gcc-16

The gcc-16.0.1 snapshot produces a false-positive warning that turns
into a build failure with CONFIG_WERROR:

In file included from arch/x86/include/asm/string.h:6,
                 from net/vmw_vsock/vmci_transport.c:10:
In function 'vmci_transport_packet_init',
    inlined from '__vmci_transport_send_control_pkt.constprop' at net/vmw_vsock/vmci_transport.c:198:2:
arch/x86/include/asm/string_32.h:150:25: error: argument 2 null where non-null expected because argument 3 is nonzero [-Werror=nonnull]
  150 | #define memcpy(t, f, n) __builtin_memcpy(t, f, n)
      |                         ^~~~~~~~~~~~~~~~~~~~~~~~~
net/vmw_vsock/vmci_transport.c:164:17: note: in expansion of macro 'memcpy'
  164 |                 memcpy(&pkt->u.wait, wait, sizeof(pkt->u.wait));
      |                 ^~~~~~
arch/x86/include/asm/string_32.h:150:25: note: in a call to built-in function '__builtin_memcpy'
net/vmw_vsock/vmci_transport.c:164:17: note: in expansion of macro 'memcpy'
  164 |                 memcpy(&pkt->u.wait, wait, sizeof(pkt->u.wait));
      |                 ^~~~~~

This seems relatively harmless, and it so far the only instance of this
warning I have found. The __vmci_transport_send_control_pkt function
is called either with wait=NULL or with one of the type values that
pass 'wait' into memcpy() here, but not from the same caller.

Replacing the memcpy with a struct assignment is otherwise the same
but avoids the warning.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Bryan Tan <bryan-bt.tan@broadcom.com>
Link: https://patch.msgid.link/20260203163406.2636463-1-arnd@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: renesas,rzv2h-gbeth: Document Renesas RZ/G3L RMII{tx,rx} clocks

As per the RZ/G3L Hardware manual, CPG_CLKON_ETH register bits{12,13} are
to control the RMII{tx, rx} clocks. Document the rmii{tx.rx} clocks for
RZ/G3L SoC.

Signed-off-by: Biju Das <biju.das.jz@bp.renesas.com>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://patch.msgid.link/20260203104541.264759-1-biju.das.jz@bp.renesas.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 's32g-use-a-syscon-for-gpr'

Dan Carpenter says:

====================
s32g: Use a syscon for GPR

The s32g devices have a GPR register region which holds a number of
miscellaneous registers.  Currently only the stmmac/dwmac-s32.c uses
anything from there and we just add a line to the device tree to
access that GMAC_0_CTRL_STS register:

                        reg = <0x4033c000 0x2000>, /* gmac IP */
                              <0x4007c004 0x4>;    /* GMAC_0_CTRL_STS */

I have included the whole list of registers below.

We still have to maintain backwards compatibility to this format,
of course, but it would be better to access these registers through a
syscon.  Putting all the registers together is more organized and shows
how the hardware actually is implemented.

Secondly, in some versions of this chipset those registers can only be
accessed via SCMI.  It's relatively straight forward to handle this
by writing a syscon driver and registering it with of_syscon_register_regmap()
but it's complicated to deal with if the registers aren't grouped
together.

Here is the whole list of registers in the GPR region

Starting from 0x4007C000

0  Software-Triggered Faults (SW_NCF)
4  GMAC Control (GMAC_0_CTRL_STS)
28 CMU Status 1 (CMU_STATUS_REG1)
2C CMUs Status 2 (CMU_STATUS_REG2)
30 FCCU EOUT Override Clear (FCCU_EOUT_OVERRIDE_CLEAR_REG)
38 SRC POR Control (SRC_POR_CTRL_REG)
54 GPR21 (GPR21)
5C GPR23 (GPR23)
60 GPR24 Register (GPR24)
CC Debug Control (DEBUG_CONTROL)
F0 Timestamp Control (TIMESTAMP_CONTROL_REGISTER)
F4 FlexRay OS Tick Input Select (FLEXRAY_OS_TICK_INPUT_SELECT_REG)
FC GPR63 Register (GPR63)

Starting from 0x4007CA00

0  Coherency Enable for PFE Ports (PFE_COH_EN)
4  PFE EMAC Interface Mode (PFE_EMACX_INTF_SEL)
20 PFE EMACX Power Control (PFE_PWR_CTRL)
28 Error Injection on Cortex-M7 AHB and AXI Pipe (CM7_TCM_AHB_SLICE)
2C Error Injection AHBP Gasket Cortex-M7 (ERROR_INJECTION_AHBP_GASKET_CM7)
40 LLCE Subsystem Status (LLCE_STAT)
44 LLCE Power Control (LLCE_CTRL)
48 DDR Urgent Control (DDR_URGENT_CTRL)
4C FTM Global Load Control (FLXTIM_CTRL)
50 FTM LDOK Status (FLXTIM_STAT)
54 Top CMU Status (CMU_STAT)
58 Accelerator NoC No Pending Trans Status (NOC_NOPEND_TRANS)
90 SerDes RD/WD Toggle Control (PCIE_TOGGLE)
94 SerDes Toggle Done Status (PCIE_TOGGLEDONE_STAT)
E0 Generic Control 0 (GENCTRL0)
E4 Generic Control 1 (GENCTRL1)
F0 Generic Status 0 (GENSTAT0)
FC Cortex-M7 AXI Parity Error and AHBP Gasket Error Alarm (CM7_AXI_AHBP_GASKET_ERROR_ALARM)

Starting from 4007C800

4  GPR01 Register (GPR01)
30 GPR12 Register (GPR12)
58 GPR22 Register (GPR22)
70 GPR28 Register (GPR28)
74 GPR29 Register (GPR29)

Starting from 4007CB00

4 WKUP Pad Pullup/Pulldown Select (WKUP_PUS)
====================

Link: https://patch.msgid.link/cover.1769764941.git.dan.carpenter@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: nxp,s32-dwmac: Use the GPR syscon

The S32 chipsets have a GPR region which has a miscellaneous registers
including the GMAC_0_CTRL_STS register. Originally, this code accessed
that register in a sort of ad-hoc way, but it's cleaner to use a
syscon interface to access these registers.

We still need to maintain the old method of accessing the GMAC register
but using a syscon will let us access other registers more cleanly.

Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/3b75e950b2f8faecd1a9fa757e7eb7b42ace838f.1769764941.git.dan.carpenter@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: s32: use a syscon for S32_PHY_INTF_SEL_RGMII

On the s32 chipsets the GMAC_0_CTRL_STS register is in GPR region.
Originally, accessing this register was done in a sort of ad-hoc way,
but we want to use the syscon interface to do it.

This is a little bit ugly because we have to maintain backwards
compatibility to the old device trees so we have to support both ways
to access this register.

Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Jan Petrous (OSS) <jan.petrous@oss.nxp.com>
Link: https://patch.msgid.link/b6b60d03344d070b2b4db7f0f00527f166e594e0.1769764941.git.dan.carpenter@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'stp-rstp-switch-support-for-pru-icssm-ethernet-driver'

Parvathi Pudi says:

====================
STP/RSTP SWITCH support for PRU-ICSSM Ethernet driver

The DUAL-EMAC patch series for Megabit Industrial Communication Sub-system
(ICSSM), which provides the foundational support for Ethernet functionality
over PRU-ICSS on the TI SOCs (AM335x, AM437x, and AM57x), was merged into
net-next recently [1].

This patch series enhances the PRU-ICSSM Ethernet driver to support bridge
(STP/RSTP) SWITCH mode, which has been implemented using the "switchdev"
framework and interacts with the "mstp daemon" for STP and RSTP management
in userspace.

When the SWITCH mode is enabled, forwarding of Ethernet packets using
either the traditional store-and-forward mechanism or via cut-through is
offloaded to the two PRU based Ethernet interfaces available within the
ICSSM. The firmware running on the PRU inspects the bridge port states and
performs necessary checks before forwarding a packet. This improves the
overall system performance and significantly reduces the packet forwarding
latency.

Protocol switching from Dual-EMAC to bridge (STP/RSTP) SWITCH mode can be
done as follows.

Assuming eth2 and eth3 are the two physical ports of the ICSS2 instance:

>> brctl addbr br0
>> ip maddr add 01:80:c2:00:00:00 dev br0
>> ip link set dev br0 address $(cat /sys/class/net/eth2/address)
>> brctl addif br0 eth2
>> brctl addif br0 eth3
>> mstpd
>> brctl stp br0 on
>> mstpctl setforcevers br0 rstp
>> ip link set dev br0 up

To revert back to the default dual EMAC mode, the steps are as follows:

>> ip link set dev br0 down
>> brctl delif br0 eth2
>> brctl delif br0 eth3
>> brctl delbr br0

The patches presented in this series have gone through the patch verification
tools and no warnings or errors are reported.
====================

Link: https://patch.msgid.link/20260130124559.1182780-1-parvathi@couthit.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ti: icssm-prueth: Add support for ICSSM RSTP switch

Add support for RSTP switch mode by enhancing the existing ICSSM dual EMAC
driver with switchdev support.

Enable the PRU-ICSSM to operate in switch mode, with the 2 PRU ports acting
as external ports and the host acting as an internal port. Packets received
from the PRU ports will be forwarded to the host (store and forward mode)
and also to the other PRU port (either using store and forward mode or via
cut-through mode). Packets coming from the host will be transmitted either
from one or both of the PRU ports (depending on the FDB decision).

By default, the dual EMAC firmware will be loaded in the PRU-ICSS
subsystem. To configure the PRU-ICSS to operate as a switch, a different
firmware must to be loaded.

Signed-off-by: Roger Quadros <rogerq@ti.com>
Signed-off-by: Andrew F. Davis <afd@ti.com>
Signed-off-by: Basharath Hussain Khaja <basharath@couthit.com>
Signed-off-by: Parvathi Pudi <parvathi@couthit.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260130124559.1182780-4-parvathi@couthit.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ti: icssm-prueth: Add switchdev support for icssm_prueth driver

Add support for offloading the RSTP switch feature to the PRU-ICSS
subsystem by adding switchdev support. PRU-ICSS is capable of operating
in RSTP switch mode with two external ports and one host port.

PRUETH driver and firmware interface support will be added into
icssm_prueth in the subsequent commits.

Signed-off-by: Roger Quadros <rogerq@ti.com>
Signed-off-by: Andrew F. Davis <afd@ti.com>
Signed-off-by: Basharath Hussain Khaja <basharath@couthit.com>
Signed-off-by: Parvathi Pudi <parvathi@couthit.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260130124559.1182780-3-parvathi@couthit.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ti: icssm-prueth: Add helper functions to configure and maintain FDB

Introduce helper functions to configure and maintain Forwarding
Database (FDB) tables to aid with the switch mode feature for PRU-ICSS
ports. The PRU-ICSS FDB is maintained such that it is always in sync with
the Linux bridge driver FDB.

The FDB is used by the driver to determine whether to flood a packet,
received from the user plane, to both ports or direct it to a specific port
using the flags in the FDB table entry.

The FDB is implemented in two main components: the Index table and the
MAC Address table. Adding, deleting, and maintaining entries are handled
by the PRUETH driver. There are two types of entries:

Dynamic: created from the received packets and are subject to aging.
Static: created by the user and these entries never age out.

8-bit hash value obtained using the source MAC address is used to identify
the index to the Index/Hash table. A bucket-based approach is used to
collate source MAC addresses with the same hash value. The Index/Hash table
holds the bucket index (16-bit value) and the number of entries in the
bucket with the same hash value (16-bit value). This table can hold up to
256 entries, with each entry consuming 4 bytes of memory. The bucket index
value points to the MAC address table indicating the start of MAC addresses
having the same hash values.

Each entry in the MAC Address table consists of:
1. 6 bytes of the MAC address,
2. 2-byte aging time, and
3. 1-byte each for port information and flags respectively.

When a new entry is added to the FDB, the hash value is calculated using an
XOR operation on the 6-byte MAC address. The result is used as an index
into the Hash/Index table to check if any entries exist. If no entries are
present, the first available empty slot in the MAC Address table is
allocated to insert this MAC address. If entries with the same hash value
are already present, the new MAC address entry is added to the MAC Address
table in such a way that it ensures all entries are grouped together and
sorted in ascending MAC address order. This approach helps efficiently
manage FDB entries.

Signed-off-by: Roger Quadros <rogerq@ti.com>
Signed-off-by: Andrew F. Davis <afd@ti.com>
Signed-off-by: Basharath Hussain Khaja <basharath@couthit.com>
Signed-off-by: Parvathi Pudi <parvathi@couthit.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260130124559.1182780-2-parvathi@couthit.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>