git.ipfire.org Git - thirdparty/kernel/linux.git/log

net: phy: Grammar update for comment in genphy_update_link

Enhance the grammar of the comment in genphy_update_link()
describing momentary link drop handling.

Found by inspection.

Signed-off-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260121-phy-gra-v1-1-8b4d178939de@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: net: Add kernel selftest for RFC 4884

RFC 4884 extended certain ICMP messages with a length attribute that
encodes the length of the "original datagram" field. This is needed so
that new information could be appended to these messages without
applications thinking that it is part of the "original datagram" field.

In version 5.9, the kernel was extended with two new socket options
(SOL_IP/IP_RECVERR_4884 and SOL_IPV6/IPV6_RECVERR_RFC4884) that allow
user space to retrieve this length which is basically the offset to the
ICMP Extension Structure at the end of the ICMP message. This is
required by user space applications that need to parse the information
contained in the ICMP Extension Structure. For example, the RFC 5837
extension for tracepath.

Add a selftest that verifies correct handling of the RFC 4884 length
field for both IPv4 and IPv6, with and without extension structures,
and validates that malformed extensions are correctly reported as invalid.

For each address family, the test creates:
  - a raw socket used to send locally crafted ICMP error packets to the
    loopback address, and
  - a datagram socket used to receive the encapsulated original datagram
    and associated error metadata from the kernel error queue.

ICMP packets are constructed entirely in user space rather than relying
on kernel-generated errors. This allows the test to exercise invalid
scenarios (such as corrupted checksums and incorrect length fields) and
verify that the SO_EE_RFC4884_FLAG_INVALID flag is set as expected.

Output Example:

$ ./icmp_rfc4884
Starting 18 tests from 18 test cases.
  RUN           rfc4884.ipv4_ext_small_payload.rfc4884 ...
            OK  rfc4884.ipv4_ext_small_payload.rfc4884
ok 1 rfc4884.ipv4_ext_small_payload.rfc4884
  RUN           rfc4884.ipv4_ext.rfc4884 ...
            OK  rfc4884.ipv4_ext.rfc4884
ok 2 rfc4884.ipv4_ext.rfc4884
  RUN           rfc4884.ipv4_ext_large_payload.rfc4884 ...
            OK  rfc4884.ipv4_ext_large_payload.rfc4884
ok 3 rfc4884.ipv4_ext_large_payload.rfc4884
  RUN           rfc4884.ipv4_no_ext_small_payload.rfc4884 ...
            OK  rfc4884.ipv4_no_ext_small_payload.rfc4884
ok 4 rfc4884.ipv4_no_ext_small_payload.rfc4884
  RUN           rfc4884.ipv4_no_ext_min_payload.rfc4884 ...
            OK  rfc4884.ipv4_no_ext_min_payload.rfc4884
ok 5 rfc4884.ipv4_no_ext_min_payload.rfc4884
  RUN           rfc4884.ipv4_no_ext_large_payload.rfc4884 ...
            OK  rfc4884.ipv4_no_ext_large_payload.rfc4884
ok 6 rfc4884.ipv4_no_ext_large_payload.rfc4884
  RUN           rfc4884.ipv4_invalid_ext_checksum.rfc4884 ...
            OK  rfc4884.ipv4_invalid_ext_checksum.rfc4884
ok 7 rfc4884.ipv4_invalid_ext_checksum.rfc4884
  RUN           rfc4884.ipv4_invalid_ext_length_small.rfc4884 ...
            OK  rfc4884.ipv4_invalid_ext_length_small.rfc4884
ok 8 rfc4884.ipv4_invalid_ext_length_small.rfc4884
  RUN           rfc4884.ipv4_invalid_ext_length_large.rfc4884 ...
            OK  rfc4884.ipv4_invalid_ext_length_large.rfc4884
ok 9 rfc4884.ipv4_invalid_ext_length_large.rfc4884
  RUN           rfc4884.ipv6_ext_small_payload.rfc4884 ...
            OK  rfc4884.ipv6_ext_small_payload.rfc4884
ok 10 rfc4884.ipv6_ext_small_payload.rfc4884
  RUN           rfc4884.ipv6_ext.rfc4884 ...
            OK  rfc4884.ipv6_ext.rfc4884
ok 11 rfc4884.ipv6_ext.rfc4884
  RUN           rfc4884.ipv6_ext_large_payload.rfc4884 ...
            OK  rfc4884.ipv6_ext_large_payload.rfc4884
ok 12 rfc4884.ipv6_ext_large_payload.rfc4884
  RUN           rfc4884.ipv6_no_ext_small_payload.rfc4884 ...
            OK  rfc4884.ipv6_no_ext_small_payload.rfc4884
ok 13 rfc4884.ipv6_no_ext_small_payload.rfc4884
  RUN           rfc4884.ipv6_no_ext_min_payload.rfc4884 ...
            OK  rfc4884.ipv6_no_ext_min_payload.rfc4884
ok 14 rfc4884.ipv6_no_ext_min_payload.rfc4884
  RUN           rfc4884.ipv6_no_ext_large_payload.rfc4884 ...
            OK  rfc4884.ipv6_no_ext_large_payload.rfc4884
ok 15 rfc4884.ipv6_no_ext_large_payload.rfc4884
  RUN           rfc4884.ipv6_invalid_ext_checksum.rfc4884 ...
            OK  rfc4884.ipv6_invalid_ext_checksum.rfc4884
ok 16 rfc4884.ipv6_invalid_ext_checksum.rfc4884
  RUN           rfc4884.ipv6_invalid_ext_length_small.rfc4884 ...
            OK  rfc4884.ipv6_invalid_ext_length_small.rfc4884
ok 17 rfc4884.ipv6_invalid_ext_length_small.rfc4884
  RUN           rfc4884.ipv6_invalid_ext_length_large.rfc4884 ...
            OK  rfc4884.ipv6_invalid_ext_length_large.rfc4884
ok 18 rfc4884.ipv6_invalid_ext_length_large.rfc4884
PASSED: 18 / 18 tests passed.
Totals: pass:18 fail:0 xfail:0 xpass:0 skip:0 error:0

Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260121114644.2863640-1-danieller@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'tcp-remove-tcp_rate-c'

Eric Dumazet says:

====================
tcp: remove tcp_rate.c

Move tcp_rate_gen() to tcp_input.c and tcp_rate_check_app_limited()
to tcp.c for better code generation.

tcp_rate.c was interesting from code maintenance perspective
but was adding cpu costs.
====================

Link: https://patch.msgid.link/20260121095923.3134639-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: move tcp_rate_check_app_limited() to tcp.c

tcp_rate_check_app_limited() is used from tcp_sendmsg_locked()
fast path and from other callers.

Move it to tcp.c so that it can be inlined in tcp_sendmsg_locked().

Small increase of code, for better TCP performance.

$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/0 grow/shrink: 1/0 up/down: 87/0 (87)
Function old new delta
tcp_sendmsg_locked 4217 4304 +87
Total: Before=22566462, After=22566549, chg +0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20260121095923.3134639-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: move tcp_rate_gen to tcp_input.c

This function is called from one caller only, in TCP fast path.

Move it to tcp_input.c so that compiler can inline it.

$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/2 grow/shrink: 1/0 up/down: 226/-300 (-74)
Function                                     old     new   delta
tcp_ack                                     5405    5631    +226
__pfx_tcp_rate_gen                            16       -     -16
tcp_rate_gen                                 284       -    -284
Total: Before=22566536, After=22566462, chg -0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20260121095923.3134639-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-stmmac-dwmac-enforce-preamble-before-sfd-for-i-mx8mp'

Stefan Eichenberger says:

====================
net: stmmac: dwmac: enforce preamble before SFD for i.MX8MP

This series adds a new phy_device flag PHY_F_KEEP_PREAMBLE_BEFORE_SFD
that allows a MAC driver to request to keep the preamble bytes before
the start frame delimiter (SFD) when receiving frames from the PHY.

This flag is set in the stmmac driver for the i.MX8MP SoC due to errata
(ERR050694), which causes it to drop frames without a preamble.

The Micrel KSZ9131 PHY supports keeping the preamble before SFD by
setting an undocumented flag, that was confirmed by NXP and Micrel. This
new feature has been added to the Micrel PHY driver for the KSZ9131 PHY.
====================

Link: https://patch.msgid.link/20260120203905.23805-1-eichest@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: dwmac-imx: keep preamble before sfd on i.MX8MP

The stmmac implementation used by NXP for the i.MX8MP SoC is subject to
errata ERR050694. According to this errata, when no preamble byte is
transferred before the SFD from the PHY to the MAC, the MAC will discard
the frame.

Setting the PHY_F_KEEP_PREAMBLE_BEFORE_SFD flag instructs PHYs that
support it to keep the preamble byte before the SFD. This ensures that
the MAC successfully receives frames.

As this is an issue in the MAC implementation, only enable the flag for
the i.MX8MP SoC where the errata applies but not for other SoCs using a
working stmmac implementation.

The exact wording of the errata ERR050694 from NXP:
The IEEE 802.3 standard states that, in MII/GMII modes, the byte
preceding the SFD (0xD5), SMD-S (0xE6,0x4C, 0x7F, or 0xB3), or SMD-C
(0x61, 0x52, 0x9E, or 0x2A) byte can be a non-PREAMBLE byte or there can
be no preceding preamble byte. The MAC receiver must successfully
receive a packet without any preamble(0x55) byte preceding the SFD,
SMD-S, or SMD-C byte.
However due to the defect, in configurations where frame preemption is
enabled, when preamble byte does not precede the SFD, SMD-S, or SMD-C
byte, the received packet is discarded by the MAC receiver. This is
because, the start-of-packet detection logic of the MAC receiver
incorrectly checks for a preamble byte.

NXP refers to IEEE 802.3 where in clause 35.2.3.2.2 Receive case (GMII)
they show two tables one where the preamble is preceding the SFD and one
where it is not. The text says:
The operation of 1000 Mb/s PHYs can result in shrinkage of the preamble
between transmission at the source GMII and reception at the destination
GMII. Table 35-3 depicts the case where no preamble bytes are conveyed
across the GMII. This case may not be possible with a specific PHY, but
illustrates the minimum preamble with which MAC shall be able to
operate. Table 35-4 depicts the case where the entire preamble is
conveyed across the GMII.

This workaround was tested on a Verdin iMX8MP by enforcing 10 MBit/s:
ethtool -s end0 speed 10
Without keeping the preamble, no packet were received. With keeping the
preamble, everything worked as expected.

Signed-off-by: Stefan Eichenberger <stefan.eichenberger@toradex.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20260120203905.23805-4-eichest@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: micrel: add option to keep the preamble before sfd for KSZ9131

If the PHY_F_KEEP_PREAMBLE_BEFORE_SFD flag is set in the
phy_device::dev_flags field, the preamble will be kept before the start
frame delimiter (SFD) on the KSZ9131 PHY. This flag is not officially
documented by Micrel. However, information provided by NXP and Micrel
indicates that this flag ensures the PHY sends the full preamble instead
of removing it. The full discussion can be found on the NXP forum:
https://community.nxp.com/t5/i-MX-Processors/iMX8MP-eqos-not-working-for-10base-t/m-p/2151032

Signed-off-by: Stefan Eichenberger <stefan.eichenberger@toradex.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20260120203905.23805-3-eichest@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: add a new phy_device flag to keep preamble before sfd

Add a new flag, PHY_F_KEEP_PREAMBLE_BEFORE_SFD, to indicate that the PHY
shall not remove the preamble before the SFD if it supports it. MACs
that do not support receiving frames without a preamble can set this
flag.

Signed-off-by: Stefan Eichenberger <stefan.eichenberger@toradex.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20260120203905.23805-2-eichest@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'nf-next-26-01-22' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next

Florian Westphal says:

====================
netfilter: updates for net-next

There is an issue with interval matching in nftables rbtree set type:
When userspace sends us set updates, there is a brief window where
false negative lookups may occur from the data plane.  Quoting Pablos
original cover letter:

This series addresses this issue by translating the rbtree, which keeps
the intervals in order, to binary search. The array is published to
packet path through RCU. The idea is to keep using the rbtree
datastructure for control plane, which needs to deal with updates, then
generate an array using this rbtree for binary search lookups.

Patch #1 allows to call .remove in case .abort is defined, which is
needed by this new approach. Only pipapo needs to skip .remove to speed.

Patch #2 add the binary search array approach for interval matching.

Patch #3 updates .get to use the binary search array to find for
(closest or exact) interval matching.

Patch #4 removes seqcount_rwlock_t as it is not needed anymore (new in
this series).

* tag 'nf-next-26-01-22' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  netfilter: nft_set_rbtree: remove seqcount_rwlock_t
  netfilter: nft_set_rbtree: use binary search array in get command
  netfilter: nft_set_rbtree: translate rbtree to array for binary search
  netfilter: nf_tables: add .abort_skip_removal flag for set types
====================

Link: https://patch.msgid.link/20260122162935.8581-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netfilter: nft_set_rbtree: remove seqcount_rwlock_t

After the conversion to binary search array, this is not required anymore.
Remove it.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>

netfilter: nft_set_rbtree: use binary search array in get command

Rework .get interface to use the binary search array, this needs a specific
lookup function to match on end intervals (<=). Packet path lookup is slight
different because match is on lesser value, not equal (ie. <).

After this patch, seqcount can be removed in a follow up patch.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>

netfilter: nft_set_rbtree: translate rbtree to array for binary search

The rbtree can temporarily store overlapping inactive elements during
the transaction processing, leading to false negative lookups.

To address this issue, this patch adds a .commit function that walks the
the rbtree to build a array of intervals of ordered elements. This
conversion compacts the two singleton elements that represent the start
and the end of the interval into a single interval object for space
efficient.

Binary search is O(log n), similar to rbtree lookup time, therefore,
performance number should be similar, and there is an implementation
available under lib/bsearch.c and include/linux/bsearch.h that is used
for this purpose.

This slightly increases memory consumption for this new array that
stores pointers to the start and the end of the interval.

With this patch:

# time nft -f 100k-intervals-set.nft

real    0m4.218s
user    0m3.544s
sys     0m0.400s

Without this patch:

# time nft -f 100k-intervals-set.nft

real    0m3.920s
user    0m3.547s
sys     0m0.276s

With this patch, with IPv4 intervals:

  baseline rbtree (match on first field only):   15254954pps

Without this patch:

  baseline rbtree (match on first field only):   10256119pps

This provides a ~50% improvement in matching intervals from packet path.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>

netfilter: nf_tables: add .abort_skip_removal flag for set types

The pipapo set backend is the only user of the .abort interface so far.
To speed up pipapo abort path, removals are skipped.

The follow up patch updates the rbtree to use to build an array of
ordered elements, then use binary search. This needs a new .abort
interface but, unlike pipapo, it also need to undo/remove elements.

Add a flag and use it from the pipapo set backend.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>

net: atp: drop ancient parallel-port Ethernet driver

This driver is old and almost certainly entirely unused. The two other
parallel port Ethernet drivers (de600/de620) were removed by Paul
Gortmaker in commit 168e06ae26dd ("drivers/net: delete old parallel
port de600/de620 drivers"), but this driver remained. Drop it - Paul's
reasoning applies here as well. To quote him:

"The parallel port is largely replaced by USB [...] Let us not pretend
that anyone cares about these drivers anymore, or worse - pretend that
anyone is using them on a modern kernel."

Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com>
Acked-by: Daniel Palmer <daniel@thingy.jp>
Link: https://patch.msgid.link/20260121084532.60606-1-enelsonmoore@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

xen/netfront: Use u64_stats_t with u64_stats_sync properly

On 64bit arches, struct u64_stats_sync is empty and provides no help
against load/store tearing. Convert to u64_stats_t to ensure atomic
operations.

Signed-off-by: David Yang <mmyangfl@gmail.com>
Link: https://patch.msgid.link/20260121082550.2389249-1-mmyangfl@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ifb: use u64_stats_t with u64_stats_sync properly

On 64bit arches, struct u64_stats_sync is empty and provides no help
against load/store tearing. Convert to u64_stats_t to ensure atomic
operations.

Signed-off-by: David Yang <mmyangfl@gmail.com>
Link: https://patch.msgid.link/20260121082348.2388314-1-mmyangfl@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

cipso: harden use of skb_cow() in cipso_v4_skbuff_setattr()

If skb_cow() is passed a headroom <= -NET_SKB_PAD, it will trigger a
BUG. As a result, use cases should avoid calling with a headroom that
is negative to prevent triggering this issue.

This is the same code pattern fixed in Commit 58fc7342b529 ("ipv6:
BUG() in pskb_expand_head() as part of calipso_skbuff_setattr()").

In cipso_v4_skbuff_setattr(), len_delta can become negative, leading to
a negative headroom passed to skb_cow(). However, the BUG is not
triggerable because the condition headroom <= -NET_SKB_PAD cannot be
satisfied due to limits on the IPv4 options header size.

Avoid potential problems in the future by only using skb_cow() to grow
the skb headroom.

Signed-off-by: Will Rosenberg <whrosenb@asu.edu>
Acked-by: Paul Moore <paul@paul-moore.com>
Link: https://patch.msgid.link/20260120155738.982771-1-whrosenb@asu.edu
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

t Merge branch 'a-series-of-minor-optimizations-of-the-bonding-module'

Tonghao Zhang says:

====================
A series of minor optimizations of the bonding module

These patches mainly target the peer notify mechanism of the bonding module.
Including updates of peer notify, lock races, etc. For more information, please
refer to the patch.

Cc: Jay Vosburgh <jv@jvosburgh.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Andrew Lunn <andrew+netdev@lunn.ch>
Cc: Nikolay Aleksandrov <razor@blackwall.org>
Cc: Hangbin Liu <liuhangbin@gmail.com>
Cc: Jason Xing <kerneljasonxing@gmail.com>
====================

Link: https://patch.msgid.link/cover.1768709239.git.tonghao@bamaicloud.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: bonding: add the READ_ONCE/WRITE_ONCE for outside lock accessing

Although operations on the variable send_peer_notif are already within
a lock-protected critical section, there are cases where it is accessed
outside the lock. Therefore, READ_ONCE() and WRITE_ONCE() should be
added to it.

Cc: Jay Vosburgh <jv@jvosburgh.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Andrew Lunn <andrew+netdev@lunn.ch>
Cc: Nikolay Aleksandrov <razor@blackwall.org>
Cc: Hangbin Liu <liuhangbin@gmail.com>
Cc: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Tonghao Zhang <tonghao@bamaicloud.com>
Reviewed-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/c1dcc53442f4d0f67beb9e0a3e7a7a6a2c94c47f.1768709239.git.tonghao@bamaicloud.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: bonding: skip the 2nd trylock when first one fail

After the first trylock fail, retrying immediately is
not advised as there is a high probability of failing
to acquire the lock again. This optimization makes sense.

Cc: Jay Vosburgh <jv@jvosburgh.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Andrew Lunn <andrew+netdev@lunn.ch>
Cc: Nikolay Aleksandrov <razor@blackwall.org>
Cc: Hangbin Liu <liuhangbin@gmail.com>
Cc: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Tonghao Zhang <tonghao@bamaicloud.com>
Reviewed-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/9aba44f02163e8fe8dbaba63ff2df921bc2b114e.1768709239.git.tonghao@bamaicloud.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: bonding: move bond_should_notify_peers, e.g. into rtnl lock block

This patch tries to avoid the possible peer notify event loss.

In bond_mii_monitor()/bond_activebackup_arp_mon(), when we hold the rtnl lock:
- check send_peer_notif again to avoid unconditionally reducing this value.
- send_peer_notif may have been reset. Therefore, it is necessary to check
whether to send peer notify via bond_should_notify_peers() to avoid the
loss of notification events.

Cc: Jay Vosburgh <jv@jvosburgh.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Andrew Lunn <andrew+netdev@lunn.ch>
Cc: Nikolay Aleksandrov <razor@blackwall.org>
Cc: Hangbin Liu <liuhangbin@gmail.com>
Cc: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Tonghao Zhang <tonghao@bamaicloud.com>
Reviewed-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/78cef328822b94638c97638b89011c507b8bf19e.1768709239.git.tonghao@bamaicloud.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: bonding: use workqueue to make sure peer notify updated in lacp mode

The rtnl lock might be locked, preventing ad_cond_set_peer_notif() from
acquiring the lock and updating send_peer_notif. This patch addresses
the issue by using a workqueue. Since updating send_peer_notif does
not require high real-time performance, such delayed updates are entirely
acceptable.

In fact, checking this value and using it in multiple places, all operations
are protected at the same time by rtnl lock, such as
- read send_peer_notif
- send_peer_notif--
- bond_should_notify_peers

By the way, rtnl lock is still required, when accessing bond.params.* for
updating send_peer_notif. In lacp mode, resetting send_peer_notif in
workqueue is safe, simple and effective way.

Additionally, this patch introduces bond_peer_notify_may_events(), which
is used to check whether an event should be sent. This function will be
used in both patch 1 and 2.

Cc: Jay Vosburgh <jv@jvosburgh.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Andrew Lunn <andrew+netdev@lunn.ch>
Cc: Nikolay Aleksandrov <razor@blackwall.org>
Cc: Hangbin Liu <liuhangbin@gmail.com>
Cc: Jason Xing <kerneljasonxing@gmail.com>
Suggested-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: Tonghao Zhang <tonghao@bamaicloud.com>
Reviewed-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/f95accb5db0b10ce3ed2f834fc70f716c9abbb9c.1768709239.git.tonghao@bamaicloud.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: dsa: yt921x: Add LAG offloading support

Add offloading for a link aggregation group supported by the YT921x
switches.

Signed-off-by: David Yang <mmyangfl@gmail.com>
Link: https://patch.msgid.link/20260117162116.1063043-1-mmyangfl@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge tag 'nf-next-26-01-20' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next

Florian Westphal says:

====================
Subject: netfilter: updates for net-next

1) Speed up nftables transactions after earlier transaction failed.
   Due to a (harmeless) bug we remained in slow paranoia mode until
   a successful transaction completes.

2) Allow generic tracker to resolve clashes, this avoids very rare
   packet drops.  From Yuto Hamaguchi.

3) Increase the cleanup budget to 64 entries in nf_conncount to reap
   more entries in one go, from Fernando Fernandez Mancera.

4) Allow icmp trackers to resolve clashes, this avoids very rare
   initial packet drop with test cases that have high-frequency pings.
   After this all trackers except tcp and sctp allow clash resolution.

5) Disentangle netfilter headers, don't include nftables/xtables headers
   in subsystems that are unrelated.

6) Don't rely on implicit includes coming from nf_conntrack_proto_gre.h.

7) Allow nfnetlink_queue nfq instance struct to get accounted via memcg,
   from Scott Mitchell.

8) Reject bogus xt target/match data upfront via netlink policiy in
   nft_compat interface rather than relying on x_tables API to do it.

9) Fix nf_conncount breakage when trying to limit loopback flows via
   prerouting rule, from Fernando Fernandez Mancera.
   This is a recent breakage but not seen as urgent enough to rush this
   via net tree at this late stage in development cycle.

10) Fix a possible off-by-one when parsing tcp option in xtables tcpmss
    match.  Also handled via -next due to late stage in development
    cycle.

* tag 'nf-next-26-01-20' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  netfilter: xt_tcpmss: check remaining length before reading optlen
  netfilter: nf_conncount: fix tracking of connections from localhost
  netfilter: nft_compat: add more restrictions on netlink attributes
  netfilter: nfnetlink_queue: nfqnl_instance GFP_ATOMIC -> GFP_KERNEL_ACCOUNT allocation
  netfilter: nf_conntrack: don't rely on implicit includes
  netfilter: don't include xt and nftables.h in unrelated subsystems
  netfilter: nf_conntrack: enable icmp clash support
  netfilter: nf_conncount: increase the connection clean up limit to 64
  netfilter: nf_conntrack: Add allow_clash to generic protocol handler
  netfilter: nf_tables: reset table validation state on abort
====================

Link: https://patch.msgid.link/20260120191803.22208-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'phylink-link-callback-replay-helpers-for-sja1105-and-xpcs'

Vladimir Oltean says:

====================
Phylink link callback replay helpers for SJA1105 and XPCS

The sja1105 is reducing its direct interaction with the XPCS.

The changes presented here are an older simplification idea, broken out
of a previous patch set to allow for more thorough review.
====================

Link: https://patch.msgid.link/20260119121954.1624535-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: sja1105: re-merge sja1105_set_port_speed() and sja1105_set_port_config()

Commit a18891b55703 ("net: dsa: sja1105: simplify static configuration
reload") split sja1105_mac_link_up() -> sja1105_adjust_port_config()
into two separate:

- sja1105_set_port_speed()
- sja1105_set_port_config()

in order to pick up the second sja1105_set_port_config() and reuse it
for the sja1105_static_config_reload() procedure which involves saving
and restoring MAC and PCS settings.

Now that these settings are restored by phylink itself, the driver no
longer needs to call its own sja1105_set_port_config(), and the
splitting is unnatural. Merge the functions back, which is to say that
the only supported internal code path is to submit the MAC Configuration
Table entry to hardware after phylink has dictated what we should set it
to.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20260119121954.1624535-5-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: sja1105: let phylink help with the replay of link callbacks

sja1105_static_config_reload() changes major settings in the switch and
it requires a reset. A use case is to change things like Qdiscs (but see
sja1105_reset_reasons[] for full list) while PTP synchronization is
running, and the servo loop must not exit the locked state (s2).
Therefore, stopping and restarting the phylink instances of all ports is
not desirable, because that also stops the phylib state machine, and
retriggers a seconds-long auto-negotiation process that breaks PTP.
Thus, saving and restoring the link management settings is handled
privately by the driver.

The method got progressively more complex as SGMII support got added,
because this is handled through the xpcs phylink_pcs component, to which
we don't have unfettered access. Nonetheless, the switch reset line is
hardwired to also reset the XPCS, creating a situation where it loses
state and needs to be reprogrammed at a moment in time outside phylink's
control.

Although commits 907476c66d73 ("net: dsa: sja1105: call PCS
config/link_up via pcs_ops structure") and 41bf58314b17 ("net: dsa:
sja1105: use phylink_pcs internally") made the sja1105 <-> xpcs
interaction slightly prettier, we still depend heavily on the PCS being
"XPCS-like", because to back up its settings, we read the MII_BMCR
register, through a mdiobus_c45_read() operation, breaking all layering
separation.

With the existence of phylink link callback replay helpers, we can do
away with all this custom code and become even more PCS-agnostic, even
though the reset domain is tightly coupled.

This creates the unique opportunity to simplify away even more code than
just the xpcs handling from sja1105_static_config_reload().
The sja1105_set_port_config() method is also invoked from
sja1105_mac_link_up(). And since that is now called directly by
phylink - we can just remove it from sja1105_static_config_reload().
This makes it possible to re-merge sja1105_set_port_speed() and
sja1105_set_port_config() in a later change.

Note that my only setups with sja1105 where the xpcs is used is with the
xpcs on the CPU-facing port (fixed-link). Thus, I cannot test xpcs + PHY.
But the replay procedure walks through all ports, and I did test a
regular RGMII user port + a PHY.

ptp4l[54.552]: master offset          5 s2 freq    -931 path delay       764
ptp4l[55.551]: master offset         22 s2 freq    -913 path delay       764
ptp4l[56.551]: master offset         13 s2 freq    -915 path delay       765
ptp4l[57.552]: master offset          5 s2 freq    -919 path delay       765
ptp4l[58.553]: master offset         13 s2 freq    -910 path delay       765
ptp4l[59.553]: master offset         13 s2 freq    -906 path delay       765
ptp4l[60.553]: master offset          6 s2 freq    -909 path delay       765
ptp4l[61.553]: master offset          6 s2 freq    -907 path delay       765
ptp4l[62.553]: master offset          6 s2 freq    -906 path delay       765
ptp4l[63.553]: master offset         14 s2 freq    -896 path delay       765
$ ip link set br0 type bridge vlan_filtering 1
[   63.983283] sja1105 spi2.0 sw0p0: Link is Down
[   63.991913] sja1105 spi2.0: Link is Down
[   64.009784] sja1105 spi2.0: Reset switch and programmed static config. Reason: VLAN filtering
[   64.020217] sja1105 spi2.0 sw0p0: Link is Up - 1Gbps/Full - flow control off
[   64.030683] sja1105 spi2.0: Link is Up - 1Gbps/Full - flow control off
ptp4l[64.554]: master offset       7397 s2 freq   +6491 path delay       765
ptp4l[65.554]: master offset         38 s2 freq   +1352 path delay       765
ptp4l[66.554]: master offset      -2225 s2 freq    -900 path delay       764
ptp4l[67.555]: master offset      -2226 s2 freq   -1569 path delay       765
ptp4l[68.555]: master offset      -1553 s2 freq   -1563 path delay       765
ptp4l[69.555]: master offset       -865 s2 freq   -1341 path delay       765
ptp4l[70.555]: master offset       -401 s2 freq   -1137 path delay       765
ptp4l[71.556]: master offset       -145 s2 freq   -1001 path delay       765
ptp4l[72.558]: master offset        -26 s2 freq    -926 path delay       765
ptp4l[73.557]: master offset         30 s2 freq    -877 path delay       765
ptp4l[74.557]: master offset         47 s2 freq    -851 path delay       765
ptp4l[75.557]: master offset         29 s2 freq    -855 path delay       765

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20260119121954.1624535-4-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phylink: introduce helpers for replaying link callbacks

Some drivers of MAC + tightly integrated PCS (example: SJA1105 + XPCS
covered by same reset domain) need to perform resets at runtime.

The reset is triggered by the MAC driver, and it needs to restore its
and the PCS' registers, all invisible to phylink.

However, there is a desire to simplify the API through which the MAC and
the PCS interact, so this becomes challenging.

Phylink holds all the necessary state to help with this operation, and
can offer two helpers which walk the MAC and PCS drivers again through
the callbacks required during a destructive reset operation. The
procedure is as follows:

Before reset, MAC driver calls phylink_replay_link_begin():
- Triggers phylink mac_link_down() and pcs_link_down() methods

After reset, MAC driver calls phylink_replay_link_end():
- Triggers phylink mac_config() -> pcs_config() -> mac_link_up() ->
pcs_link_up() methods.

MAC and PCS registers are restored with no other custom driver code.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/20260119121954.1624535-3-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phylink: simplify phylink_resolve() -> phylink_major_config() path

This is a trivial change with no functional effect which replaces the
pattern:

if (a) {
if (b) {
do_stuff();
}
}

with:

if (a && b) {
do_stuff();
};

The purpose is to reduce the delta of a subsequent functional change.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/20260119121954.1624535-2-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'phy-polarity-inversion-via-generic-device-tree-properties'

Vladimir Oltean says:

====================
PHY polarity inversion via generic device tree properties

Using the "rx-polarity" and "tx-polarity" device tree properties
introduced in linux-phy and merged into net-next in
commit 96a2d53f2478 ("Merge tag 'phy_common_properties' of git://git.kernel.org/pub/scm/linux/kernel/git/phy/linux-phy")
we convert here two existing networking use cases - the EN8811H Ethernet
PHY and the Mediatek LynxI PCS.

Original cover letter:

Polarity inversion (described in patch 4/10) is a feature with at least
4 potential new users waiting for a generic description:
- Horatiu Vultur with the lan966x SerDes
- Daniel Golle with the MaxLinear GSW1xx switches
- Bjørn Mork with the AN8811HB Ethernet PHY
- Me with a custom SJA1105 board, switch which uses the DesignWare XPCS

I became interested in exploring the problem space because I was averse
to the idea of adding vendor-specific device tree properties to describe
a common need.

This set contains an implementation of a generic feature that should
cater to all known needs that were identified during my documentation
phase.

Apart from what is converted here, we also have the following, which I
did not touch:
- "st,px_rx_pol_inv" - its binding is a .txt file and I don't have time
  for such a large detour to convert it to dtschema.
- "st,pcie-tx-pol-inv" and "st,sata-tx-pol-inv" - these are defined in a
  .txt schema but are not implemented in any driver. My verdict would be
  "delete the properties" but again, I would prefer not introducing such
  dependency to this series.
====================

Link: https://patch.msgid.link/20260119091220.1493761-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: pcs: pcs-mtk-lynxi: deprecate "mediatek,pnswap"

Prefer the new "rx-polarity" and "tx-polarity" properties, which in this
case have the advantage that polarity inversion can be specified per
direction (and per protocol, although this isn't useful here).

We use the vendor specific ones as fallback if the standard description
doesn't exist.

Daniel, referring to the Mediatek SDK, clarifies that the combined
SGMII_PN_SWAP_TX_RX register field should be split like this: bit 0 is
TX and bit 1 is RX:
https://lore.kernel.org/linux-phy/aSW--slbJWpXK0nv@makrotopia.org/

Suggested-by: Daniel Golle <daniel@makrotopia.org>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20260119091220.1493761-6-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: pcs: pcs-mtk-lynxi: pass SGMIISYS OF node to PCS

The Mediatek LynxI PCS is used from the MT7530 DSA driver (where it does
not have an OF presence) and from mtk_eth_soc, where it does
(Documentation/devicetree/bindings/net/pcs/mediatek,sgmiisys.yaml
informs of a combined clock provider + SGMII PCS "SGMIISYS" syscon
block).

Currently, mtk_eth_soc parses the SGMIISYS OF node for the
"mediatek,pnswap" property and sets a bit in the "flags" argument of
mtk_pcs_lynxi_create() if set.

I'd like to deprecate "mediatek,pnswap" in favour of a property which
takes the current phy-mode into consideration. But this is only known at
mtk_pcs_lynxi_config() time, and not known at mtk_pcs_lynxi_create(),
when the SGMIISYS OF node is parsed.

To achieve that, we must pass the OF node of the PCS, if it exists, to
mtk_pcs_lynxi_create(), and let the PCS take a reference on it and
handle property parsing whenever it wants.

Use the fwnode API which is more general than OF (in case we ever need
to describe the PCS using some other format). This API should be NULL
tolerant, so add no particular tests for the mt7530 case.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20260119091220.1493761-5-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: pcs: mediatek,sgmiisys: deprecate "mediatek,pnswap"

Reference the common PHY properties, and update the example to use them.
Note that a PCS subnode exists, and it seems a better container of the
polarity description than the SGMIISYS node that hosts "mediatek,pnswap".
So use that.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20260119091220.1493761-4-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: air_en8811h: deprecate "airoha,pnswap-rx" and "airoha,pnswap-tx"

Prefer the new "rx-polarity" and "tx-polarity" properties, and use the
vendor specific ones as fallback if the standard description doesn't
exist.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20260119091220.1493761-3-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: airoha,en8811h: deprecate "airoha,pnswap-rx" and "airoha,pnswap-tx"

Reference the common PHY properties, and update the example to use them.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20260119091220.1493761-2-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'gro-inline-tcp6_gro_-receive-complete'

Eric Dumazet says:

====================
gro: inline tcp6_gro_{receive,complete}

On some platforms, GRO stack is too deep and causes cpu stalls.

Decreasing call depths by one shows a 1.5 % gain on Zen2 cpus.
(32 RX queues, 100Gbit NIC, RFS enabled, tcp_rr with 128 threads and 10,000 flows)

We can go further by inlining ipv6_gro_{receive,complete}
and take care of IPv4 if there is interest.

Note: two temporary __always_inline will be replaced with
      inline_for_performance when/if available.

Cumulative size increase for this series (of 3):

$ scripts/bloat-o-meter -t vmlinux.0 vmlinux.3
add/remove: 2/2 grow/shrink: 5/1 up/down: 1572/-471 (1101)
Function                                     old     new   delta
ipv6_gro_receive                            1069    1846    +777
ipv6_gro_complete                            433     733    +300
tcp6_check_fraglist_gro                        -     272    +272
tcp6_gro_complete                            227     306     +79
tcp4_gro_complete                            325     397     +72
ipv6_offload_init                            218     274     +56
__pfx_tcp6_check_fraglist_gro                  -      16     +16
__pfx___skb_incr_checksum_unnecessary         32       -     -32
__skb_incr_checksum_unnecessary              186       -    -186
tcp6_gro_receive                             959     706    -253
Total: Before=22592724, After=22593825, chg +0.00%
====================

Link: https://patch.msgid.link/20260120164903.1912995-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

gro: inline tcp6_gro_complete()

Remove one function call from GRO stack for native IPv6 + TCP packets.

$ scripts/bloat-o-meter -t vmlinux.2 vmlinux.3
add/remove: 0/0 grow/shrink: 1/1 up/down: 298/-5 (293)
Function                                     old     new   delta
ipv6_gro_complete                            435     733    +298
tcp6_gro_complete                            311     306      -5
Total: Before=22593532, After=22593825, chg +0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260120164903.1912995-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

gro: inline tcp6_gro_receive()

FDO/LTO are unable to inline tcp6_gro_receive() from ipv6_gro_receive()

Make sure tcp6_check_fraglist_gro() is only called only when needed,
so that compiler can leave it out-of-line.

$ scripts/bloat-o-meter -t vmlinux.1 vmlinux.2
add/remove: 2/0 grow/shrink: 3/1 up/down: 1123/-253 (870)
Function                                     old     new   delta
ipv6_gro_receive                            1069    1846    +777
tcp6_check_fraglist_gro                        -     272    +272
ipv6_offload_init                            218     274     +56
__pfx_tcp6_check_fraglist_gro                  -      16     +16
ipv6_gro_complete                            433     435      +2
tcp6_gro_receive                             959     706    -253
Total: Before=22592662, After=22593532, chg +0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260120164903.1912995-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: always inline __skb_incr_checksum_unnecessary()

clang does not inline this helper in GRO fast path.

We can save space and cpu cycles.

$ scripts/bloat-o-meter -t vmlinux.0 vmlinux.1
add/remove: 0/2 grow/shrink: 2/0 up/down: 156/-218 (-62)
Function                                     old     new   delta
tcp6_gro_complete                            227     311     +84
tcp4_gro_complete                            325     397     +72
__pfx___skb_incr_checksum_unnecessary         32       -     -32
__skb_incr_checksum_unnecessary              186       -    -186
Total: Before=22592724, After=22592662, chg -0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260120164903.1912995-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: preserve const qualifier in tcp_rsk() and inet_rsk()

We can change tcp_rsk() and inet_rsk() to propagate their argument
const qualifier thanks to container_of_const().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260120125353.1470456-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'airoha-add-the-capability-to-read-firmware-binary-names-from-dts-for-airoha-npu-driver'

Lorenzo Bianconi says:

====================
airoha: Add the capability to read firmware binary names from dts for Airoha NPU driver

This patch is needed because NPU firmware binaries are board specific since
they depend on the MediaTek WiFi chip used on the board (e.g. MT7996 or
MT7992). This is a preliminary patch to enable MT76 NPU offloading if
the Airoha SoC is equipped with MT7996 (Eagle) WiFi chipset.
====================

Link: https://patch.msgid.link/20260120-airoha-npu-firmware-name-v4-0-88999628b4c1@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: npu: Add the capability to read firmware names from dts

Introduce the capability to read the firmware binary names from device-tree
using the firmware-name property if available.
This patch is needed because NPU firmware binaries are board specific since
they depend on the MediaTek WiFi chip used on the board (e.g. MT7996 or
MT7992) and the WiFi chip version info is not available in the NPU driver.
This is a preliminary patch to enable MT76 NPU offloading if the Airoha SoC
is equipped with MT7996 (Eagle) WiFi chipset.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20260120-airoha-npu-firmware-name-v4-2-88999628b4c1@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: airoha: npu: Add firmware-name property

Add firmware-name property in order to introduce the capability to
specify the firmware names used for 'RiscV core' and 'Data section'
binaries. This patch is needed because NPU firmware binaries are board
specific since they depend on the MediaTek WiFi chip used on the board
(e.g. MT7996 or MT7992) and the WiFi chip version info is not available
in the NPU driver. This is a preliminary patch to enable MT76 NPU
offloading if the Airoha SoC is equipped with MT7996 (Eagle) WiFi chipset.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20260120-airoha-npu-firmware-name-v4-1-88999628b4c1@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'netconsole-support-automatic-target-recovery'

Andre Carvalho says:

====================
netconsole: support automatic target recovery

This patchset introduces target resume capability to netconsole allowing
it to recover targets when underlying low-level interface comes back
online.

The patchset starts by refactoring netconsole state representation in
order to allow representing deactivated targets (targets that are
disabled due to interfaces unregister).

It then modifies netconsole to handle NETDEV_REGISTER events for such
targets, setups netpoll and forces the device UP. Targets are matched with
incoming interfaces depending on how they were bound in netconsole
(by mac or interface name). For these reasons, we also attempt resuming
on NETDEV_CHANGENAME.

The patchset includes a selftest that validates netconsole target state
transitions and that target is functional after resumed.
====================

Link: https://patch.msgid.link/20260118-netcons-retrigger-v11-0-4de36aebcf48@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: netconsole: validate target resume

Introduce a new netconsole selftest to validate that netconsole is able
to resume a deactivated target when the low level interface comes back.

The test setups the network using netdevsim, creates a netconsole target
and then remove/add netdevsim in order to bring the same interfaces
back. Afterwards, the test validates that the target works as expected.

Targets are created via cmdline parameters to the module to ensure that
we are able to resume targets that were bound by mac and interface name.

Reviewed-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Andre Carvalho <asantostc@gmail.com>
Tested-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260118-netcons-retrigger-v11-7-4de36aebcf48@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netconsole: resume previously deactivated target

Attempt to resume a previously deactivated target when the associated
interface comes back (NETDEV_REGISTER) or when it changes name
(NETDEV_CHANGENAME) by calling netpoll_setup on the device.

Depending on how the target was setup (by mac or interface name), the
corresponding field is compared with the device being brought up. Targets
that match the incoming device, are scheduled for resume on a workqueue.

Resuming happens on a workqueue as we can't execute netpoll_setup in the
context of the netdev event. A standalone workqueue (as opposed to the
global one) is used to allow for proper cleanup process during
netconsole module cleanup as we need to be able to flush all pending
work before traversing the target list given that targets are temporarily
removed from the list during resume_target.

Target transitions to STATE_DISABLED in case of failures resuming it to
avoid retrying the same target indefinitely.

Signed-off-by: Andre Carvalho <asantostc@gmail.com>
Reviewed-by: Breno Leitao <leitao@debian.org>
Tested-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260118-netcons-retrigger-v11-6-4de36aebcf48@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netconsole: introduce helpers for dynamic_netconsole_mutex lock/unlock

This commit introduces two helper functions to perform lock/unlock on
dynamic_netconsole_mutex providing no-op stub versions when compiled
without CONFIG_NETCONSOLE_DYNAMIC and refactors existing call sites to
use the new helpers.

This is done following kernel coding style guidelines, in preparation
for an upcoming change. It avoids the need for preprocessor conditionals
in the call site and keeps the logic easier to follow.

Signed-off-by: Andre Carvalho <asantostc@gmail.com>
Reviewed-by: Breno Leitao <leitao@debian.org>
Tested-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260118-netcons-retrigger-v11-5-4de36aebcf48@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netconsole: clear dev_name for devices bound by mac

This patch makes sure netconsole clears dev_name for devices bound by mac
in order to allow calling setup_netpoll on targets that have previously
been cleaned up (in order to support resuming deactivated targets).

This is required as netpoll_setup populates dev_name even when devices are
matched via mac address. The cleanup is done inside netconsole as bound
by mac is a netconsole concept.

Signed-off-by: Andre Carvalho <asantostc@gmail.com>
Reviewed-by: Breno Leitao <leitao@debian.org>
Tested-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260118-netcons-retrigger-v11-4-4de36aebcf48@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netconsole: add STATE_DEACTIVATED to track targets disabled by low level

When the low level interface brings a netconsole target down, record this
using a new STATE_DEACTIVATED state. This allows netconsole to distinguish
between targets explicitly disabled by users and those deactivated due to
interface state changes.

It also enables automatic recovery and re-enabling of targets if the
underlying low-level interfaces come back online.

From a code perspective, anything that is not STATE_ENABLED is disabled.

Devices (de)enslaving are marked STATE_DISABLED to prevent automatically
resuming as enslaved interfaces cannot have netconsole enabled.

Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Andre Carvalho <asantostc@gmail.com>
Tested-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260118-netcons-retrigger-v11-3-4de36aebcf48@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netconsole: convert 'enabled' flag to enum for clearer state management

This patch refactors the netconsole driver's target enabled state from a
simple boolean to an explicit enum (`target_state`).

This allow the states to be expanded to a new state in the upcoming
change.

Co-developed-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Andre Carvalho <asantostc@gmail.com>
Tested-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260118-netcons-retrigger-v11-2-4de36aebcf48@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netconsole: add target_state enum

Introduces a enum to track netconsole target state which is going to
replace the enabled boolean.

Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Andre Carvalho <asantostc@gmail.com>
Tested-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260118-netcons-retrigger-v11-1-4de36aebcf48@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'add-devm_clk_bulk_get_optional_enable-helper-and-use-in-axi-ethernet-driver'

Suraj Gupta says:

====================
Add devm_clk_bulk_get_optional_enable() helper and use in AXI Ethernet driver

This patch series introduces a new managed clock framework helper function
and demonstrates its usage in AXI ethernet driver.

Device drivers frequently need to get optional bulk clocks, prepare them,
and enable them during probe, while ensuring automatic cleanup on device
unbind. Currently, this requires three separate operations with manual
cleanup handling.

The new devm_clk_bulk_get_optional_enable() helper combines these
operations into a single managed call, eliminating boilerplate code and
following the established pattern of devm_clk_bulk_get_all_enabled().
====================

Link: https://patch.msgid.link/20260116192725.972966-1-suraj.gupta2@amd.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: xilinx: axienet: Use devres for resource management in probe path

Transition axienet_probe() to managed resource allocation using devm_*
APIs for network device and clock handling, while improving error paths
with dev_err_probe(). This eliminates the need for manual resource
cleanup during probe failures and streamlines the remove() function.

Signed-off-by: Sean Anderson <sean.anderson@linux.dev>
Co-developed-by: Suraj Gupta <suraj.gupta2@amd.com>
Signed-off-by: Suraj Gupta <suraj.gupta2@amd.com>
Link: https://patch.msgid.link/20260116192725.972966-3-suraj.gupta2@amd.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

clk: Add devm_clk_bulk_get_optional_enable() helper

Add a new managed clock framework helper function that combines getting
optional bulk clocks and enabling them in a single operation.

The devm_clk_bulk_get_optional_enable() function simplifies the common
pattern where drivers need to get optional bulk clocks, prepare and enable
them, and have them automatically disabled/unprepared and freed when the
device is unbound.

This new API follows the established pattern of
devm_clk_bulk_get_all_enabled() and reduces boilerplate code in drivers
that manage multiple optional clocks.

Suggested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Suraj Gupta <suraj.gupta2@amd.com>
Reviewed-by: Brian Masney <bmasney@redhat.com>
Acked-by: Stephen Boyd <sboyd@kernel.org>
Link: https://patch.msgid.link/20260116192725.972966-2-suraj.gupta2@amd.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: remove HIPPI support and RoadRunner HIPPI driver

HIPPI has not been relevant for over two decades. It was rapidly
eclipsed by Fibre Channel, and even when it was new, it was
confined to very high-end hardware. The HIPPI code has only
received tree-wide changes and fixes by inspection in the entire
Git history. Remove HIPPI support and the rrunner HIPPI driver,
and move the former maintainer to the CREDITS file. Keep the
include/uapi/linux/if_hippi.h header because it is used by the TUN
code, and to avoid breaking userspace, however unlikely that may be.

Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com>
Link: https://patch.msgid.link/20260119022451.22344-1-enelsonmoore@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: move tcp_rate_skb_delivered() to tcp_input.c

tcp_rate_skb_delivered() is only called from tcp_input.c.
Move it there and make it static.

Both gcc and clang are (auto)inlining it, TCP performance
is increased at a small space cost.

$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/2 grow/shrink: 3/0 up/down: 509/-187 (322)
Function                                     old     new   delta
tcp_sacktag_walk                            1682    1867    +185
tcp_ack                                     5230    5405    +175
tcp_shifted_skb                              437     586    +149
__pfx_tcp_rate_skb_delivered                  16       -     -16
tcp_rate_skb_delivered                       171       -    -171
Total: Before=22566192, After=22566514, chg +0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20260118123204.2315993-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'fix-typos-in-network-driver-code-comments'

Yicong Hui says:

====================
Fix typos in network driver code comments

Fix various minor typos and mispellings in 3 different driver
subdirectories in drivers/net/ethernet
====================

Link: https://patch.msgid.link/20260118121001.136806-1-yiconghui@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/xen-netback: Fix mispelling of "Software" as "Softare"

Fix misspelling of "software" as "softare" in xen-netback code comment.

Signed-off-by: Yicong Hui <yiconghui@gmail.com>
Link: https://patch.msgid.link/20260118121001.136806-4-yiconghui@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/micrel: Fix typos in micrel driver code comments

Fix various typos and misspellings in code comments in the
drivers/net/ethernet/micrel directory

Signed-off-by: Yicong Hui <yiconghui@gmail.com>
Link: https://patch.msgid.link/20260118121001.136806-3-yiconghui@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/benet: Fix typos in driver code comments

Fix various typos and misspellings in code comments in the
drivers/net/ethernet/emulex directory

Signed-off-by: Yicong Hui <yiconghui@gmail.com>
Link: https://patch.msgid.link/20260118121001.136806-2-yiconghui@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: simplify PHY fixup registration

Based on the fact that either bus_id-based matching or phy_uid-based
matching is used, the code can be simplified. PHY_ANY_ID and
PHY_ANY_UID are not needed. Ensure that phy_id_compare() is called
only if phy_uid_mask isn't zero, because a zero value would always
result in a match.
In addition change the return value type of phy_needs_fixup() to bool.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://patch.msgid.link/e7394cc8-5895-4d02-a8fe-802345c7c547@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: macb: Replace open-coded device config retrieval with of_device_get_match_data()

Use of_device_get_match_data() to replace the open-coded method for
obtaining the device config.

Additionally, adjust the ordering of local variables to ensure
compatibility with RCS.

Signed-off-by: Kevin Hao <haokexin@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20260117-macb-v1-1-f092092d8c91@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha_eth: increase max MTU to 9220 for DSA jumbo frames

The industry standard jumbo frame MTU is 9216 bytes. When using the DSA
subsystem, a 4-byte tag is added to each Ethernet frame.

Increase AIROHA_MAX_MTU to 9220 bytes (9216 + 4) so that users can set a
standard 9216-byte MTU on DSA ports.

The underlying hardware supports significantly larger frame sizes
(approximately 16K). However, the maximum MTU is limited to 9220 bytes
for now, as this is sufficient to support standard jumbo frames and does
not incur additional memory allocation overhead.

Signed-off-by: Sayantan Nandy <sayantann11@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20260119073658.6216-1-sayantann11@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-pf: Remove unnecessary bounds check

active_fec is a 2-bit unsigned field, and thus can only have the values
0-3. So checking that it is less than 4 is unnecessary.

Simplify the code by dropping this check.

As it no longer fits well where it is, move FEC_MAX_INDEX to towards the
top of the file. And add the prefix OXT2. I believe this is more
idiomatic.

Flagged by Smatch as:
...//otx2_ethtool.c:1024 otx2_get_fecparam() warn: always true condition '(pfvf->linfo.fec < 4) => (0-3 < 4)'

No functional change intended.
Compile tested only.

Signed-off-by: Simon Horman <horms@kernel.org>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Reviewed-by: Hariprasad Kelam <hkelam@marvell.com>
Link: https://patch.msgid.link/20260119-oob-v1-1-a4147e75e770@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: add kdoc for napi_consume_skb()

Looks like AI reviewers miss that napi_consume_skb() must have
a real budget passed to it. Let's see if adding a real kdoc will
help them figure this out.

Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20260119224140.1362729-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: usb: r8152: fix transmit queue timeout

When the TX queue length reaches the threshold, the netdev watchdog
immediately detects a TX queue timeout.

This patch updates the trans_start timestamp of the transmit queue
on every asynchronous USB URB submission along the transmit path,
ensuring that the network watchdog accurately reflects ongoing
transmission activity.

Signed-off-by: Mingj Ye <insyelu@gmail.com>
Reviewed-by: Hayes Wang <hayeswang@realtek.com>
Link: https://patch.msgid.link/20260120015949.84996-1-insyelu@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: drv-net: fix missing include in ncdevmem

Commit ca9d74eb5f6a ("uapi: add INT_MAX and INT_MIN constants")
recently removed some includes of limits.h in uAPI headers.
ncdevmem.c was depending on them:

  ncdevmem.c: In function ‘ethtool_add_flow’:
  ncdevmem.c:369:60: error: ‘INT_MAX’ undeclared (first use in this function)
  369 |         if (endptr == id_start || flow_id < 0 || flow_id > INT_MAX)
      |                                                            ^~~~~~~
  ncdevmem.c:77:1: note: ‘INT_MAX’ is defined in header ‘<limits.h>’; did you forget to ‘#include <limits.h>’?

Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20260120180319.1673271-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: fclone allocation small optimization

After skb allocation, initial skb->fclone value is 0 (SKB_FCLONE_UNAVAILABLE)

We can replace one RMW sequence with a single OR instruction.

movzbl 0x7e(%r13),%eax // skb->fclone = SKB_FCLONE_ORIG;
and    $0xf3,%al
or     $0x4,%al
mov    %al,0x7e(%r13)
->
or     $0x4,0x7e(%r13) // skb->fclone |= SKB_FCLONE_ORIG;

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260116164402.1872649-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'convert-the-micrel-bindings-to-dt-schema'

Stefan Eichenberger says:

====================
Convert the Micrel bindings to DT schema

Convert the device tree bindings for the Micrel PHYs and switches to DT
schema.
====================

Link: https://patch.msgid.link/20260116130948.79558-1-eichest@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: micrel: Convert micrel-ksz90x1.txt to DT schema

Convert the micrel-ksz90x1.txt to DT schema. Create a separate YAML file
for this PHY series. The old naming of ksz90x1 would be misleading in
this case, so rename it to gigabit, as it contains ksz9xx1 and lan8xxx
gigabit PHYs.

Signed-off-by: Stefan Eichenberger <stefan.eichenberger@toradex.com>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20260116130948.79558-3-eichest@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: micrel: Convert to DT schema

Convert the devicetree bindings for the Micrel PHYs and switches to DT
schema.

Signed-off-by: Stefan Eichenberger <stefan.eichenberger@toradex.com>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20260116130948.79558-2-eichest@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: sparx5: do not require phys when RGMII is used

LAN969x has 2 dedicated RGMII ports, so regular SERDES lanes are not used
for RGMII.

So, lets not require phys to be defined when any of the rgmii phy-modes are
set.

Signed-off-by: Robert Marko <robert.marko@sartura.hr>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20260115114021.111324-11-robert.marko@sartura.hr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: drv-net: extend HW timestamp test with ioctl

Extend HW timestamp tests to check that ioctl interface is not broken
and configuration setups and requests are equal to netlink interface.
Some linter warnings are disabled because of ctypes classes.

Reviewed-by: Kory Maincent <kory.maincent@bootlin.com>
Signed-off-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20260116062121.1230184-2-vadim.fedorenko@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: remove legacy way to get/set HW timestamp config

With all drivers converted to use ndo_hwstamp callbacks the legacy way
can be removed, marking ioctl interface as deprecated.

Signed-off-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Reviewed-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://patch.msgid.link/20260116062121.1230184-1-vadim.fedorenko@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: split kmalloc_reserve() to allow inlining

kmalloc_reserve() is too big to be inlined.

Put the slow path in a new out-of-line function : kmalloc_pfmemalloc()

Then let kmalloc_reserve() set skb->pfmemalloc only when/if
the slow path is taken.

This makes __alloc_skb() faster :

- kmalloc_reserve() is now automatically inlined by both gcc and clang.
- No more expensive RMW (skb->pfmemalloc = pfmemalloc).
- No more expensive stack canary (for CONFIG_STACKPROTECTOR_STRONG=y).
- Removal of two prefetches that were coming too late for modern cpus.

Text size increase is quite small compared to the cpu savings (~0.7 %)

$ size net/core/skbuff.clang.before.o net/core/skbuff.clang.after.o
   text    data     bss     dec     hex filename
  72507    5897       0   78404   13244 net/core/skbuff.clang.before.o
  72681    5897       0   78578   132f2 net/core/skbuff.clang.after.o

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260116041359.181104-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'eth-fbnic-update-ipc-mailbox-support'

Mohsin Bashir says:

====================
eth: fbnic: Update IPC mailbox support

Update IPC mailbox support for fbnic to cater for several changes.
====================

Link: https://patch.msgid.link/20260115003353.4150771-1-mohsin.bashr@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

eth: fbnic: Update RX mbox timeout value

While waiting for completions on read requests, driver is using
different timeout values for different messages. Make use of a single
timeout value.

Introduce a wrapper function to handle the wait, which also simplify
maintaining the 80 char line limit.

Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com>
Link: https://patch.msgid.link/20260115003353.4150771-6-mohsin.bashr@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

eth: fbnic: Remove retry support

The driver retries sensor read requests from firmware, but this is
unnecessary. A functioning firmware should respond to each request
within the timeout period. Remove the retry logic and set the timeout
to the sum of all retry timeouts.

Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com>
Link: https://patch.msgid.link/20260115003353.4150771-5-mohsin.bashr@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

eth: fbnic: Reuse RX mailbox pages

Currently, the RX mailbox frees and reallocates a page for each received
message. Since FW Rx messages are processed synchronously, and nothing
hold these pages (unlike skbs which we hand over to the stack), reuse
the pages and put them back on the Rx ring. Now that we ensure the ring
is always fully populated we don't have to worry about filling it up
after partial population during init, either. Update
fbnic_mbx_process_rx_msgs() to recycle pages after message processing.

Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com>
Link: https://patch.msgid.link/20260115003353.4150771-4-mohsin.bashr@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

eth: fbnic: Allocate all pages for RX mailbox

Now that memory is allocated with GFP_KERNEL, allocation failures
should be extremely rare. Ensure the FW communication ring is
always fully populated with free pages, and hard fail initialization
otherwise. This enables simplifications in next patches.

Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com>
Link: https://patch.msgid.link/20260115003353.4150771-3-mohsin.bashr@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

eth: fbnic: Use GFP_KERNEL to allocting mbx pages

Replace GFP_ATOMIC with GFP_KERNEL for mailbox RX page allocation. Since
interrupt handler is threaded GFP_KERNEL is a safe option to reduce
allocation failures.

Also remove __GFP_NOWARN so the kernel reports a warning on allocation
failure to aid debugging.

Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com>
Link: https://patch.msgid.link/20260115003353.4150771-2-mohsin.bashr@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'net-queue-rx-buf-len-v9' of https://github.com/isilence/linux

Pavel Begunkov says:

====================
Add support for providers with large rx buffer

Many modern NICs support configurable receive buffer lengths, and zcrx and
memory providers can use buffers larger than 4K to improve performance.
When paired with hw-gro larger rx buffer sizes can drastically reduce
the number of buffers traversing the stack and save a lot of processing
time. It also allows to give to users larger contiguous chunks of data.

Single stream benchmarks showed up to ~30% CPU util improvement.
E.g. comparison for 4K vs 32K buffers using a 200Gbit NIC:

packets=23987040 (MB=2745098), rps=199559 (MB/s=22837)
CPU    %usr   %nice    %sys %iowait    %irq   %soft   %idle
  0    1.53    0.00   27.78    2.72    1.31   66.45    0.22
packets=24078368 (MB=2755550), rps=200319 (MB/s=22924)
CPU    %usr   %nice    %sys %iowait    %irq   %soft   %idle
  0    0.69    0.00    8.26   31.65    1.83   57.00    0.57

This series adds net infrastructure for memory providers configuring
the size and implements it for bnxt. It's an opt-in feature for drivers,
they should advertise support for the parameter in the qops and must check
if the hardware supports the given size. It's limited to memory providers
as it drastically simplifies implementation. It doesn't affect the fast
path zcrx uAPI, and the user exposed parameter is defined in zcrx terms,
which allows it to be flexible and adjusted in the future.

A liburing example can be found at [2]

full branch:
[1] https://github.com/isilence/linux.git zcrx/large-buffers-v8
Liburing example:
[2] https://github.com/isilence/liburing.git zcrx/rx-buf-len

* tag 'net-queue-rx-buf-len-v9' of https://github.com/isilence/linux:
  io_uring/zcrx: document area chunking parameter
  selftests: iou-zcrx: test large chunk sizes
  eth: bnxt: support qcfg provided rx page size
  eth: bnxt: adjust the fill level of agg queues with larger buffers
  eth: bnxt: store rx buffer size per queue
  net: pass queue rx page size from memory provider
  net: add bare bone queue configs
  net: reduce indent of struct netdev_queue_mgmt_ops members
  net: memzero mp params when closing a queue
====================

Link: https://patch.msgid.link/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Revert "Merge branch 'netkit-support-for-io_uring-zero-copy-and-af_xdp'"

This reverts commit 77b9c4a438fc66e2ab004c411056b3fb71a54f2c, reversing
changes made to 4515ec4ad58a37e70a9e1256c0b993958c9b7497:

931420a2fc36 ("selftests/net: Add netkit container tests")
ab771c938d9a ("selftests/net: Make NetDrvContEnv support queue leasing")
6be87fbb2776 ("selftests/net: Add env for container based tests")
61d99ce3dfc2 ("selftests/net: Add bpf skb forwarding program")
920da3634194 ("netkit: Add xsk support for af_xdp applications")
eef51113f8af ("netkit: Add netkit notifier to check for unregistering devices")
b5ef109d22d4 ("netkit: Implement rtnl_link_ops->alloc and ndo_queue_create")
b5c3fa4a0b16 ("netkit: Add single device mode for netkit")
0073d2fd679d ("xsk: Proxy pool management for leased queues")
1ecea95dd3b5 ("xsk: Extend xsk_rcv_check validation")
804bf334d08a ("net: Proxy netdev_queue_get_dma_dev for leased queues")
0caa9a8ddec3 ("net: Proxy net_mp_{open,close}_rxq for leased queues")
ff8889ff9107 ("net, ethtool: Disallow leased real rxqs to be resized")
9e2103f36110 ("net: Add lease info to queue-get response")
31127deddef4 ("net: Implement netdev_nl_queue_create_doit")
a5546e18f77c ("net: Add queue-create operation")

The series will conflict with io_uring work, and the code needs more
polish.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netfilter: xt_tcpmss: check remaining length before reading optlen

Quoting reporter:
  In net/netfilter/xt_tcpmss.c (lines 53-68), the TCP option parser reads
op[i+1] directly without validating the remaining option length.

  If the last byte of the option field is not EOL/NOP (0/1), the code attempts
  to index op[i+1]. In the case where i + 1 == optlen, this causes an
  out-of-bounds read, accessing memory past the optlen boundary
  (either reading beyond the stack buffer _opt or the
  following payload).

Reported-by: sungzii <sungzii@pm.me>
Signed-off-by: Florian Westphal <fw@strlen.de>

netfilter: nf_conncount: fix tracking of connections from localhost

Since commit be102eb6a0e7 ("netfilter: nf_conncount: rework API to use
sk_buff directly"), we skip the adding and trigger a GC when the ct is
confirmed. For connections originated from local to local it doesn't
work because the connection is confirmed on POSTROUTING, therefore
tracking on the INPUT hook is always skipped.

In order to fix this, we check whether skb input ifindex is set to
loopback ifindex. If it is then we fallback on a GC plus track operation
skipping the optimization. This fallback is necessary to avoid
duplicated tracking of a packet train e.g 10 UDP datagrams sent on a
burst when initiating the connection.

Tested with xt_connlimit/nft_connlimit and OVS limit and with a HTTP
server and iperf3 on UDP mode.

Fixes: be102eb6a0e7 ("netfilter: nf_conncount: rework API to use sk_buff directly")
Reported-by: Michal Slabihoudek <michal.slabihoudek@gooddata.com>
Closes: https://lore.kernel.org/netfilter/6989BD9F-8C24-4397-9AD7-4613B28BF0DB@gooddata.com/
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Signed-off-by: Florian Westphal <fw@strlen.de>

netfilter: nft_compat: add more restrictions on netlink attributes

As far as I can see nothing bad can happen when NFTA_TARGET/MATCH_NAME
are too large because this calls x_tables helpers which check for the
length, but it seems better to already reject it during netlink parsing.

Rest of the changes avoid silent u8/u16 truncations.

For _TYPE, its expected to be only 1 or 0. In x_tables world, this
variable is set by kernel, for IPT_SO_GET_REVISION_TARGET its 1, for
all others its set to 0.

As older versions of nf_tables permitted any value except 1 to mean 'match',
keep this as-is but sanitize the value for consistency.

Fixes: 0ca743a55991 ("netfilter: nf_tables: add compatibility layer for x_tables")
Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de>
Signed-off-by: Florian Westphal <fw@strlen.de>

netfilter: nfnetlink_queue: nfqnl_instance GFP_ATOMIC -> GFP_KERNEL_ACCOUNT allocation

Currently, instance_create() uses GFP_ATOMIC because it's called while
holding instances_lock spinlock. This makes allocation more likely to
fail under memory pressure.

Refactor nfqnl_recv_config() to drop RCU lock after instance_lookup()
and peer_portid verification. A socket cannot simultaneously send a
message and close, so the queue owned by the sending socket cannot be
destroyed while processing its CONFIG message. This allows
instance_create() to allocate with GFP_KERNEL_ACCOUNT before taking
the spinlock.

Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>

netfilter: nf_conntrack: don't rely on implicit includes

several netfilter compilation units rely on implicit includes
coming from nf_conntrack_proto_gre.h.

Clean this up and add the required dependencies where needed.

nf_conntrack.h requires net_generic() helper.
Place various gre/ppp/vlan includes to where they are needed.

Signed-off-by: Florian Westphal <fw@strlen.de>

netfilter: don't include xt and nftables.h in unrelated subsystems

conntrack, xtables and nftables are distinct subsystems, don't use them
in other subystems.

Signed-off-by: Florian Westphal <fw@strlen.de>

netfilter: nf_conntrack: enable icmp clash support

Not strictly required, but should not be harmful either:
This isn't a stateful protocol, hence clash resolution should work fine.

Signed-off-by: Florian Westphal <fw@strlen.de>

netfilter: nf_conncount: increase the connection clean up limit to 64

After the optimization to only perform one GC per jiffy, a new problem
was introduced. If more than 8 new connections are tracked per jiffy the
list won't be cleaned up fast enough possibly reaching the limit
wrongly.

In order to prevent this issue, only skip the GC if it was already
triggered during the same jiffy and the increment is lower than the
clean up limit. In addition, increase the clean up limit to 64
connections to avoid triggering GC too often and do more effective GCs.

This has been tested using a HTTP server and several
performance tools while having nft_connlimit/xt_connlimit or OVS limit
configured.

Output of slowhttptest + OVS limit at 52000 connections:

slow HTTP test status on 340th second:
initializing:        0
pending:             432
connected:           51998
error:               0
closed:              0
service available:   YES

Fixes: d265929930e2 ("netfilter: nf_conncount: reduce unnecessary GC")
Reported-by: Aleksandra Rukomoinikova <ARukomoinikova@k2.cloud>
Closes: https://lore.kernel.org/netfilter/b2064e7b-0776-4e14-adb6-c68080987471@k2.cloud/
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Signed-off-by: Florian Westphal <fw@strlen.de>

netfilter: nf_conntrack: Add allow_clash to generic protocol handler

The upstream commit, 71d8c47fc653711c41bc3282e5b0e605b3727956
("netfilter: conntrack: introduce clash resolution on insertion race"),
sets allow_clash=true in the UDP/UDPLITE protocol handler
but does not set it in the generic protocol handler.

As a result, packets composed of connectionless protocols at each layer,
such as UDP over IP-in-IP, still drop packets due to conflicts during conntrack insertion.

To resolve this, this patch sets allow_clash in the nf_conntrack_l4proto_generic.

Signed-off-by: Yuto Hamaguchi <Hamaguchi.Yuto@da.MitsubishiElectric.co.jp>
Signed-off-by: Florian Westphal <fw@strlen.de>

netfilter: nf_tables: reset table validation state on abort

If a transaction fails the final validation in the commit hook, the table
validation state is changed to NFT_VALIDATE_DO and a replay of the batch is
performed.  Every rule insert will then do a graph validation.

This is much slower, but provides better error reporting to the user
because we can point at the rule that introduces the validation issue.

Without this reset the affected table(s) remain in full validation mode,
i.e. on next transaction we start with slow-mode.

This makes the next transaction after a failed incremental update very slow:

# time iptables-restore < /tmp/ruleset
real    0m0.496s [..]
# time iptables -A CALLEE -j CALLER
iptables v1.8.11 (nf_tables):  RULE_APPEND failed (Too many links): rule in chain CALLEE
real    0m0.022s [..]
# time iptables-restore < /tmp/ruleset
real    1m22.355s [..]

After this patch, 2nd iptables-restore is back to ~0.5s.

Fixes: 9a32e9850686 ("netfilter: nf_tables: don't write table validation state without mutex")
Signed-off-by: Florian Westphal <fw@strlen.de>

Merge branch 'netkit-support-for-io_uring-zero-copy-and-af_xdp'

Daniel Borkmann says:

====================
netkit: Support for io_uring zero-copy and AF_XDP

Containers use virtual netdevs to route traffic from a physical netdev
in the host namespace. They do not have access to the physical netdev
in the host and thus can't use memory providers or AF_XDP that require
reconfiguring/restarting queues in the physical netdev.

This patchset adds the concept of queue leasing to virtual netdevs that
allow containers to use memory providers and AF_XDP at native speed.
Leased queues are bound to a real queue in a physical netdev and act
as a proxy.

Memory providers and AF_XDP operations take an ifindex and queue id,
so containers would pass in an ifindex for a virtual netdev and a queue
id of a leased queue, which then gets proxied to the underlying real
queue.

We have implemented support for this concept in netkit and tested the
latter against Nvidia ConnectX-6 (mlx5) as well as Broadcom BCM957504
(bnxt_en) 100G NICs. For more details see the individual patches.
====================

Link: https://patch.msgid.link/20260115082603.219152-1-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

selftests/net: Add netkit container tests

Add two tests using NetDrvContEnv. One basic test that sets up a netkit
pair, with one end in a netns. Use LOCAL_PREFIX_V6 and nk_forward BPF
program to ping from a remote host to the netkit in netns.

Second is a selftest for netkit queue leasing, using io_uring zero copy
test binary inside of a netns with netkit. This checks that memory
providers can be bound against virtual queues in a netkit within a
netns that are leasing from a physical netdev in the default netns.

Signed-off-by: David Wei <dw@davidwei.uk>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260115082603.219152-17-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

selftests/net: Make NetDrvContEnv support queue leasing

Add a new parameter `lease` to NetDrvContEnv that sets up queue leasing
in the env.

The NETIF also has some ethtool parameters changed to support memory
provider tests. This is needed in NetDrvContEnv rather than individual
test cases since the cleanup to restore NETIF can't be done, until the
netns in the env is gone.

Signed-off-by: David Wei <dw@davidwei.uk>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260115082603.219152-16-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

selftests/net: Add env for container based tests

Add an env NetDrvContEnv for container based selftests. This automates
the setup of a netns, netkit pair with one inside the netns, and a BPF
program that forwards skbs from the NETIF host inside the container.

Currently only netkit is used, but other virtual netdevs e.g. veth can
be used too.

Expect netkit container datapath selftests to have a publicly routable
IP prefix to assign to netkit in a container, such that packets will
land on eth0. The BPF skb forward program will then forward such packets
from the host netns to the container netns.

Signed-off-by: David Wei <dw@davidwei.uk>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260115082603.219152-15-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

selftests/net: Add bpf skb forwarding program

Add nk_forward.bpf.c, a BPF program that forwards skbs matching some IPv6
prefix received on eth0 ifindex to a specified netkit ifindex. This will
be needed by netkit container tests.

Signed-off-by: David Wei <dw@davidwei.uk>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260115082603.219152-14-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

netkit: Add xsk support for af_xdp applications

Enable support for AF_XDP applications to operate on a netkit device.
The goal is that AF_XDP applications can natively consume AF_XDP
from network namespaces. The use-case from Cilium side is to support
Kubernetes KubeVirt VMs through QEMU's AF_XDP backend. KubeVirt is a
virtual machine management add-on for Kubernetes which aims to provide
a common ground for virtualization. KubeVirt spawns the VMs inside
Kubernetes Pods which reside in their own network namespace just like
regular Pods.

Raw QEMU AF_XDP backend example with eth0 being a physical device with
16 queues where netkit is bound to the last queue (for multi-queue RSS
context can be used if supported by the driver):

  # ethtool -X eth0 start 0 equal 15
  # ethtool -X eth0 start 15 equal 1 context new
  # ethtool --config-ntuple eth0 flow-type ether \
            src 00:00:00:00:00:00 \
            src-mask ff:ff:ff:ff:ff:ff \
            dst $mac dst-mask 00:00:00:00:00:00 \
            proto 0 proto-mask 0xffff action 15
  [ ... setup BPF/XDP prog on eth0 to steer into shared xsk map ... ]
  # ip netns add foo
  # ip link add numrxqueues 2 nk type netkit single
  # ./pyynl/cli.py --spec ~/netlink/specs/netdev.yaml \
                   --do queue-create \
                   --json "{"ifindex": $(ifindex nk), "type": "rx", \
                            "lease": { "ifindex": $(ifindex eth0), \
                                       "queue": { "type": "rx", "id": 15 } } }"
  {'id': 1}
  # ip link set nk netns foo
  # ip netns exec foo ip link set lo up
  # ip netns exec foo ip link set nk up
  # ip netns exec foo qemu-system-x86_64 \
          -kernel $kernel \
          -drive file=${image_name},index=0,media=disk,format=raw \
          -append "root=/dev/sda rw console=ttyS0" \
          -cpu host \
          -m $memory \
          -enable-kvm \
          -device virtio-net-pci,netdev=net0,mac=$mac \
          -netdev af-xdp,ifname=nk,id=net0,mode=native,queues=1,start-queue=1,inhibit=on,map-path=$dir/xsks_map \
          -nographic

We have tested the above against a dual-port Nvidia ConnectX-6 (mlx5)
100G NIC with successful network connectivity out of QEMU. An earlier
iteration of this work was presented at LSF/MM/BPF [0] and more
recently at LPC [1].

For getting to a first starting point to connect all things with
KubeVirt, bind mounting the xsk map from Cilium into the VM launcher
Pod which acts as a regular Kubernetes Pod while not perfect, is not
a big problem given its out of reach from the application sitting
inside the VM (and some of the control plane aspects are baked in
the launcher Pod already), so the isolation barrier is still the VM.
Eventually the goal is to have a XDP/XSK redirect extension where
there is no need to have the xsk map, and the BPF program can just
derive the target xsk through the queue where traffic was received
on.

The exposure through netkit is because Cilium should not act as a
proxy handing out xsk sockets. Existing applications expect a netdev
from kernel side and should not need to rewrite just to implement
against a CNI's protocol. Also, all the memory should not be accounted
against Cilium but rather the application Pod itself which is consuming
AF_XDP. Further, on up/downgrades we expect the data plane to being
completely decoupled from the control plane; if Cilium would own the
sockets that would be disruptive. Another use-case which opens up and
is regularly asked from users would be to have DPDK applications on
top of AF_XDP in regular Kubernetes Pods.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://bpfconf.ebpf.io/bpfconf2025/bpfconf2025_material/lsfmmbpf_2025_netkit_borkmann.pdf
Link: https://lpc.events/event/19/contributions/2275/
Link: https://patch.msgid.link/20260115082603.219152-13-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>